[FIX] Stabilizing the Validators' Block Miss

Update on Missed Blocks:
In response to the missed blocks issue, our technical team has been working diligently to identify and implement a solution. We understand the impact this has on the user experience and the broader DyDx community. The good news is that we may have found a potential solution which allowed us to achieve near 0 block misses (as shown in the Appendix). However, transparency is paramount, and we want to ensure the utmost confidence in our resolution before sharing it widely.

Our Approach:
Instead of rushing to conclusions, we are taking a cautious approach. We recognize that a thorough examination of our potential solution requires more datapoints and real-world testing. We are actively monitoring the situation, collecting additional data, and rigorously testing the proposed solution in different scenarios.

Conclusion:
In the spirit of transparency and continuous improvement, we acknowledge the challenges posed by the missed blocks issue. We are actively working on a solution and will keep the community updated as we gather more data and validate the effectiveness of our proposed fix. Your trust means everything to us, and we are committed to resolving this issue collaboratively. Together, we navigate challenges, learn, and emerge stronger as a community. Thank you for your patience and understanding as we work towards a seamless experience on DyDx V4.


Apendix:
The follwing chart clearly states the effect on the missed blocks (despite our reboot this morning which you can see as the only glitch in the data since then).


Thanks for reading,
pro-delegators-sign

2 Likes

Hi,

An overall improvement happen at the time on Santorini joined the network with ~ 19 % of the voting power.

The Santorini sign rate is not very good with a lot of missing signature (one of the worst uptime of the active set).

I have supposition that Santorini add a lot of latency in block signature and allow more validators to broadcast their signatures.
This result in a much better sign rate for all the validators.


source: Grafana

The bad aspect of this latency is the block rate of the chain. A change from 1.1sec to 1.4sec per block.


source: https://twitter.com/David_Crosnest/status/1726241853069041708

We need to keep in mind the chain is always almost idle and no trading bots or marcket maker are active on it for now and daily volume is about 160k.
What will happen when the real activity will hurt this validators ?

1 Like

As previously mentioned, we require additional data points before reaching conclusive findings. However, one noteworthy observation is that the enhancements do not seem to be linked to the Santorini’s impact on the set. To dismiss this possibility, it is evident that validators have solely encountered a reduction in their missed blocks. Conversely, the implemented fix has effectively brought our missed blocks down to zero.

The introduction of Santorini appears to have merely coincided with the timing of our modifications. Nevertheless, this is one of the reasons we exercise caution before publicly disclosing the proposed solution. It is imperative for us to eliminate any potential false assumptions. We appreciate the community’s understanding as we take a few more days to ensure precision. This preliminary statement is intended to keep you informed of our progress.

pro-delegators-sign

1 Like

After two weeks of rigorous performance testing on our validator, we regrettably found the initial situation to be below acceptable standards for production based on our criteria. In response, we conducted production tests at two of our sites, one in Europe and the other in Asia, utilizing the same type of machine—a bare metal server with an Intel Xeon-E 2386G CPU, 64 GB DDR4 ECC RAM, and RAID 10 on NVMe hard drives.

During the one-week testing period at each site, we consistently observed unacceptable performance with the out-of-the-box validator configuration. Following numerous configuration tests, we eventually reached an acceptable situation with the following adjustments in the config.toml file under the p2p section:

config.toml

max_num_inbound_peers = 60
max_num_outbound_peers = 60
flush_throttle_timeout = "10ms"
send_rate = 20480000
recv_rate = 20480000
mempool_version = "v0"
consensus_timeout_propose = "2s"

These changes were made to optimize gossip communication both in and out due to the increased number of inbound and outbound peers. Additionally, we reverted to the deprecated version “v0” for mempool. To accommodate the maximum block time, we adjusted the consensus timeout_propose to “2s”.

To enhance security measures, we also bound the p2p port on our firewall. With this revised configuration, we successfully reduced our missing block rate by an impressive 75%. It’s essential to note that while these adjustments proved effective for our hardware setup, they may require fine-tuning based on different hardware solutions.

We share these conclusions not as an ultimate solution but as a foundational reference for the community to iterate upon. Our goal is to contribute to the broader validator set’s efficiency improvement. If you have any alternative suggestions or insights, we welcome your input. Please feel free to contact us with any further recommendations.

2 Likes

follow up on this subject of sign rate improvement on the network.

Follow-up on the subject of improving the signage rate on the network.

I’m continually monitoring overall performance and would like to share my findings here.

I make a selection on the graphs to show more clearly the elements I’d like to demonstrate.

On this screenshot I made 2 marks

  • the FlashCat tombstone.
  • the end of the vote for proposition 2 and start of incentives.

FlashCat tombstone
We can observe the same type of events as those explained above with the FlashCat tombstone

FlashCat was a well-ranked validator with an average miss rate like many other validators (~ 80 misses over the observation window).

FlashCat’s tombstone should have created a voting power differential allowing more validators to sign.

If we zoom in, we can see the cause-and-effect relationship more clearly
FlashCat is the red line replaced by Blockscape.
and at 23:00 the Nocturnal Labs validator begins to sign.

The reality is different, as FlashCat’s tombstone has the effect of giving more voting power (in %) to Figment with an almost perfect signature rate.
Block consensus is achieved more easily with sufficient voting power, and slower validators lose signing rate.

end of vote on proposal 2

We noticed an increase in network load shortly before the exchange incentives were introduced.
This shows an increase in chain activity (more transactions = more traffic).

This also has the effect of loading the network bandwidth of our servers a little more, which can lead to increased latency for signature broadcasts.

We can also see that some validators are having more difficulty.
This is also true for the best sign-rates.

3 Likes

@David @Govmos

Thanks for continuously monitoring this and providing some thoughts.

What are some potential actions/outcomes from this that the dYdX community/other validators can take away to improve block time & config variables? Any suggestions for the hardware improvements that you mentioned above?

1 Like