[FIX] Stabilizing the Validators' Block Miss

Update on Missed Blocks:
In response to the missed blocks issue, our technical team has been working diligently to identify and implement a solution. We understand the impact this has on the user experience and the broader DyDx community. The good news is that we may have found a potential solution which allowed us to achieve near 0 block misses (as shown in the Appendix). However, transparency is paramount, and we want to ensure the utmost confidence in our resolution before sharing it widely.

Our Approach:
Instead of rushing to conclusions, we are taking a cautious approach. We recognize that a thorough examination of our potential solution requires more datapoints and real-world testing. We are actively monitoring the situation, collecting additional data, and rigorously testing the proposed solution in different scenarios.

Conclusion:
In the spirit of transparency and continuous improvement, we acknowledge the challenges posed by the missed blocks issue. We are actively working on a solution and will keep the community updated as we gather more data and validate the effectiveness of our proposed fix. Your trust means everything to us, and we are committed to resolving this issue collaboratively. Together, we navigate challenges, learn, and emerge stronger as a community. Thank you for your patience and understanding as we work towards a seamless experience on DyDx V4.


Apendix:
The follwing chart clearly states the effect on the missed blocks (despite our reboot this morning which you can see as the only glitch in the data since then).


Thanks for reading,
pro-delegators-sign

3 Likes

Hi,

An overall improvement happen at the time on Santorini joined the network with ~ 19 % of the voting power.

The Santorini sign rate is not very good with a lot of missing signature (one of the worst uptime of the active set).

I have supposition that Santorini add a lot of latency in block signature and allow more validators to broadcast their signatures.
This result in a much better sign rate for all the validators.


source: Grafana

The bad aspect of this latency is the block rate of the chain. A change from 1.1sec to 1.4sec per block.


source: https://twitter.com/David_Crosnest/status/1726241853069041708

We need to keep in mind the chain is always almost idle and no trading bots or marcket maker are active on it for now and daily volume is about 160k.
What will happen when the real activity will hurt this validators ?

2 Likes

As previously mentioned, we require additional data points before reaching conclusive findings. However, one noteworthy observation is that the enhancements do not seem to be linked to the Santorini’s impact on the set. To dismiss this possibility, it is evident that validators have solely encountered a reduction in their missed blocks. Conversely, the implemented fix has effectively brought our missed blocks down to zero.

The introduction of Santorini appears to have merely coincided with the timing of our modifications. Nevertheless, this is one of the reasons we exercise caution before publicly disclosing the proposed solution. It is imperative for us to eliminate any potential false assumptions. We appreciate the community’s understanding as we take a few more days to ensure precision. This preliminary statement is intended to keep you informed of our progress.

pro-delegators-sign

2 Likes

After two weeks of rigorous performance testing on our validator, we regrettably found the initial situation to be below acceptable standards for production based on our criteria. In response, we conducted production tests at two of our sites, one in Europe and the other in Asia, utilizing the same type of machine—a bare metal server with an Intel Xeon-E 2386G CPU, 64 GB DDR4 ECC RAM, and RAID 10 on NVMe hard drives.

During the one-week testing period at each site, we consistently observed unacceptable performance with the out-of-the-box validator configuration. Following numerous configuration tests, we eventually reached an acceptable situation with the following adjustments in the config.toml file under the p2p section:

config.toml

max_num_inbound_peers = 60
max_num_outbound_peers = 60
flush_throttle_timeout = "10ms"
send_rate = 20480000
recv_rate = 20480000
mempool_version = "v0"
consensus_timeout_propose = "2s"

These changes were made to optimize gossip communication both in and out due to the increased number of inbound and outbound peers. Additionally, we reverted to the deprecated version “v0” for mempool. To accommodate the maximum block time, we adjusted the consensus timeout_propose to “2s”.

To enhance security measures, we also bound the p2p port on our firewall. With this revised configuration, we successfully reduced our missing block rate by an impressive 75%. It’s essential to note that while these adjustments proved effective for our hardware setup, they may require fine-tuning based on different hardware solutions.

We share these conclusions not as an ultimate solution but as a foundational reference for the community to iterate upon. Our goal is to contribute to the broader validator set’s efficiency improvement. If you have any alternative suggestions or insights, we welcome your input. Please feel free to contact us with any further recommendations.

3 Likes

follow up on this subject of sign rate improvement on the network.

Follow-up on the subject of improving the signage rate on the network.

I’m continually monitoring overall performance and would like to share my findings here.

I make a selection on the graphs to show more clearly the elements I’d like to demonstrate.

On this screenshot I made 2 marks

  • the FlashCat tombstone.
  • the end of the vote for proposition 2 and start of incentives.

FlashCat tombstone
We can observe the same type of events as those explained above with the FlashCat tombstone

FlashCat was a well-ranked validator with an average miss rate like many other validators (~ 80 misses over the observation window).

FlashCat’s tombstone should have created a voting power differential allowing more validators to sign.

If we zoom in, we can see the cause-and-effect relationship more clearly
FlashCat is the red line replaced by Blockscape.
and at 23:00 the Nocturnal Labs validator begins to sign.

The reality is different, as FlashCat’s tombstone has the effect of giving more voting power (in %) to Figment with an almost perfect signature rate.
Block consensus is achieved more easily with sufficient voting power, and slower validators lose signing rate.

end of vote on proposal 2

We noticed an increase in network load shortly before the exchange incentives were introduced.
This shows an increase in chain activity (more transactions = more traffic).

This also has the effect of loading the network bandwidth of our servers a little more, which can lead to increased latency for signature broadcasts.

We can also see that some validators are having more difficulty.
This is also true for the best sign-rates.

4 Likes

@David @Govmos

Thanks for continuously monitoring this and providing some thoughts.

What are some potential actions/outcomes from this that the dYdX community/other validators can take away to improve block time & config variables? Any suggestions for the hardware improvements that you mentioned above?

3 Likes

Hello @BritAus,

We’ve delved into the block timeout issue and conducted some quick experiments. To gain insights, we developed an application that generates latency graphs for each block and validator.

CONTEXT:

Our tests highlighted four relevant graphs:

  1. Country proposer’: the country of the validator issuing a proposal block.
  2. Miss by country’: the number of validators missing a block, sorted by geographical location, using data from the observatory site.
  3. Signers’: the number of validators signing the block (all 60 have signed).
  4. Validator latency’: the time between a validator proposing a block and sending a pre-commit message.

A second tool precisely determines latencies between different network nodes from our Rpc located in France and Singapour. All the following latencies have already been doubled to simulate a real TCP communication.

Singapour

America
        CA min=230ms avg=233ms max=252ms peers=8
        US min=172ms avg=234ms max=285ms peers=17
        CL min=329ms avg=329ms max=329ms peers=1
Europe
        FI min=182ms avg=185ms max=190ms peers=26
        DE min=154ms avg=174ms max=250ms peers=45
        PL min=164ms avg=196ms max=253ms peers=7
        FR min=150ms avg=175ms max=288ms peers=12
        CZ min=194ms avg=219ms max=244ms peers=2
        NL min=160ms avg=176ms max=209ms peers=4
        CH min=163ms avg=178ms max=190ms peers=3
        GB min=155ms avg=202ms max=249ms peers=2
        AT min=152ms avg=152ms max=152ms peers=2
        IE min=162ms avg=177ms max=188ms peers=6
Asia
        JP min=68ms avg=74ms max=85ms peers=38
        SG min=0ms avg=0ms max=3ms peers=23
        KR min=79ms avg=92ms max=100ms peers=3
        HK min=37ms avg=37ms max=37ms peers=2
        IN min=39ms avg=60ms max=67ms peers=4
        TW min=49ms avg=49ms max=49ms peers=1
Australia
        AU min=118ms avg=118ms max=118ms peers=1

France

America
        CA min=78ms avg=81ms max=101ms peers=8
        US min=8ms avg=91ms max=145ms peers=17
        CL min=240ms avg=240ms max=240ms peers=1
Europe
        FI min=28ms avg=28ms max=40ms peers=26
        DE min=8ms avg=11ms max=25ms peers=45
        PL min=27ms avg=27ms max=28ms peers=7
        FR min=0ms avg=2ms max=10ms peers=12
        CZ min=17ms avg=17ms max=17ms peers=2
        NL min=6ms avg=8ms max=12ms peers=4
        CH min=13ms avg=15ms max=21ms peers=3
        GB min=3ms avg=3ms max=3ms peers=2
        AT min=22ms avg=22ms max=22ms peers=2
        IE min=14ms avg=17ms max=22ms peers=6
Asia
        JP min=217ms avg=231ms max=258ms peers=38
        SG min=150ms avg=174ms max=266ms peers=23
        KR min=250ms avg=268ms max=285ms peers=3
        HK min=238ms avg=251ms max=264ms peers=2
        IN min=145ms avg=199ms max=218ms peers=4
        TW min=241ms avg=241ms max=241ms peers=1
Australia
        AU min=292ms avg=292ms max=292ms peers=1

__

ANALYSIS:

After several days of monitoring, we can see that, in the majority of cases, everything is running smoothly. However, regularly (1 to 2% of the time), a large number of validators fail to sign a block. The block shows no particular transaction, and the number of transactions is constant (between 4 and 5). On the other hand, we can see that the majority of validators who fail to sign are located on another continent, particularly when the proposers are signing from Japan and the UK.

European validators proposing a block see many Asian validators not signing, and vice versa. The default timeout of 1s for pre-vote or pre-commit may be too short given the latency table from France or Singapore.

__

CONLUSIONS:

To minimize block misses, we propose the following measures:

  • Extend Consensus Round 0 windows: Increase both pre-vote and pre-commit timeouts to 1,250s (adding 250ms to the existing value). Consequently, elevate timeout_prevote_delta and timeout_precommit_delta to 625ms (adding 125ms due to the extension of the pre-vote and pre-commit timeouts). These adjustments aim to facilitate better communication among validators during periods of high latency without significantly altering the final block time. This change should be applied across the entire validator network, possibly during an upcoming update.

  • Setup Hardware Recommandations: For optimal reaction time, we recommend high-performance hardware, such as 8 core / 16 threads, 32GB RAM, and a fast disk like NVMe or RAID NVMe. Regarding our dYdX validator, we made some modifications to the config.toml file (as we previously explained in this post).

  • Distribute the Validator Set: Enhance the distribution of block proposals across a wider range of locations. Currently, 90% of proposal blocks originate from Asia or the UK. Various solutions can be explored, and we encourage the community to initiate discussions. Possible considerations include targeted Foundation delegations and setting geographically based vote power caps.

  • Establish a Transit Network: Create a dedicated “IP transit” network with institutional partners (he.net, cogentco.com, akamai, …) connecting diverse geographic areas of consensus. Prioritize access to submarine fiber-optic cables, boosting connectivity between key consensus locations. This strategic move aims to minimize average latency effectively.

To mitigate block misses, we recommend implementing these measures gradually over time. Initially, the most immediate and cost-effective solution involves extending consensus’ round 0 windows and setting minimum hardware recommendations, while concurrently addressing the centralization of block production geography. Simultaneously, efforts should be directed toward developing a longer-term solution, such as the proposed “transit network.”

Opensourcing tools

We’re willing to open source our Python exporter/dashboard for Grafana and our Cosmos-scanner if needed. Additionally, we can provide a dump of our investigation database to the foundation and the community.

__

On behalf of the entire team, thank you all for your attention.
pro-delegators-sign

6 Likes

Great analysis! Opensourcing this would be highly appreciated :clap: