Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate impact of long blocking abci handlers on p2p network #10075

Open
mhofman opened this issue Sep 12, 2024 · 0 comments
Open

Investigate impact of long blocking abci handlers on p2p network #10075

mhofman opened this issue Sep 12, 2024 · 0 comments
Assignees

Comments

@mhofman
Copy link
Member

mhofman commented Sep 12, 2024

What is the Problem Being Solved?

Under load, the cosmos and cosmic-swingset layers may block for a significant amount of time when processing blocks, in particular in the EndBlock (where swingset actually executes messages), and Commit (when the DB changes are large) phases.

It has been documented such loads negatively impact the performance of the chain. One of the earliest report is #5507, but more recently there has been numerous block production slowdowns where the chain appears to stall.

The surprising part is that on these slowdowns, our follower is able to execute and commit relatively quickly compared to the consensus block time: it might take 30s locally, yet the chain won't produce a new block for over 2 minutes. In theory once 67% of the voting power has processed the block, the chain should be able to make forward progress, however such a discrepancy in timing indicate something else may be at play here.

Furthermore, I have observed that in our instagoric networks, the non-primary nodes are not capable of making progress beyond the primary validator. This is possibly due to the these nodes not having direct p2p connectivity to the rest of the network, but instead being connected through the primary which is the one with public network connectivity.

There are a series of issues linked from cometbft/cometbft#3245 that seem to imply that the p2p layer is dependent on the layers above not blocking.

Description of the Design

Setup a 3 node chain as follow:

  • 2 large / overprovisioned nodes A & C, which together have >= 67% of the voting power
  • 1 resource constrained node B
  • A & C do not have direct connectivity, and can only connect to B
  • Place some load on the chain (real or synthetic)

We should verify that A & C clearly commit/vote on the block much faster than B. In that case we want to observe whether the chain makes forward progress as soon as A&C complete their block, or if B is somehow in the critical path.

Security Considerations

None, investigation only

Scaling Considerations

The chain should be able to make forward progress without slower nodes impeding as long as it has 67% of the voting power, regardless of the p2p topology of the nodes forming that voting power.

Test Plan

See above

Upgrade Considerations

None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants