Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large reorg test #2497

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Large reorg test #2497

wants to merge 8 commits into from

Conversation

Tristan-Wilson
Copy link
Member

@Tristan-Wilson Tristan-Wilson commented Jul 16, 2024

If the expectedCounter field for this test is increased to 10_000 then it will fail with the following error:

WARN [07-16|20:35:55.452] Large chain reorg detected              
 number=2 hash=7f6bea..02fe3e drop=10000 dropfrom=b187f2..c3897a a
dd=0 addfrom=000000..000000                                       
INFO [07-16|20:35:55.648] Chain head was updated                  
 number=2 hash=7f6bea..02fe3e root=080b98..3b3b72 elapsed=422.2294
66ms age=1m46s                                                    
INFO [07-16|20:35:55.648] Trying to resequence messages           
 number=10000                                                     
ERROR[07-16|20:35:55.649] failed to re-sequence old user message r
emoved by reorg err="insert lock taken"                           
ERROR[07-16|20:35:55.669] Failed to journal state snapshot        
 err="snapshot [0x080b98823f4e081af621c256e1448a6f6a636e394f729553
6fb0fa64873b3b72] missing"                   

We are tracking resolution of this issue separately and can increase the reorg size once it's fixed.

@cla-bot cla-bot bot added the s Automatically added by the CLA bot if the creator of a PR is registered as having signed the CLA. label Jul 16, 2024
Copy link
Contributor

@diegoximenes diegoximenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is due to this insertionMutex being held here.

During the reorg:

  1. TransactionStreamer, holding insertionMutex lock, calls ExecutionEngine, which then adds old messages to a channel, so they can be resequenced asynchronously by ExecutionEngine. After that, and before releasing the lock, TransactionStreamer does more computations.
  2. Asynchronously, ExecutionEngine reads from this channel and calls TransactionStreamer, which expects that insertionMutex is free in order to succeed.

This issue is more likely to occur with larger reorgs, since step 1 will take more time to execute, which make step 2 more likely to start while step 1 is still executing.

Not sure if this issue is already tracked in Linear, but we can bring it up with the team.

system_tests/large_reorg_test.go Outdated Show resolved Hide resolved
system_tests/large_reorg_test.go Show resolved Hide resolved
Tristan-Wilson and others added 2 commits July 18, 2024 10:46
We are temporarily lowering the number of iterations for this test until
we can fix the underlying issues we discovered with it.
@Tristan-Wilson Tristan-Wilson marked this pull request as ready for review July 18, 2024 08:49
@Tristan-Wilson
Copy link
Member Author

Fixes NIT-2686

Copy link
Contributor

@diegoximenes diegoximenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing with path scheme :(

There are two issues that I am aware with it:

  • path scheme will only be able to handle reorgs that want to revert to a block before than HEAD-128 if ancients storage is enabled. Our tests don't enable it today, here. We should be using OpenDatabaseWithFreezer for this test instead. Not sure if there is an issue with using freezer in all Nitro's system tests 🤔, if not it will be simpler to adjust this :)
  • NewNodeBuilder(ctx).DefaultConfig currently sets gethexec.CachingConfig.StateHistory to 0. We should set to a different value for this test. As an example, if StateHistory is set to 2 then it will be possible to revert to HEAD-130 but not to HEAD-131.

But even if those issues are fixed right now, not sure if we should merge this PR without first solving the issue with re-sequencing old messages during a reorg.
Even TestReorgResequencing, that has a smaller reorg than the one that is being proposed in this PR, is flaky, sometimes failing with the same error that you added to this PR's description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s Automatically added by the CLA bot if the creator of a PR is registered as having signed the CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants