Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Remove "dead" leaves #432

Closed
bkchr opened this issue May 10, 2021 · 17 comments · Fixed by #1559
Closed

Remove "dead" leaves #432

bkchr opened this issue May 10, 2021 · 17 comments · Fixed by #1559
Assignees
Labels
J0-enhancement An additional feature request.

Comments

@bkchr
Copy link
Member

bkchr commented May 10, 2021

When proposing new PoV blocks with a collator, it can happen that we are "stuck" as the relay chain for whatever reason doesn't include our blocks. After X blocks the collator will start to fail importing new blocks, because we have imported too many leaves.

Using paritytech/substrate#8533 we will be able to delete "dead" leaves.

We will require some criterias to decide wether a leave should be seen as "dead":

  1. A block was not seconded, means we can directly remove it.
  2. We have produced block X for relay chain block Y. Now we start producing for Y + N, means we should be able to remove all blocks we have produced/imported for Y.
@xlc
Copy link
Contributor

xlc commented Dec 13, 2021

I hope this can be done before Beyond the End of the Century 😅

@bkchr
Copy link
Member Author

bkchr commented Dec 13, 2021

Maybe :P (I propose that you google this term :P)

@xlc
Copy link
Contributor

xlc commented Mar 23, 2022

One of our testnet RPC node crashed again due to this and looks like the only solution is to purge db and resync.

@davxy
Copy link
Member

davxy commented Jun 27, 2022

I think that could be a good idea to generalize the leaves pruning feature to Substrate and not limit it to Cumulus.

The idea is to allow the user to pragmatically set the lifetime of chain leaves in order to optionally prune the leaves that are assumed to be "dead".

This pruning strategy can be enabled where is more appropriate, for example to solve the current observed behavior in cumulus where new blocks are never accepted because we've saturated the max leaves per height limit (currently 32).

For the pruning criteria I see three options:

  1. allow the user to specify the pruning criteria (e.g. via some closure) (more flexibility but less simplicity)
  2. we decide what is the pruning criteria (e.g. leaf lifetime) (more simplicity less flexibility)
  3. allow the user to specify the pruning criteria but also provide a default implementation (e.g. that uses leaf lifetime)

Pruning criteria using lifetime.

When a new leaf is added to the chain then we are going to start some kind of timer in order to limit the leaf lifespan.

The timer for a leaf 'L' is stopped if a child 'C' is added to 'L', in this case the timer is restarted for the new leaf 'C'.

If the lifetime for a leaf 'L' is over, then we are going to remove all its parents down to one of the following nodes
(assuming 'F' is the last finalized block):

  1. If there are forks between 'L' and 'F', then we stop pruning nodes as soon as we reach one fork point.
    Example:

    F - B0 - B1
         \ - B2 - L
    

    In this case, if lifetime of 'L' is over, then we remove 'L' and 'B2'.
    We stop at 'B0' since there is another fork depending on it.

  2. If there are no forks between 'L' and 'F', then we stop pruning at 'F' (excluded).

@skunert
Copy link
Contributor

skunert commented Jul 20, 2022

@davxy For the scenario described at the start of the issue, where we have reached the maximum allowed leaves at a given height, the timer-based solution you propose would be the same as just constantly deleting the oldest stall leaf, right? (because those will time out first)

@bkchr
Copy link
Member Author

bkchr commented Jul 20, 2022

I would first start concentrating on Cumulus aka Parachains. For normal chains this can not really happen that easily that you build too many blocks on the same level (yes it is still possible, but much more unlikely then for Parachains). We have once seen this error on a Polkadot test net, because the chain selection rule had gone apeshit, but that was a bug. This issue is also not really that much about leaves, more about blocks on the same height. I just used the naming as it is being used in Substrate.

@mn13
Copy link

mn13 commented Nov 16, 2022

Equilibrium catch the same TooManySiblingBlocks error in moonbase-alphanet.
Nodes restart and revert had no effect. Now we resync nodes to check if this will resolve stucked blocks.

@NunoAlexandre
Copy link

Equilibrium

@mn13 could you share how you fixed it? Facing the same issue also on Moonbase Alpha atm.

@bkchr
Copy link
Member Author

bkchr commented Jan 20, 2023

Equilibrium

@mn13 could you share how you fixed it? Facing the same issue also on Moonbase Alpha atm.

Resync or you can also try to call revert

@mn13
Copy link

mn13 commented Jan 20, 2023

Equilibrium

@mn13 could you share how you fixed it? Facing the same issue also on Moonbase Alpha atm.

Award

Actually we've not fix that, what i've done is build node with deps from #1559, purge and start new chain. We've not try runtime upgrade yet

@zoveress
Copy link

We have applied v9.0.37 which includes this fix and the problem happened again. "revert" is not working at all and I have a separate issue open for that. Anyone managed to fix this and keep the data without purging the chain?

Equilibrium

@mn13 could you share how you fixed it? Facing the same issue also on Moonbase Alpha atm.
Award

Actually we've not fix that, what i've done is build node with deps from #1559, purge and start new chain. We've not try runtime

@davxy
Copy link
Member

davxy commented Jan 31, 2023

@zoveress

  1. when you receive the too many sibling blocks error? After startup and immediately while importing the "next" block?
  2. can you somewhere share the logs with --log parachain=debug enabled? From the begin until you receive the error.
  3. the fix has been applied to the cumulus codebase and not polkadot. Are you using the latest master version to build your collator?

@zoveress
Copy link

@davxy

Please see answers below:

1: We had this issue twice past week, we have updated one of the collators on Tuesday to v0.9.37, then it solved the problem temporarly, then we updated the rest of the collators to v0.9.37 too.
It was all working fine until last Friday when the error happened again and I am unable fix it ever since. Basically all 4 collators keep throwning this "State Database error: Too many sibling blocks inserted" error regardless.

2: I have the log for you how would you like me to send it?

3: I believe we do use the latest, please see it here

@zoveress
Copy link

There has been a slight change, I have deleted the DB for one of the nodes, waited for it to resync with the other nodes, then I found the following error message in the logs:

Block import error: Potential long-range attack: block not in finalized chain.

After that this node started producing blocks and lost connection to all other 3 collators. So now I deleted the DB of the other 3 collators and they are syncing now with the working node. Hopefully this will work now.

@davxy
Copy link
Member

davxy commented Jan 31, 2023

2: I have the log for you how would you like me to send it?

Wherever you like, I have no preferences

@zoveress
Copy link

@davxy I have mailed a download link across.

@davxy
Copy link
Member

davxy commented Jan 31, 2023

Looking at your logs I can't see any message from the monitor that should keep the number of blocks per level within the limit (it is using debug messages with target parachain)

In particular, when you start the node you should see something like:
"Restoring chain level monitor from last finalized block ..."
and
"Restored chain level monitor up to height ..."

Then instead of seeing the overflow error you should see something like:
Detected leaves overflow at height {number}, removing {remove_count} obsolete blocks

I can't spot these messages in your logs.


Can you try to start the node and check that at least Restoring chain level monitor from last finalized block message is printed?
Should be very close to the start
(remember to enable --log parachain=debug)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J0-enhancement An additional feature request.
Projects
Status: done
8 participants