Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Re-Genesis #7458

Open
sorpaas opened this issue Oct 29, 2020 · 23 comments
Open

Re-Genesis #7458

sorpaas opened this issue Oct 29, 2020 · 23 comments

Comments

@sorpaas
Copy link
Member

sorpaas commented Oct 29, 2020

This documents some of notes and designs of a Re-Genesis process. Re-Genesis is basically the process of exporting the current chain state, and create a new chain building on it.

Rationale

The discussions started as an alternative method to Swappable Consensus (#1304). Many consensus engines we have right now (like BABE) make assumptions about the chain state, block numbers, among other things, so a direct consensus swapping will require some heavy modification of the consensus engines themselves. In addition, custom migration code must be written individually for each possible swapping.

Re-Genesis, on the contrary, is much simpler. If implemented with care, it can accomplish the same thing as Swappable Consensus. We do not need to modify existing consensus engines to remove their assumptions, but just need to make switching and restarting a runtime plus consensus engine combination fast.

Re-Genesis can also be used for other purposes that Swappable Consensus is not able to cover:

  • Replace faulty runtime upgrades.
  • As a hard fork process.
  • As a way to "squash" the chain and reduce syncing time.
  • Carry out stop-the-world migrations more smoothly and reliably.

Design

Choosing the Re-Genesis block

A Re-Genesis process divide a blockchain into eras. If a blockchain is considered in era N prior to Re-Genesis, it becomes in era N + 1 post Re-Genesis. At each era, the block number starts from 0. So we can refer to blocks as "era N block M".

The first question is how we choose the Re-Genesis block.

We can always choose the head block at a particular height, but that would not be reliable. There can be multiple such blocks at the same time, and if the state rebuilding process is heavy, allowing it to be switched around is an attack vector.

Instead, we define the Re-Genesis block as a finalized block at a particular height (for chains with finalization), or a block at a particular height with siblings of depths at least D (for chains with probabilistic finalization). This means that when switching from era N to era N + 1, upon the Re-Genesis block, the old era N chain will continue to build blocks and states, but those built blocks and states will not be accounted for in the new era. Instead, they're only there to make the possibility of having multiple Re-Genesis blocks low.

Stopping the old era chain

Having the old era N chain continuing to build blocks and states is definitely not ideal. So we can work on additional support for the runtime to stop the old era chain. The chain stopping process consists of two steps:

  • First, the chain state is frozen. No balance can be transferred. No new proposals can be submitted. The validator set is frozen. No reward will be issued. The existing validator set continue to build blocks.
  • Upon a finalization block at the Re-Genesis height, the runtime then issues a setCode command with an empty code, to permanently shut down the code chain.

Starting the new era chain

Substrate users define their own migration script. The migration will obviously define the initial parameters of the new consensus engine. For the rest of the states, Substrate users can cherry-pick what they want and discard others -- either taking the full state over, or just take the balances and other essential things.

After migration, this new state is then set as the genesis block state for era N + 1, and a new chain continues to function beyond this point.

We note that the difference of a Re-Genesis process and a complete new blockchain, is that the genesis state for a Re-Genesis process is not known until the Re-Genesis block is identified.

Discussions

Light client

Light client implementations differ by consensus engines. As a result, no matter using Swappable Consensus or Re-Genesis, they may not work accross the border. Substrate users may have to ask node users to manually switch light clients, upon Re-Genesis.

Missed time

During the Re-Genesis process, we note there's a stop-the-world migration. Even if that is fast, to identify the Re-Genesis block, time has to be spent on the old era chain to finalize the Re-Genesis block. This will result in a period of time when no actual blocks with state is building for the blockchain.

UX issues

Re-Genesis introduces a new concept called "era", and compared with Swappable Consensus, the new era's block starts their block numbers from 0 again. This can be an UX issue that we should take care of.

Prior usages

The only real-world usage right now (relying on an ad-hoc Re-Genesis process) was Kulupu's era switch at era 0 block 320,000. The process was almost like above, but everything was done manually (with a new node released after Re-Genesis block).

Edgeware also considered Re-Genesis for its first runtime upgrade, but decided against it due to UX concerns.

@sorpaas
Copy link
Member Author

sorpaas commented Oct 29, 2020

cc @andresilva

@Swader
Copy link
Contributor

Swader commented Oct 29, 2020

What happens to past era extrinsics / events for the purpose of auditing (tax etc)? Can people still rebuild the previous era with archive nodes?

@sorpaas
Copy link
Member Author

sorpaas commented Oct 29, 2020

@Swader They should always be able to do that.

Right now I'm thinking about each era using different networking identifier and storage location for simplicity (that is, if we indeed decided to go towards the Re-Genesis direction), but the UX definitely can be improved.

@Swader
Copy link
Contributor

Swader commented Oct 29, 2020

The thing I'm wondering is, right now a full node can become an archive node without needing any communication from other nodes, just based on its extrinsics which it keeps no matter the pruning mode. A full node of era 1 will not be able to do that, presumably. Would this potentially cause an availability rift if no one were to be running a full node of era 0 any more?

@sorpaas
Copy link
Member Author

sorpaas commented Oct 29, 2020

@Swader Yeah indeed. But the chance that not a single person runs era 0 full node is quite slim, IMO.

@Swader
Copy link
Contributor

Swader commented Oct 29, 2020

Agreed, just putting it out there as a there is a chance.

I think this functionality is interesting, and I'd like to see it in Substrate. I don't think Polkadot would use this (because of the slight chance of missing past era availability), but I could definitely see Kusama undergo a new era launch every 5 million blocks or so 👍

@andresilva
Copy link
Contributor

@Swader That is the same problem we will have when we implement warp syncing since nodes will stop downloading the history from before the snapshot point (or at least that was the case with our implementation in parity-ethereum). Normal node operation would still be to sync through all eras and import everything (potentially to different database locations on-disk but that's an implementation detail), so all the data would have the same availability guarantees it has today. The main driving point of this feature is as a potential implementation for swappable consensus, which we'd want to use in the future in Polkadot (e.g. for migrating from BABE to SASSAFRAS).

Light client

I think the light client would just have to start syncing from the latest era. I think this is OK since on PoS chains the light clients already cannot be trusted from genesis due to weak subjectivity.

Right now I'm thinking about each era using different networking identifier

This might make it harder to allow serving clients on all eras, but didn't check what changes would be needed on networking.

UX issues

I think ideally we'd want to avoid resetting the block numbers and just keep incrementing them across eras. From the client-side this might be doable just by maintaining an offset. For the runtime though not sure if that is enough since we might have state entries referencing block numbers from previous eras. I think we might need to remove the assumption that the genesis block is #0, and instead pickup the block number from the last era.

@tomaka
Copy link
Contributor

tomaka commented Oct 29, 2020

Ultimately the networking should be capable of "connecting" to multiple different chains (#3310), in other words to support multiple different chains/eras at the same time, provided each chain/era has a different protocolId.

If however we don't reset the block number to 0, there's no change required on the networking.

@apopiak
Copy link
Contributor

apopiak commented Oct 30, 2020

Is the name "era" intentionally similar to staking eras? If not I would suggest different naming to avoid confusion.

@Swader
Copy link
Contributor

Swader commented Oct 30, 2020

An eon is a unit that's bigger than era and is composed of eras, so that sounds appropriate.

@jak-pan
Copy link
Contributor

jak-pan commented Mar 29, 2021

We're in the same place now with HydraDX.

We've selected default epoch length from the Substrate repo of 10 minutes not realizing that it can have implications for network stability (i.e. no blocks for 10 minutes means stalling network) and also UX for validators - getting kicked out from the set and losing nominations if offline for 10 mins.

As epoch length cannot be changed after the fact the chain started, we're now either forced to restart from #0 with old state, or risk the stalling for now, prepare for this migration and restart after the fact.

The UX now however is not ideal as this is looking like a simple property change in the first place, but we're forced to upgrade all 200+ waiting validators, +even more nodes, make sure to purge their state and either wait for them to re-indicate validation/nomination by purging the validator set state from storage, or risk starting the chain and believe that they have done everything right on time.

Also going back to 0 doesn't look good from UX standpoint, since we're indeed continuing the chain.

@jordy25519
Copy link
Contributor

FWIW CENNZnet is in the same boat. We setup a system to move session keys to hot stand by nodes incase a validator is detected restarting or stalled etc.
This has kept the network running smoothly for the most part. The occasional slow block does tend to cause chaos for the network, emergency elections due to offline offences etc.

with changes like this it seems possible to increase epoch duration, maybe some one off hack like setting a specific epoch will be required: #8072

@jak-pan
Copy link
Contributor

jak-pan commented Mar 29, 2021

FWIW CENNZnet is in the same boat. We setup a system to move session keys to hot stand by nodes incase a validator is detected restarting or stalled etc.
This has kept the network running smoothly for the most part. The occasional slow block does tend to cause chaos for the network, emergency elections due to offline offences etc.

with changes like this it seems possible to increase epoch duration, maybe some one off hack like setting a specific epoch will be required: #8072

That is actually very good to hear that it's working for you, and there is a light at the end of the tunnel. I guess we could try to live with it at least during the first part of the incentivized testnet. We've already postponed slashing during this phase to 27 days and plan to revert slashes automatically, so I guess we'll have larger validator turnout since they'll need to get re-elected often, but that's actually not bad for testing phase.

@jak-pan
Copy link
Contributor

jak-pan commented Apr 1, 2021

So we've come quite far with our re-genesis galacticcouncil/hydration-node#191 but are now stuck at a chicken and egg problem here. polkadot-js/extension#687 (comment)

TLDR; We either need to stop the chain until extension is updated and then re-start (still could have problems), or deal with two separate instances of one chain which is kind of PITA since we already have quite a lot of users.

Anybody has any better idea how to tackle this problem?

@stale
Copy link

stale bot commented Jul 7, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 7, 2021
@tomaka
Copy link
Contributor

tomaka commented Jul 8, 2021

Issue still relevant and important.

@stale stale bot removed the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 8, 2021
@AurevoirXavier
Copy link
Contributor

AurevoirXavier commented Apr 19, 2022

Looking forward to this.
And I think this is also pretty useful for the long-term testnet.

@rithythul
Copy link

rithythul commented Jun 8, 2022

Has anyone try this concept out yet?

We face issue that Substrate era is not ended.

@AurevoirXavier
Copy link
Contributor

Has anyone try this concept out yet?

We face issue that Substrate era is not ended.

You could take a look at https://github.com/darwinia-network/fork-off-substrate

@ggwpez
Copy link
Member

ggwpez commented Jun 8, 2022

I tried out the fork-off-substrate but its hitting JS limits maxsam4/fork-off-substrate#87
Not sure if the Darwinia fork fixed that?

@rithythul
Copy link

rithythul commented Jun 9, 2022

Thanks @AurevoirXavier, we tried once but didn't work.

Thanks so much for the help anyway.

@AurevoirXavier
Copy link
Contributor

Thanks @AurevoirXavier, we tried once but didn't work.

Thanks so much for the help anyway.

Weird, our state is more than 1g.

@rithythul
Copy link

It works for you?
Did era not ended happen to Darwinia before too?
If so do you know the root causes?

In our case, we suspect that it could be not hardware specs are a bit low and we only have around 20 nodes + 13 validator. But still figuring out the root causes to prevent the next issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants