State sync snapshotting #5689

erikgrinaker · 2020-02-21T12:09:05Z

Summary

When implementing state sync, the Cosmos SDK must schedule, take, and prune periodic state snapshots as outlined in ADR-053.

This depends on cosmos/iavl#210.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

alexanderbez · 2020-02-21T16:00:06Z

How will this tie into the existing pruning business logic in the IAVL store? I imagine the IAVL APIs will change to support this as opposed to being set once during bootstrapping?

erikgrinaker · 2020-02-21T16:14:51Z

Yeah, so there's an unfortunate confusion of terms here. State sync snapshots and pruning don't really have anything to do with IAVL snapshots and pruning, they are different concepts. We've discussed this, and it was believed that they are different enough that we can use the same terms, but would be happy to discuss this.

The main interaction we'll need to be careful with is to prevent IAVL pruning from removing a version that is currently being snapshotted by the state sync logic.

alexanderbez · 2020-02-21T16:33:51Z

I see, thanks for the clarification. I'll need to take a refresher on state sync to fully understand the architecture and semantics WRT to ABCI and the state machine.

erikgrinaker · 2020-02-21T16:35:44Z

Sure, I'd suggest reading over ADR-053, and I'll be happy to go over it with you as well.

alexanderbez · 2020-02-24T02:05:07Z

Reviewed the ADR. I have a more complete understanding of the protocol. I'm sure we'll work closely in completing this work together.

Some initial questions:

Deterministic: snapshots must be deterministic, and identical across all nodes - typically by taking a snapshot at given height intervals.

Why must snapshotting be identical across all nodes? Will this be a consensus parameter? The section Snapshot Scheduling seems to contradict this.

Consistent: snapshots must be consistent, i.e. not affected by concurrent writes - typically by using a data store that supports versioning and/or snapshot isolation.

Curious how BadgerDB/BoltDB/LevelDB behaves here. I imagine a snapshot at a committed height will acquire a read-lock, but state transitions for the current, yet-to-be-committed block, would acquire a write-lock. Would this cause contention? Note, this also relates to the section on Asynchronous.

Garbage collected: snapshots must be garbage collected periodically.

Is this a consensus parameter as well?

The node switches to fast sync to catch up blocks that were committed while restoring the snapshot.

After snapshot restore is complete, correct?

erikgrinaker · 2020-02-24T09:01:40Z

I'm sure we'll work closely in completing this work together.

Yes, that would be great - I was thinking I could get started on this myself, since I'm already familiar with the problem domain, but would be happy to collaborate.

Deterministic: snapshots must be deterministic, and identical across all nodes - typically by taking a snapshot at given height intervals.

Why must snapshotting be identical across all nodes? Will this be a consensus parameter? The section Snapshot Scheduling seems to contradict this.

A given snapshot (e.g. the snapshot at height 10000 in format 1) must be identical across nodes, since we may be fetching chunks from multiple peers and these must "fit together".

As for scheduling, nodes are in principle free to take snapshots at whatever times they want, but it may make sense for this to be a consensus parameter such that snapshots are available across as many peers as possible. One approach might be to make the snapshot interval a consensus parameter, but allow node operators to disable snapshots completely.

Consistent: snapshots must be consistent, i.e. not affected by concurrent writes - typically by using a data store that supports versioning and/or snapshot isolation.

Curious how BadgerDB/BoltDB/LevelDB behaves here. I imagine a snapshot at a committed height will acquire a read-lock, but state transitions for the current, yet-to-be-committed block, would acquire a write-lock. Would this cause contention? Note, this also relates to the section on Asynchronous.

IAVL already handles this, since it's versioned - we just grab an ImmutableTree and dump it. For non-IAVL apps, the easiest solution is to use a database that supports MVCC transactions - from a quick glance, BadgerDB and BoltDB appear to support MVCC, but LevelDB does not.

Garbage collected: snapshots must be garbage collected periodically.

Is this a consensus parameter as well?

If we make the snapshot interval a consensus parameter, then I think this should be as well.

The node switches to fast sync to catch up blocks that were committed while restoring the snapshot.

After snapshot restore is complete, correct?

Correct.

github-actions · 2020-07-05T00:09:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

erikgrinaker added C: comet C:baseapp core labels Feb 21, 2020

erikgrinaker mentioned this issue Feb 21, 2020

State sync ABCI support #5690

Closed

4 tasks

alexanderbez self-assigned this Feb 24, 2020

erikgrinaker assigned erikgrinaker and unassigned alexanderbez Mar 2, 2020

erikgrinaker mentioned this issue Mar 19, 2020

State sync support #5803

Closed

github-actions bot added the stale label Jul 5, 2020

tac0turtle removed the stale label Jul 6, 2020

alexanderbez removed core labels Jul 6, 2020

erikgrinaker mentioned this issue Aug 25, 2020

Add state sync support #7166

Merged

9 tasks

mergify bot closed this as completed in #7166 Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State sync snapshotting #5689

State sync snapshotting #5689

erikgrinaker commented Feb 21, 2020

alexanderbez commented Feb 21, 2020 •

edited

Loading

erikgrinaker commented Feb 21, 2020

alexanderbez commented Feb 21, 2020

erikgrinaker commented Feb 21, 2020

alexanderbez commented Feb 24, 2020

erikgrinaker commented Feb 24, 2020

github-actions bot commented Jul 5, 2020

State sync snapshotting #5689

State sync snapshotting #5689

Comments

erikgrinaker commented Feb 21, 2020

Summary

For Admin Use

alexanderbez commented Feb 21, 2020 • edited Loading

erikgrinaker commented Feb 21, 2020

alexanderbez commented Feb 21, 2020

erikgrinaker commented Feb 21, 2020

alexanderbez commented Feb 24, 2020

erikgrinaker commented Feb 24, 2020

github-actions bot commented Jul 5, 2020

alexanderbez commented Feb 21, 2020 •

edited

Loading