Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Potential Database Corruption during sync #2603

Closed
tomusdrw opened this issue Oct 13, 2016 · 11 comments
Closed

Potential Database Corruption during sync #2603

tomusdrw opened this issue Oct 13, 2016 · 11 comments
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust.
Milestone

Comments

@tomusdrw
Copy link
Collaborator

2016-10-12 23:29:58  Syncing #2422970 b8be…d6b2      1 blk/s    6 tx/s   0 Mgas/s       0+ 7245 Qed   #2430219    1/46/50 peers      2 GiB db    7 MiB chain   40 MiB queue   11 MiB sync
2016-10-12 23:30:02  Block import failed for #2422985 (843d…5b07)
Error: Trie(IncompleteDatabase(11b9caba988cd1aeefcc20ca0595f051064c70e7149a5a0670366c322268c310))
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 27: 27, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 83: 83, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 41: 41, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 47: 47, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 69: 69, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 61: 61, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 72: 72, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 5: 5, state = ChainHead
2016-10-12 23:30:04  Bad header 2423110 (b8e4…a224) from 48: 48, state = ChainHead
2016-10-12 23:30:06  Bad header 2423110 (b8e4…a224) from 37: 37, state = ChainHead
2016-10-12 23:30:07  Bad header 2423110 (b8e4…a224) from 2: 2, state = ChainHead
2016-10-12 23:30:08  Bad header 2423110 (b8e4…a224) from 57: 57, state = ChainHead
2016-10-12 23:30:08  Syncing #2422984 969c…b7f0      1 blk/s   15 tx/s   0 Mgas/s       0+    0 Qed   #2422983    3/39/50 peers      2 GiB db    8 MiB chain    2 KiB queue   11 MiB sync
2016-10-12 23:30:11  Bad header 2423110 (b8e4…a224) from 23: 23, state = ChainHead
thread 'IO Worker #1' panicked at 'Potential DB corruption encountered: Database missing expected key: 1e34…d51d', ethcore/src/state/mod.rs:645
...
error: Process didn't exit successfully: `target/release/parity` (signal: 11, SIGSEGV: invalid memory reference)

Enough disk space (20GB)
4GB RAM node

Running latest master via:
$ cargo run --release --no-default-features --bin parity -- --relay-set strict --force-sealing

@tomusdrw tomusdrw added Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. M4-core ⛓ Core client code / Rust. labels Oct 13, 2016
@rphmeier
Copy link
Contributor

That's probably a rocksdb OOM issue, judging by the sigsegv.

@arkpar arkpar added F2-bug 🐞 The client fails to follow expected behavior. and removed Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. labels Oct 14, 2016
@arkpar
Copy link
Collaborator

arkpar commented Oct 14, 2016

Could not reproduce on my local VM (ubuntu 14.04)
Reproduced on the DO 4GB machine (ubuntu 15) though.

@pyskell
Copy link
Contributor

pyskell commented Oct 18, 2016

Adding some more info to this based on the suggestion from @keorn

This doesn't seem to have to do specifically with many days of runtime as even after restarting parity, or attempting to sync a new copy of the chain from the network, the same issue is encountered. So even a brand new machine, running the latest version of parity, will be unable to sync to either network. Even using the newer parity restore <snapshot> does not work (my earlier comment was in error). The only thing that has worked is fully downloading another user's blockchain.

While this seems to be due to a heavy set of blocks to process (around 2,420,000), possibly related to the recent exploit, it's important to note that this even failed to freshly sync from the network on a VPS with 16GB of RAM and 8 CPUs (Digital Ocean $160 droplet option). As such, even for more than capable machines this is a DoS for new nodes attempting to enter the network. And hints that the issue may not exactly be tied to the intense computation required for the exploit blocks.

Also worth noting is that the panic/crash is immediate. So if I start parity to sync a fresh chain, let it crash at the problem block hours later, and then start it again, it will crash within about a second.

My output in particular differs a bit from the original commenter's so I've included it below:

thread 'IO Worker #2' panicked at 'Potential DB corruption encountered: Database missing expected key: 1348…1230', ethcore/src/state.rs:629
stack backtrace:
   1:     0x7f3f8de417b9 - <unknown>
   2:     0x7f3f8de4948c - <unknown>
   3:     0x7f3f8de48359 - <unknown>
   4:     0x7f3f8de48a48 - <unknown>
   5:     0x7f3f8de488a2 - <unknown>
   6:     0x7f3f8de48810 - <unknown>
   7:     0x7f3f8da7f5da - <unknown>
   8:     0x7f3f8da01a4f - <unknown>
   9:     0x7f3f8d9c3e50 - <unknown>
  10:     0x7f3f8da37461 - <unknown>
  11:     0x7f3f8da39837 - <unknown>
  12:     0x7f3f8d9ef69a - <unknown>
  13:     0x7f3f8d8ecab5 - <unknown>
  14:     0x7f3f8de50f76 - <unknown>
  15:     0x7f3f8d94da3e - <unknown>
  16:     0x7f3f8de46ff2 - <unknown>
  17:     0x7f3f8c5830a3 - start_thread
  18:     0x7f3f8cf9387c - clone
  19:                0x0 - <unknown>
2016-10-14 13:07:16  Finishing work, please wait...

I have a working copy of the blockchain here (courtesy of another user) if it can be of any use debugging: full parity copy

This copy includes the problem blocks but parity doesn't need to process them so the remainder of blocks sync as normal.

@inmathwetrust
Copy link

I am also affected by this issue as soon as i run the executable.
Running the implementation on ubuntu 16.04.1 LTS

Stage 3 block verification failed for #2422712 (a1b3…1ce4)
Error: Block(UnknownParent(1ec2be8ab88022c770b1e76ba0147c6e16e28d88e274947f038fdc1b54552f81))

Is there a workaround for this issue? or ETA for a fix? THANKS.

@pyskell
Copy link
Contributor

pyskell commented Oct 19, 2016

@inmathwetrust

You can download my copy at the "full parity node" link and copy the DB to your .parity folder.

Just two things to keep in mind:

  • This is for the ETC network, not Ethereum
  • Make sure you don't overwrite any keys you might have stored in your .parity folder

@arkpar
Copy link
Collaborator

arkpar commented Oct 21, 2016

This should be fixed in 1.3.9. Please let us know if you see it again.

@kenzaka07
Copy link

kenzaka07 commented Oct 23, 2016

Hi.

I was using Parity 1.3.9... Everything was going well but syncing too slow, until such time it encountered this issue and won't let me sync on this block #2451318. Everytime I will restart the Parity, it will always crashed... This is the first time I have encountered such issue from when I started using 1.3.0 all the way to 1.3.9.

Please let me know what should I do... I am now behind syncing to the latest block because of slow syncing recently...

2016-10-23 19:14:29  Starting Parity/v1.3.9-beta-e9987c4-20161021/x86_64-windows-msvc/rustc1.12.0
2016-10-23 19:14:29  Using state DB journalling strategy fast
2016-10-23 19:14:29  Configured for Frontier/Homestead using Ethash engine
2016-10-23 19:14:42  NAT mapped to external address 112.201.176.90:58848
2016-10-23 19:14:42  Public node URL: enode://fd8891a24d019c70283d26f53ada8ae04309f42c1478777a733d5061428216f788ed2783297da0328127445f2dd308c1122e307fae67e1613241c707eff8e172@112.201.176.90:58848+60778
2016-10-23 19:14:50  Syncing #2451318 dd33…ffe9      0 blk/s    0 tx/s   0 Mgas/s       0+    0 Qed   #2451318    5/ 5/25 peers     18 MiB db    8 KiB chain  0 bytes queue   11 KiB sync
2016-10-23 19:15:04  Syncing #2451318 dd33…ffe9      0 blk/s    0 tx/s   0 Mgas/s       0+    0 Qed   #2451318    1/ 3/25 peers     18 MiB db    8 KiB chain  0 bytes queue   19 KiB sync
2016-10-23 19:15:04  Syncing #2451318 dd33…ffe9      0 blk/s    0 tx/s   0 Mgas/s       0+    0 Qed   #2451318    1/ 3/25 peers     18 MiB db    8 KiB chain  0 bytes queue   19 KiB sync
2016-10-23 19:15:04  Syncing #2451318 dd33…ffe9      0 blk/s    0 tx/s   0 Mgas/s       0+    0 Qed   #2451318    1/ 3/25 peers     18 MiB db    8 KiB chain  0 bytes queue   19 KiB sync
2016-10-23 19:15:12  Syncing #2451318 dd33…ffe9      0 blk/s    0 tx/s   0 Mgas/s       0+    0 Qed   #2451318    4/ 5/25 peers     18 MiB db    8 KiB chain  0 bytes queue  130 KiB sync
thread 'Verifier #0' panicked at 'DB flush failed.: "Corruption: block checksum mismatch"', ../src/libcore\result.rs:788
stack backtrace:
   0:     0x7ff67bb1346e - <unknown>
   1:     0x7ff67bb11363 - <unknown>
   2:     0x7ff67bb11e2d - <unknown>
   3:     0x7ff67bb11c76 - <unknown>
   4:     0x7ff67bb11bd4 - <unknown>
   5:     0x7ff67bb11b6b - <unknown>
   6:     0x7ff67bb1edb5 - <unknown>
   7:     0x7ff67ba2419a - <unknown>
   8:     0x7ff67b768069 - <unknown>
   9:     0x7ff67b5c037f - <unknown>
  10:     0x7ff67b62039a - <unknown>
  11:     0x7ff67bb15631 - <unknown>
  12:     0x7ff67b6818cb - <unknown>
  13:     0x7ff67bb0f15e - <unknown>
  14:     0x7ffd1dc48363 - BaseThreadInitThunk
2016-10-23 19:15:19  Finishing work, please wait...
thread 'Verifier #1' panicked at 'DB flush failed.: "Corruption: block checksum mismatch"', ../src/libcore\result.rs:788
stack backtrace:
   0:     0x7ff67bb1346e - <unknown>
   1:     0x7ff67bb11363 - <unknown>
   2:     0x7ff67bb11e2d - <unknown>
   3:     0x7ff67bb11c76 - <unknown>
   4:     0x7ff67bb11bd4 - <unknown>
   5:     0x7ff67bb11b6b - <unknown>
   6:     0x7ff67bb1edb5 - <unknown>
   7:     0x7ff67ba2419a - <unknown>
   8:     0x7ff67b768069 - <unknown>
   9:     0x7ff67b5c037f - <unknown>
  10:     0x7ff67b62039a - <unknown>
  11:     0x7ff67bb15631 - <unknown>
  12:     0x7ff67b6818cb - <unknown>
  13:     0x7ff67bb0f15e - <unknown>
  14:     0x7ffd1dc48363 - BaseThreadInitThunk

@gavofyork
Copy link
Contributor

gavofyork commented Oct 27, 2016

this is fixed in master #2832 and will be fixed in the 1.3.10 stable. please test when those are release and reopen if the issue reappears.

@5chdn
Copy link
Contributor

5chdn commented Jul 10, 2017

Some user reported this issue with the latest beta 1.6.8 - is this the very same issue?

image001

@arkpar
Copy link
Collaborator

arkpar commented Jul 10, 2017

@5chdn probably not. Was there a out of memory or out of disk error on prior run?

@5chdn
Copy link
Contributor

5chdn commented Jul 11, 2017

@arkpar can't tell, I was guiding him how to access the node logs and this is the first time he looked at it. We now reset the db and it works.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

No branches or pull requests

8 participants