Compaction thread crash on sync writes #1422

ghost · 2016-10-26T17:24:48Z

Hi,

We have a use case of doing sync writes (each write ~ 1 Kb, SSD page size = 4Kb) with the following settings to alleviate stalling,

rocksdb_options_set_write_buffer_size(options,33554432);
env = rocksdb_create_default_env();
rocksdb_env_set_background_threads(env,2);
rocksdb_env_set_high_priority_background_threads(env,1);
rocksdb_options_set_max_background_compactions(options, 2);
rocksdb_options_set_max_background_flushes(options,1);
rocksdb_options_set_env(options,env);

Everything else is default (default read cache settings). After ~ 45 million rows were loaded, we got a crash on one of the compaction threads (the build was made from the current repository last week):

(gdb) bt
#0 0x00007f7f574a29e4 in UnalignedCopy64 (decompressor=0x7f7f45d23790, writer=0x7f7f45d237c0, uncompressed_len=) at snappy-stubs-internal.h:195
#1 IncrementalCopyFastPath (decompressor=0x7f7f45d23790, writer=0x7f7f45d237c0, uncompressed_len=) at snappy.cc:147
#2 AppendFromSelf (decompressor=0x7f7f45d23790, writer=0x7f7f45d237c0, uncompressed_len=) at snappy.cc:1209
#3 DecompressAllTagssnappy::SnappyArrayWriter (decompressor=0x7f7f45d23790, writer=0x7f7f45d237c0, uncompressed_len=) at snappy.cc:779
#4 snappy::InternalUncompressAllTagssnappy::SnappyArrayWriter (decompressor=0x7f7f45d23790, writer=0x7f7f45d237c0, uncompressed_len=) at snappy.cc:865
#5 0x00007f7f574a35e0 in InternalUncompresssnappy::SnappyArrayWriter (compressed=, uncompressed=) at snappy.cc:855
#6 snappy::RawUncompress (compressed=, uncompressed=) at snappy.cc:1234
#7 0x00007f7f574a3632 in snappy::RawUncompress (compressed=, n=, uncompressed=) at snappy.cc:1229
#8 0x00007f7f5795111d in Snappy_Uncompress (data=0x7f7efa0afea0 "\305\335-\220", n=215560, contents=0x7f7f45d24ee0, format_version=2, compression_dict=..., compression_type=, ioptions=...)

at ./util/compression.h:179

#9 rocksdb::UncompressBlockContentsForCompressionType (data=0x7f7efa0afea0 "\305\335-\220", n=215560, contents=0x7f7f45d24ee0, format_version=2, compression_dict=..., compression_type=,

ioptions=...) at table/format.cc:434

#10 0x00007f7f57951631 in rocksdb::UncompressBlockContents (data=Unhandled dwarf expression opcode 0xf3

) at table/format.cc:540
#11 0x00007f7f5795206c in rocksdb::ReadBlockContents (file=Unhandled dwarf expression opcode 0xf3

) at table/format.cc:388
#12 0x00007f7f5793a628 in rocksdb::(anonymous namespace)::ReadBlockFromFile (file=Unhandled dwarf expression opcode 0xf3

) at table/block_based_table_reader.cc:77
#13 0x00007f7f5793d38d in Create (this=Unhandled dwarf expression opcode 0xf3

) at table/block_based_table_reader.cc:196
#14 rocksdb::BlockBasedTable::CreateIndexReader (this=Unhandled dwarf expression opcode 0xf3

) at table/block_based_table_reader.cc:1743
#15 0x00007f7f579439d5 in rocksdb::BlockBasedTable::Open (ioptions=Unhandled dwarf expression opcode 0xf3

) at table/block_based_table_reader.cc:788
#16 0x00007f7f57938c5b in rocksdb::BlockBasedTableFactory::NewTableReader (this=Unhandled dwarf expression opcode 0xf3

) at table/block_based_table_factory.cc:59
#17 0x00007f7f578f31c5 in rocksdb::TableCache::GetTableReader (this=Unhandled dwarf expression opcode 0xf3

) at db/table_cache.cc:111
#18 0x00007f7f578f376d in rocksdb::TableCache::FindTable (this=0x1962cb0, env_options=..., internal_comparator=..., fd=..., handle=0x7f7f45d25678, no_io=false, record_read_stats=true, file_read_hist=0x19619e0,

skip_filters=false, level=-1, prefetch_index_and_filter_in_cache=true) at db/table_cache.cc:148

#19 0x00007f7f578f3d1d in rocksdb::TableCache::NewIterator (this=0x1962cb0, options=..., env_options=..., icomparator=..., fd=..., table_reader_ptr=0x0, file_read_hist=0x19619e0, for_compaction=false, arena=

0x0, skip_filters=false, level=-1, range_del_agg=0x0, is_range_del_only=false) at db/table_cache.cc:218

#20 0x00007f7f5786943c in rocksdb::CompactionJob::FinishCompactionOutputFile (this=0x7f7f45d26ca0, input_status=Unhandled dwarf expression opcode 0xf3

) at db/compaction_job.cc:1012
#21 0x00007f7f5786be23 in rocksdb::CompactionJob::ProcessKeyValueCompaction (this=0x7f7f45d26ca0, sub_compact=0x7f7ef89ea9f0) at db/compaction_job.cc:864
#22 0x00007f7f5786ca2f in rocksdb::CompactionJob::Run (this=0x7f7f45d26ca0) at db/compaction_job.cc:535
#23 0x00007f7f578943c8 in rocksdb::DBImpl::BackgroundCompaction (this=0x19510e0, made_progress=0x7f7f45d270fe, job_context=0x7f7f45d27120, log_buffer=0x7f7f45d27320, arg=0x0) at db/db_impl.cc:3616
#24 0x00007f7f578a4d68 in rocksdb::DBImpl::BackgroundCallCompaction (this=0x19510e0, arg=0x0) at db/db_impl.cc:3314
#25 0x00007f7f579c0771 in rocksdb::ThreadPoolImpl::BGThread (this=0x194ab40, thread_id=0) at util/threadpool_imp.cc:229
#26 0x00007f7f579c0853 in rocksdb::BGThreadWrapper (arg=0x1950b90) at util/threadpool_imp.cc:253
#27 0x00000034c0007851 in start_thread () from /lib64/libpthread.so.0
#28 0x00000034bf8e890d in clone () from /lib64/libc.so.6

Thanks,

Ethan.

The text was updated successfully, but these errors were encountered:

dhruba · 2016-10-28T16:07:12Z

A couple of questions:

Can you recreate this crash repeatedly on demand?
The crash happened when a bacjground thread was doing Compaction and was trying to read data from one of the source files in the compaction process. If you look at LOG file, you might be able to find the name of the sst files that were input to this compaction run. Once you find the sst file name, can you pl dump its contents via the ldb tool (https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)? This will verify whether the file contents are sane or if they are corrupted.
If the file contents are sane, then the problem is not with the data but rather the code. It coudl be either rocksdb code or your application code. Maybe you can build the rocksdb library with "valgrind" enabled and then rerun your test program that causes this problem.

ghost · 2016-10-28T16:30:24Z

@dhruba Thanks for reply. Unfortunately it isn't reproducible on demand. I am doing a series of stress tests and this one popped up in one of the tests, but it took many hours to show up.

I do have the coredump available so I can execute commands on it if it could help determine whether this was caused by our application code.

Thanks,

siying · 2016-11-04T23:51:47Z

The call stack is interesting. After finish writing one SST file, its index block cannot be correctly decompressed. The Snappy library crashes while trying to decompress the block.

It is very strange to me. In your core dump, does memory address 0x7f7efa0afea0 and following 215560 bytes valid (The parameter of frame 8)? Could Snappy crash just because the data is not right? If it is the case, is there a way that the data is corrupted in the device or the file system after being generated?

ghost · 2016-11-05T00:44:18Z

@siying Thanks for the follow up. Quick question.
If Snappy is not able to decompress this, could this be caused by memory corruption. In other words, could it be that bugs in our application (like buffer overflow) is corrupting that data?

siying · 2016-11-05T00:52:15Z

@ehamilto it is theoretically possible for sure. However, unless you tune RocksDB in a uncommon way, it is hard for me to believe as this is a common code path and almost every RocksDB instance will go through it.

ghost · 2016-11-05T01:07:59Z

@siying we have repeated the stress test that caused this several times and I haven't seen it again.
We have fixed several crashes in our app in between (although those crashes were caused by de-referencing bad pointers not memory corruption such as a buffer overflow). The only difference between the above settings and what we currently run is that now we run with a memtable size of 4 Mb. I will run a test with a memtable of 32 M in the next few days to see if I can reproduce it again.

gfosco · 2018-01-10T22:43:01Z

Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.

gfosco added the abandoned-or-aged-out label Jan 10, 2018

gfosco closed this as completed Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction thread crash on sync writes #1422

Compaction thread crash on sync writes #1422

ghost commented Oct 26, 2016 •

edited by ghost

Loading

dhruba commented Oct 28, 2016

ghost commented Oct 28, 2016

siying commented Nov 4, 2016

ghost commented Nov 5, 2016 •

edited by ghost

Loading

siying commented Nov 5, 2016

ghost commented Nov 5, 2016

gfosco commented Jan 10, 2018

Compaction thread crash on sync writes #1422

Compaction thread crash on sync writes #1422

Comments

ghost commented Oct 26, 2016 • edited by ghost Loading

dhruba commented Oct 28, 2016

ghost commented Oct 28, 2016

siying commented Nov 4, 2016

ghost commented Nov 5, 2016 • edited by ghost Loading

siying commented Nov 5, 2016

ghost commented Nov 5, 2016

gfosco commented Jan 10, 2018

ghost commented Oct 26, 2016 •

edited by ghost

Loading

ghost commented Nov 5, 2016 •

edited by ghost

Loading