Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForFlushMemTables finishing before result is committed to Manifest #10089

Open
daggarwal387 opened this issue Jun 1, 2022 · 7 comments · May be fixed by #10334
Open

WaitForFlushMemTables finishing before result is committed to Manifest #10089

daggarwal387 opened this issue Jun 1, 2022 · 7 comments · May be fixed by #10334
Labels

Comments

@daggarwal387
Copy link

daggarwal387 commented Jun 1, 2022

While using the ExportColumnFamily API I observed that the result is missing the newly flushed memtable. Upon further investigation, I found that WaitForFlushMemTables is finishing before the flush has actually committed the result to the Manifest. Following are the logs for the column family that experienced it(there are some additional logs added for debugging).
2022/05/31-16:28:39.080669 4f8a700 [ties/checkpoint/checkpoint_impl.cc:357] [0_0] export column family onto export directory /tmp/bringup/0/disks/0/scribe/newscribe/bucket_0/db_sender_sst_file 2022/05/31-16:28:39.080729 4f8a700 [_impl/db_impl_compaction_flush.cc:1310] [0_0] Manual flush start. 2022/05/31-16:28:39.095206 4f8a700 [_impl/db_impl_write.cc:1614] [0_0] New memtable created with log file: #113. Immutable memtables: 0. 2022/05/31-16:28:39.095227 4f8a700 [_impl/db_impl_compaction_flush.cc:1596] [0_0] Flush request queued. 2022/05/31-16:28:39.112425 7f5694e41700 (Original Log Time 2022/05/31-16:28:39.095402) [_impl/db_impl_compaction_flush.cc:2208] Calling FlushMemTableToOutputFile with column family [0_0], flush slots available 10, compaction slots available 1, flush slots scheduled 10, compaction slots scheduled 0 2022/05/31-16:28:39.112432 7f5694e41700 [ush_job.cc:322] [0_0] [JOB 320] Flushing memtable with next log file: 113 and mem_table_id: 10 2022/05/31-16:28:39.112450 7f5694e41700 [ush_job.cc:351] [0_0] [JOB 320] Level-0 flush table #114: started 2022/05/31-16:28:39.116592 4d70700 [_impl/db_impl_write.cc:1614] [0_0] New memtable created with log file: #115. Immutable memtables: 1. 2022/05/31-16:28:39.116620 4d70700 [_impl/db_impl_compaction_flush.cc:1596] [0_0] Flush request queued. 2022/05/31-16:28:39.127211 7f5696e45700 (Original Log Time 2022/05/31-16:28:39.116846) [_impl/db_impl_compaction_flush.cc:2208] Calling FlushMemTableToOutputFile with column family [0_0], flush slots available 10, compaction slots available 1, flush slots scheduled 10, compaction slots scheduled 0 2022/05/31-16:28:39.127218 7f5696e45700 [ush_job.cc:322] [0_0] [JOB 331] Flushing memtable with next log file: 115 and mem_table_id: 11 2022/05/31-16:28:39.127238 7f5696e45700 [ush_job.cc:351] [0_0] [JOB 331] Level-0 flush table #116: started 2022/05/31-16:28:39.130535 7f5694e41700 [ush_job.cc:392] [0_0] [JOB 320] Level-0 flush table #114: 235965 bytes OK 2022/05/31-16:28:39.140412 7f5696e45700 [ush_job.cc:392] [0_0] [JOB 331] Level-0 flush table #116: 2489 bytes OK 2022/05/31-16:28:39.154308 7f5696e45700 (Original Log Time 2022/05/31-16:28:39.150687) [_impl/db_impl_compaction_flush.cc:204] [0_0] Level summary: base level 4 level multiplier 8.00 max bytes base 4096000 files[1 0 0 0 1] max score 0.12 2022/05/31-16:28:39.154415 4f8a700 [_impl/db_impl_compaction_flush.cc:1866] [0_0]Flush finished. Num not flushed: 1, earliest_memtable_id_now: 11, mem_table_we_flushed: 10 2022/05/31-16:28:39.154423 4f8a700 [_impl/db_impl_compaction_flush.cc:1320] [0_0] Manual flush finished, status: OK 2022/05/31-16:28:39.154474 4f8a700 [ties/checkpoint/checkpoint_impl.cc:380] [0_0] HardLinking /000100.sst 2022/05/31-16:28:39.154551 4f8a700 [ties/checkpoint/checkpoint_impl.cc:380] [0_0] HardLinking /000090.sst 2022/05/31-16:28:39.166255 7f5694e41700 (Original Log Time 2022/05/31-16:28:39.137073) [mtable_list.cc:383] [0_0] Level-0 commit table #114 has started, mem_id: 10 2022/05/31-16:28:39.166260 7f5694e41700 (Original Log Time 2022/05/31-16:28:39.154210) [mtable_list.cc:439] [0_0] Level-0 commit table #114: memtable #1memtable_id: 10 done 2022/05/31-16:28:39.166263 7f5694e41700 (Original Log Time 2022/05/31-16:28:39.154215) [mtable_list.cc:383] [0_0] Level-0 commit table #116 has started, mem_id: 11 2022/05/31-16:28:39.166265 7f5694e41700 (Original Log Time 2022/05/31-16:28:39.166175) [mtable_list.cc:439] [0_0] Level-0 commit table #116: memtable #1memtable_id: 11 done 2022/05/31-16:28:39.166270 7f5694e41700 (Original Log Time 2022/05/31-16:28:39.166208) [_impl/db_impl_compaction_flush.cc:204] [0_0] Level summary: base level 4 level multiplier 8.00 max bytes base 4096000 files[3 0 0 0 1] max score 0.38 2022/05/31-16:28:39.166273 4f8a700 [ties/checkpoint/checkpoint_impl.cc:435] [0_0] Export succeeded. 2022/05/31-16:28:39.166309 4d70700 [_impl/db_impl_compaction_flush.cc:1866] [0_0]Flush finished. Num not flushed: 0, earliest_memtable_id_now: 18446744073709551615, mem_table_we_flushed: 11 2022/05/31-16:28:39.166321 4d70700 [_impl/db_impl_compaction_flush.cc:1443] [0_0] Manual compaction starting

So if you look here, the memtable with GetID() as 10 is the one that was selected for manual flush from ExportColumnFamily. During committing the logline "Level-0 commit table #114 has started, mem_id: 10" the memtable is at the back of the memlist but a few lines before it "Flush finished. Num not flushed: 1, earliest_memtable_id_now: 11, mem_table_we_flushed: 10" the GetEarliestMemTableID gave us 11.

Expected behavior

WaitForFlushMemTables should not finish before committing to manifest file.

Actual behavior

WaitForFlushMemTables is finishing before committing to manifest file.

Steps to reproduce the behavior

Trigger parallel flush on the same column family.
I am using 6.4.6 rocksdb version with some additional logging and attaching the complete LOG file also in case that helps.
LOG.zip

@akankshamahajan15
Copy link
Contributor

akankshamahajan15 commented Jun 7, 2022

@daggarwal387 Can you try with the latest RocksDB version. I couldn't find the exact PR with the fix but it shouldn't occur in the lastest version.

@akankshamahajan15 akankshamahajan15 added the waiting Waiting for a response from the issue creator. label Jun 7, 2022
@daggarwal387
Copy link
Author

daggarwal387 commented Jun 9, 2022

Upgrading the whole rocksdb version is not feasible for me as it'll take long to qualify and deploy. If you can point me to the PR that fixed it then it will be of immense help as I'll patch with that. I tried searching the latest repo commits with WaitForFlushMemTables but could not find the relevant PR. In case you were talking about #8046 then this is not the case because the logs show that the flush did commit to the Manifest a little later.

@riversand963
Copy link
Contributor

@daggarwal387 can you take a look at 5f2b661 and see if this fixes the issue?

@daggarwal387
Copy link
Author

daggarwal387 commented Jun 9, 2022

@riversand963 It will not because if you check the logs above, the flush actually did not see any issue and committed just a few milliseconds after coming out of WaitForFlushMemTables. 5f2b661 is same as #8046 that I mentioned in the previous comment.
2022/05/31-16:28:39.154415 4f8a700 [_impl/db_impl_compaction_flush.cc:1866] [0_0]Flush finished. Num not flushed: 1, earliest_memtable_id_now: 11, mem_table_we_flushed: 10
This log line shows that we did not exit the wait from that break condition as mentioned in 5f2b661 and we exited via the normal flow.

@akankshamahajan15
Copy link
Contributor

@daggarwal387 Can you share a unit test. I can check it with latest release to see if it has been fixed or not and can take from there.

@ajkr ajkr added question up-for-grabs Up for grabs and removed waiting Waiting for a response from the issue creator. labels Jun 21, 2022
@daggarwal387
Copy link
Author

daggarwal387 commented Jul 11, 2022

@akankshamahajan15 @ajkr parallel_flush_UT.patch.txt and Pull request(#10334)
Uploading the UT as a patch which reproduces the issue. This UT triggers two flushes, one with wait via the ExportColumnFamily API and another without wait. I have added a random delay to reproduce the race. So running FlushBeforeWritingManifestWithCheckpoint in a loop should quickly reproduce the failure.
Expected behavior:
ExportColumnFamily should always get an SST file as there was a write.
Actual behavior:
ExportColumnFamily sometimes don't get any SST file as the file was not committed to manifest.

Suspected Race:

  1. A write was done to column family and a flush with wait is triggered(let's call it Flush_A).
  2. A second write is issued to the column family which also triggers a flush(let's call it Flush_B) such that both flushes are running in parallel.
  3. Flush_B flushes its memtable before Flush_A and goes ahead with commit. This leads to Flush_A exiting the WaitForFlushMemTables before the commit to manifest happens.

Sample logs when failure happens:
`cohesity@u18-100911-01:/tmp/rocksdbtest-1000$ grep default db_flush_test_8868917373269466110/LOG

2022/07/10-22:23:55.424822 7f9bafde5240 [lumn_family.cc:481] --------------- Options for column family [default]:

2022/07/10-22:23:55.426174 7f9bafde5240 [rsion_set.cc:4421] Column family [default] (ID 0), log number is 0
** Compaction Stats [default] **
** Compaction Stats [default] **
** File Read Latency Histogram By Level [default] **
** Compaction Stats [default] **
** Compaction Stats [default] **
** File Read Latency Histogram By Level [default] **

2022/07/10-22:23:55.448325 7f9bafde5240 [ties/checkpoint/checkpoint_impl.cc:357] [default] export column family onto export directory /tmp/checkpoint1_1657517035448

2022/07/10-22:23:55.448421 7f9bafde5240 [_impl/db_impl_compaction_flush.cc:1320] [default] Manual flush start.

2022/07/10-22:23:55.448568 7f9bafde5240 [_impl/db_impl_write.cc:1614] [default] New memtable created with log file: #9. Immutable memtables: 0.

2022/07/10-22:23:55.448760 7f9b967fc700 (Original Log Time 2022/07/10-22:23:55.448679) [_impl/db_impl_compaction_flush.cc:2192] Calling FlushMemTableToOutputFile with column family [default], flush slots available 3, compaction slots available 1, flush slots scheduled 3, compaction slots scheduled 0

2022/07/10-22:23:55.448785 7f9b967fc700 [ush_job.cc:325] [default] [JOB 4] Flushing memtable with next log file: 9

2022/07/10-22:23:55.448818 7f9b967fc700 [ush_job.cc:354] [default] [JOB 4] Level-0 flush table #10: started

2022/07/10-22:23:55.449092 7f9b967fc700 [_impl/db_impl_compaction_flush.cc:1320] [default] Manual flush start.

2022/07/10-22:23:55.449270 7f9b967fc700 [_impl/db_impl_write.cc:1614] [default] New memtable created with log file: #11. Immutable memtables: 1.

2022/07/10-22:23:55.449406 7f9b967fc700 [_impl/db_impl_compaction_flush.cc:1331] [default] Manual flush finished, status: OK

2022/07/10-22:23:55.449422 7f9b95ffb700 (Original Log Time 2022/07/10-22:23:55.449386) [_impl/db_impl_compaction_flush.cc:2192] Calling FlushMemTableToOutputFile with column family [default], flush slots available 3, compaction slots available 1, flush slots scheduled 3, compaction slots scheduled 0

2022/07/10-22:23:55.449431 7f9b95ffb700 [ush_job.cc:325] [default] [JOB 6] Flushing memtable with next log file: 11

2022/07/10-22:23:55.449450 7f9b95ffb700 [ush_job.cc:354] [default] [JOB 6] Level-0 flush table #12: started

2022/07/10-22:23:55.455355 7f9b95ffb700 EVENT_LOG_v1 {"time_micros": 1657517035455330, "cf_name": "default", "job": 6, "event": "table_file_creation", "file_number": 12, "file_size": 812, "table_properties": {"data_size": 28, "index_size": 29, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 0, "raw_key_size": 11, "raw_average_key_size": 11, "raw_value_size": 1, "raw_average_value_size": 1, "num_data_blocks": 1, "num_entries": 1, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1657517035, "oldest_key_time": 1657517035, "file_creation_time": 1657517035}}

2022/07/10-22:23:55.455398 7f9b95ffb700 [ush_job.cc:397] [default] [JOB 6] Level-0 flush table #12: 812 bytes OK

2022/07/10-22:23:55.455372 7f9b967fc700 EVENT_LOG_v1 {"time_micros": 1657517035455334, "cf_name": "default", "job": 4, "event": "table_file_creation", "file_number": 10, "file_size": 812, "table_properties": {"data_size": 28, "index_size": 29, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 0, "raw_key_size": 11, "raw_average_key_size": 11, "raw_value_size": 1, "raw_average_value_size": 1, "num_data_blocks": 1, "num_entries": 1, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1657517035, "oldest_key_time": 1657517035, "file_creation_time": 1657517035}}

2022/07/10-22:23:55.455445 7f9b967fc700 [ush_job.cc:397] [default] [JOB 4] Level-0 flush table #10: 812 bytes OK

2022/07/10-22:23:55.627947 7f9b95ffb700 (Original Log Time 2022/07/10-22:23:55.627865) [_impl/db_impl_compaction_flush.cc:205] [default] Level summary: files[0 0 0 0 0 0 0] max score 0.00

2022/07/10-22:23:55.628103 7f9bafde5240 [_impl/db_impl_compaction_flush.cc:1331] [default] Manual flush finished, status: OK

2022/07/10-22:23:55.632167 7f9bafde5240 [ties/checkpoint/checkpoint_impl.cc:434] [default] Export succeeded.

2022/07/10-22:23:55.632225 7f9b967fc700 (Original Log Time 2022/07/10-22:23:55.541595) [mtable_list.cc:382] [default] Level-0 commit table #10 started

2022/07/10-22:23:55.632230 7f9b967fc700 (Original Log Time 2022/07/10-22:23:55.628007) [mtable_list.cc:430] [default] Level-0 commit table #10: memtable #1 done

2022/07/10-22:23:55.632233 7f9b967fc700 (Original Log Time 2022/07/10-22:23:55.628012) [mtable_list.cc:382] [default] Level-0 commit table #12 started

2022/07/10-22:23:55.632236 7f9b967fc700 (Original Log Time 2022/07/10-22:23:55.632161) [mtable_list.cc:430] [default] Level-0 commit table #12: memtable #1 done

2022/07/10-22:23:55.632241 7f9b967fc700 (Original Log Time 2022/07/10-22:23:55.632195) [_impl/db_impl_compaction_flush.cc:205] [default] Level summary: files[2 0 0 0 0 0 0] max score 0.50`

@daggarwal387 daggarwal387 linked a pull request Jul 11, 2022 that will close this issue
@daggarwal387
Copy link
Author

daggarwal387 commented Feb 18, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants