-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault on shutdown during optimization #1316
Comments
Hi 👋 Looking through the provided gdb log, it seems the error happened in the rocksdb code, not the qdrant. I would like to reproduce it this bug. Can you please provide more information about environment where this happened? |
Hi @bazhenov 👋 Thank you for your interest in this issue! I have not tried to reproduce this issue recently, but I used to be able to get it from time to time. All the gdb logs I got pointed to rocks db, here is the ones I still have: I run into this issue by running the following test on my laptop (Ubuntu):
If you get lucky one of those node will crash during shutdown |
Here is what I've found at the moment. I was able to reproduce the issue and catch it under
interesting things happens on the boundary between frames no. 2 and 3. Frame no. 3 pointing to
Then call flow lands in
Looking through RocksDB source I've not been able to find where |
Very happy you were able to reproduce the issue 👏 I started to think it was actually related to something on my local system. It could be related to compaction according to my core dumps and your investigation. Maybe you could try to disable compaction somehow on our Rocksdb setup to see if the bug disappears 🤔 We have the following configuration in pub fn db_options() -> Options {
let mut options: Options = Options::default();
options.set_write_buffer_size(DB_CACHE_SIZE);
options.create_if_missing(true);
options.set_log_level(LogLevel::Error);
options.set_recycle_log_file_num(2);
options.set_max_log_file_size(DB_MAX_LOG_SIZE);
options.create_missing_column_families(true);
options.set_max_open_files(DB_MAX_OPEN_FILES as i32);
#[cfg(debug_assertions)]
{
options.set_paranoid_checks(true);
}
options
} |
Ok, I think I know what's going on. TLDR: there is a race condition between the main thread shutdown and several threads where RocksDB flush/compaction is executed. Detailed descriptionRocksDB is using several static variables when the compaction job is running ( The issue happens when the compaction job is running at this moment because it's addressing the static variables which are not valid anymore. Every SIGSEGV instance I've faced had two threads that interacting in the following way:
Here thread 1 is the main thread running cleanup of static variables and thread 9 is the flushing/compacting job facing an invalid pointer
There is mention that RocksDB should be fully closed before
What can be done?IMO
I've tried to build minimal changes to the source code which can prove that issue can be fixed that way. In the |
Great investigation 👍 I can see that we are calling Regarding the To make sure I understand, why are those drop calls not enough in case of a graceful shutdown? Regarding your current approach, are you trying to set a handler on the |
I'm not sure how exactly this happens, but I saw using
This part I'm still trying to figure out. When |
Not quite. |
Ok. It seems working. I can force let mut toc_arc = toc_arc;
loop {
match Arc::try_unwrap(toc_arc) {
Ok(toc) => {
drop(toc);
break;
}
Err(toc) => {
toc_arc = toc;
log::warn!("Waiting for ToC");
thread::sleep_ms(1000);
}
}
} I abused 3 node cluster as hard as I can, but no SIGSEGVs are happened. |
Nice progress 👍 Were you somehow able to track which Feel free to create a draft PR to share your results with everyone, it does not have to be perfect right away :) |
Yes,
Sure, I will. |
The qdrant process produces a segmentation fault when shutting down during the index optimization.
It does not always happen, it is rather rare.
Current Behavior
Logs of the node at the time.
I was able to analyze the core dump with gdb, the full log with all stacktraces is here core-dump-gdb.log
The main information is
Steps to Reproduce
Deploy a 3 nodes cluster and kill one node while it optimizes an HNSW index.
The text was updated successfully, but these errors were encountered: