Work around a deadlock in hts_tpool_process_flush. #1309

jkbonfield · 2021-07-15T15:14:52Z

The tpool_worker threads have a check on

&& q->qsize - q->n_output > p->tsize - p->nwaiting

This is to ensure that all waiting threads have sufficient room to write their results to the output queue, so if they all start up, there's room.

However, when flushing, we need to leave a bit more leeway. It's possible that we haven't yet consumed the results, so the output queue is full, meaning the flush isn't something that we can call and just wait on unless we have a separate thread that is able to drain the queue. (This is valid for SAM and BAM, but not CRAM where the result consumer is also the same thread that calls the flush).

Hts_tpool_process_flush attempts to fix this, albeit with an ironic comment of "Shouldn't be possible to get here, but just in case". It can, and it matters, but it's not sufficient. The condition was:

if (q->qsize < q->n_output + q->n_input + q->n_processing)

We can see from tpool_worker above however that it checks "p->tsize - p->nwaiting", so it's not just a case of qsize vs n_output. Adding "p->tsize" to our qsize increase avoids this potential deadlock.

(Potentially we may want to do this as a temporary measure because a caller may do a flush mid-way and not at a time where it's about to tear everything down, but an increased queue size isn't a major concern as it's simply a little bit more memory.)

As to how to trigger this, it was a 1 in 750 or so case from the htslib test:

./test_view -@4 -o seqs_per_slice=100 -o no_ref=1 -o multi_seq_per_slice=-1 -S -C multi_ref.tmp.sam

The deadlock occurs in cram_close, which initially calls cram_flush_container_mt and then moves on to hts_tpool_process_flush. Inbetween these is the race condition that can lead to lots of waiting working threads and hence the qsize being too small. We can trigger this bug 100% of the time by inserting a "sleep(1)" prior to the "if (fd->pool &&..." conditional. With the fix, it's survived 20,000 tests.

Arguably the bug is in cram_io.c and not the thread pool, but it's best to make the pool robust to poor usage, especially given there is no hts_tpool_process_flush call that has a timeout.

daviesrob · 2021-07-20T20:00:16Z

I've managed to reproduce this, albeit not with 100% reliability even with the strategically-placed sleep. I can confirm that the patch seems to work, although I wonder if the actual problem is in the comparison used in tpool_worker() (q->qsize - q->n_output > p->tsize - p->nwaiting).

One observation is that it looks like the worker isn't taking itself into account when checking the number of busy threads, because it hasn't incremented p->nwaiting at this point. It is, however, not doing anything at this point, so maybe the comparison should be q->qsize - q->n_output > p->tsize - p->nwaiting - 1 - which does fix the bug in my limited testing.

Also I'm not sure if p->tsize - p->nwaiting is the right thing to compare against. It appeared in bd32ec6, which was designed to keep work on a smaller set of threads when there isn't much work available. This part makes the thread shut down early if the output queue is becoming full. Using q->qsize - q->n_output > q->n_processing would allow it to continue until all of the output slots are filled, and matches the condition used to start threads in wake_next_worker. This version also fixed the deadlock when I tried it.

I think this would be worth pursuing as threads stopping prematurely might be causing the thread pool to work less effectively than it should.

jkbonfield · 2021-07-27T16:23:07Z

I think this would be worth pursuing as threads stopping prematurely might be causing the thread pool to work less effectively than it should.

I've tried a variety of things and cannot find any evidence of this.

Do you have any example commands that aren't running efficiently? I know sort doesn't, but that's nothing to do with the thread pool. It's an issue of I/O contention and a non-pipelined process (all sort, all write, repeat... then all merge).

jkbonfield · 2021-07-28T10:45:14Z

After more experimentation on a machine with some frequency scaling, showing GHz, CPUs utilised, task-clock and wall-clock times for my PR and the q->n_processing variant I see no real evidence for speed differences. This is -@32 BAM->SAM | md5sum (to get a bottleneck and limit the number of useful threads).

jkb PR    3.5688 2.5505 85.7087 33.6056                                         
rmd2      3.5313 2.5708 85.7184 33.3408

Almost no differences there, and probably without error of measurement (10 trials).

With 8 threads instead, it's pretty much the same figures. That's where we saw the big difference before where giving it many more threads than needed triggered a wall-clock slowdown as each thread then ran considerably slower (edit: although I cannot demonstrate that now either other than very tiny changes).

So I'm confident we're not reintroducing issues, so I'll probably go with the nprocessing one and keep a belt and braces fudge to expand the queue size if we don't fit during close, just incase (it shouldn't trigger, but better safe than sorry).

@4

There is a very rare deadlock in hts_tpool_process_flush. The test harness step can trigger this: ./test_view -@4 -o seqs_per_slice=100 -o no_ref=1 -o multi_seq_per_slice=-1 -S -C multi_ref.tmp.sam [This occured for me about 1 in 750 times on one host. Adding a sleep in cram_io.c's cram_close function, just before the "if (fd->pool &&" check, makes this trigger much more frequently.] As an explanation: the tpool_worker threads have a check on && q->qsize - q->n_output > p->tsize - p->nwaiting This was added in bd32ec6 and is designed to avoid CPU clock-scaling issues when running many more threads than we can keep active. To avoid exhausting the input queue and alternating between many dormant threads and active threads, we only take a job off the input queue if we have more in the queue than the workers currently executing something. However, when flushing, we need to leave a bit more leeway. It's possible that we haven't yet consumed the results, so the output queue is full, meaning the flush isn't something that we can call and just wait on unless we have a separate thread that is able to drain the queue. (This is valid for SAM and BAM, but not CRAM where the result consumer is also the same thread that calls the flush). Hts_tpool_process_flush attempts to fix this, albeit with an ironic comment of "Shouldn't be possible to get here, but just in case". It can, and it matters, but it's not sufficient. The condition was: if (q->qsize < q->n_output + q->n_input + q->n_processing) We can see from tpool_worker above however that it checks "p->tsize - p->nwaiting", so it's not just a case of qsize vs n_output. Adding "p->tsize" to our qsize check above avoids this potential deadlock, but the main cause is elsewhere (spotted by Rob). The primary issue is the initial in tpool_worker didn't consider that the current worker thread isn't executing, So "p->tsize - p->nwaiting - 1" is the correct assessment. This does also cure the deadlock, but @daviesrob suggested "q->n_processing" (which can be anywhere from 0 to p->tsize - p->nwaiting - 1, hence also fixing the same issue) is a better fix and also now matches the Hts_tpool_process_flush logic. I've verified this with ~100,000 tests of the test_view command above. It's harder to state with certainty that it doesn't alter the initial aim of bd32ec6 as more modern systems appear to be considerably less affected by frequency scaling issues, but the effect is still measureable, albeit now very slight (~5% wall-clock time differences). This change appears to not undo that improvment. Co-authored-by: Rob Davies <rmd+git@sanger.ac.uk>

* Fix NEWS typo

daviesrob · 2021-07-29T10:54:20Z

You accidentally included an htscodecs submodule update here. As it's unrelated to the bug-fix, I've split it out into its own commit.

jkbonfield force-pushed the cram_deadlock branch from 502c22b to ea3984c Compare July 15, 2021 15:17

daviesrob self-assigned this Jul 20, 2021

jkbonfield force-pushed the cram_deadlock branch from ea3984c to 8faf60a Compare July 28, 2021 13:24

jkbonfield and others added 2 commits July 29, 2021 11:46

Update htscodecs submodule

21490fb

* Fix NEWS typo

daviesrob force-pushed the cram_deadlock branch from 8faf60a to 21490fb Compare July 29, 2021 10:52

daviesrob merged commit 43c127e into samtools:develop Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around a deadlock in hts_tpool_process_flush. #1309

Work around a deadlock in hts_tpool_process_flush. #1309

jkbonfield commented Jul 15, 2021

daviesrob commented Jul 20, 2021

jkbonfield commented Jul 27, 2021

jkbonfield commented Jul 28, 2021 •

edited

Loading

daviesrob commented Jul 29, 2021

Work around a deadlock in hts_tpool_process_flush. #1309

Work around a deadlock in hts_tpool_process_flush. #1309

Conversation

jkbonfield commented Jul 15, 2021

daviesrob commented Jul 20, 2021

jkbonfield commented Jul 27, 2021

jkbonfield commented Jul 28, 2021 • edited Loading

daviesrob commented Jul 29, 2021

jkbonfield commented Jul 28, 2021 •

edited

Loading