Fix sarama consumer deadlock #2587

albertteoh · 2020-10-25T05:54:49Z

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Closes Flaky test: TestSaramaConsumerWrapper_start_Messages #1314

Short description of the changes

Prevent deadlock between the handleErrors() and Close() functions trying to acquire the partitionLock and access the partition WaitGroup by:

Removing per-partition WaitGroups
Removing the need for acquiring the partitionLock within Close()

Additional improvements:

More correct ordering of close calls: Ensure underlying sarama consumer is closed first to prevent repartitioning, causing potentially new partition consumers to be created that attempt to write to already closed channels.
Instead, Close() will rely on Consumer-scoped WaitGroups to determine graceful closure of message and error handlers. These WaitGroups will be managed through goroutines monitoring messagesDone and errorsDone channels, which will be produced to by the respective handler functions.
Remove the redundant loop to close partition consumers which is already performed by closing the aforementioned underlying sarama consumer.
Additional diagnostic logging, mostly at debug, to help pickup race-related errors.
Sarama tests use the New Consumer constructor to DRY up duplicated construction logic.

Testing

Local unit tests have not failed any Sarama tests.
Confirm this failure no longer occurs on Travis branch builds (in fork) on Sarama tests.
Local integration tests with Kafka, Cassandra, collector, agent, query and mock trace generator running:
- Start ingester binary ./cmd/ingester/ingester-darwin-amd64 --log-level debug:
  - Confirm successful startup.
  - Confirm span data consumed and showing correctly in Jaeger UI.
  - Confirm metrics look fine: rate(jaeger_kafka_spans_written_total[5m]) and rate(jaeger_ingester_sarama_consumer_messages_total[5m]) more or less line up.
- Manually send SIGINT with Ctrl+C to trigger shutdown:
  - Confirm shutdown completes without hanging.
  - Confirm no Kafka consumer related error messages or panics present.
  - Test under two conditions: when there is 0 lag, and when the consumer is lagging behind; terminating while busy consuming messages to catch potential race conditions.
- Force a race condition between Consumer start and shutdown to test potential race conditions:
  - Command: ./cmd/ingester/ingester-darwin-amd64 --log-level debug & pid1="$\!"; sleep 0.1; kill $pid1
  - Confirm shutdown attempted before consumer starts.
  - Confirm graceful shutdown completes successfully without error.
  - Output:

{"level":"debug","ts":1603594054.544082,"caller":"consumer/deadlock_detector.go:149","msg":"Global deadlock detector disabled"}
{"level":"info","ts":1603594054.5441122,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1603594054.544123,"caller":"flags/service.go:154","msg":"Shutting down"}
{"level":"info","ts":1603594054.544128,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"unavailable"}
{"level":"info","ts":1603594054.544121,"caller":"consumer/consumer.go:85","msg":"Starting main loop"}
{"level":"debug","ts":1603594054.544126,"caller":"consumer/consumer.go:107","msg":"Waiting for done signals"}
{"level":"info","ts":1603594054.5441332,"caller":"consumer/consumer.go:118","msg":"Closing parent consumer"}
{"level":"debug","ts":1603594054.544141,"caller":"consumer/consumer.go:107","msg":"Waiting for done signals"}
{"level":"info","ts":1603594054.55716,"caller":"consumer/consumer.go:196","msg":"Starting error handler","partition":0}
{"level":"info","ts":1603594054.557175,"caller":"consumer/consumer.go:138","msg":"Starting message handler","partition":0}
{"level":"debug","ts":1603594054.561001,"caller":"consumer/consumer.go:121","msg":"Closing deadlock detector"}
{"level":"info","ts":1603594054.7042592,"caller":"consumer/consumer.go:206","msg":"Finished handling errors","partition":0}
{"level":"debug","ts":1603594054.704276,"caller":"consumer/consumer.go:124","msg":"Waiting for messages and errors to be handled"}
{"level":"debug","ts":1603594054.704284,"caller":"consumer/consumer.go:109","msg":"Received done signal","msg":"HandleErrors done"}
{"level":"debug","ts":1603594054.704291,"caller":"consumer/deadlock_detector.go:100","msg":"Partition deadlock detector disabled"}
{"level":"info","ts":1603594054.704301,"caller":"consumer/consumer.go:163","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1603594054.704475,"caller":"consumer/consumer.go:187","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1603594054.704489,"caller":"consumer/consumer.go:190","msg":"Closed partition consumer","partition":0}
{"level":"debug","ts":1603594054.704499,"caller":"consumer/consumer.go:109","msg":"Received done signal","msg":"HandleMessages done"}
{"level":"debug","ts":1603594054.704504,"caller":"consumer/consumer.go:128","msg":"Closing message and error done channels"}
{"level":"info","ts":1603594055.207753,"caller":"flags/service.go:162","msg":"Shutdown complete"}

codecov · 2020-10-25T06:40:59Z

Codecov Report

Merging #2587 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2587   +/-   ##
=======================================
  Coverage   95.06%   95.07%           
=======================================
  Files         209      209           
  Lines        9370     9364    -6     
=======================================
- Hits         8908     8903    -5     
+ Misses        387      385    -2     
- Partials       75       76    +1

Impacted Files	Coverage Δ
cmd/ingester/app/consumer/consumer.go	`100.00% <100.00%> (ø)`
cmd/query/app/static_handler.go	`83.03% <0.00%> (-1.79%)`	⬇️
plugin/storage/integration/integration.go	`77.90% <0.00%> (+0.55%)`	⬆️
...lugin/sampling/strategystore/adaptive/processor.go	`100.00% <0.00%> (+0.92%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81761cd...b612f2c. Read the comment docs.

yurishkuro

I am concerned that this makes the logic even more complex by introducing additional signaling. My suggestion was to instead simplify it by reducing # of goroutines and handing more events in a single select{}, i.e. messages, errors, deadlock events. The errors handling is especially trivial (just a counter bump). Then perhaps with this refactoring the WaitGroup won't be necessary at all.

albertteoh · 2020-10-25T21:37:59Z

I am concerned that this makes the logic even more complex by introducing additional signaling. My suggestion was to instead simplify it by reducing # of goroutines and handing more events in a single select{}, i.e. messages, errors, deadlock events. The errors handling is especially trivial (just a counter bump). Then perhaps with this refactoring the WaitGroup won't be necessary at all.

Yeah, good call, agree it's more complex than necessary.

I think we can do away completely with additional goroutines and channels/signalling that this PR introduced, though I think a waitgroup (at consumer scope) is still necessary as we still want some mechanism to block Close() until messages and errors have been flushed/handled. So in this case, as you pointed out, the errors handling would bump the waitgroup counter.

If we use a single select, IIUC, would it not require introducing new channels to listen on messages, error events? Or maybe I've misunderstood something.

Signed-off-by: albertteoh <albert.teoh@logz.io>

yurishkuro · 2020-10-26T21:40:27Z

cmd/ingester/app/consumer/consumer.go

@@ -78,17 +79,11 @@ func (c *Consumer) Start() {
 		c.logger.Info("Starting main loop")
 		for pc := range c.internalConsumer.Partitions() {
 			c.partitionMapLock.Lock()
-			if p, ok := c.partitionIDToState[pc.Partition()]; ok {


Are you sure this is safe to remove? From the comment is seems like a case where a partition gets re-assigned to the same service instance, so there's a potentially another partition consumer for this partition ID already running (or more likely shutting down).

@vprithvi do you have any more color to add to this condition?

We noticed the case that you described a couple of times, but weren't able to reproduce it in tests.
Since then, many things have changed with both Kafka and Sarama. I think we can remove this now in the interest of simplifying and revisit if see problems.

How did it manifest when it happened? Committed offsets going backwards?

IIUC, this happens when new messages arrive on the partition (because the Messages() channel isn't closed yet) after the partition consumer was closed (via the c.closePartition(p.partitionConsumer) call), causing a new partition consumer to be created.

This PR proposes to solve this by calling c.internalConsumer.Close() first within Close() which closes the Messages() channel, preventing new messages from arriving, and then closes the Partitions() channel, preventing new PartitionConsumers from being created.

cmd/ingester/app/consumer/consumer.go

vprithvi · 2020-10-26T22:38:22Z

cmd/ingester/app/consumer/consumer.go

 func (c *Consumer) handleMessages(pc sc.PartitionConsumer) {
 	c.logger.Info("Starting message handler", zap.Int32("partition", pc.Partition()))
 	c.partitionMapLock.Lock()
 	c.partitionsHeld++
 	c.partitionsHeldGauge.Update(c.partitionsHeld)
-	wg := &c.partitionIDToState[pc.Partition()].wg
 	c.partitionMapLock.Unlock()
 	defer func() {
 		c.closePartition(pc)


Is this closePartition still necessary provided internalConsumer.Close() closes all partitions?

Yup, I think you're right, @vprithvi. 👍

However, closePartition() also increments a metrics counter, which is used in tests to detect whether if the partition is closed or not; neither the Consumer nor PartitionConsumer interface appear to offer an "OnClosed" callback or channel to signal successful closure of a partition.

We could instead replace c.closePartition(pc) with c.partitionMetrics(pc.Partition()).closeCounter.Inc(1) to bump up the metric and remove the closePartition() function, if that sounds reasonable to you?

Pull request has been modified.

yurishkuro · 2020-11-02T04:05:55Z

Thanks!

albertteoh · 2020-11-02T04:08:04Z

Thanks for the review, @yurishkuro & @vprithvi!

albertteoh requested a review from a team as a code owner October 25, 2020 05:54

albertteoh requested a review from vprithvi October 25, 2020 05:54

yurishkuro reviewed Oct 25, 2020

View reviewed changes

albertteoh added 2 commits October 26, 2020 09:07

Fix sarama consumer deadlock

5adfd9d

Signed-off-by: albertteoh <albert.teoh@logz.io>

Simplify

409c72a

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh force-pushed the fix-flaky-sarama-chan branch from 592e1d0 to 409c72a Compare October 25, 2020 22:08

yurishkuro reviewed Oct 26, 2020

View reviewed changes

cmd/ingester/app/consumer/consumer.go Show resolved Hide resolved

vprithvi reviewed Oct 26, 2020

View reviewed changes

yurishkuro previously approved these changes Nov 2, 2020

View reviewed changes

Merge branch 'master' into fix-flaky-sarama-chan

b612f2c

yurishkuro approved these changes Nov 2, 2020

View reviewed changes

yurishkuro merged commit 2a0d8cc into jaegertracing:master Nov 2, 2020

albertteoh deleted the fix-flaky-sarama-chan branch November 2, 2020 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sarama consumer deadlock #2587

Fix sarama consumer deadlock #2587

albertteoh commented Oct 25, 2020 •

edited

Loading

codecov bot commented Oct 25, 2020 •

edited

Loading

yurishkuro left a comment

albertteoh commented Oct 25, 2020 •

edited

Loading

yurishkuro Oct 26, 2020

vprithvi Oct 26, 2020

yurishkuro Oct 26, 2020

albertteoh Oct 30, 2020

vprithvi Oct 26, 2020

albertteoh Oct 27, 2020

yurishkuro commented Nov 2, 2020

albertteoh commented Nov 2, 2020

Fix sarama consumer deadlock #2587

Fix sarama consumer deadlock #2587

Conversation

albertteoh commented Oct 25, 2020 • edited Loading

Which problem is this PR solving?

Short description of the changes

Testing

codecov bot commented Oct 25, 2020 • edited Loading

Codecov Report

yurishkuro left a comment

Choose a reason for hiding this comment

albertteoh commented Oct 25, 2020 • edited Loading

yurishkuro Oct 26, 2020

Choose a reason for hiding this comment

vprithvi Oct 26, 2020

Choose a reason for hiding this comment

yurishkuro Oct 26, 2020

Choose a reason for hiding this comment

albertteoh Oct 30, 2020

Choose a reason for hiding this comment

vprithvi Oct 26, 2020

Choose a reason for hiding this comment

albertteoh Oct 27, 2020

Choose a reason for hiding this comment

yurishkuro commented Nov 2, 2020

albertteoh commented Nov 2, 2020

albertteoh commented Oct 25, 2020 •

edited

Loading

codecov bot commented Oct 25, 2020 •

edited

Loading

albertteoh commented Oct 25, 2020 •

edited

Loading