Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3` #5324

jcsp · 2022-07-04T09:40:25Z

FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=3 (1/72 runs)
- failure at 2022-07-04T08:05:02.564Z: TimeoutError('Redpanda node docker-rp-2 failed to stop in 60 seconds')
- in job https://buildkite.com/redpanda/redpanda/builds/12033#0181c7df-a02c-4d55-a971-399a1a1e571b

There was a previous issue #4670 about a "failed to stop" in this test that was related to the failure injection, but the logs in this failure look like an authentic shutdown hang in redpanda -- we can see it getting the signal to stop, but remaining alive for some time. There is compaction going on at the time it is signaled to stop.

bharathv · 2022-07-20T15:47:10Z

I did some initial log analysis, seems like an issue in clean up of consensus state..

Sequence of events..

Shutdown request came in at 2022-07-04 07:02:24 but a single log was not getting shutdown, we can see that the log is active with the housekeeping activities beyond the test timeout of 60s, finally stops at 2022-07-04 07:04:41

41076:INFO  2022-07-04 07:02:24,633 [shard 0] redpanda::main - application.cc:247 - Stopping...
.....

41160:TRACE 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:122 - waiting for {kafka_internal/group/6} compaction to finish
.....

41245:TRACE 2022-07-04 07:02:29,840 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313349840}, max_bytes:18446744073709551615, should_sanitize:false}
41247:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:290 - [{kafka/topic-khoyjdamnd/0}] time retention timestamp: {timestamp: 1656313349840}, first segment max timestamp: {timestamp: 1656918141945}
41248:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41251:TRACE 2022-07-04 07:02:40,168 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313360168}, max_bytes:18446744073709551615, should_sanitize:false}
....

41258:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}
41259:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:571 - [{kafka/topic-khoyjdamnd/0}] applying 'deletion' log cleanup policy with config: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}

.....

41318:DEBUG 2022-07-04 07:04:34,821 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41336:DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log {offsets:{start_offset:0, committed_offset:11949, committed_offset_term:1, dirty_offset:11949, dirty_offset_term:1, last_term_start_offset:0}, max_collectible_offset: {-9223372036854775808}, closed:0, config:{ntp: {kafka/topic-khoyjdamnd/0}, base_dir: /var/lib/redpanda/data, overrides: {compaction_strategy: {nullopt}, cleanup_policy_bitflags: {delete}, segment_size: {nullopt}, retention_bytes: {}, retention_time_ms: {}, recovery_enabled: false}, revision: 32, initial_revision: 32}, logs:{size: 1, [{offset_tracker:{term:1, base_offset:0, committed_offset:11949, dirty_offset:11949}, compacted_segment=0, finished_self_compaction=0, reader={/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.log, (199838 bytes)}, writer={no_of_chunks:64, closed:0, fallocation_offset:33554432, committed_offset:199838, bytes_flush_pending:0}, cache={cache_size=400}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{11949}, base_timestamp:{timestamp: 1656918114419}, max_timestamp:{timestamp: 1656918141945}, index(6,6,6)}, step:32768, needs_persistence:1}}]}}
41337:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:122 - waiting for {kafka/topic-khoyjdamnd/0} compaction to finish
41338:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:124 - stopping {kafka/topic-khoyjdamnd/0} readers cache

Its just one ntp

grep "closing log" redpanda.log  | cut -d '{' -f1
DEBUG 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:114 - closing log 
DEBUG 2022-07-04 07:02:25,257 [shard 2] storage - disk_log_impl.cc:114 - closing log 
DEBUG 2022-07-04 07:02:25,258 [shard 0] storage - disk_log_impl.cc:114 - closing log 
........... 
DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log  ntp ==> kafka/topic-khoyjdamnd/0

It took a lot longer for consensus to shutdown (by then the leader already moved on to a different node)..

INFO  2022-07-04 07:02:25,255 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:226 - Stopping
DEBUG 2022-07-04 07:02:25,256 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 11949 aborted - shutting down
INFO  2022-07-04 07:02:26,811 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:526 - Node {id: {1}, revision: {32}} recovery cancelled (rpc::errc::client_request_timeout)
TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
WARN  2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
WARN  2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
INFO  2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}

Think the timestamps suggest that the issue with delayed shutdown could possibly be in one of the following two (event manager shutdown seems trivial to me)...

ss::future<> consensus::stop() {
    vlog(_ctxlog.info, "Stopping");
    shutdown_input();
    for (auto& idx : _fstats) {
        idx.second.follower_state_change.broken();
    }
    return _event_manager.stop()
      .then([this] { return _append_requests_buffer.stop(); }) <===
      .then([this] { return _batcher.stop(); }) <===
      .then([this] { return _bg.close(); })
      .then([this] {

The following log and the timestamp suggest that, things were probably stuck in flush, some sort of interplay between append_entries_buffer and consensus shutdown, not obvious to me though.

INFO  2022-07-04 07:04:41,218 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] gate-closed, waiting to finish background requests

Finally it realizes it is not the leader...and steps down.. something seems to be messed up, need to take a closer look.

TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
INFO  2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}

andrwng · 2022-08-17T18:19:31Z

Seen again here:

FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1 (1/24 runs)
failure at 2022-08-17T10:53:32.899Z: TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds')
in job https://buildkite.com/redpanda/redpanda/builds/14253#0182aaf0-b6fb-4499-a1df-8288ad33b894

Stack trace

====================================================================================================
test_id:    rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1
status:     FAIL
run time:   3 minutes 40.696 seconds


    TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_offsets_migration_test.py", line 164, in test_migrating_consume_offsets
    self.redpanda.restart_nodes(n, stop_timeout=60)
  File "/root/tests/rptest/services/redpanda.py", line 1544, in restart_nodes
    self.stop_node(node, timeout=stop_timeout)
  File "/root/tests/rptest/services/redpanda.py", line 1347, in stop_node
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda node docker-rp-5 failed to stop in 60 seconds

bharathv · 2022-08-24T04:29:33Z

Another instance of this .. https://ci-artifacts.dev.vectorized.cloud/redpanda/0182cbda-ffdd-4339-90a7-ef336ed22f9c/vbuild/ducktape/results/2022-08-23--001/ConsumerOffsetsMigrationTest/test_migrating_consume_offsets/failures=False.cpus=1/47/

A potential fix for this is ready.. can someone review and merge please? Thanks.

BenPope · 2022-10-05T11:30:22Z

Looks like on v22.2.x as well: https://buildkite.com/redpanda/redpanda/builds/16044#01839d62-8fd9-420d-ad3c-2e3f7edb4af9

jcsp added kind/bug Something isn't working area/redpanda ci-failure labels Jul 4, 2022

piyushredpanda assigned dlex Jul 4, 2022

mmedenjak added the area/tests label Jul 5, 2022

piyushredpanda assigned bharathv and unassigned dlex Jul 19, 2022

bharathv mentioned this issue Jul 29, 2022

rpc/transport: Release resource units right after dispatching send #5730

Merged

5 tasks

ztlpn mentioned this issue Aug 15, 2022

More robust waiting for the quiescent state in partition balancer tests #6007

Merged

5 tasks

ZeDRoman mentioned this issue Aug 23, 2022

ducktape: test_cancelling_partition_move_x_core wait movement starts before cancelling #6152

Merged

5 tasks

rystsov added the ci-disabled-test label Sep 16, 2022

bharathv closed this as completed in #5730 Oct 6, 2022

vbotbuildovich mentioned this issue Oct 6, 2022

[v22.2.x] Shutdown hang in ConsumerOffsetsMigrationTest test_migrating_consume_offsets failures=True.cpus=3 #6646

Closed

vbotbuildovich mentioned this issue Oct 24, 2022

[v22.2.x] Shutdown hang in ConsumerOffsetsMigrationTest test_migrating_consume_offsets failures=True.cpus=3 #6907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3` #5324

Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3` #5324

jcsp commented Jul 4, 2022

bharathv commented Jul 20, 2022

andrwng commented Aug 17, 2022

bharathv commented Aug 24, 2022

BenPope commented Oct 5, 2022

Shutdown hang in ConsumerOffsetsMigrationTest test_migrating_consume_offsets failures=True.cpus=3 #5324

Shutdown hang in ConsumerOffsetsMigrationTest test_migrating_consume_offsets failures=True.cpus=3 #5324

Comments

jcsp commented Jul 4, 2022

bharathv commented Jul 20, 2022

andrwng commented Aug 17, 2022

bharathv commented Aug 24, 2022

BenPope commented Oct 5, 2022

Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3` #5324

Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3` #5324