Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown hang in ConsumerOffsetsMigrationTest test_migrating_consume_offsets failures=True.cpus=3 #5324

Closed
jcsp opened this issue Jul 4, 2022 · 4 comments · Fixed by #5730

Comments

@jcsp
Copy link
Contributor

jcsp commented Jul 4, 2022

There was a previous issue #4670 about a "failed to stop" in this test that was related to the failure injection, but the logs in this failure look like an authentic shutdown hang in redpanda -- we can see it getting the signal to stop, but remaining alive for some time. There is compaction going on at the time it is signaled to stop.

@bharathv
Copy link
Contributor

I did some initial log analysis, seems like an issue in clean up of consensus state..

Sequence of events..

  • Shutdown request came in at 2022-07-04 07:02:24 but a single log was not getting shutdown, we can see that the log is active with the housekeeping activities beyond the test timeout of 60s, finally stops at 2022-07-04 07:04:41
41076:INFO  2022-07-04 07:02:24,633 [shard 0] redpanda::main - application.cc:247 - Stopping...
.....

41160:TRACE 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:122 - waiting for {kafka_internal/group/6} compaction to finish
.....

41245:TRACE 2022-07-04 07:02:29,840 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313349840}, max_bytes:18446744073709551615, should_sanitize:false}
41247:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:290 - [{kafka/topic-khoyjdamnd/0}] time retention timestamp: {timestamp: 1656313349840}, first segment max timestamp: {timestamp: 1656918141945}
41248:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41251:TRACE 2022-07-04 07:02:40,168 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313360168}, max_bytes:18446744073709551615, should_sanitize:false}
....

41258:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}
41259:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:571 - [{kafka/topic-khoyjdamnd/0}] applying 'deletion' log cleanup policy with config: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}

.....

41318:DEBUG 2022-07-04 07:04:34,821 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41336:DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log {offsets:{start_offset:0, committed_offset:11949, committed_offset_term:1, dirty_offset:11949, dirty_offset_term:1, last_term_start_offset:0}, max_collectible_offset: {-9223372036854775808}, closed:0, config:{ntp: {kafka/topic-khoyjdamnd/0}, base_dir: /var/lib/redpanda/data, overrides: {compaction_strategy: {nullopt}, cleanup_policy_bitflags: {delete}, segment_size: {nullopt}, retention_bytes: {}, retention_time_ms: {}, recovery_enabled: false}, revision: 32, initial_revision: 32}, logs:{size: 1, [{offset_tracker:{term:1, base_offset:0, committed_offset:11949, dirty_offset:11949}, compacted_segment=0, finished_self_compaction=0, reader={/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.log, (199838 bytes)}, writer={no_of_chunks:64, closed:0, fallocation_offset:33554432, committed_offset:199838, bytes_flush_pending:0}, cache={cache_size=400}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{11949}, base_timestamp:{timestamp: 1656918114419}, max_timestamp:{timestamp: 1656918141945}, index(6,6,6)}, step:32768, needs_persistence:1}}]}}
41337:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:122 - waiting for {kafka/topic-khoyjdamnd/0} compaction to finish
41338:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:124 - stopping {kafka/topic-khoyjdamnd/0} readers cache

Its just one ntp

grep "closing log" redpanda.log  | cut -d '{' -f1
DEBUG 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:114 - closing log 
DEBUG 2022-07-04 07:02:25,257 [shard 2] storage - disk_log_impl.cc:114 - closing log 
DEBUG 2022-07-04 07:02:25,258 [shard 0] storage - disk_log_impl.cc:114 - closing log 
........... 
DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log  ntp ==> kafka/topic-khoyjdamnd/0

It took a lot longer for consensus to shutdown (by then the leader already moved on to a different node)..

INFO  2022-07-04 07:02:25,255 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:226 - Stopping
DEBUG 2022-07-04 07:02:25,256 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 11949 aborted - shutting down
INFO  2022-07-04 07:02:26,811 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:526 - Node {id: {1}, revision: {32}} recovery cancelled (rpc::errc::client_request_timeout)
TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
WARN  2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
WARN  2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
INFO  2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}

Think the timestamps suggest that the issue with delayed shutdown could possibly be in one of the following two (event manager shutdown seems trivial to me)...

ss::future<> consensus::stop() {
    vlog(_ctxlog.info, "Stopping");
    shutdown_input();
    for (auto& idx : _fstats) {
        idx.second.follower_state_change.broken();
    }
    return _event_manager.stop()
      .then([this] { return _append_requests_buffer.stop(); }) <===
      .then([this] { return _batcher.stop(); }) <===
      .then([this] { return _bg.close(); })
      .then([this] {

The following log and the timestamp suggest that, things were probably stuck in flush, some sort of interplay between append_entries_buffer and consensus shutdown, not obvious to me though.

INFO  2022-07-04 07:04:41,218 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] gate-closed, waiting to finish background requests

Finally it realizes it is not the leader...and steps down.. something seems to be messed up, need to take a closer look.

TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
INFO  2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}

@andrwng
Copy link
Contributor

andrwng commented Aug 17, 2022

Seen again here:

FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1 (1/24 runs)
failure at 2022-08-17T10:53:32.899Z: TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds')
in job https://buildkite.com/redpanda/redpanda/builds/14253#0182aaf0-b6fb-4499-a1df-8288ad33b894

Stack trace

====================================================================================================
test_id:    rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1
status:     FAIL
run time:   3 minutes 40.696 seconds


    TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_offsets_migration_test.py", line 164, in test_migrating_consume_offsets
    self.redpanda.restart_nodes(n, stop_timeout=60)
  File "/root/tests/rptest/services/redpanda.py", line 1544, in restart_nodes
    self.stop_node(node, timeout=stop_timeout)
  File "/root/tests/rptest/services/redpanda.py", line 1347, in stop_node
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda node docker-rp-5 failed to stop in 60 seconds

@bharathv
Copy link
Contributor

@BenPope
Copy link
Member

BenPope commented Oct 5, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants