Failure to become healthy after maintenance mode in `UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback` #5713

andrwng · 2022-07-28T17:36:19Z

CI failure: https://ci-artifacts.dev.vectorized.cloud/redpanda/018241dc-b092-405f-a8b4-72cf76eacffd/vbuild/ducktape/results/2022-07-27--001/report.html

Module: rptest.tests.upgrade_test
Class:  UpgradeWithWorkloadTest
Method: test_rolling_upgrade_with_rollback
Arguments:
{
  "upgrade_after_rollback": false
}

We fail to wait 90 seconds for the cluster to be considered healthy according to the RedpandaService health status, which checks for underreplication. The check itself isn't perfect since it's checking the metrics endpoint, but it seems like a considerable amount of time for it to fail.

[INFO  - 2022-07-28 00:32:19,477 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False: Summary: TimeoutError('')
Traceback (most recent call last):                                                                   
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run   
    data = self.run_test()                                                                           
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)                                                     
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper        
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)                                
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped                                 
    r = f(self, *args, **kwargs)                                                                     
  File "/root/tests/rptest/tests/upgrade_test.py", line 158, in test_rolling_upgrade_with_rollback   
    self.redpanda.rolling_restart_nodes([first_node],                                                
  File "/root/tests/rptest/services/redpanda.py", line 1400, in rolling_restart_nodes                
    restarter.restart_nodes(nodes,                                                                   
  File "/root/tests/rptest/services/rolling_restarter.py", line 56, in restart_nodes                 
    wait_until(lambda: self.redpanda.healthy(),                                                      
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until      
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception                 
ducktape.errors.TimeoutError

The text was updated successfully, but these errors were encountered:

Lazin · 2022-08-02T07:23:25Z

Variation of this test here - https://buildkite.com/redpanda/redpanda/builds/13417#018259f6-efcd-472c-a0eb-365a1e8b0411/480-9162 (upgrade_after_rollback=True)

andrwng · 2022-08-02T21:53:16Z

I've been looking through logs, reproducing it pretty consistently over the span of 20-30 test repeats and it doesn't look like anything interesting is going on. Replication and recovery occur concurrently (which I believe is expected), and eventually, the recovery finishes, albeit after 1+ minute(s).

My current suspicion is that this isn't a Redpanda bug and is rather caused by the fact that all our released binaries (those installed before upgrading) are running in release mode, while our test binaries may run with debug binaries. Perhaps the leader and a follower are running in release mode and are being produced to, we then attempt to catch up a follower in with debug binaries, and all the while the debug binaries are a step slower than the release binaries so we end up staying underreplicated (recovering) for longer than expected. In the logs of one long-running test case, I see:

TRACE 2022-08-02 00:34:55,390 [shard 1] raft - [group_id:2, {kafka/topic/1}] consensus.cc:374 - Starting recovery process for {id: {1}, revision: {22}} - current reply: {node_id: {id: {1}, revision: {22}}, target_node_id{id: {3}, revision: {22}}, group: {2}, term:{1}, last_dirty_log_index:{13404}, last_flushed_log_index:{13405}, last_term_base_offset:{-9223372036854775808}, result: failure}
... periodically see logs about replicate_entries_stm ...
TRACE 2022-08-02 00:34:55,788 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{2}, commit_index:{13459}, term:{1}, prev_log_index:{13461}, prev_log_term:{1}}                                                                                                                                    
TRACE 2022-08-02 00:34:55,788 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:196 - Leader append result: {append_time:352492097, base_offset:{13462}, last_offset:{13462}, byte_size:75}                                                                                                                                                
TRACE 2022-08-02 00:34:55,788 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:248 - Skipping sending append request to {id: {1}, revision: {22}} - last sent offset: 13423, expected follower last offset: 13461
... ^ skipping replicating because recovery hasn't caught up the replica yet ...
TRACE 2022-08-02 00:34:55,809 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{2}, commit_index:{13461}, term:{1}, prev_log_index:{13463}, prev_log_term:{1}}                                                                                                                                    
TRACE 2022-08-02 00:34:55,809 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:196 - Leader append result: {append_time:352492117, base_offset:{13464}, last_offset:{13464}, byte_size:75}                                                                                                                                                
TRACE 2022-08-02 00:34:55,809 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{2}, commit_index:{13461}, term:{1}, prev_log_index:{13463}, prev_log_term:{1}} to {id: {1}, revision: {22}}
... ^ eventually we do send the entries successfully ...
TRACE 2022-08-02 00:36:01,818 [shard 1] raft - [group_id:2, {kafka/topic/1}] consensus.cc:263 - Append entries response: {node_id: {id: {1}, revision: {22}}, target_node_id{id: {3}, revision: {22}}, group: {2}, term:{1}, last_dirty_log_index:{20062}, last_flushed_log_index:{20062}, last_term_base_offset:{-9223372036854775808}, result: success}        
TRACE 2022-08-02 00:36:01,818 [shard 1] raft - [group_id:2, {kafka/topic/1}] consensus.cc:462 - Updated node {id: {1}, revision: {22}} match 20062 and next 20063 indices
TRACE 2022-08-02 00:36:01,818 [shard 1] raft - [follower: {id: {1}, revision: {22}}] [group_id:2, {kafka/topic/1}] - recovery_stm.cc:536 - Finished recovery

This looks to be the expected control path (at least as far as concurrent replicates and recovery is concerned), but we're just missing the check for whether recovery has completed for some time. It doesn't look like the recovery_stm and replicate_entries_stm interact at all with each other, and when one node is slow, it doesn't seem unreasonable to think we end up not 100% recovered when there's an ongoing workload since there's always a bit more to send over. Once we're able to use the replicate_entries_stm without skipping, I wonder if it makes sense to stop the recovery_stm somehow.

The strongest reason I have for believing this is related to node slowness rather than a latent bug:

I'm unable to reproduce it when all binaries are running older released versions and instead of upgrading we just restart.
I'm unable to reproduce it when all binaries are running the current head debug version and instead of upgrading we just restart.
I am able to reproduce it when swapping in the current (built on dev) release version instead of starting out with older released versions, and "upgrading" one node to the current (built on dev) debug version.

Also cc @mmaslankaprv turns out this isn't related to the joint consensus improvements. I removed those commits locally and still ran into this.

ztlpn · 2022-08-02T22:13:26Z

Once again here https://buildkite.com/redpanda/redpanda/builds/13424#01825f84-49b8-4ca0-b0fb-199f0787c350

Tests that run workloads through a rolling restart are susceptible to flakiness when the node being restarted is knowingly slower than the others. This can typically the case when running in mixed versions, wherein one node is the locally-built debug binaries while the others are downloaded release binaries. Upon returning from maintenance mode, we attempt to wait for the number of recovering replicas to drop to zero, but even the reduced workload we're running leads the test to be flaky ~5% of the time when run locally. This commit significantly reduces the workload sent, while still ensuring progress is made through the rolling restart. I ran UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback 100 times with this reduced workload and saw no failures. Fixes redpanda-data#5713

andrwng added the kind/bug Something isn't working label Jul 28, 2022

andrwng mentioned this issue Jul 28, 2022

redpanda_installer: don't initialize releases unless needed #5688

Merged

andrwng self-assigned this Jul 28, 2022

mmedenjak added area/redpanda ci-failure labels Jul 30, 2022

abhijat mentioned this issue Aug 1, 2022

cloud_storage: use is_elected_leader for leadership query #5769

Merged

ztlpn mentioned this issue Aug 2, 2022

Partition balancer admin ops fuzz test + small improvements #5778

Merged

5 tasks

This was referenced Aug 4, 2022

Failure in UpgradeWithWorkloadTest.test_rolling_upgrade #5837

Closed

timeout in PartitionMovementUpgradeTest.test_basic_upgrade #5854

Closed

ztlpn mentioned this issue Aug 4, 2022

Partition autobalancer full disk test #5839

Merged

1 task

andrwng mentioned this issue Aug 5, 2022

upgrade_test: reduce workload throughput #5859

Merged

5 tasks

andrwng closed this as completed in #5859 Aug 5, 2022

jcsp mentioned this issue Aug 9, 2022

cluster: fix config status write pre-check #5680

Merged

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to become healthy after maintenance mode in `UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback` #5713

Failure to become healthy after maintenance mode in `UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback` #5713

andrwng commented Jul 28, 2022

Lazin commented Aug 2, 2022

andrwng commented Aug 2, 2022

ztlpn commented Aug 2, 2022

Failure to become healthy after maintenance mode in UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback #5713

Failure to become healthy after maintenance mode in UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback #5713

Comments

andrwng commented Jul 28, 2022

Lazin commented Aug 2, 2022

andrwng commented Aug 2, 2022

ztlpn commented Aug 2, 2022

Failure to become healthy after maintenance mode in `UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback` #5713

Failure to become healthy after maintenance mode in `UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback` #5713