Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to become healthy after maintenance mode in UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback #5713

Closed
andrwng opened this issue Jul 28, 2022 · 3 comments · Fixed by #5859
Assignees
Labels

Comments

@andrwng
Copy link
Contributor

andrwng commented Jul 28, 2022

CI failure: https://ci-artifacts.dev.vectorized.cloud/redpanda/018241dc-b092-405f-a8b4-72cf76eacffd/vbuild/ducktape/results/2022-07-27--001/report.html

Module: rptest.tests.upgrade_test
Class:  UpgradeWithWorkloadTest
Method: test_rolling_upgrade_with_rollback
Arguments:
{
  "upgrade_after_rollback": false
}

We fail to wait 90 seconds for the cluster to be considered healthy according to the RedpandaService health status, which checks for underreplication. The check itself isn't perfect since it's checking the metrics endpoint, but it seems like a considerable amount of time for it to fail.

[INFO  - 2022-07-28 00:32:19,477 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False: Summary: TimeoutError('')
Traceback (most recent call last):                                                                   
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run   
    data = self.run_test()                                                                           
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)                                                     
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper        
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)                                
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped                                 
    r = f(self, *args, **kwargs)                                                                     
  File "/root/tests/rptest/tests/upgrade_test.py", line 158, in test_rolling_upgrade_with_rollback   
    self.redpanda.rolling_restart_nodes([first_node],                                                
  File "/root/tests/rptest/services/redpanda.py", line 1400, in rolling_restart_nodes                
    restarter.restart_nodes(nodes,                                                                   
  File "/root/tests/rptest/services/rolling_restarter.py", line 56, in restart_nodes                 
    wait_until(lambda: self.redpanda.healthy(),                                                      
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until      
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception                 
ducktape.errors.TimeoutError 
@Lazin
Copy link
Contributor

Lazin commented Aug 2, 2022

Variation of this test here - https://buildkite.com/redpanda/redpanda/builds/13417#018259f6-efcd-472c-a0eb-365a1e8b0411/480-9162 (upgrade_after_rollback=True)

@andrwng
Copy link
Contributor Author

andrwng commented Aug 2, 2022

I've been looking through logs, reproducing it pretty consistently over the span of 20-30 test repeats and it doesn't look like anything interesting is going on. Replication and recovery occur concurrently (which I believe is expected), and eventually, the recovery finishes, albeit after 1+ minute(s).

My current suspicion is that this isn't a Redpanda bug and is rather caused by the fact that all our released binaries (those installed before upgrading) are running in release mode, while our test binaries may run with debug binaries. Perhaps the leader and a follower are running in release mode and are being produced to, we then attempt to catch up a follower in with debug binaries, and all the while the debug binaries are a step slower than the release binaries so we end up staying underreplicated (recovering) for longer than expected. In the logs of one long-running test case, I see:

TRACE 2022-08-02 00:34:55,390 [shard 1] raft - [group_id:2, {kafka/topic/1}] consensus.cc:374 - Starting recovery process for {id: {1}, revision: {22}} - current reply: {node_id: {id: {1}, revision: {22}}, target_node_id{id: {3}, revision: {22}}, group: {2}, term:{1}, last_dirty_log_index:{13404}, last_flushed_log_index:{13405}, last_term_base_offset:{-9223372036854775808}, result: failure}
... periodically see logs about replicate_entries_stm ...
TRACE 2022-08-02 00:34:55,788 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{2}, commit_index:{13459}, term:{1}, prev_log_index:{13461}, prev_log_term:{1}}                                                                                                                                    
TRACE 2022-08-02 00:34:55,788 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:196 - Leader append result: {append_time:352492097, base_offset:{13462}, last_offset:{13462}, byte_size:75}                                                                                                                                                
TRACE 2022-08-02 00:34:55,788 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:248 - Skipping sending append request to {id: {1}, revision: {22}} - last sent offset: 13423, expected follower last offset: 13461
... ^ skipping replicating because recovery hasn't caught up the replica yet ...
TRACE 2022-08-02 00:34:55,809 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{2}, commit_index:{13461}, term:{1}, prev_log_index:{13463}, prev_log_term:{1}}                                                                                                                                    
TRACE 2022-08-02 00:34:55,809 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:196 - Leader append result: {append_time:352492117, base_offset:{13464}, last_offset:{13464}, byte_size:75}                                                                                                                                                
TRACE 2022-08-02 00:34:55,809 [shard 1] raft - [group_id:2, {kafka/topic/1}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{2}, commit_index:{13461}, term:{1}, prev_log_index:{13463}, prev_log_term:{1}} to {id: {1}, revision: {22}}
... ^ eventually we do send the entries successfully ...
TRACE 2022-08-02 00:36:01,818 [shard 1] raft - [group_id:2, {kafka/topic/1}] consensus.cc:263 - Append entries response: {node_id: {id: {1}, revision: {22}}, target_node_id{id: {3}, revision: {22}}, group: {2}, term:{1}, last_dirty_log_index:{20062}, last_flushed_log_index:{20062}, last_term_base_offset:{-9223372036854775808}, result: success}        
TRACE 2022-08-02 00:36:01,818 [shard 1] raft - [group_id:2, {kafka/topic/1}] consensus.cc:462 - Updated node {id: {1}, revision: {22}} match 20062 and next 20063 indices
TRACE 2022-08-02 00:36:01,818 [shard 1] raft - [follower: {id: {1}, revision: {22}}] [group_id:2, {kafka/topic/1}] - recovery_stm.cc:536 - Finished recovery

This looks to be the expected control path (at least as far as concurrent replicates and recovery is concerned), but we're just missing the check for whether recovery has completed for some time. It doesn't look like the recovery_stm and replicate_entries_stm interact at all with each other, and when one node is slow, it doesn't seem unreasonable to think we end up not 100% recovered when there's an ongoing workload since there's always a bit more to send over. Once we're able to use the replicate_entries_stm without skipping, I wonder if it makes sense to stop the recovery_stm somehow.

The strongest reason I have for believing this is related to node slowness rather than a latent bug:

  • I'm unable to reproduce it when all binaries are running older released versions and instead of upgrading we just restart.
  • I'm unable to reproduce it when all binaries are running the current head debug version and instead of upgrading we just restart.
  • I am able to reproduce it when swapping in the current (built on dev) release version instead of starting out with older released versions, and "upgrading" one node to the current (built on dev) debug version.

Also cc @mmaslankaprv turns out this isn't related to the joint consensus improvements. I removed those commits locally and still ran into this.

@ztlpn
Copy link
Contributor

ztlpn commented Aug 2, 2022

andrwng added a commit to andrwng/redpanda that referenced this issue Aug 5, 2022
Tests that run workloads through a rolling restart are susceptible to
flakiness when the node being restarted is knowingly slower than the
others. This can typically the case when running in mixed versions,
wherein one node is the locally-built debug binaries while the others
are downloaded release binaries. Upon returning from maintenance mode,
we attempt to wait for the number of recovering replicas to drop to
zero, but even the reduced workload we're running leads the test to be
flaky ~5% of the time when run locally.

This commit significantly reduces the workload sent, while still
ensuring progress is made through the rolling restart. I ran
UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback 100 times
with this reduced workload and saw no failures.

Fixes redpanda-data#5713
andrwng added a commit to andrwng/redpanda that referenced this issue Aug 5, 2022
Tests that run workloads through a rolling restart are susceptible to
flakiness when the node being restarted is knowingly slower than the
others. This can typically the case when running in mixed versions,
wherein one node is the locally-built debug binaries while the others
are downloaded release binaries. Upon returning from maintenance mode,
we attempt to wait for the number of recovering replicas to drop to
zero, but even the reduced workload we're running leads the test to be
flaky ~5% of the time when run locally.

This commit significantly reduces the workload sent, while still
ensuring progress is made through the rolling restart. I ran
UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback 100 times
with this reduced workload and saw no failures.

Fixes redpanda-data#5713
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants