-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to become healthy after maintenance mode in UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback
#5713
Comments
Variation of this test here - https://buildkite.com/redpanda/redpanda/builds/13417#018259f6-efcd-472c-a0eb-365a1e8b0411/480-9162 (upgrade_after_rollback=True) |
I've been looking through logs, reproducing it pretty consistently over the span of 20-30 test repeats and it doesn't look like anything interesting is going on. Replication and recovery occur concurrently (which I believe is expected), and eventually, the recovery finishes, albeit after 1+ minute(s). My current suspicion is that this isn't a Redpanda bug and is rather caused by the fact that all our released binaries (those installed before upgrading) are running in release mode, while our test binaries may run with debug binaries. Perhaps the leader and a follower are running in release mode and are being produced to, we then attempt to catch up a follower in with debug binaries, and all the while the debug binaries are a step slower than the release binaries so we end up staying underreplicated (recovering) for longer than expected. In the logs of one long-running test case, I see:
This looks to be the expected control path (at least as far as concurrent replicates and recovery is concerned), but we're just missing the check for whether recovery has completed for some time. It doesn't look like the The strongest reason I have for believing this is related to node slowness rather than a latent bug:
Also cc @mmaslankaprv turns out this isn't related to the joint consensus improvements. I removed those commits locally and still ran into this. |
Tests that run workloads through a rolling restart are susceptible to flakiness when the node being restarted is knowingly slower than the others. This can typically the case when running in mixed versions, wherein one node is the locally-built debug binaries while the others are downloaded release binaries. Upon returning from maintenance mode, we attempt to wait for the number of recovering replicas to drop to zero, but even the reduced workload we're running leads the test to be flaky ~5% of the time when run locally. This commit significantly reduces the workload sent, while still ensuring progress is made through the rolling restart. I ran UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback 100 times with this reduced workload and saw no failures. Fixes redpanda-data#5713
Tests that run workloads through a rolling restart are susceptible to flakiness when the node being restarted is knowingly slower than the others. This can typically the case when running in mixed versions, wherein one node is the locally-built debug binaries while the others are downloaded release binaries. Upon returning from maintenance mode, we attempt to wait for the number of recovering replicas to drop to zero, but even the reduced workload we're running leads the test to be flaky ~5% of the time when run locally. This commit significantly reduces the workload sent, while still ensuring progress is made through the rolling restart. I ran UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback 100 times with this reduced workload and saw no failures. Fixes redpanda-data#5713
CI failure: https://ci-artifacts.dev.vectorized.cloud/redpanda/018241dc-b092-405f-a8b4-72cf76eacffd/vbuild/ducktape/results/2022-07-27--001/report.html
We fail to wait 90 seconds for the cluster to be considered healthy according to the
RedpandaService
health status, which checks for underreplication. The check itself isn't perfect since it's checking the metrics endpoint, but it seems like a considerable amount of time for it to fail.The text was updated successfully, but these errors were encountered: