You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a test in progress that does the following:
We first perform a partial upgrade of node1,
then run the workload between the upgraded node1 and non-upgraded node2,
then run the workload in the opposite direction,
then do a partial rollback of node2 and repeat,
then proceed with a full upgrade and repeat.
Our workload in this case stops a node, runs a workload until segments are deleted, waits for the down node to be deemed underreplicated, starts the node up again, and waits for the log to recover.
In this case, at step 4 after rolling back node2, the test never quiesces because the communication between node1 and node2 is botched:
On node1:
WARN 2022-07-19 05:37:41,835 [shard 0] cluster - health_monitor_backend.cc:576 - unable to get node health report from 2 - rpc::errc::exponential_backoff, marking node as down
INFO 2022-07-19 05:37:43,773 [shard 1] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2
INFO 2022-07-19 05:37:43,785 [shard 0] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2
WARN 2022-07-19 05:37:43,822 [shard 1] rpc - transport.h:177 - Protocol violation: request version rpc::transport_version::v2 incompatible with reply version rpc::transport_version::v0
INFO 2022-07-19 05:37:43,822 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:2700947381, got:299583078 - {version:40, header_checksum:2700947381, compression:26, payload_size:603981824, meta:13631489, correlation_id:33554688, payload_checksum:11168932305873935872}
WARN 2022-07-19 05:37:43,827 [shard 0] rpc - transport.h:177 - Protocol violation: request version rpc::transport_version::v2 incompatible with reply version rpc::transport_version::v0
INFO 2022-07-19 05:37:43,827 [shard 0] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:2700947381, got:3660059771 - {version:40, header_checksum:2700947381, compression:18, payload_size:3690989568, meta:8912896, correlation_id:33554688, payload_checksum:608830381705140224}
INFO 2022-07-19 05:37:43,978 [shard 1] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2
INFO 2022-07-19 05:37:43,982 [shard 0] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2
... repeats until test times out waiting to recover
On node2:
INFO 2022-07-19 05:37:43,765 [shard 1] cluster - controller_backend.cc:508 - partition operation {type: addition, ntp: {kafka/topic-vfjzkjerwk/0}, offset: 16, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_assignment: {nullopt}} finished
INFO 2022-07-19 05:37:43,765 [shard 0] cluster - feature_manager.cc:74 - Starting...
INFO 2022-07-19 05:37:43,766 [shard 0] redpanda::main - application.cc:1230 - Started RPC server listening at {host: docker-rp-2, port: 33145}
WARN 2022-07-19 05:37:43,766 [shard 0] admin_api_server - admin_server.cc:336 - Insecure Admin API listener on 192.168.64.9:9647, consider enabling `admin_api_require_auth`
INFO 2022-07-19 05:37:43,766 [shard 0] admin_api_server - admin_server.cc:111 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-2, port: 9644}}, {iplistener:{host: 192.168.64.9, port: 9647}}}
WARN 2022-07-19 05:37:43,766 [shard 1] admin_api_server - admin_server.cc:336 - Insecure Admin API listener on 192.168.64.9:9647, consider enabling `admin_api_require_auth`
INFO 2022-07-19 05:37:43,766 [shard 1] admin_api_server - admin_server.cc:111 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-2, port: 9644}}, {iplistener:{host: 192.168.64.9, port: 9647}}}
INFO 2022-07-19 05:37:43,766 [shard 0] redpanda::main - application.cc:1315 - Waiting for cluster membership
INFO 2022-07-19 05:37:43,767 [shard 0] redpanda::main - application.cc:1328 - Started Kafka API server listening at {{dnslistener:{host: docker-rp-2, port: 9092}}, {iplistener:{host: 192.168.64.9, port: 9093}}}
INFO 2022-07-19 05:37:43,767 [shard 0] redpanda::main - application.cc:1126 - Successfully started Redpanda!
WARN 2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}
WARN 2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}
WARN 2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}
WARN 2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}
WARN 2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
WARN 2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
... continues until test times out waiting for recovery
The test runner:
[INFO - 2022-07-19 05:39:16,713 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.prefix_truncate_recovery_test.PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade: FAIL: TimeoutError('')
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 191, in test_recover_during_upgrade
MixedVersionWorkloadRunner.upgrade_with_workload(
File "/root/tests/rptest/tests/upgrade_with_workload.py", line 58, in upgrade_with_workload
workload_fn(node0, node1)
File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 189, in _run_recovery
self.run_recovery(acks=1, dst_node=dst_node)
File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 143, in run_recovery
wait_until(lambda: self.fully_replicated(self.redpanda.nodes),
File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
The text was updated successfully, but these errors were encountered:
Redpanda version: dev with #5488
I have a test in progress that does the following:
Our workload in this case stops a node, runs a workload until segments are deleted, waits for the down node to be deemed underreplicated, starts the node up again, and waits for the log to recover.
In this case, at step 4 after rolling back node2, the test never quiesces because the communication between node1 and node2 is botched:
On node1:
On node2:
The test runner:
The text was updated successfully, but these errors were encountered: