Rolling back a node to version that doesn't have RPC adaptive encoding after a partial upgrade yields protocol violation #5506

andrwng · 2022-07-19T05:56:38Z

Redpanda version: dev with #5488

I have a test in progress that does the following:

We first perform a partial upgrade of node1,
then run the workload between the upgraded node1 and non-upgraded node2,
then run the workload in the opposite direction,
then do a partial rollback of node2 and repeat,
then proceed with a full upgrade and repeat.

Our workload in this case stops a node, runs a workload until segments are deleted, waits for the down node to be deemed underreplicated, starts the node up again, and waits for the log to recover.

In this case, at step 4 after rolling back node2, the test never quiesces because the communication between node1 and node2 is botched:

On node1:

WARN  2022-07-19 05:37:41,835 [shard 0] cluster - health_monitor_backend.cc:576 - unable to get node health report from 2 - rpc::errc::exponential_backoff, marking node as down                                                                                                                                                                                 
INFO  2022-07-19 05:37:43,773 [shard 1] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2                                                                                                                                                                                                                                             
INFO  2022-07-19 05:37:43,785 [shard 0] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2                                                                                                                                                                                                                                             
WARN  2022-07-19 05:37:43,822 [shard 1] rpc - transport.h:177 - Protocol violation: request version rpc::transport_version::v2 incompatible with reply version rpc::transport_version::v0                                                                                                                                                                        
INFO  2022-07-19 05:37:43,822 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:2700947381, got:299583078 - {version:40, header_checksum:2700947381, compression:26, payload_size:603981824, meta:13631489, correlation_id:33554688, payload_checksum:11168932305873935872}                                                         
WARN  2022-07-19 05:37:43,827 [shard 0] rpc - transport.h:177 - Protocol violation: request version rpc::transport_version::v2 incompatible with reply version rpc::transport_version::v0                                                                                                                                                                        
INFO  2022-07-19 05:37:43,827 [shard 0] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:2700947381, got:3660059771 - {version:40, header_checksum:2700947381, compression:18, payload_size:3690989568, meta:8912896, correlation_id:33554688, payload_checksum:608830381705140224}                                                          
INFO  2022-07-19 05:37:43,978 [shard 1] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2                                                                                                                                                                                                                                             
INFO  2022-07-19 05:37:43,982 [shard 0] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2
... repeats until test times out waiting to recover

On node2:

INFO  2022-07-19 05:37:43,765 [shard 1] cluster - controller_backend.cc:508 - partition operation {type: addition, ntp: {kafka/topic-vfjzkjerwk/0}, offset: 16, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_assignment: {nullopt}} finished                             
INFO  2022-07-19 05:37:43,765 [shard 0] cluster - feature_manager.cc:74 - Starting...                                                                                                                                                                                                                                                                            
INFO  2022-07-19 05:37:43,766 [shard 0] redpanda::main - application.cc:1230 - Started RPC server listening at {host: docker-rp-2, port: 33145}                                                                                                                                                                                                                  
WARN  2022-07-19 05:37:43,766 [shard 0] admin_api_server - admin_server.cc:336 - Insecure Admin API listener on 192.168.64.9:9647, consider enabling `admin_api_require_auth`                                                                                                                                                                                    
INFO  2022-07-19 05:37:43,766 [shard 0] admin_api_server - admin_server.cc:111 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-2, port: 9644}}, {iplistener:{host: 192.168.64.9, port: 9647}}}                                                                                                                   
WARN  2022-07-19 05:37:43,766 [shard 1] admin_api_server - admin_server.cc:336 - Insecure Admin API listener on 192.168.64.9:9647, consider enabling `admin_api_require_auth`                                                                                                                                                                                    
INFO  2022-07-19 05:37:43,766 [shard 1] admin_api_server - admin_server.cc:111 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-2, port: 9644}}, {iplistener:{host: 192.168.64.9, port: 9647}}}                                                                                                                   
INFO  2022-07-19 05:37:43,766 [shard 0] redpanda::main - application.cc:1315 - Waiting for cluster membership                                                                                                                                                                                                                                                    
INFO  2022-07-19 05:37:43,767 [shard 0] redpanda::main - application.cc:1328 - Started Kafka API server listening at {{dnslistener:{host: docker-rp-2, port: 9092}}, {iplistener:{host: 192.168.64.9, port: 9093}}}                                                                                                                                              
INFO  2022-07-19 05:37:43,767 [shard 0] redpanda::main - application.cc:1126 - Successfully started Redpanda!                                                                                                                                                                                                                                                    
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
... continues until test times out waiting for recovery

The test runner:

[INFO  - 2022-07-19 05:39:16,713 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.prefix_truncate_recovery_test.PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade: FAIL: TimeoutError('')                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                               
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run                                                                                                                                                                                                                                                                
    data = self.run_test()                                                                                                                                                                                                                                                                                                                                       
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test                                                                                                                                                                                                                                                           
    return self.test_context.function(self.test)                                                                                                                                                                                                                                                                                                                 
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped                                                                                                                                                                                                                                                                                             
    r = f(self, *args, **kwargs)                                                                                                                                                                                                                                                                                                                                 
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 191, in test_recover_during_upgrade                                                                                                                                                                                                                                                     
    MixedVersionWorkloadRunner.upgrade_with_workload(                                                                                                                                                                                                                                                                                                            
  File "/root/tests/rptest/tests/upgrade_with_workload.py", line 58, in upgrade_with_workload                                                                                                                                                                                                                                                                    
    workload_fn(node0, node1)                                                                                                                                                                                                                                                                                                                                    
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 189, in _run_recovery                                                                                                                                                                                                                                                                   
    self.run_recovery(acks=1, dst_node=dst_node)                                                                                                                                                                                                                                                                                                                 
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 143, in run_recovery                                                                                                                                                                                                                                                                    
    wait_until(lambda: self.fully_replicated(self.redpanda.nodes),                                                                                                                                                                                                                                                                                               
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until                                                                                                                                                                                                                                                                   
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception                                                                                                                                                                                                                                                                          
ducktape.errors.TimeoutError

The text was updated successfully, but these errors were encountered:

andrwng added the kind/bug Something isn't working label Jul 19, 2022

dotnwat mentioned this issue Jul 19, 2022

rpc: reset transport version via reconnect_transport #5520

Merged

mmedenjak added the area/redpanda label Jul 20, 2022

dotnwat closed this as completed in #5520 Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling back a node to version that doesn't have RPC adaptive encoding after a partial upgrade yields protocol violation #5506

Rolling back a node to version that doesn't have RPC adaptive encoding after a partial upgrade yields protocol violation #5506

andrwng commented Jul 19, 2022 •

edited

Loading

Rolling back a node to version that doesn't have RPC adaptive encoding after a partial upgrade yields protocol violation #5506

Rolling back a node to version that doesn't have RPC adaptive encoding after a partial upgrade yields protocol violation #5506

Comments

andrwng commented Jul 19, 2022 • edited Loading

andrwng commented Jul 19, 2022 •

edited

Loading