Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling back a node to version that doesn't have RPC adaptive encoding after a partial upgrade yields protocol violation #5506

Closed
andrwng opened this issue Jul 19, 2022 · 0 comments · Fixed by #5520
Labels
area/redpanda kind/bug Something isn't working

Comments

@andrwng
Copy link
Contributor

andrwng commented Jul 19, 2022

Redpanda version: dev with #5488

I have a test in progress that does the following:

  1. We first perform a partial upgrade of node1,
  2. then run the workload between the upgraded node1 and non-upgraded node2,
  3. then run the workload in the opposite direction,
  4. then do a partial rollback of node2 and repeat,
  5. then proceed with a full upgrade and repeat.

Our workload in this case stops a node, runs a workload until segments are deleted, waits for the down node to be deemed underreplicated, starts the node up again, and waits for the log to recover.

In this case, at step 4 after rolling back node2, the test never quiesces because the communication between node1 and node2 is botched:

On node1:

WARN  2022-07-19 05:37:41,835 [shard 0] cluster - health_monitor_backend.cc:576 - unable to get node health report from 2 - rpc::errc::exponential_backoff, marking node as down                                                                                                                                                                                 
INFO  2022-07-19 05:37:43,773 [shard 1] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2                                                                                                                                                                                                                                             
INFO  2022-07-19 05:37:43,785 [shard 0] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2                                                                                                                                                                                                                                             
WARN  2022-07-19 05:37:43,822 [shard 1] rpc - transport.h:177 - Protocol violation: request version rpc::transport_version::v2 incompatible with reply version rpc::transport_version::v0                                                                                                                                                                        
INFO  2022-07-19 05:37:43,822 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:2700947381, got:299583078 - {version:40, header_checksum:2700947381, compression:26, payload_size:603981824, meta:13631489, correlation_id:33554688, payload_checksum:11168932305873935872}                                                         
WARN  2022-07-19 05:37:43,827 [shard 0] rpc - transport.h:177 - Protocol violation: request version rpc::transport_version::v2 incompatible with reply version rpc::transport_version::v0                                                                                                                                                                        
INFO  2022-07-19 05:37:43,827 [shard 0] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:2700947381, got:3660059771 - {version:40, header_checksum:2700947381, compression:18, payload_size:3690989568, meta:8912896, correlation_id:33554688, payload_checksum:608830381705140224}                                                          
INFO  2022-07-19 05:37:43,978 [shard 1] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2                                                                                                                                                                                                                                             
INFO  2022-07-19 05:37:43,982 [shard 0] r/heartbeat - heartbeat_manager.cc:208 - Closed unresponsive connection to 2
... repeats until test times out waiting to recover

On node2:

INFO  2022-07-19 05:37:43,765 [shard 1] cluster - controller_backend.cc:508 - partition operation {type: addition, ntp: {kafka/topic-vfjzkjerwk/0}, offset: 16, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_assignment: {nullopt}} finished                             
INFO  2022-07-19 05:37:43,765 [shard 0] cluster - feature_manager.cc:74 - Starting...                                                                                                                                                                                                                                                                            
INFO  2022-07-19 05:37:43,766 [shard 0] redpanda::main - application.cc:1230 - Started RPC server listening at {host: docker-rp-2, port: 33145}                                                                                                                                                                                                                  
WARN  2022-07-19 05:37:43,766 [shard 0] admin_api_server - admin_server.cc:336 - Insecure Admin API listener on 192.168.64.9:9647, consider enabling `admin_api_require_auth`                                                                                                                                                                                    
INFO  2022-07-19 05:37:43,766 [shard 0] admin_api_server - admin_server.cc:111 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-2, port: 9644}}, {iplistener:{host: 192.168.64.9, port: 9647}}}                                                                                                                   
WARN  2022-07-19 05:37:43,766 [shard 1] admin_api_server - admin_server.cc:336 - Insecure Admin API listener on 192.168.64.9:9647, consider enabling `admin_api_require_auth`                                                                                                                                                                                    
INFO  2022-07-19 05:37:43,766 [shard 1] admin_api_server - admin_server.cc:111 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-2, port: 9644}}, {iplistener:{host: 192.168.64.9, port: 9647}}}                                                                                                                   
INFO  2022-07-19 05:37:43,766 [shard 0] redpanda::main - application.cc:1315 - Waiting for cluster membership                                                                                                                                                                                                                                                    
INFO  2022-07-19 05:37:43,767 [shard 0] redpanda::main - application.cc:1328 - Started Kafka API server listening at {{dnslistener:{host: docker-rp-2, port: 9092}}, {iplistener:{host: 192.168.64.9, port: 9093}}}                                                                                                                                              
INFO  2022-07-19 05:37:43,767 [shard 0] redpanda::main - application.cc:1126 - Successfully started Redpanda!                                                                                                                                                                                                                                                    
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,803 [shard 1] raft - [group_id:1, {kafka/topic-vfjzkjerwk/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {16}}                                                                                                                   
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1310720}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}                                                                                                                       
WARN  2022-07-19 05:37:43,816 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:539 - received append_entries request addressed to different node: {id: {1703936}, revision: {0}}, current node: {id: {2}, revision: {0}}
... continues until test times out waiting for recovery

The test runner:

[INFO  - 2022-07-19 05:39:16,713 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.prefix_truncate_recovery_test.PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade: FAIL: TimeoutError('')                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                               
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run                                                                                                                                                                                                                                                                
    data = self.run_test()                                                                                                                                                                                                                                                                                                                                       
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test                                                                                                                                                                                                                                                           
    return self.test_context.function(self.test)                                                                                                                                                                                                                                                                                                                 
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped                                                                                                                                                                                                                                                                                             
    r = f(self, *args, **kwargs)                                                                                                                                                                                                                                                                                                                                 
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 191, in test_recover_during_upgrade                                                                                                                                                                                                                                                     
    MixedVersionWorkloadRunner.upgrade_with_workload(                                                                                                                                                                                                                                                                                                            
  File "/root/tests/rptest/tests/upgrade_with_workload.py", line 58, in upgrade_with_workload                                                                                                                                                                                                                                                                    
    workload_fn(node0, node1)                                                                                                                                                                                                                                                                                                                                    
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 189, in _run_recovery                                                                                                                                                                                                                                                                   
    self.run_recovery(acks=1, dst_node=dst_node)                                                                                                                                                                                                                                                                                                                 
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 143, in run_recovery                                                                                                                                                                                                                                                                    
    wait_until(lambda: self.fully_replicated(self.redpanda.nodes),                                                                                                                                                                                                                                                                                               
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until                                                                                                                                                                                                                                                                   
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception                                                                                                                                                                                                                                                                          
ducktape.errors.TimeoutError
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants