RpkException in PartitionBalancerTest.test_maintenance_mode.kill_same_node=True #6352

r-vasquez · 2022-09-09T18:31:43Z

saw in: https://buildkite.com/redpanda/redpanda/builds/15168#01832300-bb99-4544-82aa-c885c3f12780

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_maintenance_mode.kill_same_node=True
--
  | status:     FAIL
  | run time:   1 minute 56.672 seconds
  |  
  |  
  | <NodeCrash docker-rp-9: ==1927==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f10c52d318f bp 0x7ffc14f30500 sp 0x7ffc14f30260 T0)
  | >
  | Traceback (most recent call last):
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 569, in test_maintenance_mode
  | rpk.cluster_maintenance_enable(node, wait=True)
  | File "/root/tests/rptest/clients/rpk.py", line 661, in cluster_maintenance_enable
  | return self._execute(cmd)
  | File "/root/tests/rptest/clients/rpk.py", line 637, in _execute
  | raise RpkException(
  | rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0537a91f8006b8098-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/bin/rpk --api-urls docker-rp-9:9644,docker-rp-11:9644,docker-rp-12:9644,docker-rp-13:9644,docker-rp-14:9644 cluster maintenance enable 3 --wait returned 1, output: Request error, trying another node: Put "http://docker-rp-9:9644/v1/brokers/3/maintenance": dial tcp 172.16.16.26:9644: connect: connection refused
  | Request error, trying another node: request PUT http://docker-rp-14:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::disconnected_endpoint(node down)\", \"code\": 503}"
  |  
  | Request error, trying another node: request PUT http://docker-rp-11:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::exponential_backoff\", \"code\": 503}"
  |  
  | Request error, trying another node: request PUT http://docker-rp-13:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::exponential_backoff\", \"code\": 503}"
  |  
  | error: error enabling maintenance mode for node 3: request PUT http://docker-rp-12:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::exponential_backoff\", \"code\": 503}"
  |  
  | >
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 40, in wrapped
  | self.redpanda.raise_on_crash()
  | File "/root/tests/rptest/services/redpanda.py", line 1241, in raise_on_crash
  | raise NodeCrash(crashes)
  | rptest.services.utils.NodeCrash: <NodeCrash docker-rp-9: ==1927==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f10c52d318f bp 0x7ffc14f30500 sp 0x7ffc14f30260 T0)
  | >

Similar to #6033 but it's a different method & error message.

The text was updated successfully, but these errors were encountered:

BenPope · 2022-09-27T12:32:30Z

The crash:

TRACE 2022-09-09 17:07:02,554 [shard 0] kafka - produce.cc:565 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=30000 topics={{name={topic-equzvyjuvs} partitions={{partition_index=26 records={batch 1404 v2_format true valid_crc true}}}}}}
TRACE 2022-09-09 17:07:02,554 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:374 - Replication result [offset: 680, term: 1, commit_idx: 680, current_term: 1]
TRACE 2022-09-09 17:07:02,554 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:407 - Replication success, last offset: 680, term: 1
TRACE 2022-09-09 17:07:02,555 [shard 1] storage - readers_cache.cc:101 - {kafka/topic-equzvyjuvs/26} - trying to get reader for: {start_offset:{675}, max_offset:{680}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-09-09 17:07:02,555 [shard 1] storage - readers_cache.cc:130 - {kafka/topic-equzvyjuvs/26} - reader cache miss for: {start_offset:{675}, max_offset:{680}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-09-09 17:07:02,555 [shard 0] r/heartbeat - heartbeat_manager.cc:316 - Heartbeat reply from node: 4 - {node_id: {id: {4}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,555 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:298 - Append entries response: {node_id: {id: {4}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,555 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:378 - Updated node {id: {4}, revision: {0}} last committed log index: 40
TRACE 2022-09-09 17:07:02,555 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:498 - Updated node {id: {4}, revision: {0}} match 40 and next 41 indices
TRACE 2022-09-09 17:07:02,556 [shard 0] r/heartbeat - heartbeat_manager.cc:316 - Heartbeat reply from node: 2 - {node_id: {id: {2}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,556 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:298 - Append entries response: {node_id: {id: {2}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,556 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:378 - Updated node {id: {2}, revision: {0}} last committed log index: 40
TRACE 2022-09-09 17:07:02,556 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:498 - Updated node {id: {2}, revision: {0}} match 40 and next 41 indices
TRACE 2022-09-09 17:07:02,557 [shard 1] storage - readers_cache.cc:75 - {kafka/topic-equzvyjuvs/26} - adding reader [0,680]
TRACE 2022-09-09 17:07:02,557 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] consensus.cc:298 - Append entries response: {node_id: {id: {5}, revision: {23}}, target_node_id{id: {1}, revision: {23}}, group: {27}, term:{1}, last_dirty_log_index:{680}, last_flushed_log_index:{680}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,557 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] consensus.cc:378 - Updated node {id: {5}, revision: {23}} last committed log index: 680
TRACE 2022-09-09 17:07:02,557 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] consensus.cc:498 - Updated node {id: {5}, revision: {23}} match 680 and next 681 indices
TRACE 2022-09-09 17:07:02,558 [shard 1] kafka - requests.cc:91 - [172.16.16.22:37588] processing name:offset_commit, key:8, version:7 for consumer-test_group-1, mem_units: 8414
TRACE 2022-09-09 17:07:02,558 [shard 1] kafka - offset_commit.cc:61 - Handling request {group_id={test_group} generation_id=1 member_id={consumer-test_group-1-9fe2778f-8001-4869-bf27-0909c7897b40} group_instance_id={nullopt} retention_time_ms=-1 topics={{name={topic-equzvyjuvs} partitions={{partition_index=14 committed_offset=234 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}, {partition_index=6 committed_offset=371 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}, {partition_index=9 committed_offset=131 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}, {partition_index=25 committed_offset=274 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}}}}}
TRACE 2022-09-09 17:07:02,559 [shard 1] kafka - [N:test_group S:Stable G:1] group.cc:1193 - Scheduling heartbeat expiration 30000 ms for {consumer-test_group-1-9fe2778f-8001-4869-bf27-0909c7897b40}
TRACE 2022-09-09 17:07:02,562 [shard 1] raft - [group_id:43, {kafka/__consumer_offsets/15}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{43}, commit_index:{441}, term:{1}, prev_log_index:{441}, prev_log_term:{1}, last_visible_index:{441}}
TRACE 2022-09-09 17:07:02,562 [shard 1] storage - readers_cache.cc:328 - {kafka/topic-equzvyjuvs/26} - removing reader: [0,680] lower_bound: 681
TRACE 2022-09-09 17:07:02,564 [shard 1] storage - storage_resources.cc:197 - stm_take_bytes 333 (current 5368586534)
TRACE 2022-09-09 17:07:02,565 [shard 1] storage - storage_resources.cc:175 - offset_translator_take_bytes 333 (current 5368586534)
TRACE 2022-09-09 17:07:02,565 [shard 1] storage - storage_resources.cc:186 - configuration_manager_take_bytes 333 (current 5368586534)
TRACE 2022-09-09 17:07:02,565 [shard 1] raft - [group_id:43, {kafka/__consumer_offsets/15}] replicate_entries_stm.cc:196 - Leader append result: {append_time:3604859, base_offset:{442}, last_offset:{445}, byte_size:333}
TRACE 2022-09-09 17:07:02,566 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{27}, commit_index:{680}, term:{1}, prev_log_index:{680}, prev_log_term:{1}, last_visible_index:{680}}
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1927==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f10c52d318f bp 0x7ffc14f30500 sp 0x7ffc14f30260 T0)
==1927==The signal is caused by a READ memory access.
==1927==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
TRACE 2022-09-09 17:07:02,569 [shard 1] raft - [group_id:6, {kafka/topic-equzvyjuvs/5}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{294}, term:{1}, prev_log_index:{294}, prev_log_term:{1}, last_visible_index:{294}}
TRACE 2022-09-09 17:07:02,569 [shard 1] raft - [group_id:6, {kafka/topic-equzvyjuvs/5}] consensus.cc:2588 - Follower commit index updated 294
TRACE 2022-09-09 17:07:02,569 [shard 1] raft - [group_id:43, {kafka/__consumer_offsets/15}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{43}, commit_index:{441}, term:{1}, prev_log_index:{441}, prev_log_term:{1}, last_visible_index:{441}} to {id: {3}, revision: {26}}
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

BenPope · 2022-09-27T14:25:48Z

Of note:

SEGV was on T0 ,which is shard 0.
First log line - [shard 0] kafka - produce.cc:565 - handling produce request ... topics={{name={topic-equzvyjuvs} partitions={{partition_index=26
Last log line - [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:186

So there was a cross-core operation for partition 26, and the crash occurred on the originating core.

piyushredpanda · 2022-10-05T21:12:25Z

@BenPope: has this been hit again? Are there logs or any such thing we want to add here for the next instance when this is hit? If no, then I wonder what options we have here as next step?

BenPope · 2022-10-06T09:15:11Z

I can only realistically search the dev branch. Of all the SEGV failures in the last 35 days, this is the only one I found that's the same:

https://buildkite.com/redpanda/redpanda/builds/15279#01833a71-a4f6-46de-8cb9-3207eec45860 (2022-09-14)
Module: rptest.tests.pandaproxy_test
Class: PandaProxyBrokersTest
Method: test_get_brokers_with_stale_health_metadata

And also contains no backtrace.

Almost all of the other ones are confined to a small window of the first 3 days of October and related to SI.

Issue #6391 mentions the same problem, which has been closed as duplicate of #5575. But #5575 does appear to be different to this one. I also think coredumps are the only reasonable way forward.

Lazin · 2022-10-27T14:09:43Z

Same failure here https://buildkite.com/redpanda/redpanda/builds/17411#018417da-00d8-435b-95c1-c905f841eec1

BenPope · 2022-11-09T10:02:20Z

Probably a duplicate of #6973

abhijat · 2022-11-10T04:58:06Z

I have been seeing a similar issue (may be something different) while working on #7060 - it is reliably reproduced on CDT when running the debug build and a newly added test listed in that issue.

I have not been able to capture a core dump yet from the node where redpanda has this error but will try to capture it.

AddressSanitizer:DEADLYSIGNAL                                                                                                                                                                                      
=================================================================                                                                                                                                                  
==51498==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f686df2209f bp 0x7ffdbbc2ec80 sp 0x7ffdbbc2e9e0 T0)                                                                                               
==51498==The signal is caused by a READ memory access.                                                                                                                                                             
==51498==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.                                          
TRACE 2022-11-05 07:20:14,046 [shard 3] storage - storage_resources.cc:202 - configuration_manager_take_bytes 27799771 += 786714 (current 1961145309)                                                              
TRACE 2022-11-05 07:20:14,046 [shard 1] storage - storage_resources.cc:215 - stm_take_bytes 6688052 += 917823 (current 2530399106)                                                                                 
TRACE 2022-11-05 07:20:14,047 [shard 3] raft - [group_id:94, {kafka/si_test_topic/93}] replicate_entries_stm.cc:196 - Leader append result: {append_time:68604536, base_offset:{216}, last_offset:{221}, byte_size:
786714}                                                                                                                                                                                                            
TRACE 2022-11-05 07:20:14,048 [shard 2] storage - storage_resources.cc:215 - stm_take_bytes 0 += 131169 (current 2526205552)                                                                                       
TRACE 2022-11-05 07:20:14,048 [shard 2] storage - readers_cache.cc:101 - {kafka/si_test_topic/20} - trying to get reader for: {start_offset:{226}, max_offset:{226}, min_bytes:0, max_bytes:18446744073709551615, t
ype_filter:nullopt, first_timestamp:nullopt}                                                                                                                                                                       
TRACE 2022-11-05 07:20:14,048 [shard 2] storage - readers_cache.cc:130 - {kafka/si_test_topic/20} - reader cache miss for: {start_offset:{226}, max_offset:{226}, min_bytes:0, max_bytes:18446744073709551615, type
_filter:nullopt, first_timestamp:nullopt}

abhijat · 2022-11-10T07:49:25Z

I reran the test in CDT after rebasing on dev hoping that the recent changes for lifetimes would improve the situation, but I still see this error, eg

test_id:    rptest.scale_tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention
status:     FAIL
run time:   10 minutes 27.000 seconds


    <NodeCrash (ip-172-31-63-175,ip-172-31-54-63) ip-172-31-63-175: ==256048==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f14cda1709f bp 0x7ffc7d68e400 sp 0x7ffc7d68e160 T0)
>
Traceback (most recent call last):
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/cloud_retention_test.py", line 47, in test_cloud_retention
    producer.wait_for_acks(msg_count, timeout_sec=600, backoff_sec=5)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 564, in wait_for_acks
    wait_until(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_crash()
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1369, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash (ip-172-31-63-175,ip-172-31-54-63) ip-172-31-63-175: ==256048==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f14cda1709f bp 0x7ffc7d68e400 sp 0x7ffc7d68e160 T0)
>

jcsp · 2022-11-25T10:23:52Z

In last 30d we have not seen another case of the PartitionBalancerTest.test_maintenance_mode.kill_same_node=True since the one that Evgeny linked 29 days ago (https://buildkite.com/redpanda/redpanda/builds/17411#018417da-00d8-435b-95c1-c905f841eec1)

I don't think we have any clues to work with here, and there have been fixes for general lifetime issues in coroutine lambdas (via #6973) since, so closing.

r-vasquez added kind/bug Something isn't working ci-failure labels Sep 9, 2022

r-vasquez mentioned this issue Sep 9, 2022

rpk: bump golangci lint to 1.49.0 #6341

Merged

6 tasks

rystsov added the pr-blocker CI failures blocking a PR from being merged label Sep 16, 2022

piyushredpanda assigned dotnwat and BenPope and unassigned dotnwat Sep 22, 2022

dotnwat self-assigned this Oct 1, 2022

BenPope added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Nov 9, 2022

jcsp closed this as completed Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RpkException in PartitionBalancerTest.test_maintenance_mode.kill_same_node=True #6352

RpkException in PartitionBalancerTest.test_maintenance_mode.kill_same_node=True #6352

r-vasquez commented Sep 9, 2022 •

edited

Loading

BenPope commented Sep 27, 2022 •

edited

Loading

BenPope commented Sep 27, 2022

piyushredpanda commented Oct 5, 2022

BenPope commented Oct 6, 2022 •

edited

Loading

Lazin commented Oct 27, 2022

BenPope commented Nov 9, 2022

abhijat commented Nov 10, 2022 •

edited

Loading

abhijat commented Nov 10, 2022

jcsp commented Nov 25, 2022

RpkException in PartitionBalancerTest.test_maintenance_mode.kill_same_node=True #6352

RpkException in PartitionBalancerTest.test_maintenance_mode.kill_same_node=True #6352

Comments

r-vasquez commented Sep 9, 2022 • edited Loading

BenPope commented Sep 27, 2022 • edited Loading

BenPope commented Sep 27, 2022

piyushredpanda commented Oct 5, 2022

BenPope commented Oct 6, 2022 • edited Loading

Lazin commented Oct 27, 2022

BenPope commented Nov 9, 2022

abhijat commented Nov 10, 2022 • edited Loading

abhijat commented Nov 10, 2022

jcsp commented Nov 25, 2022

r-vasquez commented Sep 9, 2022 •

edited

Loading

BenPope commented Sep 27, 2022 •

edited

Loading

BenPope commented Oct 6, 2022 •

edited

Loading

abhijat commented Nov 10, 2022 •

edited

Loading