Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RpkException in PartitionBalancerTest.test_maintenance_mode.kill_same_node=True #6352

Closed
r-vasquez opened this issue Sep 9, 2022 · 9 comments
Assignees
Labels
ci-failure kind/bug Something isn't working pr-blocker CI failures blocking a PR from being merged sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@r-vasquez
Copy link
Contributor

r-vasquez commented Sep 9, 2022

saw in: https://buildkite.com/redpanda/redpanda/builds/15168#01832300-bb99-4544-82aa-c885c3f12780

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_maintenance_mode.kill_same_node=True
--
  | status:     FAIL
  | run time:   1 minute 56.672 seconds
  |  
  |  
  | <NodeCrash docker-rp-9: ==1927==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f10c52d318f bp 0x7ffc14f30500 sp 0x7ffc14f30260 T0)
  | >
  | Traceback (most recent call last):
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 569, in test_maintenance_mode
  | rpk.cluster_maintenance_enable(node, wait=True)
  | File "/root/tests/rptest/clients/rpk.py", line 661, in cluster_maintenance_enable
  | return self._execute(cmd)
  | File "/root/tests/rptest/clients/rpk.py", line 637, in _execute
  | raise RpkException(
  | rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0537a91f8006b8098-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/bin/rpk --api-urls docker-rp-9:9644,docker-rp-11:9644,docker-rp-12:9644,docker-rp-13:9644,docker-rp-14:9644 cluster maintenance enable 3 --wait returned 1, output: Request error, trying another node: Put "http://docker-rp-9:9644/v1/brokers/3/maintenance": dial tcp 172.16.16.26:9644: connect: connection refused
  | Request error, trying another node: request PUT http://docker-rp-14:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::disconnected_endpoint(node down)\", \"code\": 503}"
  |  
  | Request error, trying another node: request PUT http://docker-rp-11:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::exponential_backoff\", \"code\": 503}"
  |  
  | Request error, trying another node: request PUT http://docker-rp-13:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::exponential_backoff\", \"code\": 503}"
  |  
  | error: error enabling maintenance mode for node 3: request PUT http://docker-rp-12:9644/v1/brokers/3/maintenance failed: Service Unavailable, body: "{\"message\": \"Not ready: rpc::errc::exponential_backoff\", \"code\": 503}"
  |  
  | >
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 40, in wrapped
  | self.redpanda.raise_on_crash()
  | File "/root/tests/rptest/services/redpanda.py", line 1241, in raise_on_crash
  | raise NodeCrash(crashes)
  | rptest.services.utils.NodeCrash: <NodeCrash docker-rp-9: ==1927==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f10c52d318f bp 0x7ffc14f30500 sp 0x7ffc14f30260 T0)
  | >

Similar to #6033 but it's a different method & error message.

@r-vasquez r-vasquez added kind/bug Something isn't working ci-failure labels Sep 9, 2022
@rystsov rystsov added the pr-blocker CI failures blocking a PR from being merged label Sep 16, 2022
@piyushredpanda piyushredpanda assigned dotnwat and BenPope and unassigned dotnwat Sep 22, 2022
@BenPope
Copy link
Member

BenPope commented Sep 27, 2022

The crash:

TRACE 2022-09-09 17:07:02,554 [shard 0] kafka - produce.cc:565 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=30000 topics={{name={topic-equzvyjuvs} partitions={{partition_index=26 records={batch 1404 v2_format true valid_crc true}}}}}}
TRACE 2022-09-09 17:07:02,554 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:374 - Replication result [offset: 680, term: 1, commit_idx: 680, current_term: 1]
TRACE 2022-09-09 17:07:02,554 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:407 - Replication success, last offset: 680, term: 1
TRACE 2022-09-09 17:07:02,555 [shard 1] storage - readers_cache.cc:101 - {kafka/topic-equzvyjuvs/26} - trying to get reader for: {start_offset:{675}, max_offset:{680}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-09-09 17:07:02,555 [shard 1] storage - readers_cache.cc:130 - {kafka/topic-equzvyjuvs/26} - reader cache miss for: {start_offset:{675}, max_offset:{680}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-09-09 17:07:02,555 [shard 0] r/heartbeat - heartbeat_manager.cc:316 - Heartbeat reply from node: 4 - {node_id: {id: {4}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,555 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:298 - Append entries response: {node_id: {id: {4}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,555 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:378 - Updated node {id: {4}, revision: {0}} last committed log index: 40
TRACE 2022-09-09 17:07:02,555 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:498 - Updated node {id: {4}, revision: {0}} match 40 and next 41 indices
TRACE 2022-09-09 17:07:02,556 [shard 0] r/heartbeat - heartbeat_manager.cc:316 - Heartbeat reply from node: 2 - {node_id: {id: {2}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,556 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:298 - Append entries response: {node_id: {id: {2}, revision: {0}}, target_node_id{id: {1}, revision: {0}}, group: {0}, term:{1}, last_dirty_log_index:{40}, last_flushed_log_index:{40}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,556 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:378 - Updated node {id: {2}, revision: {0}} last committed log index: 40
TRACE 2022-09-09 17:07:02,556 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:498 - Updated node {id: {2}, revision: {0}} match 40 and next 41 indices
TRACE 2022-09-09 17:07:02,557 [shard 1] storage - readers_cache.cc:75 - {kafka/topic-equzvyjuvs/26} - adding reader [0,680]
TRACE 2022-09-09 17:07:02,557 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] consensus.cc:298 - Append entries response: {node_id: {id: {5}, revision: {23}}, target_node_id{id: {1}, revision: {23}}, group: {27}, term:{1}, last_dirty_log_index:{680}, last_flushed_log_index:{680}, last_term_base_offset:{-9223372036854775808}, result: success}
TRACE 2022-09-09 17:07:02,557 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] consensus.cc:378 - Updated node {id: {5}, revision: {23}} last committed log index: 680
TRACE 2022-09-09 17:07:02,557 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] consensus.cc:498 - Updated node {id: {5}, revision: {23}} match 680 and next 681 indices
TRACE 2022-09-09 17:07:02,558 [shard 1] kafka - requests.cc:91 - [172.16.16.22:37588] processing name:offset_commit, key:8, version:7 for consumer-test_group-1, mem_units: 8414
TRACE 2022-09-09 17:07:02,558 [shard 1] kafka - offset_commit.cc:61 - Handling request {group_id={test_group} generation_id=1 member_id={consumer-test_group-1-9fe2778f-8001-4869-bf27-0909c7897b40} group_instance_id={nullopt} retention_time_ms=-1 topics={{name={topic-equzvyjuvs} partitions={{partition_index=14 committed_offset=234 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}, {partition_index=6 committed_offset=371 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}, {partition_index=9 committed_offset=131 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}, {partition_index=25 committed_offset=274 committed_leader_epoch=-1 commit_timestamp=-1 committed_metadata={}}}}}}
TRACE 2022-09-09 17:07:02,559 [shard 1] kafka - [N:test_group S:Stable G:1] group.cc:1193 - Scheduling heartbeat expiration 30000 ms for {consumer-test_group-1-9fe2778f-8001-4869-bf27-0909c7897b40}
TRACE 2022-09-09 17:07:02,562 [shard 1] raft - [group_id:43, {kafka/__consumer_offsets/15}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{43}, commit_index:{441}, term:{1}, prev_log_index:{441}, prev_log_term:{1}, last_visible_index:{441}}
TRACE 2022-09-09 17:07:02,562 [shard 1] storage - readers_cache.cc:328 - {kafka/topic-equzvyjuvs/26} - removing reader: [0,680] lower_bound: 681
TRACE 2022-09-09 17:07:02,564 [shard 1] storage - storage_resources.cc:197 - stm_take_bytes 333 (current 5368586534)
TRACE 2022-09-09 17:07:02,565 [shard 1] storage - storage_resources.cc:175 - offset_translator_take_bytes 333 (current 5368586534)
TRACE 2022-09-09 17:07:02,565 [shard 1] storage - storage_resources.cc:186 - configuration_manager_take_bytes 333 (current 5368586534)
TRACE 2022-09-09 17:07:02,565 [shard 1] raft - [group_id:43, {kafka/__consumer_offsets/15}] replicate_entries_stm.cc:196 - Leader append result: {append_time:3604859, base_offset:{442}, last_offset:{445}, byte_size:333}
TRACE 2022-09-09 17:07:02,566 [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{27}, commit_index:{680}, term:{1}, prev_log_index:{680}, prev_log_term:{1}, last_visible_index:{680}}
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1927==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f10c52d318f bp 0x7ffc14f30500 sp 0x7ffc14f30260 T0)
==1927==The signal is caused by a READ memory access.
==1927==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
TRACE 2022-09-09 17:07:02,569 [shard 1] raft - [group_id:6, {kafka/topic-equzvyjuvs/5}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{294}, term:{1}, prev_log_index:{294}, prev_log_term:{1}, last_visible_index:{294}}
TRACE 2022-09-09 17:07:02,569 [shard 1] raft - [group_id:6, {kafka/topic-equzvyjuvs/5}] consensus.cc:2588 - Follower commit index updated 294
TRACE 2022-09-09 17:07:02,569 [shard 1] raft - [group_id:43, {kafka/__consumer_offsets/15}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{43}, commit_index:{441}, term:{1}, prev_log_index:{441}, prev_log_term:{1}, last_visible_index:{441}} to {id: {3}, revision: {26}}
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

@BenPope
Copy link
Member

BenPope commented Sep 27, 2022

Of note:

  • SEGV was on T0 ,which is shard 0.
  • First log line - [shard 0] kafka - produce.cc:565 - handling produce request ... topics={{name={topic-equzvyjuvs} partitions={{partition_index=26
  • Last log line - [shard 1] raft - [group_id:27, {kafka/topic-equzvyjuvs/26}] replicate_entries_stm.cc:186

So there was a cross-core operation for partition 26, and the crash occurred on the originating core.

@dotnwat dotnwat self-assigned this Oct 1, 2022
@piyushredpanda
Copy link
Contributor

@BenPope: has this been hit again? Are there logs or any such thing we want to add here for the next instance when this is hit? If no, then I wonder what options we have here as next step?

@BenPope
Copy link
Member

BenPope commented Oct 6, 2022

I can only realistically search the dev branch. Of all the SEGV failures in the last 35 days, this is the only one I found that's the same:

And also contains no backtrace.

Almost all of the other ones are confined to a small window of the first 3 days of October and related to SI.

Issue #6391 mentions the same problem, which has been closed as duplicate of #5575. But #5575 does appear to be different to this one. I also think coredumps are the only reasonable way forward.

@Lazin
Copy link
Contributor

Lazin commented Oct 27, 2022

@BenPope BenPope added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Nov 9, 2022
@BenPope
Copy link
Member

BenPope commented Nov 9, 2022

Probably a duplicate of #6973

@abhijat
Copy link
Contributor

abhijat commented Nov 10, 2022

I have been seeing a similar issue (may be something different) while working on #7060 - it is reliably reproduced on CDT when running the debug build and a newly added test listed in that issue.

I have not been able to capture a core dump yet from the node where redpanda has this error but will try to capture it.

AddressSanitizer:DEADLYSIGNAL                                                                                                                                                                                      
=================================================================                                                                                                                                                  
==51498==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f686df2209f bp 0x7ffdbbc2ec80 sp 0x7ffdbbc2e9e0 T0)                                                                                               
==51498==The signal is caused by a READ memory access.                                                                                                                                                             
==51498==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.                                          
TRACE 2022-11-05 07:20:14,046 [shard 3] storage - storage_resources.cc:202 - configuration_manager_take_bytes 27799771 += 786714 (current 1961145309)                                                              
TRACE 2022-11-05 07:20:14,046 [shard 1] storage - storage_resources.cc:215 - stm_take_bytes 6688052 += 917823 (current 2530399106)                                                                                 
TRACE 2022-11-05 07:20:14,047 [shard 3] raft - [group_id:94, {kafka/si_test_topic/93}] replicate_entries_stm.cc:196 - Leader append result: {append_time:68604536, base_offset:{216}, last_offset:{221}, byte_size:
786714}                                                                                                                                                                                                            
TRACE 2022-11-05 07:20:14,048 [shard 2] storage - storage_resources.cc:215 - stm_take_bytes 0 += 131169 (current 2526205552)                                                                                       
TRACE 2022-11-05 07:20:14,048 [shard 2] storage - readers_cache.cc:101 - {kafka/si_test_topic/20} - trying to get reader for: {start_offset:{226}, max_offset:{226}, min_bytes:0, max_bytes:18446744073709551615, t
ype_filter:nullopt, first_timestamp:nullopt}                                                                                                                                                                       
TRACE 2022-11-05 07:20:14,048 [shard 2] storage - readers_cache.cc:130 - {kafka/si_test_topic/20} - reader cache miss for: {start_offset:{226}, max_offset:{226}, min_bytes:0, max_bytes:18446744073709551615, type
_filter:nullopt, first_timestamp:nullopt}                  

@abhijat
Copy link
Contributor

abhijat commented Nov 10, 2022

I reran the test in CDT after rebasing on dev hoping that the recent changes for lifetimes would improve the situation, but I still see this error, eg

test_id:    rptest.scale_tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention
status:     FAIL
run time:   10 minutes 27.000 seconds


    <NodeCrash (ip-172-31-63-175,ip-172-31-54-63) ip-172-31-63-175: ==256048==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f14cda1709f bp 0x7ffc7d68e400 sp 0x7ffc7d68e160 T0)
>
Traceback (most recent call last):
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/cloud_retention_test.py", line 47, in test_cloud_retention
    producer.wait_for_acks(msg_count, timeout_sec=600, backoff_sec=5)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 564, in wait_for_acks
    wait_until(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_crash()
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1369, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash (ip-172-31-63-175,ip-172-31-54-63) ip-172-31-63-175: ==256048==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f14cda1709f bp 0x7ffc7d68e400 sp 0x7ffc7d68e160 T0)
>

@jcsp
Copy link
Contributor

jcsp commented Nov 25, 2022

In last 30d we have not seen another case of the PartitionBalancerTest.test_maintenance_mode.kill_same_node=True since the one that Evgeny linked 29 days ago (https://buildkite.com/redpanda/redpanda/builds/17411#018417da-00d8-435b-95c1-c905f841eec1)

I don't think we have any clues to work with here, and there have been fixes for general lifetime issues in coroutine lambdas (via #6973) since, so closing.

@jcsp jcsp closed this as completed Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working pr-blocker CI failures blocking a PR from being merged sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

No branches or pull requests

8 participants