Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (transactions stuck in not_leader_for_partition loop) in CompactionE2EIdempotencyTest.test_basic_compaction #8486

Closed
rystsov opened this issue Jan 28, 2023 · 5 comments · Fixed by #8624
Assignees
Labels
area/transactions ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 28, 2023

https://buildkite.com/redpanda/redpanda/builds/22020#0185f9a4-0ade-4701-961d-4d098bc63128

Module: rptest.tests.compaction_e2e_test
Class:  CompactionE2EIdempotencyTest
Method: test_basic_compaction
Arguments:
{
  "initial_cleanup_policy": "compact",
  "workload": "TX"
}

It isn't enough to have the same stacktrace in report.txt you need to look at compacted verifier & redpanda logs to confirm the same root cause (see comments)

test_id:    rptest.tests.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX
status:     FAIL
run time:   1 minute 21.073 seconds

    TimeoutError("writes got stuck: hasn't written 100 records in 30s")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/compaction_e2e_test.py", line 122, in test_basic_compaction
    rw_verifier.ensure_progress(expect_progress, 30)
  File "/root/tests/rptest/services/compacted_verifier.py", line 77, in ensure_progress
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: writes got stuck: hasn't written 100 records in 30s
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jan 28, 2023
@rystsov rystsov self-assigned this Jan 28, 2023
@rystsov rystsov added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Jan 28, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 28, 2023

Setting sev/high because the issue leads to loss of availability

@rystsov
Copy link
Contributor Author

rystsov commented Jan 28, 2023

there are several failures in the linked build, here we look at results/2023-01-28--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.TX/35 run

Let's check the compacted verifier log (rw.log) to see which operation is failing.

We see that the test has three producers:

0	0	-1	start	producer
1	1	1	init
2	1	0	init
3	1	2	init

but the the end of the log only two are active

112120	61	0	w	42651
112121	61	1	w	42267
112122	61	0	w	42652
112123	61	1	w	42268

By using ^\d+\s\d+\s2 regex find that the test stuck at write operation to 2nd partition:

57277	26	2	w	17006

After looking at test_log.debug we learn that the topic's name is topic-fcsstrzylp

[DEBUG - 2023-01-28 19:13:25,344 - kafka_cli_tools - _execute - lineno:414]: Created topic topic-fcsstrzylp.

By grepping kafka/topic-fcsstrzylp/2.+became in redpanda logs we find that docker-rp-7 hosts leader of the "faulty" partition.

By searching kafka/topic-fcsstrzylp/2.+processing name:replicate_tx (write operation) from the end we get:

TRACE 2023-01-28 19:14:27,780 [shard 1] kafka - handler.h:68 - [client_id: {producer-99236e36-a589-4787-99cf-7fa92a7dbf67}] handling produce request {transactional_id={{99236e36-a589-4787-99cf-7fa92a7dbf67}} acks=-1 timeout_ms=30000 topics={{name={topic-fcsstrzylp} partitions={{partition_index=2 records={batch 1106 v2_format true valid_crc true}}}}}}
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:2
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1397 - Status of the original attempt pid:{producer_identity: id=2, epoch=0} seq:17006 is unknown. Returning not_leader_for_partition to trigger retry.
TRACE 2023-01-28 19:14:27,780 [shard 1] kafka - request_context.h:179 - [172.16.16.5:52200] sending 0:produce for {producer-99236e36-a589-4787-99cf-7fa92a7dbf67}, response {responses={{name={topic-fcsstrzylp} partitions={{partition_index=2 error_code={ error_code: not_leader_for_partition [6] } base_offset=-9223372036854775808 log_append_time_ms={timestamp: missing} log_start_offset=-1 record_errors={} error_message={nullopt}}}}} throttle_time_ms=0}

and then we notice that the issue is happening over and over again (not_leader_for_partition loop)

TRACE 2023-01-28 19:13:52,118 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:1
TRACE 2023-01-28 19:13:52,118 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
INFO  2023-01-28 19:13:52,118 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1431 - got raft::errc:4 on replicating tx data batch for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:13:55,079 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:2682 - Removing abort indexes 0 with offset < 0
TRACE 2023-01-28 19:13:55,130 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:2
TRACE 2023-01-28 19:13:55,130 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:13:55,130 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1397 - Status of the original attempt pid:{producer_identity: id=2, epoch=0} seq:17006 is unknown. Returning not_leader_for_partition to trigger retry.
...
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:2
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1397 - Status of the original attempt pid:{producer_identity: id=2, epoch=0} seq:17006 is unknown. Returning not_leader_for_partition to trigger retry.

@rystsov
Copy link
Contributor Author

rystsov commented Jan 28, 2023

It looks like we don't clean the cache on re-election so we need to:

  • reset log_state::seq_table::cache on sync
  • do not fill cache on restoring from snapshot

@rystsov
Copy link
Contributor Author

rystsov commented Feb 4, 2023

prototype of the fix - #8624

rystsov added a commit to rystsov/redpanda that referenced this issue Feb 7, 2023
once a new leader is elected we need to remove info about unresolved
inflight replication because it's guranteed that all unresolved repli-
cation of previous leader is resolved and the new leader sees all its
written records

the residual info was causing the not_leader_for_partition loop

fix redpanda-data#8486
rystsov added a commit to rystsov/redpanda that referenced this issue Feb 9, 2023
once a new leader is elected we need to remove info about unresolved
inflight replication because it's guranteed that all unresolved repli-
cation of previous leader is resolved and the new leader sees all its
written records

the residual info was causing the not_leader_for_partition loop

fix redpanda-data#8486
andijcr pushed a commit to andijcr/redpanda that referenced this issue Feb 10, 2023
once a new leader is elected we need to remove info about unresolved
inflight replication because it's guranteed that all unresolved repli-
cation of previous leader is resolved and the new leader sees all its
written records

the residual info was causing the not_leader_for_partition loop

fix redpanda-data#8486
@rystsov
Copy link
Contributor Author

rystsov commented Feb 14, 2023

/backport v22.3.x

rystsov added a commit to rystsov/redpanda that referenced this issue Feb 15, 2023
once a new leader is elected we need to remove info about unresolved
inflight replication because it's guranteed that all unresolved repli-
cation of previous leader is resolved and the new leader sees all its
written records

the residual info was causing the not_leader_for_partition loop

fix redpanda-data#8486

(cherry picked from commit 95e809f)
rystsov added a commit to rystsov/redpanda that referenced this issue Feb 15, 2023
once a new leader is elected we need to remove info about unresolved
inflight replication because it's guranteed that all unresolved repli-
cation of previous leader is resolved and the new leader sees all its
written records

the residual info was causing the not_leader_for_partition loop

fix redpanda-data#8486

(cherry picked from commit 95e809f)
rystsov added a commit to rystsov/redpanda that referenced this issue Mar 3, 2023
once a new leader is elected we need to remove info about unresolved
inflight replication because it's guranteed that all unresolved repli-
cation of previous leader is resolved and the new leader sees all its
written records

the residual info was causing the not_leader_for_partition loop

fix redpanda-data#8486

(cherry picked from commit 95e809f)
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transactions ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants