CI Failure (transactions stuck in not_leader_for_partition loop) in `CompactionE2EIdempotencyTest.test_basic_compaction` #8486

rystsov · 2023-01-28T21:05:25Z

https://buildkite.com/redpanda/redpanda/builds/22020#0185f9a4-0ade-4701-961d-4d098bc63128

Module: rptest.tests.compaction_e2e_test
Class:  CompactionE2EIdempotencyTest
Method: test_basic_compaction
Arguments:
{
  "initial_cleanup_policy": "compact",
  "workload": "TX"
}

It isn't enough to have the same stacktrace in report.txt you need to look at compacted verifier & redpanda logs to confirm the same root cause (see comments)

test_id:    rptest.tests.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX
status:     FAIL
run time:   1 minute 21.073 seconds

    TimeoutError("writes got stuck: hasn't written 100 records in 30s")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/compaction_e2e_test.py", line 122, in test_basic_compaction
    rw_verifier.ensure_progress(expect_progress, 30)
  File "/root/tests/rptest/services/compacted_verifier.py", line 77, in ensure_progress
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: writes got stuck: hasn't written 100 records in 30s

The text was updated successfully, but these errors were encountered:

rystsov · 2023-01-28T21:06:29Z

Setting sev/high because the issue leads to loss of availability

rystsov · 2023-01-28T21:26:44Z

there are several failures in the linked build, here we look at results/2023-01-28--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.TX/35 run

Let's check the compacted verifier log (rw.log) to see which operation is failing.

We see that the test has three producers:

0	0	-1	start	producer
1	1	1	init
2	1	0	init
3	1	2	init

but the the end of the log only two are active

112120	61	0	w	42651
112121	61	1	w	42267
112122	61	0	w	42652
112123	61	1	w	42268

By using ^\d+\s\d+\s2 regex find that the test stuck at write operation to 2nd partition:

57277	26	2	w	17006

After looking at test_log.debug we learn that the topic's name is topic-fcsstrzylp

[DEBUG - 2023-01-28 19:13:25,344 - kafka_cli_tools - _execute - lineno:414]: Created topic topic-fcsstrzylp.

By grepping kafka/topic-fcsstrzylp/2.+became in redpanda logs we find that docker-rp-7 hosts leader of the "faulty" partition.

By searching kafka/topic-fcsstrzylp/2.+processing name:replicate_tx (write operation) from the end we get:

TRACE 2023-01-28 19:14:27,780 [shard 1] kafka - handler.h:68 - [client_id: {producer-99236e36-a589-4787-99cf-7fa92a7dbf67}] handling produce request {transactional_id={{99236e36-a589-4787-99cf-7fa92a7dbf67}} acks=-1 timeout_ms=30000 topics={{name={topic-fcsstrzylp} partitions={{partition_index=2 records={batch 1106 v2_format true valid_crc true}}}}}}
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:2
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1397 - Status of the original attempt pid:{producer_identity: id=2, epoch=0} seq:17006 is unknown. Returning not_leader_for_partition to trigger retry.
TRACE 2023-01-28 19:14:27,780 [shard 1] kafka - request_context.h:179 - [172.16.16.5:52200] sending 0:produce for {producer-99236e36-a589-4787-99cf-7fa92a7dbf67}, response {responses={{name={topic-fcsstrzylp} partitions={{partition_index=2 error_code={ error_code: not_leader_for_partition [6] } base_offset=-9223372036854775808 log_append_time_ms={timestamp: missing} log_start_offset=-1 record_errors={} error_message={nullopt}}}}} throttle_time_ms=0}

and then we notice that the issue is happening over and over again (not_leader_for_partition loop)

TRACE 2023-01-28 19:13:52,118 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:1
TRACE 2023-01-28 19:13:52,118 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
INFO  2023-01-28 19:13:52,118 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1431 - got raft::errc:4 on replicating tx data batch for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:13:55,079 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:2682 - Removing abort indexes 0 with offset < 0
TRACE 2023-01-28 19:13:55,130 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:2
TRACE 2023-01-28 19:13:55,130 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:13:55,130 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1397 - Status of the original attempt pid:{producer_identity: id=2, epoch=0} seq:17006 is unknown. Returning not_leader_for_partition to trigger retry.
...
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1332 - processing name:replicate_tx pid:{producer_identity: id=2, epoch=0} in term:2
TRACE 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1355 - found tx_seq:1701 for pid:{producer_identity: id=2, epoch=0}
DEBUG 2023-01-28 19:14:27,780 [shard 0] tx - [{kafka/topic-fcsstrzylp/2}] - rm_stm.cc:1397 - Status of the original attempt pid:{producer_identity: id=2, epoch=0} seq:17006 is unknown. Returning not_leader_for_partition to trigger retry.

rystsov · 2023-01-28T21:28:51Z

It looks like we don't clean the cache on re-election so we need to:

reset log_state::seq_table::cache on sync
do not fill cache on restoring from snapshot

redpanda-data#8491 redpanda-data#8486

rystsov · 2023-02-04T02:34:14Z

prototype of the fix - #8624

once a new leader is elected we need to remove info about unresolved inflight replication because it's guranteed that all unresolved repli- cation of previous leader is resolved and the new leader sees all its written records the residual info was causing the not_leader_for_partition loop fix redpanda-data#8486

rystsov · 2023-02-14T20:54:12Z

/backport v22.3.x

once a new leader is elected we need to remove info about unresolved inflight replication because it's guranteed that all unresolved repli- cation of previous leader is resolved and the new leader sees all its written records the residual info was causing the not_leader_for_partition loop fix redpanda-data#8486 (cherry picked from commit 95e809f)

rystsov added kind/bug Something isn't working ci-failure labels Jan 28, 2023

rystsov self-assigned this Jan 28, 2023

rystsov added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Jan 28, 2023

rystsov mentioned this issue Jan 28, 2023

Introduces new compacted verifier unaffected by the "ssh" errors #8413

Merged

6 tasks

jcsp added the area/transactions label Jan 30, 2023

rystsov added a commit to rystsov/redpanda that referenced this issue Feb 1, 2023

ducky: mark compaction e2e tests ok_to_fail

0f92829

redpanda-data#8491 redpanda-data#8486

rystsov added a commit to rystsov/redpanda that referenced this issue Feb 2, 2023

ducky: mark compaction e2e tests ok_to_fail

8bbda5b

redpanda-data#8491 redpanda-data#8486

rystsov added a commit to rystsov/redpanda that referenced this issue Feb 2, 2023

ducky: mark compaction e2e tests ok_to_fail

6b071ff

redpanda-data#8491 redpanda-data#8486

rystsov added a commit to rystsov/redpanda that referenced this issue Feb 2, 2023

ducky: mark compaction e2e tests ok_to_fail

cc292c3

redpanda-data#8491 redpanda-data#8486

rystsov mentioned this issue Feb 7, 2023

Fixes 8486 (stuck transactions) #8624

Merged

6 tasks

jcsp mentioned this issue Feb 10, 2023

cluster/types.h: default init of some duration fields and others #8797

Merged

6 tasks

piyushredpanda closed this as completed in #8624 Feb 14, 2023

vbotbuildovich mentioned this issue Feb 14, 2023

[v22.3.x] CI Failure (transactions stuck in not_leader_for_partition loop) in CompactionE2EIdempotencyTest.test_basic_compaction #8884

Closed

NyaliaLui mentioned this issue May 26, 2023

CI Failure (writes got stuck) in CompactedVerifierTest.test_tx #11065

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (transactions stuck in not_leader_for_partition loop) in `CompactionE2EIdempotencyTest.test_basic_compaction` #8486

CI Failure (transactions stuck in not_leader_for_partition loop) in `CompactionE2EIdempotencyTest.test_basic_compaction` #8486

rystsov commented Jan 28, 2023

rystsov commented Jan 28, 2023

rystsov commented Jan 28, 2023

rystsov commented Jan 28, 2023

rystsov commented Feb 4, 2023

rystsov commented Feb 14, 2023

CI Failure (transactions stuck in not_leader_for_partition loop) in CompactionE2EIdempotencyTest.test_basic_compaction #8486

CI Failure (transactions stuck in not_leader_for_partition loop) in CompactionE2EIdempotencyTest.test_basic_compaction #8486

Comments

rystsov commented Jan 28, 2023

rystsov commented Jan 28, 2023

rystsov commented Jan 28, 2023

rystsov commented Jan 28, 2023

rystsov commented Feb 4, 2023

rystsov commented Feb 14, 2023

CI Failure (transactions stuck in not_leader_for_partition loop) in `CompactionE2EIdempotencyTest.test_basic_compaction` #8486

CI Failure (transactions stuck in not_leader_for_partition loop) in `CompactionE2EIdempotencyTest.test_basic_compaction` #8486