Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing` #4702

rystsov · 2022-05-11T19:10:29Z

https://buildkite.com/redpanda/redpanda/builds/9985#74311a35-dfef-42fb-a669-5e752e0248a0

Module: rptest.tests.partition_movement_test
Class:  SIPartitionMovementTest
Method: test_shadow_indexing

test_id:    rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing
status:     FAIL
run time:   2 minutes 1.397 seconds

    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 622, in test_shadow_indexing
    self.run_validation(enable_idempotence=False,
  File "/root/tests/rptest/tests/end_to_end.py", line 188, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 154, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.

The text was updated successfully, but these errors were encountered:

rystsov · 2022-05-11T19:21:28Z

The logs also have this record:

[WARNING - 2022-05-11 07:08:19,191 - service_registry - stop_all - lineno:55]: Error stopping service <VerifiableConsumer-0-140554673463360: num_nodes: 1, nodes: ['docker-rp-12']>: VerifiableConsumer-0-140554673463360-worker-1: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 266, in _worker
    handler.handle_records_consumed(event, self.logger)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 100, in handle_records_consumed
    raise AssertionError(msg)
AssertionError: Consumed from an unexpected offset (1879, 1881) for partition TopicPartition(topic='topic', partition=0)

It comes from

redpanda/tests/rptest/services/verifiable_consumer.py

Line 97 in 4e7d7de

msg = "Consumed from an unexpected offset (%d, %d) for partition %s" % \

If we look at the verifiable_producer.stdout log we'll find:

{"timestamp":1652252831017,"name":"producer_send_success","key":null,"value":"1.1942","offset":1878,"topic":"topic","partition":0}
{"timestamp":1652252831043,"name":"producer_send_error","topic":"topic","key":null,"value":"1.1943","exception":"class org.apache.kafka.common.errors.TimeoutException","message":"The request timed out."}
...
{"timestamp":1652252836015,"name":"producer_send_error","topic":"topic","key":null,"value":"1.2454","exception":"class org.apache.kafka.common.errors.NotLeaderOrFollowerException","message":"For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition."}
{"timestamp":1652252836198,"name":"producer_send_success","key":null,"value":"1.2455","offset":1881,"topic":"topic","partition":0}

The record which should have been written with 1879 offset failed with an indecisive error (timeout). The next successfully written offset is 1881. So it seems that the test is wrong and treats timeouts as the definite error and complains when it sees 1879 instead of 1881.

dimitriscruz · 2022-05-20T23:33:20Z

Another instance

FAIL test: SIPartitionMovementTest.test_shadow_indexing (1/48 runs)
  failure at 2022-05-20T07:47:03.291Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3863} after waiting 45s.")
      in job https://buildkite.com/redpanda/redpanda/builds/10330#f402f79c-2888-4625-9c50-2db1d2799aa5

BenPope · 2022-06-14T15:36:56Z

https://buildkite.com/redpanda/redpanda/builds/11282#018160d7-44b5-46b7-93cd-687324cc3771

mmaslankaprv · 2022-06-30T12:12:58Z

https://buildkite.com/redpanda/redpanda/builds/11889#0181b400-e6c1-41ce-a1f7-34c36116670f

VladLazar · 2022-07-07T09:15:56Z

https://buildkite.com/redpanda/redpanda/builds/12164#0181d360-3128-4827-ba34-280a4eb7d567

NyaliaLui · 2022-07-07T19:00:01Z

https://buildkite.com/redpanda/redpanda/builds/12223#0181d751-e498-4171-9779-b8803e81096d/1556-8488

rystsov · 2022-07-08T21:30:24Z

#5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue

bharathv · 2022-07-25T23:50:34Z

https://buildkite.com/redpanda/redpanda/builds/13051#0182373e-6373-4809-ad96-c9af80ef161f

twmb · 2022-08-05T21:50:26Z

https://buildkite.com/redpanda/redpanda/builds/13690#01826ed3-0a49-4212-8cdb-dc348c88ff3d/6-6678

BenPope · 2022-08-08T08:36:56Z

https://buildkite.com/redpanda/redpanda/builds/13731#018274bf-d8f2-4ac0-8acb-fb3e6b0f7996

dotnwat · 2022-08-10T00:11:01Z

Similar https://buildkite.com/redpanda/redpanda/builds/13891#0182847d-fc5d-4af2-b6ca-643efbd6f6ac

abhijat · 2022-08-17T05:12:29Z

seen here https://buildkite.com/redpanda/redpanda/builds/14108#01829833-bbd7-4156-b4ee-f24ef39f3f55 on #6013

rystsov · 2022-08-20T14:24:10Z

Another instance - https://buildkite.com/redpanda/redpanda/builds/14417#0182b82f-a440-4ab7-9da4-267e358d8428

jcsp · 2022-08-20T14:47:32Z

When I looked into https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5 (technically #6076 but same failure mode as this issue) it looks a lot like the server really is leaving a gap in the kafka offsets. I don't think the consumer is paying any attention to what was/wasn't acked to the producer: it's just consuming the partition, and seeing the kafka offset jump forward around the same point in the log where we did a partition move.

It's not clear to me that this is valid behavior in redpanda.

rystsov · 2022-08-23T04:38:29Z

Another instance https://buildkite.com/redpanda/redpanda/builds/14540#0182c846-5775-4635-b17f-6b7a9535171f

rystsov · 2022-08-23T04:52:42Z

it looks a lot like the server really is leaving a gap in the kafka offsets ... It's not clear to me that this is valid behavior in redpanda.

@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"

jcsp · 2022-08-23T08:52:02Z

@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"

What I saw in the consumer stdout was that its fetch responses were skipping offsets (this is in run https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5):

{"timestamp":1660999734132,"name":"records_consumed","count":2,"partitions":[{"topic":"topic","partition":0,"count":2,"minOffset":2975,"maxOffset":2976}]}
{"timestamp":1660999734153,"name":"offsets_committed","offsets":[{"topic":"topic","partition":0,"offset":2977}],"success":true}
{"timestamp":1660999738330,"name":"record_data","key":null,"value":"1.3444","topic":"topic","partition":0,"offset":2983}
{"timestamp":1660999738330,"name":"records_consumed","count":1,"partitions":[{"topic":"topic","partition":0,"count":1,"minOffset":2983,"maxOffset":2983}]}

This isn't code where it's checking producer's claims, it's just consuming the topic naively, and seeing a jump in the offsets returned by the server.

Looking on the server side, this jump corresponds with a hop between nodes.

The last contiguous read response was from docker-rp-5:

RedpandaService-0-140283699091472/docker-rp-5/redpanda.log:TRACE 2022-08-20 12:48:54,127 [shard 1] kafka - request_context.h:160 - [172.16.16.12:50054] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1073741824 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2977 last_stable_offset=2977 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 87}}}}}}}

Then on docker-rp-12 a few seconds later we see this:

TRACE 2022-08-20 12:48:58,208 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: offset_out_of_range [1] } high_watermark=-1 last_stable_offset=-1 log_start_offset=-1 aborted={nullopt} preferred_read_replica=-1 records={{size 0}}}}}}}
TRACE 2022-08-20 12:48:58,329 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2984 last_stable_offset=2984 log_start_offset=2983 aborted={nullopt} preferred_read_replica=-1 records={{size 74}}}}}}}

It looks like during the movement the log_start_offset has got bumped up to 2983. So now this looks like the server has prefix truncated the log incorrectly.

rystsov · 2022-08-23T20:26:53Z

Interesting so we have several issues hidden behind the "Consumer failed to consume up to offsets" umbrella

mmaslankaprv · 2022-08-25T10:54:12Z

The root cause in this issue is exactly the same as in #6076

jcsp · 2022-08-25T17:54:09Z

@Lazin #6229 also fixes this, right?

rystsov added kind/bug Something isn't working area/tests ci-failure labels May 11, 2022

piyushredpanda assigned LenaAn May 14, 2022

piyushredpanda assigned rystsov and unassigned LenaAn May 24, 2022

abhijat mentioned this issue Jun 22, 2022

Iam roles support #5015

Closed

VladLazar mentioned this issue Jul 7, 2022

Bump seastar version for OSS builds #5359

Merged

rystsov mentioned this issue Jul 14, 2022

Reduce duration of partition_movement_test from 25min to 8min #5238

Draft

mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 21, 2022

dotnwat mentioned this issue Jul 25, 2022

serde/type_str: drop MSVC bits #5515

Merged

bharathv mentioned this issue Jul 25, 2022

serde: support std::chrono::duration #5432

Merged

twmb mentioned this issue Aug 5, 2022

rpk: add a method to deprecate child commands #5841

Merged

1 task

dotnwat mentioned this issue Aug 7, 2022

compat: support cluster::delete_acls_request #5881

Merged

5 tasks

dotnwat mentioned this issue Aug 10, 2022

compat: Last round of types for full compat support #5919

Merged

BenPope mentioned this issue Aug 11, 2022

[v22.2.x] cloud_storage: adjust condition to remove archiver #5942

Merged

VladLazar mentioned this issue Aug 11, 2022

[v22.2.x] archival: catch sem timeout in upload/sync loops #5978

Merged

andrwng mentioned this issue Aug 17, 2022

CI failure: SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0 #6076

Closed

jcsp mentioned this issue Aug 19, 2022

test: enable parallel redpanda startup by default #5972

Merged

5 tasks

ajfabbri mentioned this issue Aug 19, 2022

Scale test for recovery from S3 #5818

Merged

rystsov mentioned this issue Aug 20, 2022

Fix data loss consistency violation #6019

Merged

5 tasks

rystsov mentioned this issue Aug 23, 2022

ducky: pin to the latest version #6003

Closed

5 tasks

mmaslankaprv mentioned this issue Aug 23, 2022

Validate under replicated metric after consumer is done #6184

Merged

5 tasks

jcsp mentioned this issue Aug 25, 2022

CI Failure in SIPartitionMovementTest.test_cross_shard #6147

Closed

Lazin closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing` #4702

Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing` #4702

rystsov commented May 11, 2022

rystsov commented May 11, 2022

dimitriscruz commented May 20, 2022

BenPope commented Jun 14, 2022

mmaslankaprv commented Jun 30, 2022

VladLazar commented Jul 7, 2022

NyaliaLui commented Jul 7, 2022

rystsov commented Jul 8, 2022

bharathv commented Jul 25, 2022

twmb commented Aug 5, 2022

BenPope commented Aug 8, 2022

dotnwat commented Aug 10, 2022

abhijat commented Aug 17, 2022

rystsov commented Aug 20, 2022

jcsp commented Aug 20, 2022 •

edited

Loading

rystsov commented Aug 23, 2022

rystsov commented Aug 23, 2022

jcsp commented Aug 23, 2022

rystsov commented Aug 23, 2022

mmaslankaprv commented Aug 25, 2022 •

edited

Loading

jcsp commented Aug 25, 2022

Consumer failed to consume up to offsets in SIPartitionMovementTest.test_shadow_indexing #4702

Consumer failed to consume up to offsets in SIPartitionMovementTest.test_shadow_indexing #4702

Comments

rystsov commented May 11, 2022

rystsov commented May 11, 2022

dimitriscruz commented May 20, 2022

BenPope commented Jun 14, 2022

mmaslankaprv commented Jun 30, 2022

VladLazar commented Jul 7, 2022

NyaliaLui commented Jul 7, 2022

rystsov commented Jul 8, 2022

bharathv commented Jul 25, 2022

twmb commented Aug 5, 2022

BenPope commented Aug 8, 2022

dotnwat commented Aug 10, 2022

abhijat commented Aug 17, 2022

rystsov commented Aug 20, 2022

jcsp commented Aug 20, 2022 • edited Loading

rystsov commented Aug 23, 2022

rystsov commented Aug 23, 2022

jcsp commented Aug 23, 2022

rystsov commented Aug 23, 2022

mmaslankaprv commented Aug 25, 2022 • edited Loading

jcsp commented Aug 25, 2022

Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing` #4702

Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing` #4702

jcsp commented Aug 20, 2022 •

edited

Loading

mmaslankaprv commented Aug 25, 2022 •

edited

Loading