Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer failed to consume up to offsets in SIPartitionMovementTest.test_shadow_indexing #4702

Closed
rystsov opened this issue May 11, 2022 · 20 comments
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working

Comments

@rystsov
Copy link
Contributor

rystsov commented May 11, 2022

https://buildkite.com/redpanda/redpanda/builds/9985#74311a35-dfef-42fb-a669-5e752e0248a0

Module: rptest.tests.partition_movement_test
Class:  SIPartitionMovementTest
Method: test_shadow_indexing
test_id:    rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing
status:     FAIL
run time:   2 minutes 1.397 seconds

    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 622, in test_shadow_indexing
    self.run_validation(enable_idempotence=False,
  File "/root/tests/rptest/tests/end_to_end.py", line 188, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 154, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.
@rystsov rystsov added kind/bug Something isn't working area/tests ci-failure labels May 11, 2022
@rystsov
Copy link
Contributor Author

rystsov commented May 11, 2022

The logs also have this record:

[WARNING - 2022-05-11 07:08:19,191 - service_registry - stop_all - lineno:55]: Error stopping service <VerifiableConsumer-0-140554673463360: num_nodes: 1, nodes: ['docker-rp-12']>: VerifiableConsumer-0-140554673463360-worker-1: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 266, in _worker
    handler.handle_records_consumed(event, self.logger)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 100, in handle_records_consumed
    raise AssertionError(msg)
AssertionError: Consumed from an unexpected offset (1879, 1881) for partition TopicPartition(topic='topic', partition=0)

It comes from

msg = "Consumed from an unexpected offset (%d, %d) for partition %s" % \

If we look at the verifiable_producer.stdout log we'll find:

{"timestamp":1652252831017,"name":"producer_send_success","key":null,"value":"1.1942","offset":1878,"topic":"topic","partition":0}
{"timestamp":1652252831043,"name":"producer_send_error","topic":"topic","key":null,"value":"1.1943","exception":"class org.apache.kafka.common.errors.TimeoutException","message":"The request timed out."}
...
{"timestamp":1652252836015,"name":"producer_send_error","topic":"topic","key":null,"value":"1.2454","exception":"class org.apache.kafka.common.errors.NotLeaderOrFollowerException","message":"For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition."}
{"timestamp":1652252836198,"name":"producer_send_success","key":null,"value":"1.2455","offset":1881,"topic":"topic","partition":0}

The record which should have been written with 1879 offset failed with an indecisive error (timeout). The next successfully written offset is 1881. So it seems that the test is wrong and treats timeouts as the definite error and complains when it sees 1879 instead of 1881.

@dimitriscruz
Copy link
Contributor

Another instance

FAIL test: SIPartitionMovementTest.test_shadow_indexing (1/48 runs)
  failure at 2022-05-20T07:47:03.291Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3863} after waiting 45s.")
      in job https://buildkite.com/redpanda/redpanda/builds/10330#f402f79c-2888-4625-9c50-2db1d2799aa5

@piyushredpanda piyushredpanda assigned rystsov and unassigned LenaAn May 24, 2022
@BenPope
Copy link
Member

BenPope commented Jun 14, 2022

@VladLazar
Copy link
Contributor

@rystsov
Copy link
Contributor Author

rystsov commented Jul 8, 2022

#5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue

@bharathv
Copy link
Contributor

@twmb
Copy link
Contributor

twmb commented Aug 5, 2022

@BenPope
Copy link
Member

BenPope commented Aug 8, 2022

@dotnwat
Copy link
Member

dotnwat commented Aug 10, 2022

@abhijat
Copy link
Contributor

abhijat commented Aug 17, 2022

@rystsov
Copy link
Contributor Author

rystsov commented Aug 20, 2022

@jcsp
Copy link
Contributor

jcsp commented Aug 20, 2022

When I looked into https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5 (technically #6076 but same failure mode as this issue) it looks a lot like the server really is leaving a gap in the kafka offsets. I don't think the consumer is paying any attention to what was/wasn't acked to the producer: it's just consuming the partition, and seeing the kafka offset jump forward around the same point in the log where we did a partition move.

It's not clear to me that this is valid behavior in redpanda.

@rystsov
Copy link
Contributor Author

rystsov commented Aug 23, 2022

@rystsov
Copy link
Contributor Author

rystsov commented Aug 23, 2022

it looks a lot like the server really is leaving a gap in the kafka offsets ... It's not clear to me that this is valid behavior in redpanda.

@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"

@jcsp
Copy link
Contributor

jcsp commented Aug 23, 2022

@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"

What I saw in the consumer stdout was that its fetch responses were skipping offsets (this is in run https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5):

{"timestamp":1660999734132,"name":"records_consumed","count":2,"partitions":[{"topic":"topic","partition":0,"count":2,"minOffset":2975,"maxOffset":2976}]}
{"timestamp":1660999734153,"name":"offsets_committed","offsets":[{"topic":"topic","partition":0,"offset":2977}],"success":true}
{"timestamp":1660999738330,"name":"record_data","key":null,"value":"1.3444","topic":"topic","partition":0,"offset":2983}
{"timestamp":1660999738330,"name":"records_consumed","count":1,"partitions":[{"topic":"topic","partition":0,"count":1,"minOffset":2983,"maxOffset":2983}]}

This isn't code where it's checking producer's claims, it's just consuming the topic naively, and seeing a jump in the offsets returned by the server.

Looking on the server side, this jump corresponds with a hop between nodes.

The last contiguous read response was from docker-rp-5:

RedpandaService-0-140283699091472/docker-rp-5/redpanda.log:TRACE 2022-08-20 12:48:54,127 [shard 1] kafka - request_context.h:160 - [172.16.16.12:50054] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1073741824 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2977 last_stable_offset=2977 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 87}}}}}}}

Then on docker-rp-12 a few seconds later we see this:

TRACE 2022-08-20 12:48:58,208 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: offset_out_of_range [1] } high_watermark=-1 last_stable_offset=-1 log_start_offset=-1 aborted={nullopt} preferred_read_replica=-1 records={{size 0}}}}}}}
TRACE 2022-08-20 12:48:58,329 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2984 last_stable_offset=2984 log_start_offset=2983 aborted={nullopt} preferred_read_replica=-1 records={{size 74}}}}}}}

It looks like during the movement the log_start_offset has got bumped up to 2983. So now this looks like the server has prefix truncated the log incorrectly.

@rystsov
Copy link
Contributor Author

rystsov commented Aug 23, 2022

Interesting so we have several issues hidden behind the "Consumer failed to consume up to offsets" umbrella

@mmaslankaprv
Copy link
Member

mmaslankaprv commented Aug 25, 2022

The root cause in this issue is exactly the same as in #6076

@jcsp
Copy link
Contributor

jcsp commented Aug 25, 2022

@Lazin #6229 also fixes this, right?

@Lazin Lazin closed this as completed Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests