Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failure: SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0 #6076

Closed
andrwng opened this issue Aug 17, 2022 · 5 comments
Closed

CI failure: SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0 #6076

andrwng opened this issue Aug 17, 2022 · 5 comments
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working

Comments

@andrwng
Copy link
Contributor

andrwng commented Aug 17, 2022

FAIL test: SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0 (1/24 runs)
  failure at 2022-08-17T10:39:45.559Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=5): 22962, TopicPartition(topic='topic', partition=0): 15855, TopicPartition(topic='topic', partition=2): 22930, TopicPartition(topic='topic', partition=6): 22831, TopicPartition(topic='topic', partition=9): 23392, TopicPartition(topic='topic', partition=1): 22515, TopicPartition(topic='topic', partition=7): 22906, TopicPartition(topic='topic', partition=4): 22688, TopicPartition(topic='topic', partition=8): 22613, TopicPartition(topic='topic', partition=3): 22911} after waiting 45s, last committed offsets: {TopicPartition(topic='topic', partition=2): 18069, TopicPartition(topic='topic', partition=3): 18254, TopicPartition(topic='topic', partition=5): 18224, TopicPartition(topic='topic', partition=8): 17965, TopicPartition(topic='topic', partition=1): 17804, TopicPartition(topic='topic', partition=7): 18098, TopicPartition(topic='topic', partition=9): 18624, TopicPartition(topic='topic', partition=0): 12370, TopicPartition(topic='topic', partition=4): 17919, TopicPartition(topic='topic', partition=6): 18029}.")
      in job https://buildkite.com/redpanda/redpanda/builds/14253#0182aaf0-b6f8-46f5-a9b1-0b320479e88b

Stack trace:

====================================================================================================
test_id:    rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0
status:     FAIL
run time:   7 minutes 6.485 seconds


    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=5): 22962, TopicPartition(topic='topic', partition=0): 15855, TopicPartition(topic='topic', partition=2): 22930, TopicPartition(topic='topic', partition=6): 22831, TopicPartition(topic='topic', partition=9): 23392, TopicPartition(topic='topic', partition=1): 22515, TopicPartition(topic='topic', partition=7): 22906, TopicPartition(topic='topic', partition=4): 22688, TopicPartition(topic='topic', partition=8): 22613, TopicPartition(topic='topic', partition=3): 22911} after waiting 45s, last committed offsets: {TopicPartition(topic='topic', partition=2): 18069, TopicPartition(topic='topic', partition=3): 18254, TopicPartition(topic='topic', partition=5): 18224, TopicPartition(topic='topic', partition=8): 17965, TopicPartition(topic='topic', partition=1): 17804, TopicPartition(topic='topic', partition=7): 18098, TopicPartition(topic='topic', partition=9): 18624, TopicPartition(topic='topic', partition=0): 12370, TopicPartition(topic='topic', partition=4): 17919, TopicPartition(topic='topic', partition=6): 18029}.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 884, in test_cross_shard
    self.run_validation(enable_idempotence=False,
  File "/root/tests/rptest/tests/end_to_end.py", line 226, in run_validation
    self.run_consumer_validation(
  File "/root/tests/rptest/tests/end_to_end.py", line 237, in run_consumer_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 185, in await_consumed_offsets
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=5): 22962, TopicPartition(topic='topic', partition=0): 15855, TopicPartition(topic='topic', partition=2): 22930, TopicPartition(topic='topic', partition=6): 22831, TopicPartition(topic='topic', partition=9): 23392, TopicPartition(topic='topic', partition=1): 22515, TopicPartition(topic='topic', partition=7): 22906, TopicPartition(topic='topic', partition=4): 22688, TopicPartition(topic='topic', partition=8): 22613, TopicPartition(topic='topic', partition=3): 22911} after waiting 45s, last committed offsets: {TopicPartition(topic='topic', partition=2): 18069, TopicPartition(topic='topic', partition=3): 18254, TopicPartition(topic='topic', partition=5): 18224, TopicPartition(topic='topic', partition=8): 17965, TopicPartition(topic='topic', partition=1): 17804, TopicPartition(topic='topic', partition=7): 18098, TopicPartition(topic='topic', partition=9): 18624, TopicPartition(topic='topic', partition=0): 12370, TopicPartition(topic='topic', partition=4): 17919, TopicPartition(topic='topic', partition=6): 18029}.

Similar if not the same underlying issue as #4702

@jcsp
Copy link
Contributor

jcsp commented Aug 25, 2022

https://buildkite.com/redpanda/redpanda/builds/14666#0182d1bc-2a4c-4d87-87ff-e5d806fa1173

  File "/root/tests/rptest/services/verifiable_consumer.py", line 101, in handle_records_consumed
    raise AssertionError(msg)
AssertionError: Consumed from an unexpected offset (5139, 5141) for partition TopicPartition(topic='topic', partition=0)

@mmaslankaprv
Copy link
Member

This failure seems to be an actual issue.
Consumed messages are missing few offsets which were produced:

>>> producer

{"timestamp":1661381360746,"name":"producer_send_success","key":null,"value":"1.70613","offset":5138,"topic":"topic","partition":0}
{"timestamp":1661381360750,"name":"producer_send_success","key":null,"value":"1.70616","offset":5139,"topic":"topic","partition":0} < missing
{"timestamp":1661381360750,"name":"producer_send_success","key":null,"value":"1.70617","offset":5140,"topic":"topic","partition":0} < missing

{"timestamp":1661381363463,"name":"producer_send_success","key":null,"value":"1.73326","offset":5141,"topic":"topic","partition":0}
{"timestamp":1661381363463,"name":"producer_send_success","key":null,"value":"1.73327","offset":5142,"topic":"topic","partition":0}
{"timestamp":1661381363463,"name":"producer_send_success","key":null,"value":"1.73328","offset":5143,"topic":"topic","partition":0}
{"timestamp":1661381363463,"name":"producer_send_success","key":null,"value":"1.73329","offset":5144,"topic":"topic","partition":0}
{"timestamp":1661381363463,"name":"producer_send_success","key":null,"value":"1.73330","offset":5145,"topic":"topic","partition":0}

>>> consumer

{"timestamp":1661381360747,"name":"record_data","key":null,"value":"1.70613","topic":"topic","partition":0,"offset":5138}
>>> GAP
{"timestamp":1661381363469,"name":"record_data","key":null,"value":"1.73326","topic":"topic","partition":0,"offset":5141}
{"timestamp":1661381363469,"name":"record_data","key":null,"value":"1.73327","topic":"topic","partition":0,"offset":5142}

The gap in offsets is caused by the fact that log start offset changes after a move:

TRACE 2022-08-24 22:49:23,521 [shard 0] kafka - request_context.h:160 - [172.16.16.7:42688] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=5148 last_stable_offset=5148 log_start_offset=5141 aborted={nullopt} preferred_read_replica=-1 records={{size 89}}}}}}}

Look at partition_index=0 log_start_offset=5141 it changes from 0 (this is expected as we have SI in place) to 5141. Offset prior to 5141 are not readable since log was truncated up to that point

looks like Redpanda wasn't able to retrieve information from cloud storage:

DEBUG 2022-08-24 22:49:21,192 [shard 0] cluster - partition_manager.cc:192 - Logs can't be downloaded because cloud storage is not configured. Continue creating {ntp: {kafka/topic/0}, base_dir: /var/lib/redpanda/data, overrides: {compaction_strategy: {nullopt}, cleanup_policy_bitflags: {delete}, segment_size: {nullopt}, retention_bytes: {}, retention_time_ms: {}, recovery_enabled: false}, revision: 15, initial_revision: 15} witout downloading the logs.
TRACE 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:106 - ::get segment file /var/lib/redpanda/data/kafka/topic/0_15/3976-8-v1.log, refcount=0
DEBUG 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:110 - Opening segment file /var/lib/redpanda/data/kafka/topic/0_15/3976-8-v1.log
TRACE 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:133 - ::put segment file /var/lib/redpanda/data/kafka/topic/0_15/3976-8-v1.log, refcount=1
DEBUG 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:137 - Closing segment file /var/lib/redpanda/data/kafka/topic/0_15/3976-8-v1.log
TRACE 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:106 - ::get segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log, refcount=0
DEBUG 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:110 - Opening segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log
TRACE 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:133 - ::put segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log, refcount=1
DEBUG 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:137 - Closing segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log
DEBUG 2022-08-24 22:49:21,193 [shard 0] storage - log_replayer.cc:95 - Recovering segment {offset_tracker:{term:9, base_offset:4597, committed_offset:5157, dirty_offset:5157}, compacted_segment=0, finished_self_compaction=0, generation={0}, reader={/var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log, (25344 bytes)}, writer=nullptr, cache={cache_size=0}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.base_index, offsets:{4597}, index:{header_bitflags:0, base_offset:{4597}, max_offset:{5157}, base_timestamp:{timestamp: 1661381354929}, max_timestamp:{timestamp: 1661381360745}, index(1,1,1)}, step:32768, needs_persistence:0}}
TRACE 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:106 - ::get segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log, refcount=0
DEBUG 2022-08-24 22:49:21,193 [shard 0] storage - segment_reader.cc:110 - Opening segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log
TRACE 2022-08-24 22:49:21,194 [shard 0] storage - segment_reader.cc:133 - ::put segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log, refcount=1
DEBUG 2022-08-24 22:49:21,194 [shard 0] storage - segment_reader.cc:137 - Closing segment file /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log
TRACE 2022-08-24 22:49:21,194 [shard 0] storage - segment.cc:301 - truncating segment /var/lib/redpanda/data/kafka/topic/0_15/4597-9-v1.log at 5157

@mmaslankaprv
Copy link
Member

the same underlying cause as in: #6147 and #4702

@jcsp
Copy link
Contributor

jcsp commented Aug 25, 2022

@Lazin #6229 also fixes this, right?

@Lazin
Copy link
Contributor

Lazin commented Aug 25, 2022

@jcsp yep, I belive so

@Lazin Lazin closed this as completed Aug 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants