Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in test_changing_topic_retention.test_changing_topic_retention.test_changing_topic_retention #3924

Closed
gousteris opened this issue Mar 2, 2022 · 16 comments · Fixed by #4010
Assignees
Labels
ci-failure kind/bug Something isn't working

Comments

@gousteris
Copy link
Contributor

bk ref: https://buildkite.com/redpanda/redpanda/builds/7714#a98f1f48-1546-4bec-bd00-00fd47d6359e

error:

test_id:    rptest.tests.retention_policy_test.RetentionPolicyTest.test_changing_topic_retention.property=retention.ms.acks=1
--
  | status:     FAIL
  | run time:   2 minutes 29.938 seconds
  |  
  |  
  | TimeoutError('Segments were not removed')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 433, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 63, in test_changing_topic_retention
  | wait_for_segments_removal(self.redpanda,
  | File "/root/tests/rptest/util.py", line 125, in wait_for_segments_removal
  | wait_until(done,
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Segments were not removed
@gousteris gousteris added kind/bug Something isn't working ci-failure labels Mar 2, 2022
@gousteris
Copy link
Contributor Author

@jcsp
Copy link
Contributor

jcsp commented Mar 4, 2022

Possibly related to segments not being removed in topic deletion test #3884

jcsp added a commit to jcsp/redpanda that referenced this issue Mar 4, 2022
If something gets left behind we definitely want
to know what!

Related: redpanda-data#3924
Related: redpanda-data#3884
@jcsp
Copy link
Contributor

jcsp commented Mar 11, 2022

Caught a failure with the extra listing of files on failure here: https://buildkite.com/redpanda/vtools/builds/1724#f464055c-1abd-488f-aec4-597375d25fe4

Looks like something different to #3884, as in this test there are 10+ segments still on each node, so it's not some subtle race, the compaction just isn't happen (or isn't happening in the timescale the test expects)

@jcsp
Copy link
Contributor

jcsp commented Mar 11, 2022

The failure case is hitting the "bogus timestamp" code path:

docker_n_17/redpanda.log:WARN  2022-03-11 11:05:13,321 [shard 1] storage-gc - disk_log_impl.cc:182 - [{kafka/topic-bbamrurfno/0}] found segment with bogus max timestamp: {timestamp: 1651291555881} - {offset_tracker:{term:1, base_offset:0, committed_offset:1106, dirty_offset:1106}, compacted_segment=0, finished_self_compaction=0, reader={/var/lib/redpanda/data/kafka/topic-bbamrurfno/0_17/0-1-v1.log, (1144013 bytes)}, writer=nullptr, cache={cache_size=15}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/topic-bbamrurfno/0_17/0-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{1106}, base_timestamp:{timestamp: 1646996588988}, max_timestamp:{timestamp: 1651291555881}, index(15,15,15)}, step:32768, needs_persistence:0}}

@jcsp
Copy link
Contributor

jcsp commented Mar 11, 2022

2/3 nodes have "bogus timestamp" messages

The node that was handling incoming produce messages (docker_n_6) does not have any bogus timestamps reported, but it is also failing to delete segments, and is hanging on shutdown.

@jcsp
Copy link
Contributor

jcsp commented Mar 11, 2022

The bogus timestamps are about 56 days in the future compared with the actual test execution time, which doesn't leap out at me as being explainable by any internal maths errors. However, the timestamps on the redpanda log don't show any clock jumping around (which I would expect to see if the node's clock was crazy).

@jcsp
Copy link
Contributor

jcsp commented Mar 11, 2022

Here's another example, this time with bogus timestamp on one node only, ahead by about 46 days https://buildkite.com/redpanda/redpanda/builds/7997#a07f2622-e3f9-4099-a0ec-3baa953eb4c4

This coincidence of two quite rare failures happening within a day of each other makes me more suspicious of an environmental clock issue.

@andrewhsu
Copy link
Member

@gousteris
Copy link
Contributor Author

@andrewhsu
Copy link
Member

@gousteris
Copy link
Contributor Author

@andrewhsu
Copy link
Member

/backport v21.11.x

@NyaliaLui
Copy link
Contributor

@NyaliaLui NyaliaLui reopened this May 31, 2022
@ajfabbri
Copy link
Contributor

CDT failure here.

@jcsp
Copy link
Contributor

jcsp commented Jul 4, 2022

The CDT link above is for the _with_restart variant, which is #2406

@jcsp
Copy link
Contributor

jcsp commented Jul 4, 2022

No failures on dev of this test in the last 30 days + I'm inclined to think that if it was still problematic we'd be seeing it, because we are seeing failures in test_changing_topic_retention_with_restart

@jcsp jcsp closed this as completed Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants