Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: adjust condition to remove archiver #5934

Merged
merged 2 commits into from
Aug 10, 2022

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Aug 10, 2022

An archiver is removed in reconcile loop if it is considered to be stopped. The stop condition checks if the upload loop is stopped. with read replica there are two status variables which denote a stopped loop - one related to normal upload and other related to read replica.

This change adjusts the condition such that any of these two variables being true marks the archiver as stopped, as the two conditions start out being false and they turn to true in mutually exclusive code paths.

The archiver must be removed in order for a stopped upload loop to restart later. This can happen in a condition where a node lost leadership, archiver is stopped and it immediately regained leadership. It is then necessary for the archiver to have been removed correctly earlier for it to start again.

fixes #5928

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

None

Lazin
Lazin previously approved these changes Aug 10, 2022
LenaAn
LenaAn previously approved these changes Aug 10, 2022
Copy link
Contributor

@LenaAn LenaAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one small nit

src/v/archival/ntp_archiver_service.cc Show resolved Hide resolved
@mmedenjak mmedenjak added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem area/archival and removed area/redpanda labels Aug 10, 2022
@abhijat abhijat dismissed stale reviews from LenaAn and Lazin via f0a7e39 August 10, 2022 13:14
@abhijat abhijat requested review from Lazin and LenaAn August 10, 2022 13:14
LenaAn
LenaAn previously approved these changes Aug 10, 2022
Copy link
Contributor

@LenaAn LenaAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

VladLazar
VladLazar previously approved these changes Aug 10, 2022
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was also looking at this the other day.

@abhijat
Copy link
Contributor Author

abhijat commented Aug 10, 2022

at least one test failure looks concerning:



test_id:    rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
--
  | status:     FAIL
  | run time:   1 minute 41.920 seconds
  |  
  |  
  | AssertionError("produced and consumed messages differ, produced length: 14139, consumed length: 14132, first mismatch: produced: b'14772', consumed: b'14863' (offset: 15243)")
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 292, in test_restore_with_aborted_tx
  | assert (not first_mismatch), (
  | AssertionError: produced and consumed messages differ, produced length: 14139, consumed length: 14132, first mismatch: produced: b'14772', consumed: b'14863' (offset: 15243)

will wait for debug build to finish and then trigger another run, to see if it may be related to this change.

EDIT: did not fail in debug, doing a repeatx5 to check what fails.

@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 10, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 10, 2022
@abhijat
Copy link
Contributor Author

abhijat commented Aug 10, 2022

most test failures are variants of

failures=False
failures=True

eg https://buildkite.com/redpanda/redpanda/builds/13929#0182885d-f1ef-4b0a-bb56-c7715aa0f404

couple of others which are mentioned on the PRs

test_id: rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1
 

an archiver is removed in reconcile loop if it is stopped. the stop
condition checks if the upload loop is stopped. with read replica there
are two status variables which denote a stopped loop. this change makes
the condition such that any of these two variables being true marks the
archiver as stopped, as the two conditions start out being false. they
turn to true in mutually exclusive code paths.

the archiver must be marked as removed in order for a stopped upload
loop to restart.
@abhijat abhijat dismissed stale reviews from VladLazar and LenaAn via a7ff954 August 10, 2022 18:40
@abhijat abhijat requested a review from LenaAn August 10, 2022 18:41
@abhijat abhijat requested a review from VladLazar August 10, 2022 18:41
Copy link
Contributor

@ajfabbri ajfabbri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the fix.

@abhijat abhijat merged commit 56e47a5 into redpanda-data:dev Aug 10, 2022
@piyushredpanda
Copy link
Contributor

/backport 22.2.x

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extreme_recovery_test.py fails to upload one manifest
10 participants