Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin: read-after-write consistency for config status on leader node #5835

Merged
merged 3 commits into from
Aug 16, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Aug 4, 2022

Cover letter

admin: read-after-write consistency for config status on leader node

Previously, after writing a config update, API clients could do
a /status query to the same node and not see any nodes (including
the leader that they just PUT to) reflect the new version.

With this change, if the client is talking to the controller leader,
it will reliably see the new config version reflected in the /status
result when querying the same node again after a PUT.

This is a little subtle and later we should make simpler rules
for this via a higher level "wait for status updates" as part
of the PUT call itself: https://github.com/redpanda-data/redpanda/issues/5833

Related: #5609

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

  • none

Previously, after writing a config update, API clients could do
a /status query to the same node and not see any nodes (including
the leader that they just PUT to) reflect the new version.

With this change, if the client is talking to the controller leader,
it will reliably see the new config version reflected in the /status
result when querying the same node again after a PUT.

This is a little subtle and later we should make simpler rules
for this via a higher level "wait for status updates" as part
of the PUT call itself: redpanda-data#5833

Related: redpanda-data#5609
This tests the new behaviour in the previous commit.
nicolaferraro
nicolaferraro previously approved these changes Aug 4, 2022
Copy link
Member

@nicolaferraro nicolaferraro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@jcsp
Copy link
Contributor Author

jcsp commented Aug 5, 2022

This had a couple failures in ClusterConfigTest, will need to take a look at whether those tests have incorrect assumptions or if something else is up.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +1125 to +1127
Clearly doing fast reads isn't a guarantee of strict consistency
rules, but it will detect violations on realistic timescales. This
test did fail in practice before the change to have /status return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do i understand correctly that what you are saying here is that a read-your-own-write, provided by this patch, might not yet be replicated such that a write may appear to disappear under failure scenarios? if not, i guess i'm a bit confused about what is being said here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is really about the test more than the main code: pointing out that for tests, doing reads after writes does not in itself prove read-after-write consistency (we might just get lucky), but that in practice i have confidence in this test because it did indeed fail when run against a redpanda without the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might not yet be replicated such that a write may appear to disappear under failure scenarios?

No, the node where the config update has been applied will not rewind its view of its own status: the ack of PUT is a replicate_and_wait of the configuration delta. It isn't waiting for the configuration status to be persisted, but that's the thrust of the change in this PR: we will now have nodes report a non-persistent status for themselves if the persistent status hasn't advanced yet.

If we query another node, it is possible to see persistent
status updates for the nodes _other_ than the one we are
querying, and non-persistent update to the stauts of the node
we are querying, that passes the version check.

Then if we query status on a different node a moment later,
we will see an older state for the node we first queried.

This only matters for tests that are actively trying to read the
status _again_ after wait_for_version_sync.  wait_for_version_sync
was already correct inasmuchas when it complete the config has
been applied everywhere.
@jcsp
Copy link
Contributor Author

jcsp commented Aug 9, 2022

@dotnwat
Copy link
Member

dotnwat commented Aug 16, 2022

restarted ci since the pr is a bit old, but otherwise looks good

@jcsp
Copy link
Contributor Author

jcsp commented Aug 16, 2022

CI failures are:

@jcsp jcsp merged commit b40ed60 into redpanda-data:dev Aug 16, 2022
@jcsp jcsp deleted the issue-5609-config-status-consistency branch August 16, 2022 08:47
@jcsp
Copy link
Contributor Author

jcsp commented Aug 16, 2022

/backport v22.2.x

@jcsp
Copy link
Contributor Author

jcsp commented Aug 16, 2022

/backport v22.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants