-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add metrics clear to reader cache probe #5939
Conversation
We recently ran into a double registration issue where a reader cache was still alive even though stop() was called on it. This leads to a double registration situation because a new reader cache will attempt to register metrics again. Fixes redpanda-data#5938
81f9b1d
to
9bd1c45
Compare
Had to force-push because I forgot clang linter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending clean CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I am not sure about this. it may lead to assertion that |
actually it should not, we need to investigate further the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it.
do we have a description of what the scenario is that leads to the double registration?
Essentially a partition is moved away and then back, the object holding the probe isn't released (for some reason). |
I posted this on the original slack thread where the issue was mentioned: ContextThe exception is thrown from broker id 5. FixThe easy fix here is to clear the metrics as part of storage_cache::stop(). We should come back to this and figure out the lifetime. |
I've looked through the code and logs and it seems that the solution in this PR is fine. Partition is kept alive f.e. while collecting size statistics. It was released in next reconciliation loop pass, it wasn't permanently kept alive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look as well. It's hard to tell what keeps the partition
object alive. There's quite a few places where it's copied. I think this fix is fine.
/backport v22.2.x |
Cover letter
We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it. This leads to a
double registration situation because a new reader cache will
attempt to register metrics again.
Fixes #5938
Changes from force-push
9bd1c45
:Backport Required
UX changes
Release notes