Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add metrics clear to reader cache probe #5939

Merged
merged 1 commit into from
Aug 11, 2022

Conversation

NyaliaLui
Copy link
Contributor

@NyaliaLui NyaliaLui commented Aug 10, 2022

Cover letter

We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it. This leads to a
double registration situation because a new reader cache will
attempt to register metrics again.

Fixes #5938

Changes from force-push 9bd1c45:

  • Use the linter

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

  • none

Release notes

  • none

We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it. This leads to a
double registration situation because a new reader cache will
attempt to register metrics again.

Fixes redpanda-data#5938
@NyaliaLui
Copy link
Contributor Author

Had to force-push because I forgot clang linter

Copy link
Contributor

@ajfabbri ajfabbri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending clean CI.

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mmaslankaprv
Copy link
Member

I am not sure about this. it may lead to assertion that ntp was registered twice

@mmaslankaprv
Copy link
Member

I am not sure about this. it may lead to assertion that ntp was registered twice

actually it should not, we need to investigate further the log lifecycle issue

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it.

do we have a description of what the scenario is that leads to the double registration?

@BenPope
Copy link
Member

BenPope commented Aug 10, 2022

Essentially a partition is moved away and then back, the object holding the probe isn't released (for some reason).

@VladLazar
Copy link
Contributor

I posted this on the original slack thread where the issue was mentioned:

Context

The exception is thrown from broker id 5.
The test does the following replica set moves before the exception is thrown:
{5, 1, 3} -> {4, 1, 3}
{4, 1, 3} -> {5, 4, 3}
The exception is thrown on step 2. A side effect of step 1 is that the segments belonging to the partition in question are removed from disk.
Issue
The double registration exception is thrown from storage::readers_cache which is owned by storage::disk_log_impl. storage::disk_log_impl is used via the shared pointer like wrapper storage::log. The problem is that after the removal of the segments the disk_log_impl object does not go out of scope for some reason (maybe there's still a storage::log holding on to it somewhere). This in turn means that the reader cache is still alive, but in a weird state as stop() was called on it as part of log_manager::remove. When the partition is moved back to broker 5 (step 2), the probe creation throws.

Fix

The easy fix here is to clear the metrics as part of storage_cache::stop(). We should come back to this and figure out the lifetime.

@mmaslankaprv
Copy link
Member

I've looked through the code and logs and it seems that the solution in this PR is fine. Partition is kept alive f.e. while collecting size statistics. It was released in next reconciliation loop pass, it wasn't permanently kept alive

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look as well. It's hard to tell what keeps the partition object alive. There's quite a few places where it's copied. I think this fix is fine.

@NyaliaLui
Copy link
Contributor Author

CI failures are instances of
#5950
#5276

@NyaliaLui NyaliaLui merged commit a38aa4f into redpanda-data:dev Aug 11, 2022
@NyaliaLui NyaliaLui added the kind/backport PRs targeting a stable branch label Aug 11, 2022
@NyaliaLui
Copy link
Contributor Author

/backport v22.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda area/storage kind/backport PRs targeting a stable branch kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics double registration (storage_log_readers) in partition balancer test
6 participants