storage: add metrics clear to reader cache probe #5939

NyaliaLui · 2022-08-10T19:04:41Z

Cover letter

We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it. This leads to a
double registration situation because a new reader cache will
attempt to register metrics again.

Fixes #5938

Changes from force-push 9bd1c45:

Use the linter

Backport Required

UX changes

none

Release notes

none

We recently ran into a double registration issue where a reader cache was still alive even though stop() was called on it. This leads to a double registration situation because a new reader cache will attempt to register metrics again. Fixes redpanda-data#5938

NyaliaLui · 2022-08-10T19:08:34Z

Had to force-push because I forgot clang linter

ajfabbri

LGTM pending clean CI.

BenPope

LGTM

mmaslankaprv · 2022-08-10T19:18:31Z

I am not sure about this. it may lead to assertion that ntp was registered twice

mmaslankaprv · 2022-08-10T19:20:35Z

I am not sure about this. it may lead to assertion that ntp was registered twice

actually it should not, we need to investigate further the log lifecycle issue

dotnwat

We recently ran into a double registration issue where a reader cache
was still alive even though stop() was called on it.

do we have a description of what the scenario is that leads to the double registration?

BenPope · 2022-08-10T21:59:09Z

Essentially a partition is moved away and then back, the object holding the probe isn't released (for some reason).

VladLazar · 2022-08-11T09:15:24Z

I posted this on the original slack thread where the issue was mentioned:

Context

The exception is thrown from broker id 5.
The test does the following replica set moves before the exception is thrown:
{5, 1, 3} -> {4, 1, 3}
{4, 1, 3} -> {5, 4, 3}
The exception is thrown on step 2. A side effect of step 1 is that the segments belonging to the partition in question are removed from disk.
Issue
The double registration exception is thrown from storage::readers_cache which is owned by storage::disk_log_impl. storage::disk_log_impl is used via the shared pointer like wrapper storage::log. The problem is that after the removal of the segments the disk_log_impl object does not go out of scope for some reason (maybe there's still a storage::log holding on to it somewhere). This in turn means that the reader cache is still alive, but in a weird state as stop() was called on it as part of log_manager::remove. When the partition is moved back to broker 5 (step 2), the probe creation throws.

Fix

The easy fix here is to clear the metrics as part of storage_cache::stop(). We should come back to this and figure out the lifetime.

mmaslankaprv · 2022-08-11T11:29:40Z

I've looked through the code and logs and it seems that the solution in this PR is fine. Partition is kept alive f.e. while collecting size statistics. It was released in next reconciliation loop pass, it wasn't permanently kept alive

VladLazar

Had a look as well. It's hard to tell what keeps the partition object alive. There's quite a few places where it's copied. I think this fix is fine.

NyaliaLui · 2022-08-11T12:45:45Z

CI failures are instances of
#5950
#5276

NyaliaLui · 2022-08-11T12:53:03Z

/backport v22.2.x

NyaliaLui added kind/bug Something isn't working area/storage area/redpanda labels Aug 10, 2022

NyaliaLui requested review from dotnwat, VladLazar and mmaslankaprv August 10, 2022 19:04

NyaliaLui self-assigned this Aug 10, 2022

NyaliaLui force-pushed the readers-probe-clear branch from 81f9b1d to 9bd1c45 Compare August 10, 2022 19:08

ajfabbri approved these changes Aug 10, 2022

View reviewed changes

BenPope approved these changes Aug 10, 2022

View reviewed changes

dotnwat reviewed Aug 10, 2022

View reviewed changes

VladLazar approved these changes Aug 11, 2022

View reviewed changes

mmaslankaprv approved these changes Aug 11, 2022

View reviewed changes

NyaliaLui merged commit a38aa4f into redpanda-data:dev Aug 11, 2022

NyaliaLui added the kind/backport PRs targeting a stable branch label Aug 11, 2022

This was referenced Aug 11, 2022

[v22.2.x] metrics double registration (storage_log_readers) in partition balancer test #5961

Closed

[v22.2.x] storage: add metrics clear to reader cache probe #5962

Merged

BenPope mentioned this pull request Jan 4, 2023

Metrics double_registration (storage_log_written_bytes) #7983

Closed

NyaliaLui deleted the readers-probe-clear branch March 15, 2023 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: add metrics clear to reader cache probe #5939

storage: add metrics clear to reader cache probe #5939

NyaliaLui commented Aug 10, 2022 •

edited

Loading

NyaliaLui commented Aug 10, 2022

ajfabbri left a comment

BenPope left a comment

mmaslankaprv commented Aug 10, 2022

mmaslankaprv commented Aug 10, 2022

dotnwat left a comment

BenPope commented Aug 10, 2022

VladLazar commented Aug 11, 2022

mmaslankaprv commented Aug 11, 2022

VladLazar left a comment

NyaliaLui commented Aug 11, 2022

NyaliaLui commented Aug 11, 2022

storage: add metrics clear to reader cache probe #5939

storage: add metrics clear to reader cache probe #5939

Conversation

NyaliaLui commented Aug 10, 2022 • edited Loading

Cover letter

Backport Required

UX changes

Release notes

NyaliaLui commented Aug 10, 2022

ajfabbri left a comment

Choose a reason for hiding this comment

BenPope left a comment

Choose a reason for hiding this comment

mmaslankaprv commented Aug 10, 2022

mmaslankaprv commented Aug 10, 2022

dotnwat left a comment

Choose a reason for hiding this comment

BenPope commented Aug 10, 2022

VladLazar commented Aug 11, 2022

Context

Fix

mmaslankaprv commented Aug 11, 2022

VladLazar left a comment

Choose a reason for hiding this comment

NyaliaLui commented Aug 11, 2022

NyaliaLui commented Aug 11, 2022

NyaliaLui commented Aug 10, 2022 •

edited

Loading