Use BasicLifecycler for Compactor and autoforget unhealthy instances #3771

56quarters · 2022-12-19T17:09:34Z

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

What this PR does

Change the compactor to use BasicLifecycler instead of Lifecycler so that we can make use of autoforget functionality. This works around an issue where ownership of tenants is retained by unhealthy instances.

The behavior of compactors while starting changes slightly because of the use of BasicLifecycler instead of Lifecycler.

Lifecycler didn't join the ring when starting, only after running so the wait_active_instance_timeout was only needed while waiting for the instance to become active.
BasicLifecycler joins the ring while starting, thus we need to apply the wait_active_instance_timeout when starting the lifecycler and waiting for it to become active in the ring, not just when waiting for the instance to become active.

Which issue(s) this PR fixes or relates to

See #3708
Fixes #1588

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/compactor/compactor.go

Change the compactor to use BasicLifecycler instead of Lifecycler so that we can make use of autoforget functionality. This works around an issue where ownership of tenants is retained by unhealthy instances. The behavior of compactors while starting changes slightly because of the use of BasicLifecycler instead of Lifecycler. * Lifecycler didn't join the ring when starting, only after running so the `wait_active_instance_timeout` was only needed while waiting for the instance to become active. * BasicLifecycler joins the ring while starting, thus we need to apply the `wait_active_instance_timeout` when starting the lifecycler, _not_ just when waiting for the instance to become active. See #3708 Fixes #1588 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pracucci

Thanks for working on this! Overall LGTM. I just left a couple of comments.

Do you mind testing it in a dev cluster and running a rollout to ensure everything works as expected? Unfortunately both unit tests and integration tests are not very good to catch issues with the ring initialization and state changes.

pkg/distributor/distributor.go

pracucci · 2022-12-20T11:30:51Z

pkg/compactor/compactor.go

+		return nil, nil, errors.Wrap(err, "failed to initialize compactors' lifecycler")
+	}
+
+	compactorsRing, err := ring.NewWithStoreClientAndStrategy(cfg.ToRingConfig(), "compactor", ringKey, kvStore, ring.NewDefaultReplicationStrategy(), prometheus.WrapRegistererWithPrefix("cortex_", reg), logger)


We're reusing kvStore for the ring client too. kvStore is created with name compactor-lifecycler which doesn't apply if we reuse it for the ring client too. In the old implementation, the name of the KV store used for the client was called compactor-ring. I see two options:

Use two different KV stores, to keep previous behaviour

Rename kvStore from compactor-lifecycler to just compactor, then please double check if there's any dashboard or alert to update

I will take a look at this.

I'm going to go back to use two different KV stores and maintain the previous behavior rather than have compactors using a different setup than other components.

This also exposes a problem with the way the kvstore is created for compactors (via this PR) and distributors: The prometheus.Registerer used for the kvstore isn't prefixed with cortex_ so it emits metrics like kv_request_duration_seconds_count compared to ingesters which use cortex_kv_request_duration_seconds_count

pkg/compactor/compactor.go

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pracucci

LGTM! I left a couple of minor comments.

pracucci · 2022-12-22T13:11:10Z

pkg/distributor/distributor.go

@@ -445,6 +445,7 @@ func New(cfg Config, clientConfig ingester_client.Config, limits *validation.Ove

 // newRingAndLifecycler creates a new distributor ring and lifecycler with all required lifecycler delegates
 func newRingAndLifecycler(cfg RingConfig, instanceCount *atomic.Uint32, logger log.Logger, reg prometheus.Registerer) (*ring.Ring, *ring.BasicLifecycler, error) {
+	reg = prometheus.WrapRegistererWithPrefix("cortex_", reg)


This changes the metric names for the KV client used by the distributor. Two comments:

Can you double check what's the impact on dashboards and alerts?

This should be mentioned in the CHANGELOG entry too

Can you double check what's the impact on dashboards and alerts?

I checked it, and we always query cortex_kv_request_duration_seconds_* in dashboards and alerts. This means that this is a fix, not a regression, which is good.

However, querying the metric name kv_request_duration_seconds (without the cortex_ prefix) I've found it's also exposed by the query-scheduler.

Suggestion: I suggest to revert this change from this PR (keeping the "broken" version) and open a separate PR to fix it both here in the distributor and query-scheduler, so that the change is better tracked in the commits log.

Reverted. I'll open a separate PR to fix for distributors and query-scheduler.

pkg/compactor/compactor_test.go

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Fixes an issue with distributors and query-schedulers where they were emitting metrics without a `cortex_` prefix which is expected in all our dashboards. See #3771 (comment) Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Make sure lifecycler metrics have "cortex_" prefix Fixes an issue with distributors and query-schedulers where they were emitting metrics without a `cortex_` prefix which is expected in all our dashboards. See #3771 (comment) Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters commented Dec 19, 2022

View reviewed changes

pkg/compactor/compactor.go Show resolved Hide resolved

56quarters force-pushed the 56quarters/compactor-lifecycler branch from c8ec299 to 82b7b57 Compare December 19, 2022 19:10

56quarters marked this pull request as ready for review December 19, 2022 19:42

56quarters requested a review from a team as a code owner December 19, 2022 19:42

Remove unused method

7051b65

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pracucci self-requested a review December 20, 2022 10:25

pracucci reviewed Dec 20, 2022

View reviewed changes

Code review changes

5f60d63

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters requested a review from pracucci December 20, 2022 19:16

pracucci approved these changes Dec 22, 2022

View reviewed changes

Code review changes: revert distributor registerer fix

ddb4f2b

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters merged commit f851910 into main Dec 22, 2022

56quarters deleted the 56quarters/compactor-lifecycler branch December 22, 2022 15:43

56quarters mentioned this pull request Dec 22, 2022

Make sure lifecycler metrics have "cortex_" prefix #3809

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use BasicLifecycler for Compactor and autoforget unhealthy instances #3771

Use BasicLifecycler for Compactor and autoforget unhealthy instances #3771

56quarters commented Dec 19, 2022 •

edited

Loading

pracucci left a comment

pracucci Dec 20, 2022

56quarters Dec 20, 2022

56quarters Dec 20, 2022

56quarters Dec 20, 2022

pracucci left a comment

pracucci Dec 22, 2022

pracucci Dec 22, 2022

56quarters Dec 22, 2022

Use BasicLifecycler for Compactor and autoforget unhealthy instances #3771

Use BasicLifecycler for Compactor and autoforget unhealthy instances #3771

Conversation

56quarters commented Dec 19, 2022 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

pracucci left a comment

Choose a reason for hiding this comment

pracucci Dec 20, 2022

Choose a reason for hiding this comment

56quarters Dec 20, 2022

Choose a reason for hiding this comment

56quarters Dec 20, 2022

Choose a reason for hiding this comment

56quarters Dec 20, 2022

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci Dec 22, 2022

Choose a reason for hiding this comment

pracucci Dec 22, 2022

Choose a reason for hiding this comment

56quarters Dec 22, 2022

Choose a reason for hiding this comment

56quarters commented Dec 19, 2022 •

edited

Loading