Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to scaling down ingesters #6144

Closed
danielblando opened this issue Aug 7, 2024 · 3 comments · Fixed by #6163 or #6178
Closed

Alternative to scaling down ingesters #6144

danielblando opened this issue Aug 7, 2024 · 3 comments · Fixed by #6163 or #6178

Comments

@danielblando
Copy link
Contributor

danielblando commented Aug 7, 2024

Is your feature request related to a problem? Please describe.
Today, scaling down ingesters is a complicated and highly manual process. As described on the website, scaling down requires ensuring that blocks are flushed to storage and that queries use the stored data set to 0s (.query-store-after). However, this approach is not suitable for all use cases, as in some scenarios, we want to utilize ingesters for querying as well, to improve request performance.

Describe the solution you'd like
Automating the scale down of ingesters is not a trivial task. It is desirable for ingesters to have a mechanism that allows users to scale them down gradually without missing data.

A proposed solution is to introduce a new state for ingesters called READONLY. In this state, ingesters cannot receive data, meaning all Push requests would fail, but they can still accept query data. Cortex would use the ring operation to filter out the correct ingesters by state, allowing the distributor and query/ruler to use the appropriate set of ingesters.

Write = NewOp([]InstanceState{ACTIVE})
Read = NewOp([]InstanceState{ACTIVE, PENDING, LEAVING, JOINING, READONLY})

To enable users to set an ingester to READONLY mode, ingesters would have a new API that allows them to transition to READONLY or ACTIVE. It would be permissible for an ingester to return to ACTIVE mode as a way to cancel a scale down if needed.

a.indexPage.AddLink(SectionDangerous, "/ingester/mode", "Change Ingester mode on ring")

Furthermore, to allow ingesters to be safely removed from the ring, they would also have a new API that lets users know which blocks an ingester has loaded. The idea is that when an ingester has deleted all blocks, it can be stopped.

a.indexPage.AddLink(SectionDangerous, "/ingester/blocks", "List blocks on ingesters")

This approach introduces a new READONLY state for ingesters, enabling a controlled scale down process without data loss. Users can transition ingesters to READONLY mode, preventing new data ingestion while allowing queries on existing data. Once an ingester has deleted all its blocks, it can be safely stopped and removed from the ring.

Describe alternatives you've considered

  1. Using the LEAVING state as READONLY. This was discarded as the LEAVING state already has multiple logics and premises on why the pod is in that state, which could make the code more confusing.

  2. Not having the /ingester/blocks endpoint and using the .query-store-after configuration to scale down ingesters. While this can still work, it adds complexity for the user as they would need to track the time, ensure the configuration hasn't changed, and account for failures in ingesters pushing blocks to storage.

Additional context
Add any other context or screenshots about the feature request here.

@friedrichg
Copy link
Member

Setting up querier.query-store-after=0 is definitely a bad idea. I mentioned it in #5121
I think this idea helps in that regard

So If I understand correctly, you could have many ingesters on readonly and get rid of them at once. Is that it?

I think "/ingester/blocks" is probably not the right name. Something like local blocks or maybe /ingester/blocks?local=true, etc. This endpoint returns empty results when its ready to be deleted. right?

This requires more thought. I can see multiple edge cases and failure scenarios.

@danielblando
Copy link
Contributor Author

Agree, the querier.query-store-after=0 is bad for performance and in some cases for missing data depending on the -querier.query-ingesters-within.

Correct, the idea is to have ingesters on READONLY mode and you can terminate them whenever you want. For example if your querier.query-ingesters-within is 5h and you don wait to wait for the ingester to be empty, you could have a cron job to stop any ReadOnly after 5h.

Basic usage of it would be

T0: Ingester 5,6,7 are set to READONLY
T1: Ingester 5.6.7 do not receive more data, but still reply to queries
T2(T1+query-ingesters-within): Ingester 5,6,7 do not receive any more requests and just hold a backup of the data
T3(T1+retention_period): all blocks are removed
T4: ingester 5,6,7 are removed from Cortex

Any time after T2 it would be safe to remove ingesters without any service impact

The idea of the ingester/blocks endpoint is just a safer way to make sure no data is still loaded in that ingester. Sometimes even after stopping querying from ingesters, it is good to have a buffer using retention_period of the data locally. So this would mostly track if after retention_period all blocks were indeed removed.

@friedrichg
Copy link
Member

@danielblando thanks for explaining more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants