Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unregister ingesters on shutdown unless an update is being rolled out #5901

Closed
liam-howe-maersk opened this issue Sep 1, 2023 · 26 comments · Fixed by #7739
Closed

Unregister ingesters on shutdown unless an update is being rolled out #5901

liam-howe-maersk opened this issue Sep 1, 2023 · 26 comments · Fixed by #7739

Comments

@liam-howe-maersk
Copy link

Is your feature request related to a problem? Please describe.

We run our Mimir stack on AKS spot nodes across 3 availability zones and so far we have found this to be an acceptable solution for us despite the risk of node evictions and pods being shuffled amongst nodes. Part of the reason this has worked for us is by making use of the setting -ingester.unregister-on-shutdown=true so that when an ingester is shut down, because it is being moved to another node, it will unregister itself from the ring and it will not cause any downtime as series are sent to other ingesters in the zone.

However, we have observed that this has a negative effect during rollout of updates to a zone, similar to discussed here grafana/helm-charts#1313. During a rollout a zone will see it's number of in-memory series double and lots of out-of-order samples ingested, both of which cause more CPU to be required and makes our system unstable and susceptible to downtime during ingester rollouts. This is becoming more prevelant as our cluster grows and we only expect to grow more, right now we have about 360,000,000 in-memory series across 3 zones and 330 ingesters and this can double during rollouts.

Describe the solution you'd like

I believe that we could make our approach work if we were able to tell ingesters not to unregister themselves when an update is being rolled out to their zone, but other zones should continue to unregister themselves if they are shutdown for any reason. We use the rollout-operator to handle our rollouts across our 3 zones and so expect only 1 zone to be rolled out at a time.

Right now I believe that to update the config -ingester.unregister-on-shutdown to false we would need to rollout an update to the ingesters which would again cause the problems we are already seeing. Some solution whereby we update the ingesters' configuration to not unregister itself before the rollout-operator kills the pod should work, if this is possible.

Describe alternatives you've considered

If there was some way to update this config for unregistering on the fly then we could script something to change this setting before we merge an update to our ingesters but it feels like having this logic as part of the rollout-operator would be the smoothest, in which case it may make more sense to move this issue to that repository.

@dimitarvdimitrov
Copy link
Contributor

are the ingesters coming up under different host names? That would prevent the ingesters from seamlessly rejoining the ring. Otherwise, they'd join as brand new replicas and trigger resharding.

@liam-howe-maersk
Copy link
Author

@dimitarvdimitrov the ingesters always come back up with the same pod name, if I've understood the question correctly, e.g ingester-zone-a-0, ingester-zone-a-1, ingester-zone-a-2, ...

To give an example of what we are seeing, the below chart shows the number of in-memory active series in our ingesters per zone, when the number increases that is when the roll out for that zone has started, when it goes back down that is when the bihourly head block compaction/storage shipping happens
image

@dimitarvdimitrov
Copy link
Contributor

Thanks. I was trying to understand the need for -ingester.unregister-on-shutdown=true

Part of the reason this has worked for us is by making use of the setting -ingester.unregister-on-shutdown=true so that when an ingester is shut down, because it is being moved to another node, it will unregister itself from the ring and it will not cause any downtime as series are sent to other ingesters in the zone.

an ingester shutting down shouldn't be causing downtime if the ingesters in the other two zones are available. Typically each timeseries is sharded onto three ingesters - one from each zone. As long as two of these ingesters are available, then the writes and reads succeed and are consistent.

Because each timeseries in a write requests is sharded individually, this effectively means that you need to have two completely available zones at a time to ensure availability. Is this not the case in your setup?

@liam-howe-maersk
Copy link
Author

this effectively means that you need to have two completely available zones at a time to ensure availability. Is this not the case in your setup?

Unfortunately, because we are running on spot nodes, there is an increased likelihood that during a rollout we will have 2 zones with unavailable ingesters, one that is being rolled out and another that has had a node evicted. Even without roll outs we could see 2 nodes from different zones evicted at the same time.

So far we have managed to run this set up without the risk of this problem by having -ingester.unregister-on-shutdown=true, except we now have the side-effect mentioned above that we see this increase in memory and CPU usage as well as out-of-order samples during a roll out. I believe this side-effect would be resolved if we had a solution like mentioned in the issue, if we could tell ingesters not to unregister when it is having an update rolled out but for all other cases to unregister.

@LasseHels
Copy link
Contributor

@dimitarvdimitrov This is still causing us quite a bit of headache. Let me elaborate a bit on the problem described by Liam in earlier posts.

The Problem

We run ingesters across three availability zones in Azure. Each zone has its own Kubernetes StatefulSet (i.e., ingester-zone-a, ingester-zone-b and ingester-zone-c), and they all run on spot nodes.

Because we run on spot nodes, any ingester is at risk of being terminated at any time. If ingesters don't unregister from the ring upon being terminated, then spot node evictions in two different zones around the same time put us at risk of an incident. It is our understanding that both the metric write and read path need quorum, and that it takes a missing ingester in two out of three zones to take down either path.

To mitigate this risk, we set -ingester.ring.unregister-on-shutdown to true. Doing so solves our spot node problem, but introduces a new one: ingesters now unregister and distribute their memory series during a rollout, causing a dramatic increase in memory series across zones. The increase in memory series brings with it an increase in CPU usage that lasts until the next WAL block cut, after which memory series are brought back to normal:
Screenshot 2023-12-11 at 09 33 29

We were initially oblivious to this relationship between rollouts, memory series and CPU usage; it took several incidents for us to narrow down the problem.

Each of our 64 CPU nodes can easily handle three or four ingesters during normal operations (where each ingester will consume ~10 CPU), but struggle to keep up during rollouts when each ingester's CPU usage increases to ~20. Because of this, we've had to overprovision ingesters to limit the amount of ingesters that can be placed on a node. This overprovisioning is incurring a fairly large additional cost to our infrastructure, and is something we'd like to do away with.

The Proposed Solution

We'd like to introduce a new ingester HTTP endpoint: /ingester/prepare-unregister. This endpoint can be used to control whether the ingester should unregister from the ring the next time it is stopped.

It supports three HTTP methods: GET, PUT and DELETE.

When called with the GET method, the endpoint returns the current unregister state in a JSON response body with a 200 status code:

{"unregister": true}

When called with the PUT method, the endpoint accepts a request body:

{"unregister": false}

The endpoint sets i.lifecycler.SetUnregisterOnShutdown() to the value passed in the request body. After doing so, it returns the current lifecycler unregister value with a 200 status code:

{"unregister": false}

Using a request body supports both of these use cases:

  1. Disabling unregister for an ingester that has it enabled by default.
  2. Enabling unregister for an ingester that has it disabled by default.

If the endpoint used a toggle, then the caller would need to know the value of the -ingester.ring.unregister-on-shutdown passed to the ingester.

PUT is inconsistent with the /ingester/prepare-shutdown endpoint which uses POST, but potentially more semantically correct as it conveys the idempotent nature of the endpoint.

When called with the DELETE method, the endpoint sets the ingester's unregister state to what was passed via the -ingester.ring.unregister-on-shutdown option. After doing so, it returns the current unregister value with a 200 status code:

{"unregister": false}

The DELETE method is considered to "delete" any override, and can be used to ensure that the ingester's unregister state is set to the value that was set on ingester startup.

All three behaviours of the endpoint are idempotent.

Unhappy paths:

  • The endpoint is called with any method other than GET, PUT and DELETE. A 405 Method Not Allowed status code is returned with no response body.
  • The endpoint is called while the ingester's state is not Running. A 503 Service Unavailable status code is returned with no response body.
  • The endpoint is called with the GET or DELETE method and a request body. A 400 Bad Request status code is returned with no response body.
  • Marshaling JSON response body fails. A 500 Internal Server Error status code is returned with no response body. The error is logged internally at the Error level.
  • The endpoint is called with the PUT method and no request body. A 400 Bad Request status code is returned with no response body.
  • The endpoint is called with the PUT method and a request body that is not valid JSON. A 400 Bad Request status code is returned with no response body.
  • The endpoint is called with the PUT method and a request body that is valid JSON but contains invalid content. A 422 Unprocessable Content status code is returned with no response body.

We are unsure if the endpoint needs a file marker (similar to what the /ingester/prepare-shutdown uses). Our naive opinion is that such a marker is not needed for the /ingester/prepare-unregister endpoint.

The /ingester/prepare-unregister endpoint facilitates overriding the unregister state of ingesters. In our case, it would allow us to set -ingester.ring.unregister-on-shutdown=true as ingesters' default value, but then disable it when doing rollouts. The endpoint also supports the opposite use case; setting -ingester.ring.unregister-on-shutdown=false and then selectively enabling it.

With the Mimir team's blessing, we'd be happy to work on implementing this endpoint.

Alternative Solutions Considered

We looked into using Mimir's runtime configuration to dynamically set the unregister state of an ingester. We found a couple of problems with this:

  • Runtime configuration doesn't support the unregister-on-shutdown configuration option.
  • From what we can see, there is no API endpoint to set runtime configuration; the updating application would need access to the ingester's disk.
  • The updating application doesn't know the runtime config refresh interval of the ingester, and would have to continuously poll the /runtime_config endpoint to see if changes have been applied.

We've also considered updating the /ingester/prepare-shutdown endpoint to take a request body that specifies which of flush and unregister should be modified. We find this solution to be inelegant, and we also don't see a way to make this change in a backwards compatible manner. Adding a request body can be done while maintaining backwards compatibility (by treating the lack of a request body as an invocation of the original endpoint functionality), but the GET functionality would have to return information with a granularity other than set/unset, thus breaking backwards compatibility.

Another option is to move our workload off of spot nodes. Doing this would cost us several hundred thousands of dollars per month. Ingesters make up the majority of our overall compute workload, so we'd still be looking at an additional cost of at least six figures even if we only moved ingesters.

@dimitarvdimitrov
Copy link
Contributor

Thank you for the detailed response @LasseHels

I think we need to take a step back and look at the problem and why it exists.

Strong consistency

As you said both the write and read path need quorum. Mimir is currently designed so that partial responses are not let through. I don't think there is a technical reason for this. The overall promise is that Mimr chooses consistency over partition tolerance.

Partial responses

What you are describing is being ok with partial responses when ingesters from multiple zones are unavailable. In other words we would want availability and partition tolerance over consistency.

The effects of this are a few and I will mention some of them:

  • the float64 values you get from Mimir can be slightly skewed
  • or missing data rows (series)
  • or missing data in some time range.
  • Due to all of the above alert evaluations become unreliable for example

Problem

Then the problem is that Mimir only supports strong consistency and doesn't leave clients to decide whether they are ok with partial responses. I believe this is the problem you need solved and the unregister-on-shutdown is just one solution to it. Do you agree this is the problem we are need to solve?

@LasseHels
Copy link
Contributor

@dimitarvdimitrov

Then the problem is that Mimir only supports strong consistency and doesn't leave clients to decide whether they are ok with partial responses. I believe this is the problem you need solved and the unregister-on-shutdown is just one solution to it. Do you agree this is the problem we are need to solve?

This is not quite how we see the issue.

I want to make sure that we mean the same thing when we refer to an "unavailable" ingester. Our understanding of an "unavailable" ingester is an ingester that fulfils these two criteria:

  1. It is the exclusive owner of memory series in the zone (in other words, it holds memory series that no other ingester in the zone holds).
  2. It is not able to serve reads and/or writes.

It is also our understanding that an ingester that unregisters from the ring on shutdown is not "unavailable" since it distributes its memory series to other ingesters in the zone (and is thus no longer the exclusive owner).

If this is correct, then we don't see that our solution proposal leads to unavailable ingesters (and thus partial responses):

  1. Under normal operations, terminated ingesters leave the ring and distribute their series to other ingesters in the zone. Partial responses do not occur since all memory series are accounted for in all zones.
  2. During ingester rollouts, terminated ingesters in the zone that is being rolled out don't leave the ring. This leaves memory series unaccounted for, but only in the zone that is being rolled out. Quorum can still be achieved since terminated (i.e., evicted) ingesters in the other two zones leave the ring on shutdown, ensuring that series in those zones are accounted for.

The second point assumes that there is some mechanism by which the read and write paths (which I guess would be distributor and querier, respectively?) are aware of the unavailable ingester(s) in the zone that is being rolled out, and that they will reach out to the available zones to achieve quorum.

Our platform is relied upon by hundreds of teams internally, and we cannot afford any of the listed effects of partial responses.

Thanks for working through this with us. We are constantly improving our understanding of Mimir, and conversations like these are helpful.

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Dec 14, 2023

It is also our understanding that an ingester that unregisters from the ring on shutdown is not "unavailable" since it distributes its memory series to other ingesters in the zone (and is thus no longer the exclusive owner).

This is not the case, unfortunately. The ingester does not transfer series which it has already ingested. Those stay exclusively in its own memory and/or disk. Redistribution of series (aka resharding) happens only after the ingester is unregistered from the ring and only applies to newly ingested data. See Ingesters failure and data loss

@LasseHels
Copy link
Contributor

@dimitarvdimitrov

My previous message should have been a bit more clear: I am specifically talking about ingesters that exit gracefully. It looks like the Ingesters failure and data loss page only applies to ingesters that have exited abruptly (i.e., with a SIGKILL and not a SIGINT/SIGTERM).

As mentioned above, we also see an increase in memory series across a zone during rollouts. Our only explanation for this is that ingesters distribute their series to other ingesters in the zone on shutdown, and then re-read the same series from their WAL on start-up, leading to a net increase in series across the zone.

Are we sure that an ingester that is terminated gracefully with SIGINT or SIGTERM does not distribute its memory series to other ingesters in the zone?

@pstibrany
Copy link
Member

Are we sure that an ingester that is terminated gracefully with SIGINT or SIGTERM does not distribute its memory series to other ingesters in the zone?

Yes, ingesters don't have such functionality in them. They used to do that in old Cortex days with chunks storage, and in fact even when we were adding blocks storage into Cortex, ingesters were able to trasnfer state on shutdown, but this was since long removed, and was never part of Mimir.

@LasseHels
Copy link
Contributor

@pstibrany Interesting. If that is the case, what could explain the dramatic increase in memory series we see during rollouts?
Screenshot 2023-12-11 at 09 33 29

@pstibrany
Copy link
Member

@pstibrany Interesting. If that is the case, what could explain the dramatic increase in memory series we see during rollouts? Screenshot 2023-12-11 at 09 33 29

My understanding is that you use -ingester.unregister-on-shutdown=true, so your ingesters unregister from the ring when they are shutting down. This tells distributors to distribute incoming series to other ingesters (those that are still running and still have entry in the ring). When your ingesters come back and rejoin the ring, they will again receive portion of series. However those ingesters that didn't restart have already received some series, and they will only remove them from memory on next head compaction -- which happens every 2h.

If your ingesters join the ring with new tokens each time, this causes series reshuffling. That's not great, because each series stay in ingester's memory for up to 3h since last sample was received.

Head compaction runs every 2h at odd hours (in UTC). For example head compaction at 13:00 will take samples between 10:00 – 12:00, put them into a block, and then remove from memory. If ingester received last sample for series at 10:05, it will first remove this series from memory during compaction at 13:00.

@LasseHels
Copy link
Contributor

@pstibrany

My understanding is that you use -ingester.unregister-on-shutdown=true

Correct.

This tells distributors to distribute incoming series to other ingesters (those that are still running and still have entry in the ring). When your ingesters come back and rejoin the ring, they will again receive portion of series. However those ingesters that didn't restart have already received some series, and they will only remove them from memory on next head compaction -- which happens every 2h.

I'm still not sure I understand how this can lead to an increase in memory series of more than 100% across a zone. If the rate of data ingested is more or less constant (which is the case for us), then I'd expect the amount of memory series in a zone to be similarly constant, even if the ingesters to which series are distributed changes a bit.

Are series duplicated (i.e., distributed to more than one ingester in the same zone) at any point during this process you described? That would explain an increase in memory series even if traffic is constant.

Head compaction runs every 2h at odd hours (in UTC). For example head compaction at 13:00 will take samples between 10:00 – 12:00, put them into a block, and then remove from memory. If ingester received last sample for series at 10:05, it will first remove this series from memory during compaction at 13:00.

We did not know that head compaction uses a maxt of (now - 1 hour). This has previously caused us some confusion during rollouts. Thanks for elaborating on this.

@dimitarvdimitrov
Copy link
Contributor

Are series duplicated (i.e., distributed to more than one ingester in the same zone) at any point during this process you described? That would explain an increase in memory series even if traffic is constant.

here's a scenario to help visualize how this can happen

  1. ingester-1 is receiving samples for series up{pod="app-123"}; the ingester counts this series towards its cortex_ingester_memory_series
  2. ingester-2 is receiving samples for series up{pod="db-456"}; the ingester counts this series towards its cortex_ingester_memory_series
  3. ingester-1 unregisters from the ring and shuts down for a restart
  4. ingester-7 starts owning up{pod="app-123"}
  5. ingester-7 receives a sample for that series and keeps it in memory; it also increments its cortex_ingester_memory_series
  6. ingester-1 is done restarting and rejoins the ring
    • ingester-1 replays the WAL it has on disk, loads uncompcated series in memory and updates cortex_ingester_memory_series; this metric still includes the samples from the WAL for up{pod="app-123"} because the samples have not been compacted yet
  7. ingester-1 generates new set of tokens
    • now ingester 1 owns up{pod="db-123"}
    • ingester-2 stops owning up{pod="db-456"}
  8. ingester-1 receives a sample for up{pod="db-456"}; adds it to the in-memory series and increments cortex_ingester_memory_series

In the end the two series are accounted for in multiple ingesters; not only do they

  • up{pod="db-456"}
    • ingester-2 - from before the restart
    • ingester-1 - from after the restart
  • up{pod="app-123"}
    • ingester-1 - from before the restart
    • ingester-7 - from after the restart

You can avoid the resharding in step 7. you can use -ingester.ring.tokens-file-path to store tokens on disk. But the resharding in step 4. will still happen.

this doesn't account for replication, but with replication it kind of just happens $replication_factor number of times

@LasseHels
Copy link
Contributor

@dimitarvdimitrov Thank you for the excellent explanation! I was mixing up samples and series, but with your example it makes perfect sense 👍.

Our newfound understanding that an ingester's series/samples are not distributed to other ingesters leads to another question: is -ingester.unregister-on-shutdown=true actually solving the problem that we think it is?

We've been enabling -ingester.unregister-on-shutdown=true thinking it would mitigate the effects of spot node evictions. We erroneously based this on evicted ingesters distributing their series/samples to other ingesters in the zone when they leave the ring.

It is now our understanding that even with -ingester.unregister-on-shutdown=true, some series are "lost"1 when an eviction happens. If an eviction happens in two different zones at once, and they happen to hit ingesters that have overlapping series, then we will be unable to achieve quorum for the overlapping series, as two out of three ingesters holding those series are now unavailable.

Do you agree with this?

It bears repeating that we are appreciative of the effort you are putting into this. It would have taken us figuratively forever to piece this together ourselves.

Footnotes

  1. "Lost" in the sense that the memory series of the evicted ingester are unavailable until the ingester restarts (and replays WAL). In other words, the series are only lost temporarily.

@dimitarvdimitrov
Copy link
Contributor

is -ingester.unregister-on-shutdown=true actually solving the problem that we think it is?

It appears to be solving the quorum errors. But I think it creates another larger problem - that of (temporary) data loss and unreliable queries.

It is now our understanding that even with -ingester.unregister-on-shutdown=true, some series are "lost"1 when an eviction happens. If an eviction happens in two different zones at once, and they happen to hit ingesters that have overlapping series, then we will be unable to achieve quorum for the overlapping series, as two out of three ingesters holding those series are now unavailable.

that's right. Mimir will actually fail the query early when it detects that it cannot contact all registered instances in the ring. By unregistering ingesters from the ring you hide from the querier the fact that there is data elsewhere (on some restarting ingesters) which is currently not accessible.

@LasseHels
Copy link
Contributor

LasseHels commented Dec 22, 2023

@dimitarvdimitrov I previously wrote:

"Lost" in the sense that the memory series of the evicted ingester are unavailable until the ingester restarts (and replays WAL). In other words, the series are only lost temporarily.

On second thought, I wonder if this is the case.

Consider this example where ingesters do unregister on shutdown:

  1. Ingester a-1 owns series up{pod="app-123"} and has six samples.
  2. Ingester a-1 is terminated.
  3. Ingester a-1 leaves the ring and ownership of series up{pod="app-123"} is transferred to ingester a-2.
  4. Some time goes by, and a-2 now has three samples for the series.
  5. a-1 has replayed WAL and is accepting traffic again. Ingester a-2 is still the owner of series up{pod="app-123"}.
  6. A user queries up{pod="app-123"}. At this point, will queriers only reach out to a-2 since a-2 is the current owner of the series? In other words, even though a-1 has some samples on disk and in memory, those samples are "lost" since a-2 is now the owner of the series.

It is our understanding that each metric series has exactly one owner per zone at any time. When a-1 terminates, it hands over ownership of the series to a-2, but a-2 does not hand ownership back to a-1 when a-1 has re-started(?). If this is true, then any samples for the series that a-1 owns are lost (unless a-1 becomes the owner of the series again which is unlikely to happen).

This would mean that split series ownership - even for a brief amount of time - would practically always lead to some amount of data loss?

@dimitarvdimitrov
Copy link
Contributor

6. A user queries up{pod="app-123"}. At this point, will queriers only reach out to a-2 since a-2 is the current owner of the series? In other words, even though a-1 has some samples on disk and in memory, those samples are "lost" since a-2 is now the owner of the series.

Good point. This may happen if shuffle-sharding is enabled for the tenant. In other words, if a tenant is only on a-1 (later replaced by a-2), b-1, and c-1. In this case a-1 will not be reconsidered when queries come in and the shuffle-shard is a-2, b-1, and c-1.

If a tenant is already sharded to all ingesters (i.e. shuffle-sharding is disabled), then this problem doesn't exist because all ingesters a tenant shards to are queried for every query.

@LasseHels
Copy link
Contributor

@dimitarvdimitrov

We've spent the last few weeks looking into our own Mimir setup, as well as how Mimir behaves in general. This analysis combined with your replies have given us new insight. It is now our understanding that:

  • A given metric series has exactly one owner in each zone at any time.
  • When an ingester leaves the ring, that ingester's tokens are also removed from the ring. Ownership of series previously owned by this ingester is transferred implicitly to other ingesters.
  • Ingester tokens are consistent across ring exits because we set -ingester.ring.tokens-file-path.
  • Metric samples are not distributed when an ingester leaves the ring. Once ingested, a metric sample stays with the same ingester until it is offloaded to backend storage (Azure storage account in our case).
  • Series ownership matters only on the write path.
  • All ingesters in all zones are queried for every query. The result of the query will be the combined samples returned by the first two1 zones to return2.

We now also understand that we are already in a state of potential data loss. With our current setup of ingesters leaving the ring on shutdown, a multi-zone node eviction will cause temporary3 data loss for any overlapping4 series.

Your first reply to my solution proposal was:

What you are describing is being ok with partial responses when ingesters from multiple zones are unavailable. In other words we would want availability and partition tolerance over consistency.

We didn't understand it at the time, but you hit the nail on the head. Multi-zone evictions in our system are rare, and it is okay for them to have some impact on the read path. Ideally, they should have no impact on the write path.

This is the conundrum as we see it (-ingester.ring.unregister-on-shutdown=true is our current setup):

Single-zone eviction Multi-zone eviction Rollout
-ingester.ring.unregister-on-shutdown=false No problem. Guaranteed full loss of read path for all series. Loss of write path for overlapping series. No problem. If an eviction happens in another zone than the one currently being rolled out, then we will see the same issues as a multi-zone eviction.
-ingester.ring.unregister-on-shutdown=true No problem. Temporary loss of data for overlapping series. No loss of write path. Ingester CPU problems. If an eviction happens in another zone than the one currently being rolled out, then we will see the same issues as a multi-zone eviction.

The inability to dynamically set unregister-on-shutdown means that we are forced to choose one of these two bad options. The previously proposed API endpoint is our naive solution to this problem. With this endpoint, we can disable unregister-on-shutdown by default, and only enable it during evictions to ensure a stable read path.

What do you think of the proposed API endpoint solution, and can you think of more elegant solutions?

In our opinion, being able to run ingesters on spot nodes is a big win, and affords users of Mimir significant cost savings.

Footnotes

  1. We run a replication factor of three.

  2. We are on Mimir 2.9.x and don't yet have access to the minimize_ingester_requests configuration.

  3. The data loss is temporary since samples are recovered once the ingesters boot up again.

  4. I use "overlapping" to mean series that were owned by evicted ingesters in both zones. Series for whom only a single owner is evicted in a multi-zone eviction are unaffected.

@dimitarvdimitrov
Copy link
Contributor

I think the endpoint is a reasonable middle ground. Ingesters will unregister from the ring by default. On rollouts some automation would invoke /ingester/prepare-unregister so that the next shutdown doesn't unregister the ingester from the ring. I think we can reuse the existing GET,POST,DELETE /ingester/prepare-shutdown to avoid creating too many endpoints that do similar things.

I'd like to hear others' thoughts on this first. @grafana/mimir-tech-leads, this looks like a slightly new approach to operating Mimir, but I think as long as the operators are ok with the tradeoffs it isn't strictly incompatible with Mimir's architecture. This comment gives a very good summary of the problem and this is a proposal to address the problem. What are your thoughts?

@LasseHels
Copy link
Contributor

@dimitarvdimitrov Did you and the team have a chance to review the solution proposal?

On rollouts some automation would invoke /ingester/prepare-unregister so that the next shutdown doesn't unregister the ingester from the ring.

This is one option, but in our case, we would probably do the opposite: default to not leaving the ring on shutdown, and invoke /ingester/prepare-unregister to leave the ring on evictions.

Defaulting to leaving the ring is more tricky, since that requires us to invoke the endpoint during Mimir rollouts, but also in other cases like node restarts during a Kubernetes upgrade.

I think we can reuse the existing GET,POST,DELETE /ingester/prepare-shutdown to avoid creating too many endpoints that do similar things.

I thought about this as well and couldn't find an elegant way to do it in a backwards compatible manner. See #5901 (comment). Would be happy to give it another shot if the team thinks that using the existing endpoint is important.

@LasseHels
Copy link
Contributor

@dimitarvdimitrov Progress on this issue has slowed. Anything we can do to get the ball rolling again? We'd be happy to chime in with any information that's required. We're also happy to do the implementation, and we'd like buy-in from the Mimir team before we start.

@dimitarvdimitrov
Copy link
Contributor

Thank you for the ping and apologies for dropping the ball on this @LasseHels.

While I support the idea and I see how it solves your problem I still think it's a good idea to get more eyes on it before you kick off development on it. Especially because of the consistency tradeoff that it makes, it makes it a potentially contentious change.

Do you think you can open a PR with a proposal doc similar to this #5029? The idea is to condense the matter into a similar structure as the one linked so it's easy to grasp by someone who's unfamiliar with the issue. I think this will help get buy-in from other maintainers and allow to move forward with the implementation. (It can also serve as a good documentation basis if you want to later integrate the changes in the rollout-operator.)

The doc doesn't need to go into implementation details or be overly verbose. Your previous comment is serving largely the same purpose. Perhaps the changes I'd make are to update the problem description with the better understanding from this comment and the example flow of how this endpoint will be used this comment.

@LasseHels
Copy link
Contributor

@dimitarvdimitrov Absolutely. I'll open a proposal pull request once my schedule clears a bit 👍.

@LasseHels
Copy link
Contributor

Proposal document pull request opened: #7231.

@LasseHels
Copy link
Contributor

Now that this issue has been closed, I figured I'd chime in with a final update for the record.

We've been running a custom Mimir image with a crude implementation of the /ingester/unregister-on-shutdown endpoint for a while now.

With the implementation of this endpoint, we've updated our ingesters to not leave the ring on shutdown. This fixes the memory series issue during rollouts.

When ingesters don't leave the ring on shutdown, a multi-zone node eviction would typically cause downtime. Using the endpoint, we can now run a small service to listen for node eviction events. The service calls the /ingester/unregister-on-shutdown endpoint of each ingester on the node before it is evicted so that the ingesters leave the ring on shutdown.

The trade-off here is that multi-zone evictions do still cause temporary data loss on the read path. All things considered, this is acceptable to us, particularly considering how infrequently multi-zone evictions happen.

The setup adds a fair amount of complexity, and the only reason we're doing it is that the cost savings are huge; the fact that ingester memory series no longer balloon during rollouts has allowed us to right-size ingesters. Doing so has saved us 60 Standard_D64ads_v5 nodes in Azure.

Thanks to the Mimir team for sponsoring this change 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants