-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unregister ingesters on shutdown unless an update is being rolled out #5901
Comments
are the ingesters coming up under different host names? That would prevent the ingesters from seamlessly rejoining the ring. Otherwise, they'd join as brand new replicas and trigger resharding. |
@dimitarvdimitrov the ingesters always come back up with the same pod name, if I've understood the question correctly, e.g To give an example of what we are seeing, the below chart shows the number of in-memory active series in our ingesters per zone, when the number increases that is when the roll out for that zone has started, when it goes back down that is when the bihourly head block compaction/storage shipping happens |
Thanks. I was trying to understand the need for
an ingester shutting down shouldn't be causing downtime if the ingesters in the other two zones are available. Typically each timeseries is sharded onto three ingesters - one from each zone. As long as two of these ingesters are available, then the writes and reads succeed and are consistent. Because each timeseries in a write requests is sharded individually, this effectively means that you need to have two completely available zones at a time to ensure availability. Is this not the case in your setup? |
Unfortunately, because we are running on spot nodes, there is an increased likelihood that during a rollout we will have 2 zones with unavailable ingesters, one that is being rolled out and another that has had a node evicted. Even without roll outs we could see 2 nodes from different zones evicted at the same time. So far we have managed to run this set up without the risk of this problem by having |
@dimitarvdimitrov This is still causing us quite a bit of headache. Let me elaborate a bit on the problem described by Liam in earlier posts. The ProblemWe run ingesters across three availability zones in Azure. Each zone has its own Kubernetes StatefulSet (i.e., Because we run on spot nodes, any ingester is at risk of being terminated at any time. If ingesters don't unregister from the ring upon being terminated, then spot node evictions in two different zones around the same time put us at risk of an incident. It is our understanding that both the metric write and read path need quorum, and that it takes a missing ingester in two out of three zones to take down either path. To mitigate this risk, we set We were initially oblivious to this relationship between rollouts, memory series and CPU usage; it took several incidents for us to narrow down the problem. Each of our 64 CPU nodes can easily handle three or four ingesters during normal operations (where each ingester will consume ~10 CPU), but struggle to keep up during rollouts when each ingester's CPU usage increases to ~20. Because of this, we've had to overprovision ingesters to limit the amount of ingesters that can be placed on a node. This overprovisioning is incurring a fairly large additional cost to our infrastructure, and is something we'd like to do away with. The Proposed SolutionWe'd like to introduce a new ingester HTTP endpoint: It supports three HTTP methods: When called with the {"unregister": true} When called with the {"unregister": false} The endpoint sets {"unregister": false} Using a request body supports both of these use cases:
If the endpoint used a toggle, then the caller would need to know the value of the
When called with the {"unregister": false} The All three behaviours of the endpoint are idempotent. Unhappy paths:
We are unsure if the endpoint needs a file marker (similar to what the The With the Mimir team's blessing, we'd be happy to work on implementing this endpoint. Alternative Solutions ConsideredWe looked into using Mimir's runtime configuration to dynamically set the unregister state of an ingester. We found a couple of problems with this:
We've also considered updating the Another option is to move our workload off of spot nodes. Doing this would cost us several hundred thousands of dollars per month. Ingesters make up the majority of our overall compute workload, so we'd still be looking at an additional cost of at least six figures even if we only moved ingesters. |
Thank you for the detailed response @LasseHels I think we need to take a step back and look at the problem and why it exists. Strong consistencyAs you said both the write and read path need quorum. Mimir is currently designed so that partial responses are not let through. I don't think there is a technical reason for this. The overall promise is that Mimr chooses consistency over partition tolerance. Partial responsesWhat you are describing is being ok with partial responses when ingesters from multiple zones are unavailable. In other words we would want availability and partition tolerance over consistency. The effects of this are a few and I will mention some of them:
ProblemThen the problem is that Mimir only supports strong consistency and doesn't leave clients to decide whether they are ok with partial responses. I believe this is the problem you need solved and the unregister-on-shutdown is just one solution to it. Do you agree this is the problem we are need to solve? |
This is not quite how we see the issue. I want to make sure that we mean the same thing when we refer to an "unavailable" ingester. Our understanding of an "unavailable" ingester is an ingester that fulfils these two criteria:
It is also our understanding that an ingester that unregisters from the ring on shutdown is not "unavailable" since it distributes its memory series to other ingesters in the zone (and is thus no longer the exclusive owner). If this is correct, then we don't see that our solution proposal leads to unavailable ingesters (and thus partial responses):
The second point assumes that there is some mechanism by which the read and write paths (which I guess would be distributor and querier, respectively?) are aware of the unavailable ingester(s) in the zone that is being rolled out, and that they will reach out to the available zones to achieve quorum. Our platform is relied upon by hundreds of teams internally, and we cannot afford any of the listed effects of partial responses. Thanks for working through this with us. We are constantly improving our understanding of Mimir, and conversations like these are helpful. |
This is not the case, unfortunately. The ingester does not transfer series which it has already ingested. Those stay exclusively in its own memory and/or disk. Redistribution of series (aka resharding) happens only after the ingester is unregistered from the ring and only applies to newly ingested data. See Ingesters failure and data loss |
My previous message should have been a bit more clear: I am specifically talking about ingesters that exit gracefully. It looks like the Ingesters failure and data loss page only applies to ingesters that have exited abruptly (i.e., with a As mentioned above, we also see an increase in memory series across a zone during rollouts. Our only explanation for this is that ingesters distribute their series to other ingesters in the zone on shutdown, and then re-read the same series from their WAL on start-up, leading to a net increase in series across the zone. Are we sure that an ingester that is terminated gracefully with |
Yes, ingesters don't have such functionality in them. They used to do that in old Cortex days with chunks storage, and in fact even when we were adding blocks storage into Cortex, ingesters were able to trasnfer state on shutdown, but this was since long removed, and was never part of Mimir. |
@pstibrany Interesting. If that is the case, what could explain the dramatic increase in memory series we see during rollouts? |
My understanding is that you use If your ingesters join the ring with new tokens each time, this causes series reshuffling. That's not great, because each series stay in ingester's memory for up to 3h since last sample was received. Head compaction runs every 2h at odd hours (in UTC). For example head compaction at 13:00 will take samples between 10:00 – 12:00, put them into a block, and then remove from memory. If ingester received last sample for series at 10:05, it will first remove this series from memory during compaction at 13:00. |
Correct.
I'm still not sure I understand how this can lead to an increase in memory series of more than 100% across a zone. If the rate of data ingested is more or less constant (which is the case for us), then I'd expect the amount of memory series in a zone to be similarly constant, even if the ingesters to which series are distributed changes a bit. Are series duplicated (i.e., distributed to more than one ingester in the same zone) at any point during this process you described? That would explain an increase in memory series even if traffic is constant.
We did not know that head compaction uses a maxt of |
here's a scenario to help visualize how this can happen
In the end the two series are accounted for in multiple ingesters; not only do they
You can avoid the resharding in step 7. you can use this doesn't account for replication, but with replication it kind of just happens |
@dimitarvdimitrov Thank you for the excellent explanation! I was mixing up samples and series, but with your example it makes perfect sense 👍. Our newfound understanding that an ingester's series/samples are not distributed to other ingesters leads to another question: is We've been enabling It is now our understanding that even with Do you agree with this? It bears repeating that we are appreciative of the effort you are putting into this. It would have taken us figuratively forever to piece this together ourselves. Footnotes
|
It appears to be solving the quorum errors. But I think it creates another larger problem - that of (temporary) data loss and unreliable queries.
that's right. Mimir will actually fail the query early when it detects that it cannot contact all registered instances in the ring. By unregistering ingesters from the ring you hide from the querier the fact that there is data elsewhere (on some restarting ingesters) which is currently not accessible. |
@dimitarvdimitrov I previously wrote:
On second thought, I wonder if this is the case. Consider this example where ingesters do unregister on shutdown:
It is our understanding that each metric series has exactly one owner per zone at any time. When This would mean that split series ownership - even for a brief amount of time - would practically always lead to some amount of data loss? |
Good point. This may happen if shuffle-sharding is enabled for the tenant. In other words, if a tenant is only on If a tenant is already sharded to all ingesters (i.e. shuffle-sharding is disabled), then this problem doesn't exist because all ingesters a tenant shards to are queried for every query. |
We've spent the last few weeks looking into our own Mimir setup, as well as how Mimir behaves in general. This analysis combined with your replies have given us new insight. It is now our understanding that:
We now also understand that we are already in a state of potential data loss. With our current setup of ingesters leaving the ring on shutdown, a multi-zone node eviction will cause temporary3 data loss for any overlapping4 series. Your first reply to my solution proposal was:
We didn't understand it at the time, but you hit the nail on the head. Multi-zone evictions in our system are rare, and it is okay for them to have some impact on the read path. Ideally, they should have no impact on the write path. This is the conundrum as we see it (
The inability to dynamically set What do you think of the proposed API endpoint solution, and can you think of more elegant solutions? In our opinion, being able to run ingesters on spot nodes is a big win, and affords users of Mimir significant cost savings. Footnotes
|
I think the endpoint is a reasonable middle ground. Ingesters will unregister from the ring by default. On rollouts some automation would invoke I'd like to hear others' thoughts on this first. @grafana/mimir-tech-leads, this looks like a slightly new approach to operating Mimir, but I think as long as the operators are ok with the tradeoffs it isn't strictly incompatible with Mimir's architecture. This comment gives a very good summary of the problem and this is a proposal to address the problem. What are your thoughts? |
@dimitarvdimitrov Did you and the team have a chance to review the solution proposal?
This is one option, but in our case, we would probably do the opposite: default to not leaving the ring on shutdown, and invoke Defaulting to leaving the ring is more tricky, since that requires us to invoke the endpoint during Mimir rollouts, but also in other cases like node restarts during a Kubernetes upgrade.
I thought about this as well and couldn't find an elegant way to do it in a backwards compatible manner. See #5901 (comment). Would be happy to give it another shot if the team thinks that using the existing endpoint is important. |
@dimitarvdimitrov Progress on this issue has slowed. Anything we can do to get the ball rolling again? We'd be happy to chime in with any information that's required. We're also happy to do the implementation, and we'd like buy-in from the Mimir team before we start. |
Thank you for the ping and apologies for dropping the ball on this @LasseHels. While I support the idea and I see how it solves your problem I still think it's a good idea to get more eyes on it before you kick off development on it. Especially because of the consistency tradeoff that it makes, it makes it a potentially contentious change. Do you think you can open a PR with a proposal doc similar to this #5029? The idea is to condense the matter into a similar structure as the one linked so it's easy to grasp by someone who's unfamiliar with the issue. I think this will help get buy-in from other maintainers and allow to move forward with the implementation. (It can also serve as a good documentation basis if you want to later integrate the changes in the rollout-operator.) The doc doesn't need to go into implementation details or be overly verbose. Your previous comment is serving largely the same purpose. Perhaps the changes I'd make are to update the problem description with the better understanding from this comment and the example flow of how this endpoint will be used this comment. |
@dimitarvdimitrov Absolutely. I'll open a proposal pull request once my schedule clears a bit 👍. |
Proposal document pull request opened: #7231. |
Now that this issue has been closed, I figured I'd chime in with a final update for the record. We've been running a custom Mimir image with a crude implementation of the With the implementation of this endpoint, we've updated our ingesters to not leave the ring on shutdown. This fixes the memory series issue during rollouts. When ingesters don't leave the ring on shutdown, a multi-zone node eviction would typically cause downtime. Using the endpoint, we can now run a small service to listen for node eviction events. The service calls the The trade-off here is that multi-zone evictions do still cause temporary data loss on the read path. All things considered, this is acceptable to us, particularly considering how infrequently multi-zone evictions happen. The setup adds a fair amount of complexity, and the only reason we're doing it is that the cost savings are huge; the fact that ingester memory series no longer balloon during rollouts has allowed us to right-size ingesters. Doing so has saved us 60 Thanks to the Mimir team for sponsoring this change 🎉 |
Is your feature request related to a problem? Please describe.
We run our Mimir stack on AKS spot nodes across 3 availability zones and so far we have found this to be an acceptable solution for us despite the risk of node evictions and pods being shuffled amongst nodes. Part of the reason this has worked for us is by making use of the setting
-ingester.unregister-on-shutdown=true
so that when an ingester is shut down, because it is being moved to another node, it will unregister itself from the ring and it will not cause any downtime as series are sent to other ingesters in the zone.However, we have observed that this has a negative effect during rollout of updates to a zone, similar to discussed here grafana/helm-charts#1313. During a rollout a zone will see it's number of in-memory series double and lots of out-of-order samples ingested, both of which cause more CPU to be required and makes our system unstable and susceptible to downtime during ingester rollouts. This is becoming more prevelant as our cluster grows and we only expect to grow more, right now we have about 360,000,000 in-memory series across 3 zones and 330 ingesters and this can double during rollouts.
Describe the solution you'd like
I believe that we could make our approach work if we were able to tell ingesters not to unregister themselves when an update is being rolled out to their zone, but other zones should continue to unregister themselves if they are shutdown for any reason. We use the rollout-operator to handle our rollouts across our 3 zones and so expect only 1 zone to be rolled out at a time.
Right now I believe that to update the config
-ingester.unregister-on-shutdown
to false we would need to rollout an update to the ingesters which would again cause the problems we are already seeing. Some solution whereby we update the ingesters' configuration to not unregister itself before the rollout-operator kills the pod should work, if this is possible.Describe alternatives you've considered
If there was some way to update this config for unregistering on the fly then we could script something to change this setting before we merge an update to our ingesters but it feels like having this logic as part of the rollout-operator would be the smoothest, in which case it may make more sense to move this issue to that repository.
The text was updated successfully, but these errors were encountered: