-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos sidecar is READY state when Prometheus is unavailable #5384
Comments
In such a situation, what happens if you curl |
@GiedriusS I have not checked yet but I remember the situation at that moment Total containers are available is 3/4 And Prometheus container at that moment is loading the WAL file. I believe it is a very well-known issue In the past 2 weeks We have tweaked the Readiness & StarUp Probes to 1000 failure because We know the time for WAL loading is pretty long (usually more than 20 minutes). It helps our Prometheus can be restarted successfully, but I don't know whether it was affected to the Thanos SideCar flow or not.
|
hmm I suddenly realized that K8s Service actually has no capability for GRPC protocol Load Balancing as it mentioned there. So if I want to make sure the traffic is distributed across the pod and did the health check to make sure the pod is READY, maybe I will need to handle it from the Ingress layer such as (Linker or Nginx Ingres). I was thinking to use the traditional headless DNS (Cluster IP: None) as the normal way but it would not help in case 1 of the Prometheus pod is unavailable. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
I think Sidecar now becomes not ready in such case. Should be fixed by #4939 |
Hello everyone,
TL;DR: We are using the traditional combination model like any other, Prometheus & Thanos sideCar. However, We encountered the issue when our Prometheus got OOMed and We expect the calling from Thanos Query will not route the traffic to the unavailable Prometheus Pod.In fact, the traffic is still forwarded as usual. I guess the main reason is that Thanos SideCar state is still READY at that moment
I assumed this PR was implemented since 0.23 and what current version that We are using is 0.25 so it should not be a problem
The manifest details
Thanos-Query
Thanos-Store
K8s Service for Prometheus-K8s
Prometheus Pod 4 containers inside.
What actual problem: When the Prometheus got OOMed, the Thanos SideCar is still Ready and traffic still forwarded to that Pod
What I expect: If Prometheus got killed, Thanos SideCar state should change to NOT READY to prevent the traffic sending there
Logs from Thanos Query when Prometheus down
as you can see it the Thanos Query forwarded the traffic to the prometheus-main-1 pod which was OOMed at that time
Version
PrometheusOperator: v0.53.0
Prometheus: v2.33.5
Thanos Query / SideCar : v.0.25.0
K8s: 1.21.1
The text was updated successfully, but these errors were encountered: