Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos sidecar is READY state when Prometheus is unavailable #5384

Closed
toan-hf opened this issue May 24, 2022 · 6 comments
Closed

Thanos sidecar is READY state when Prometheus is unavailable #5384

toan-hf opened this issue May 24, 2022 · 6 comments
Labels

Comments

@toan-hf
Copy link

toan-hf commented May 24, 2022

Hello everyone,

TL;DR: We are using the traditional combination model like any other, Prometheus & Thanos sideCar. However, We encountered the issue when our Prometheus got OOMed and We expect the calling from Thanos Query will not route the traffic to the unavailable Prometheus Pod.In fact, the traffic is still forwarded as usual. I guess the main reason is that Thanos SideCar state is still READY at that moment

I assumed this PR was implemented since 0.23 and what current version that We are using is 0.25 so it should not be a problem

The manifest details

Thanos-Query

    Container ID:  containerd://942e9480c2430a3328d9478598db27601f0c91f83526e0006245d9566401b6ee
    Image:         quay.io/thanos/thanos:v0.25.0
    Image ID:      quay.io/thanos/thanos@sha256:bc3657f2b793f2f482991807e5e5a637f1ae4f1c75fb58d563c18a447ea61b8b
    Ports:         10901/TCP, 9090/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      query
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:9090
      --log.level=info
      --log.format=logfmt
      --query.replica-label=prometheus_replica
      --query.replica-label=rule_replica
      --store=prometheus-k8s.prometheus-operator.svc.cluster.local:10901 #short term
      --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local:10901 #long term 
      

Thanos-Store

thanos-store:
    Container ID:  containerd://fe2a3839ff2356f937b0b5f6cf056d45ab8feb7988bb08a11c62dbd9035d3c36
    Image:         quay.io/thanos/thanos:v0.25.0
    Image ID:      quay.io/thanos/thanos@sha256:bc3657f2b793f2f482991807e5e5a637f1ae4f1c75fb58d563c18a447ea61b8b
    Ports:         10901/TCP, 10902/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      store
      --log.level=info
      --log.format=logfmt
      --data-dir=/var/thanos/store
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --objstore.config=$(OBJSTORE_CONFIG)
      --ignore-deletion-marks-delay=24h
      --index-cache-size=20GB
      --index-cache.config="config":
        "addr": "REDACTED:6379"
        "db": 0
      "type": "REDIS"
      --store.caching-bucket.config="config":
        "addr": "REDACTED:6379"
        "db": 1
      "type": "REDIS"

K8s Service for Prometheus-K8s

spec:
  clusterIP: 100.70.78.65
  clusterIPs:
  - 100.70.78.65
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: metrics
    port: 9090
    protocol: TCP
    targetPort: metrics
  - name: grpc
    port: 10901
    protocol: TCP
    targetPort: grpc
  selector:
    app.kubernetes.io/name: prometheus
  sessionAffinity: None
  type: ClusterIP

Prometheus Pod 4 containers inside.

Containers:
  prometheus:
    Container ID:  containerd://48401e685253283d0b967e9b433eb9fb9c5b9b1e4208efbe71d06d4bad81f257
    Image:         prom/prometheus:v2.33.5
    ..
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --storage.tsdb.retention.time=24h
      --web.enable-lifecycle
      --web.enable-admin-api
      --web.external-url=https://prometheus-main.domain
      --web.route-prefix=/
      --web.config.file=/etc/prometheus/web_config/web-config.yaml
      --storage.tsdb.max-block-duration=2h
      --storage.tsdb.min-block-duration=2h
    State:          Running
    ....
 config-reloader:
    Container ID:  containerd://c5a1e0dd480cca7ce4be2b132e461dc3f0df6f8b3951b272fdd2e227452d862f
    ....
  thanos-sidecar:
    Container ID:  containerd://85efbd7813b37f7c653c128238d0486d4db1487ee4a91115c2038e81a3879413
    Image:         quay.io/thanos/thanos:v0.25.0
    Image ID:      quay.io/thanos/thanos@sha256:bc3657f2b793f2f482991807e5e5a637f1ae4f1c75fb58d563c18a447ea61b8b
    Ports:         10902/TCP, 10901/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      sidecar
      --prometheus.url=http://localhost:9090/
      --grpc-address=:10901
      --http-address=:10902
      --objstore.config=$(OBJSTORE_CONFIG)
      --tsdb.path=/prometheus
      --log.level=info
    State:          Running
      Started:      Mon, 23 May 2022 13:22:31 +0200
    Ready:          True
   ...
  vault-agent:
    Container ID:  containerd://bfdc4bb9b055a83ec1e2cfe28fd8687e6cfadd3590424ad5cdf157bc438dfe7d
  ....
    Mounts:
      /etc/vault/config from vault-config (ro)
      /home/vault from vault-token (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c7fd2 (ro)

What actual problem: When the Prometheus got OOMed, the Thanos SideCar is still Ready and traffic still forwarded to that Pod

What I expect: If Prometheus got killed, Thanos SideCar state should change to NOT READY to prevent the traffic sending there

Logs from Thanos Query when Prometheus down

image
as you can see it the Thanos Query forwarded the traffic to the prometheus-main-1 pod which was OOMed at that time

Version
PrometheusOperator: v0.53.0
Prometheus: v2.33.5
Thanos Query / SideCar : v.0.25.0
K8s: 1.21.1

@GiedriusS
Copy link
Member

In such a situation, what happens if you curl /api/v1/status/config on Prometheus when it goes down?

@toan-hf
Copy link
Author

toan-hf commented May 25, 2022

@GiedriusS I have not checked yet but I remember the situation at that moment

Total containers are available is 3/4
Prometheus Container state is NOT READY

And Prometheus container at that moment is loading the WAL file. I believe it is a very well-known issue

In the past 2 weeks We have tweaked the Readiness & StarUp Probes to 1000 failure because We know the time for WAL loading is pretty long (usually more than 20 minutes). It helps our Prometheus can be restarted successfully, but I don't know whether it was affected to the Thanos SideCar flow or not.

Readiness:    http-get http://:metrics/-/ready delay=0s timeout=3s period=5s #success=1 #failure=1000
    Startup:      http-get http://:metrics/-/ready delay=0s timeout=3s period=15s #success=1 #failure=1000

@toan-hf
Copy link
Author

toan-hf commented May 25, 2022

hmm I suddenly realized that K8s Service actually has no capability for GRPC protocol Load Balancing as it mentioned there.

So if I want to make sure the traffic is distributed across the pod and did the health check to make sure the pod is READY, maybe I will need to handle it from the Ingress layer such as (Linker or Nginx Ingres).

I was thinking to use the traditional headless DNS (Cluster IP: None) as the normal way but it would not help in case 1 of the Prometheus pod is unavailable.

@toan-hf
Copy link
Author

toan-hf commented May 26, 2022

/api/v1/status/config

image

@stale
Copy link

stale bot commented Jul 31, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 31, 2022
@GiedriusS
Copy link
Member

I think Sidecar now becomes not ready in such case. Should be fixed by #4939

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants