Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non deterministic deduplication on receive and sidecar #6705

Closed
abaguas opened this issue Sep 6, 2023 · 3 comments
Closed

Non deterministic deduplication on receive and sidecar #6705

abaguas opened this issue Sep 6, 2023 · 3 comments

Comments

@abaguas
Copy link
Contributor

abaguas commented Sep 6, 2023

Thanos, Prometheus and Golang version used:

Query frontend, querier and receive on Thanos version 0.32.2
Sidecar on Thanos version 0.30.2

What happened:

The same query returned different results when executed at different times when deduplication was enabled. This happens for queries on data in sidecars (v0.30.2) or receives (v0.32.2)

What you expected to happen:

The same query always returns the same results.

How to reproduce it (as minimally and precisely as possible):

I attach two videos illustrating the problem:

1 - dedup_bug_sidecar_0_30_2.mov

dedup_bug_sidecar_0_30_2.mov

Here the data is scraped by Prometheus HA (2 replicas) and queried on its Thanos sidecar.

The query `aggregator_unavailable_apiservice_total{cluster="osdp-prod-azu-switzerlandnorth-1",name="v1beta1.metrics.k8s.io"} returns two series when deduplication is disabled, one for each prometheus instance:
{prometheus_replica="prometheus-osdp-monitoring-prometheus-0"} 3
{prometheus_replica="prometheus-osdp-monitoring-prometheus-1"} 5

When querying the raw data for the past 30 minutes with deduplication is enabled the query returns 3 most of the time, however sometimes it returns 5 until time X and 3 afterwards. This X is always around the top of the hour.

2 - dedup_bug_receive_0_32_2.mov

dedup_bug_receive_0_32_2.mov

Here the data is scrapped by Prometheus HA (2 replicas), remote written to Thanos receives (factor 2 replication) and queried from there.

The query `aggregator_unavailable_apiservice_total{cluster="osse-prod-azu-eastus-1",name="v1beta1.custom.metrics.k8s.io"} returns four series when deduplication is disabled, one for each prometheus instance and receive replica combination:

{prometheus_replica="prometheus-osdp-monitoring-prometheus-0",receive_replica="thanos-receive-cloudinfrastructure-1"} 8
{prometheus_replica="prometheus-osdp-monitoring-prometheus-0",receive_replica="thanos-receive-cloudinfrastructure-3"} 8
{prometheus_replica="prometheus-osdp-monitoring-prometheus-1",receive_replica="thanos-receive-cloudinfrastructure-1"} 7
{prometheus_replica="prometheus-osdp-monitoring-prometheus-1",receive_replica="thanos-receive-cloudinfrastructure-3"} 7

As in the example above, different prometheus replicas have store different values.

When querying the raw data for the past 30 minutes with deduplication enabled the query returns 8 most of the time, however sometimes it returns 8 until time X and 7 afterwards. Again, this X is always around the top of the hour.

Anything else we need to know:

I can upgrade the sidecar to 0.32.2 if you would like. But I think showing the bug was there in 0.30.2 is still interesting since 0.31.0 was the source of some querying issues.

On the querier there is deduplication on the following labels

    - --query.replica-label=prometheus_replica
    - --query.replica-label=receive_replica
    - --query.replica-label=tenant_id
    - --query.replica-label=thanos_ruler_replica
    - ```
@MichaHoffmann
Copy link
Contributor

Might be related to #6702 (comment)

@MichaHoffmann
Copy link
Contributor

#6706 This might be a fix maybe

@GiedriusS
Copy link
Member

I think let's close it since #6706 was merged. Please reopen if this is still the case on latest main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants