Non deterministic deduplication on receive and sidecar #6705

abaguas · 2023-09-06T13:56:24Z

Thanos, Prometheus and Golang version used:

Query frontend, querier and receive on Thanos version 0.32.2
Sidecar on Thanos version 0.30.2

What happened:

The same query returned different results when executed at different times when deduplication was enabled. This happens for queries on data in sidecars (v0.30.2) or receives (v0.32.2)

What you expected to happen:

The same query always returns the same results.

How to reproduce it (as minimally and precisely as possible):

I attach two videos illustrating the problem:

1 - dedup_bug_sidecar_0_30_2.mov

dedup_bug_sidecar_0_30_2.mov

Here the data is scraped by Prometheus HA (2 replicas) and queried on its Thanos sidecar.

The query `aggregator_unavailable_apiservice_total{cluster="osdp-prod-azu-switzerlandnorth-1",name="v1beta1.metrics.k8s.io"} returns two series when deduplication is disabled, one for each prometheus instance:
{prometheus_replica="prometheus-osdp-monitoring-prometheus-0"} 3
{prometheus_replica="prometheus-osdp-monitoring-prometheus-1"} 5

When querying the raw data for the past 30 minutes with deduplication is enabled the query returns 3 most of the time, however sometimes it returns 5 until time X and 3 afterwards. This X is always around the top of the hour.

2 - dedup_bug_receive_0_32_2.mov

dedup_bug_receive_0_32_2.mov

Here the data is scrapped by Prometheus HA (2 replicas), remote written to Thanos receives (factor 2 replication) and queried from there.

The query `aggregator_unavailable_apiservice_total{cluster="osse-prod-azu-eastus-1",name="v1beta1.custom.metrics.k8s.io"} returns four series when deduplication is disabled, one for each prometheus instance and receive replica combination:

{prometheus_replica="prometheus-osdp-monitoring-prometheus-0",receive_replica="thanos-receive-cloudinfrastructure-1"} 8
{prometheus_replica="prometheus-osdp-monitoring-prometheus-0",receive_replica="thanos-receive-cloudinfrastructure-3"} 8
{prometheus_replica="prometheus-osdp-monitoring-prometheus-1",receive_replica="thanos-receive-cloudinfrastructure-1"} 7
{prometheus_replica="prometheus-osdp-monitoring-prometheus-1",receive_replica="thanos-receive-cloudinfrastructure-3"} 7

As in the example above, different prometheus replicas have store different values.

When querying the raw data for the past 30 minutes with deduplication enabled the query returns 8 most of the time, however sometimes it returns 8 until time X and 7 afterwards. Again, this X is always around the top of the hour.

Anything else we need to know:

I can upgrade the sidecar to 0.32.2 if you would like. But I think showing the bug was there in 0.30.2 is still interesting since 0.31.0 was the source of some querying issues.

On the querier there is deduplication on the following labels

    - --query.replica-label=prometheus_replica
    - --query.replica-label=receive_replica
    - --query.replica-label=tenant_id
    - --query.replica-label=thanos_ruler_replica
    - ```

The text was updated successfully, but these errors were encountered:

MichaHoffmann · 2023-09-06T13:59:46Z

Might be related to #6702 (comment)

MichaHoffmann · 2023-09-06T19:11:03Z

#6706 This might be a fix maybe

GiedriusS · 2023-10-14T10:32:40Z

I think let's close it since #6706 was merged. Please reopen if this is still the case on latest main.

douglascamata added the duplicate label Sep 19, 2023

GiedriusS closed this as completed Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non deterministic deduplication on receive and sidecar #6705

Non deterministic deduplication on receive and sidecar #6705

abaguas commented Sep 6, 2023

MichaHoffmann commented Sep 6, 2023

MichaHoffmann commented Sep 6, 2023

GiedriusS commented Oct 14, 2023

Non deterministic deduplication on receive and sidecar #6705

Non deterministic deduplication on receive and sidecar #6705

Comments

abaguas commented Sep 6, 2023

Thanos, Prometheus and Golang version used:

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

1 - dedup_bug_sidecar_0_30_2.mov

2 - dedup_bug_receive_0_32_2.mov

Anything else we need to know:

MichaHoffmann commented Sep 6, 2023

MichaHoffmann commented Sep 6, 2023

GiedriusS commented Oct 14, 2023