Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: auto-downsampling causes inaccurate output of metrics (inflated values) #922

Closed
ottoyiu opened this issue Mar 14, 2019 · 23 comments
Closed

Comments

@ottoyiu
Copy link

ottoyiu commented Mar 14, 2019

Thanos, Prometheus and Golang version used

image: improbable/thanos:v0.3.1

and

image: improbable/thanos:v0.3.2

I was able to replicate it on both versions.

Prometheus 2.7.1

What happened
When --query.auto-downsampling is enabled on the query component, metrics beyond two days would be ballooned by multiples of the actual result. In this case, we've seen the metrics values go 10x.

PromQL:

sum(dest_bps_in{hostname=~"$hostname", exported_namespace=~"$namespace"}) by (service_name, exported_namespace) * 8

Auto-downsampling enabled (grafana v5.3.4):
Screenshot-2019-3-14 Grafana - (k8s) BalanceD Service At A Glance(1)

Auto-downsampling disabled (grafana v.5.3.4) - these metrics are accurate:
Screenshot-2019-3-14 Grafana - (k8s) BalanceD Service At A Glance

Another one with auto-downsampling enabled (grafana v6.0.1):
Screenshot-2019-3-14 New dashboard - Grafana

What you expected to happen
Metrics to be accurate irregardless of auto-downsampling is enabled or not.

How to reproduce it (as minimally and precisely as possible):

        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=365d

on compactor

  • Enable auto-downsampling, observe any metrics with 30 days window in Grafana. Metrics are inaccurate, and when zooming back in to a smaller window, the metrics become accurate again.
    Disable auto-downsamping, observe any metrics with 30 days windows in Grafana. Metrics are accurate.

Full logs to relevant components
thanos bucket inspect output

Logs

|            ULID            |        FROM         |        UNTIL        |     RANGE     |   UNTIL-COMP   |  #SERIES  |    #SAMPLES    |   #CHUNKS   | COMP-LEVEL | COMP-FAILED |                                                                   LABELS                                                                   | RESOLUTION |  SOURCE   |
|----------------------------|---------------------|---------------------|---------------|----------------|-----------|----------------|-------------|------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
| 01D5C9DF7VXBRQPCR8P9HF0ERH | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.639s | -152h8m55.639s | 1,562,415 | 44,642,000,050 | 373,976,728 | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5D42HA6ES2XJM5XNN0EZ7VT | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.663s | -152h8m55.663s | 1,562,599 | 44,651,399,075 | 373,977,040 | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5D8J10JXA8FTFRPGSA52CNN | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.639s | 47h51m4.361s   | 1,562,415 | 3,459,134,615  | 25,743,172  | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5DF724T11E1WX5W6SAPJJ4R | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.663s | 47h51m4.337s   | 1,562,599 | 3,460,647,340  | 25,743,356  | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5G8YD96MC0JG89X1BM60ANE | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | -8h0m0s        | 1,588,659 | 11,285,984,868 | 94,100,935  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5G9XG672QFRTWJ4C7CAMZNF | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | -8h0m0s        | 1,588,728 | 11,286,004,997 | 94,101,042  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5GAZWSJXNGCG44NVTPJKC6J | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | 192h0m0s       | 1,588,658 | 874,836,605    | 7,651,554   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5GCD7FRDZS5YBS78SGF4X0S | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | 192h0m0s       | 1,588,728 | 874,836,712    | 7,651,624   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5NE0GQPJXX86J8F1N64R0G5 | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,592,420 | 11,349,456,418 | 94,636,942  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5NEYKFR32V0Y4CNP4F20KGC | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,592,436 | 11,349,476,561 | 94,636,972  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5NFZS7T5KX10N84NVAGKTW2 | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,592,419 | 880,110,425    | 7,696,584   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5NHF3JXX7JZR52HSYQVW86B | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,592,435 | 880,110,409    | 7,696,600   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5TRKJ8ZTT4RA81F8X2H08HA | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,659,932 | 11,414,008,950 | 95,216,437  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5TSEMGHA3P1FCWC5P06J6QB | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,660,023 | 11,414,028,755 | 95,216,496  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5TTAE3NYKETYZPQ4B35G79P | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,659,871 | 885,270,232    | 7,788,901   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5TVKN3QK4FCMNZ14H9BC2HZ | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,659,962 | 885,270,286    | 7,788,992   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5VDBQD2YN5YSTETC1HFW89K | 12-03-2019 17:00:00 | 13-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,552,580 | 1,893,087,654  | 15,924,597  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5VDJSHWEKASM3Q7AG19GDA1 | 12-03-2019 17:00:00 | 13-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,552,545 | 1,892,299,071  | 15,924,289  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5W93Q11Y8XGN10D3P8W4AJG | 13-03-2019 01:00:00 | 13-03-2019 09:00:00 | 8h0m0s        | 32h0m0s        | 1,580,955 | 1,910,657,289  | 15,953,592  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5W9JYC4D4MRYN53F5NS8D55 | 13-03-2019 01:00:00 | 13-03-2019 09:00:00 | 8h0m0s        | 32h0m0s        | 1,580,949 | 1,910,653,297  | 15,953,550  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5X49TXE2FPRN1XKPAJCAZNB | 13-03-2019 09:00:00 | 13-03-2019 17:00:00 | 8h0m0s        | 32h0m0s        | 1,546,268 | 1,891,744,574  | 15,479,228  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5X4G6QTG00SBS74Z3ZS5WF6 | 13-03-2019 09:00:00 | 13-03-2019 17:00:00 | 8h0m0s        | 32h0m0s        | 1,546,270 | 1,891,253,611  | 15,479,221  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5XZRJBRTZ8ZRE6YQPDHXQNA | 13-03-2019 17:00:00 | 14-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,556,778 | 1,911,067,151  | 15,936,507  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5Y002A7A24Y55WWQ8V84Z7V | 13-03-2019 17:00:00 | 14-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,556,774 | 1,911,064,040  | 15,936,467  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5XXQPZJ96WTWG5Z7PZBBPDD | 14-03-2019 01:00:00 | 14-03-2019 03:00:00 | 2h0m0s        | 38h0m0s        | 1,544,246 | 477,747,855    | 3,981,488   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5XXQPZM56832FMDBFCQ3SPP | 14-03-2019 01:00:00 | 14-03-2019 03:00:00 | 2h0m0s        | 38h0m0s        | 1,544,238 | 477,748,831    | 3,981,484   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01D5Y4KE4GR3HDZAY8K8SBZH1X | 14-03-2019 03:00:00 | 14-03-2019 05:00:00 | 2h0m0s        | 38h0m0s        | 1,546,542 | 477,752,906    | 3,983,785   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5Y4KE4HXB5M4E85HK02N54P | 14-03-2019 03:00:00 | 14-03-2019 05:00:00 | 2h0m0s        | 38h0m0s        | 1,546,548 | 477,753,874    | 3,983,791   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01D5YBF5C7BN8DJ27C7493GRAJ | 14-03-2019 05:00:00 | 14-03-2019 07:00:00 | 2h0m0s        | 38h0m0s        | 1,544,508 | 477,740,036    | 3,981,739   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5YBF5CJPTW2X4G5KZ58NS14 | 14-03-2019 05:00:00 | 14-03-2019 07:00:00 | 2h0m0s        | 38h0m0s        | 1,544,510 | 477,740,921    | 3,981,741   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |


Anything else we need to know
Using Grafana v5.3.4 and v6.0.1. Could this be a grafana bug?

@ottoyiu ottoyiu changed the title query: auto-downsampling causes inaccurate output of metrics query: auto-downsampling causes inaccurate output of metrics (nflated values) Mar 14, 2019
@ottoyiu ottoyiu changed the title query: auto-downsampling causes inaccurate output of metrics (nflated values) query: auto-downsampling causes inaccurate output of metrics (inflated values) Mar 14, 2019
@FUSAKLA
Copy link
Member

FUSAKLA commented Mar 23, 2019

Hi, hmm can you try this in the Thanos query UI directly? There is select box where you can choose exatly what lvl of downsampling to display. So try explicitly raw data and the 5m resolution possibly even the 1d.
This should rule out the grafana or anything else along the way (trickster possibly).

@ottoyiu
Copy link
Author

ottoyiu commented Mar 26, 2019

@FUSAKLA the same thing happens if I specify 'max 5m resolution' in the thanos query UI. The same of the graph looks largely the same except that its 10x the value.

5m resolution:
1553561499_1751_25032019_2473x416

raw resolution:
1553561682_1754_25032019_2471x424

1h resolution:
1553561737_1755_25032019_2466x461

the 1h resolution results is a pretty broken graph it seems.

@bwplotka
Copy link
Member

Thanks for report, I have seen similar problem on our prod as well, but fortunately we can fallback to raw data always. I am pretty sure this path is not well tested, so some bug in choosing blocks might happen.

High priority IMO.

@bwplotka
Copy link
Member

Essentially there are couple things we need to take a closer look:

cc @SuperQ as I mislead you on slack I think. Max 5m ,Max 1h should fallback to higher resolutions, but it seems to not work.

cc @mjd95

@mjd95
Copy link
Contributor

mjd95 commented Mar 29, 2019

Thanks @bwplotka. I'll have a look in to this - will first add the "Only 1h" and "Only 5m" features in the UI, agree this would be helpful for debugging.

@ivmaks
Copy link

ivmaks commented Apr 11, 2019

@ottoyiu plz, compare the result:

sum(dest_bps_in{hostname=~"$hostname", exported_namespace=~"$namespace"}) by (service_name, exported_namespace) * 8
vs
sum(dest_bps_in{hostname="$hostname", exported_namespace="$namespace"}) by (service_name, exported_namespace) * 8

@ottoyiu
Copy link
Author

ottoyiu commented Apr 12, 2019

@ivmaks the graph looks identical with or without the ~
without:
image

with:
Screenshot-2019-4-12 Thanos long term storage Prometheus solution(1)

@bwplotka
Copy link
Member

bwplotka commented May 30, 2019

So we found and fixed issue in algorithm that chooses what blocks to use: #1146. So if you different blocks compacted and some not compacted it could incorrectly just give 0 results. This will fix:

the 1h resolution results is a pretty broken graph it seems.

But still incorrect result is a different story. I am kind of sure, that the reason is that you use sum on downsampled data, what if you try sum_over_time? Sorry - this bit is definitely confusing even for someone that knows promQL and how we downsample pretty well - we should document it better.

@ottoyiu
Copy link
Author

ottoyiu commented May 30, 2019

@bwplotka awesome to see #1146, will definitely try out the v0.5.0-rc.0! thank you for all the hard work and to everyone involved.

I'm applying an instant sum because we have 3 loadbalancers and we want to see at a given time what the aggregate bytes per seconds is per service. I'm not sure if sum_over_time makes sense, since it's not really an instant vector we're looking for unless I'm mistaken what sum_over_time does.

I'll try to replicate this bug in v0.5.0 and see if something was changed that fixed it.

Edit: I can still replicate it on v0.5.0-rc.0 with and without the multipler (times 8):
without:
1559253423_1457_30052019_2502x627

with:
1559253755_1502_30052019_2501x636

tried using sum_over_time instead (I don't know how to use sum_over_time and still group by the the two fields, so maybe this is not what you're looking for):
1559253524_1458_30052019_2497x621

@ryansheppard
Copy link

We are also experiencing this bug. Our Prometheus instances are all set to scrape every 30 seconds. We noticed that with Max 5 Minute downsampling, we get a 10x increase in the value. With Max 1 Hour, we get a 12x increase over the 5 Minute downsampling, and ~120x over the raw data.

All of these values can be explained by the number of scrapes in the downsample window. Those values make sense because we have 10 raw samples in a 5 minute window and 120 samples in a 1 hour window. Is the store trying to "smear" the values across the 5m/1hr time frame to fill in the gaps, causing aggregations like sum to output the wrong result?

raw

5m

1h

@alivespirit
Copy link

Same thing here, expression sum(elasticsearch_cluster_health_number_of_nodes)/count(elasticsearch_cluster_health_number_of_nodes) gives value 15 with raw data and 450 with downsampling.

@bwplotka
Copy link
Member

Guys Is it reproducible with v0.5.0? (:

@alivespirit
Copy link

alivespirit commented Jun 12, 2019

@bwplotka Actually it is not! Just tested 0.5.0 in our setup and it works great. Was mistaken, forgot about sum.

@bwplotka
Copy link
Member

bwplotka commented Jun 12, 2019

We fixed nasty bug (: Thanks to this: #1146

Thanks everyone involved. Closing, we can reopen if anyone can repro it with v0.5.0 or newest master.

@hhsnow
Copy link

hhsnow commented Jun 12, 2019

@bwplotka, @ryansheppard's graphs are from 0.5.0-rc.0. we can try the v0.5.0 release today, but I don't see anything besides documentation changes.

@bwplotka bwplotka reopened this Jun 12, 2019
@therapy-lf
Copy link

I have the same issue with v0.5.0

@bwplotka
Copy link
Member

I think there are many things here. One was the bug with choosing the downsampled resolution: That one works.

Additional one is with particular non _over_time operators like sum and there is some confusion there (or bug) in handling those queries. We should look closely on that part in the code between PromQL and Agreggations to fetch.

@vladvasiliu
Copy link
Contributor

I don't think there's any bug in this, but there might be some confusion caused by the way Grafana and the Prometheus console work, that is the graphs are sampled:

If sample frequency (=scraping frequency) is smaller than drawing interval (the $__interval variable), then samples in between are dropped.
This is as opposed to e.g. Kibana, which usually uses histograms, which means that samples are summed.
That's why bars in Kibana go up as the selected time interval goes up, whereas in Grafana they stay the same. Basically, in Kibana requests per second become requests per minute, then requests per hour, etc.

In Thanos, down-sampled series are actually aggregated values over the interval. When you request a value from one such series, you get either an average over the down-sampling interval, either some aggregated value if there's some function involved (min/max/sum/count).

IMHO this sheds some light on what the querier returns.

This explains why sum(requests) is not the same as requests + stacking:
The first one sums a bunch of sums, whereas the latter sums a bunch of averages. Each average being sum / count (count is number of scrapes per interval), which is exactly the difference seen in the examples above.

However, this kind of "bucketing" combined with the sampling in Grafana can yield surprising results, so care should be taken when building the queries.

@therapy-lf
Copy link

@vladvasiliu I have the same issue in thanos-query ui, not just in grafana.

@vladvasiliu
Copy link
Contributor

That's because thanos-query ui works the same. See below for examples.

It's a pretty generic tool that draws values for whatever time series you throw at it.

It doesn't always make sense to group the values when "zooming out". For example if you have a time series for some status, like up. It wouldn't make sense to display "3" if the UI step is 3 times the scraping period. It has no way of knowing what the value represents, so it goes with the safest route, which is sampling, ie dropping values.

In my opinion there should be a broader documentation in Thanos about how this works and how this interacts with graphing tools. I think the most surprising things happen when the graphing tool has a resolution in between downsampling intervals, say 20 minutes in the case of Thanos. If you sum your values, you'll get a partial sum for that period, which is weird for me.

The way to look at this is that downsampling loses resolution. Instead of five values, one every 5 minutes, you only get one, which isn't equal to any of them. (You actually get several: min, max, sum, count - see #813 - this allows to retain some idea about what the data distribution was).

I think what's a bit confusing is that asking for just one sample gives an average, so the value isn't clearly wrong when compared to raw data (but it should be noted that if the raw data is somewhat random, they don't match!)
People probably don't expect that summing different dimensions would actually sum different values that what's returned without the sum for a query that's identical otherwise.

The graphs below have samples scraped every minute. This is thanos-query UI v0.5.0.


Sampling and missing data: same series, different "zoom":
The first is over the last 12h. Notice the peak at 60.
Screenshot 2019-06-13 at 18 47 08

The second is over the last two days. Notice the maximum barely hits 30. There's information missing (focus on the graph that's present, the series was only created yesterday).
Screenshot 2019-06-13 at 18 48 05


Same series, 1 scrape per minute in raw data, downsamples are to 5 minutes. See how the shape of the curve changes with what is displayed.

  • 60s resolution, only raw data. Notice the peak above 8000.

60s - raw

  • 60s resolution, downsampled. The peak is gone, and there are 5m plateaus. This is a smoothed version, because it's the average for each plateau.

60s - ds

  • 60s resolution, downsampled, max. The peak is back, and the shape is roughly the same.

60s - ds - max

  • 300s raw. This is where it gets interesting. Notice the missing features. But there are no plateaus.

300s - raw

  • 300s downsampled. It's pretty much the same, just smoothed. Instead of sampling 1 of 5 values, it uses the average.

300s - ds

  • 300s downsampled, max. The peak is back.

300s - ds- max

Up until this point, all values are roughly the same as the original raw data. The next one is the confusing part. Note there's just one series, so if using sum or avg on the raw data the values wouldn't change. But on the downsampling it does, and it's... 5 times larger !
300s - ds - sum

The difference is the way those charts are read. The last is read "in the interval between one point and the other, there were this many requests". That's an aggregation. All the others read "at some time between the last two scrapes there were this many requests per unit of time".

You'll have to check with the series for this unit, and it's always the same. If the range is 5 minutes, you probably don't care. If it's a day, and you only had 2000 requests, it makes a big difference to know whether those were for the whole day or just during one second.

@jjneely
Copy link
Contributor

jjneely commented Jun 20, 2019

So yeah, you've demonstrated Spike Erosion quite well here. Definitely a well understood side effect when you downsample by averaging or your graph display toolkit uses a weighted averages to dynamical resize the graph.

Being that we have min, max, sum, count, and (therefore) average for each downsampled data point, I bet that we are using the sum value of the downsampled data point when we use the sum() function...which leads to these inflated results. However, when sum_over_time() is used, this is exactly what we want to do.

Being that Spike Erosion is usually controlled by controlling the downsampling aggregation function, do we need to expose how to select the min. max, sum, or count when working with downsampled data?

Another way to handle Spike Erosion is by using and aggregating histograms to build a quantile estimation. That too is going to require sum() over Counter type data and probably work best with max as the aggregation function.

@GiedriusS
Copy link
Member

I believe this has been fixed in 0.6.0-rc.0. Please test.

@ryansheppard
Copy link

Just rolled out 0.6.0-rc.0 for the querier and it looks good. Both sum and avg are returning what we expect with Max 5m Downsampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests