Rollup indicator #1481

Dieterbe · 2019-09-25T13:01:04Z

Here's my vision:
for every series, there's a SeriesSource. The SeriesSource describes how the data was generated through the concept of SeriesSourceProperties (which describe which storage-schemas rules applied to our data, which archive we read from, what kind of normalization and runtime consolidation was applied) along with a count of series for those properties.

In the most simple case (fetching 1 timeseries, and not running a function like sumSeries() or no merging together because of an interval change), the source is only one set of properties with a count of 1.

When things get interesting (mergeSeries, sumSeries(), etc) is when we need to combine multiple series to create one new output series. in all these cases we merge the SeriesSources together, adding up the counts based on their SeriesSourceProperties.

Ultimately then, for any series returned in the http response body, we know if it was composed out of just 1 source timeseries or multiple, and how they were generated.

Conceptually I think it's fairly simple. the main challenge is updating all processing functions accordingly (for now i've only done a few to showcase the idea) and making sure not to make any permanent changes to SeriesSource maps when we may reuse series in the processing pipeline

the output format for now looks like so:
(notice how we already had response-global metadata, and now we're adding per series metadata under "source")
note that instead of (or in addition to) schema-id i plan to return the actual schemas rule being used.

~ ❯❯❯ http  'http://localhost:6060/render?target=metrictank.stats.docker-env.default.input.carbon.metricdata.received.counter32&from=-5s&meta=true'
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 399
Content-Type: application/json
Date: Wed, 25 Sep 2019 12:57:19 GMT
Vary: Accept-Encoding

{
    "meta": {
        "stats": {
            "executeplan.cache-hit-partial.count": 0,
            "executeplan.cache-hit.count": 0,
            "executeplan.cache-miss.count": 0,
            "executeplan.chunks-from-cache.count": 0,
            "executeplan.chunks-from-store.count": 0,
            "executeplan.chunks-from-tank.count": 1,
            "executeplan.get-targets.ms": 0,
            "executeplan.plan-run.ms": 0,
            "executeplan.points-fetch.count": 5,
            "executeplan.points-return.count": 5,
            "executeplan.prepare-series.ms": 0,
            "executeplan.resolve-series.ms": 0,
            "executeplan.series-fetch.count": 1
        }
    },
    "series": [
        {
            "datapoints": [
                [
                    8152,
                    1569416235
                ],
                [
                    8559,
                    1569416236
                ],
                [
                    8864,
                    1569416237
                ],
                [
                    9162,
                    1569416238
                ],
                [
                    9450,
                    1569416239
                ]
            ],
            "source": [
                {
                    "aggnum-norm": 1,
                    "aggnum-rc": 0,
                    "archive": 0,
                    "consolidate-normfetch": "AverageConsolidator",
                    "consolidate-rc": "NoneConsolidator",
                    "count": 1,
                    "schema-id": 0
                }
            ],
            "tags": {
                "name": "metrictank.stats.docker-env.default.input.carbon.metricdata.received.counter32"
            },
            "target": "metrictank.stats.docker-env.default.input.carbon.metricdata.received.counter32"
        }
    ],
    "version": "v0.1"
}

cc @torkelo

Dieterbe · 2019-09-30T19:51:51Z

I have:

refactored it to bypass the msgp encoding problem, instead of non-string-keyed map, we now uses slices
cleaned up the code. instead of "source" it is now a meta section for the series
updated all expression functions up until per-second

left todo:

make sure it works across cluster, and in cluster while upgrading from older version
update remaining expression functions and do some interesting tests with processing functions that adjust the meta sections
test for cases where series from different archives get used

feedback anyone? @torkelo @shanson7

the per series metadata looks like this now.
notes:

format will be changed later to show actual storage-schemas rule used
aggNum means number of points consolidated into 1 point. (if >1, it means runtime consolidation was applied)

*/
	SchemaID              uint16                     // id of storage-schemas rule this series corresponds to
	Archive               uint8                      // which archive was being read from
	AggNumNorm            uint32                     // aggNum for normalization
	AggNumRC              uint32                     // aggNum runtime consolidation
	ConsolidatorNormFetch consolidation.Consolidator // consolidator used for normalization and reading from store (if applicable)
	ConsolidatorRC        consolidation.Consolidator // consolidator used for runtime consolidation (if applicable)
	Count                 uint32                     // number of series corresponding to these properties
*/


$ http  'http://localhost:6060/render?target=sum(metrictank.stats.docker-env.default.input.carbon.*.received.counter32)&from=-5s&meta=true'
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 400
Content-Type: application/json
Date: Mon, 30 Sep 2019 19:48:33 GMT
Vary: Accept-Encoding

{
    "meta": {
        "stats": {
            "executeplan.cache-hit-partial.count": 0,
            "executeplan.cache-hit.count": 0,
            "executeplan.cache-miss.count": 0,
            "executeplan.chunks-from-cache.count": 0,
            "executeplan.chunks-from-store.count": 0,
            "executeplan.chunks-from-tank.count": 3,
            "executeplan.get-targets.ms": 0,
            "executeplan.plan-run.ms": 0,
            "executeplan.points-fetch.count": 15,
            "executeplan.points-return.count": 15,
            "executeplan.prepare-series.ms": 0,
            "executeplan.resolve-series.ms": 1,
            "executeplan.series-fetch.count": 3
        }
    },
    "series": [
        {
            "datapoints": [
                [
                    200503,
                    1569872909
                ],
                [
                    204823,
                    1569872910
                ],
                [
                    209144,
                    1569872911
                ],
                [
                    213463,
                    1569872912
                ],
                [
                    217778,
                    1569872913
                ]
            ],
            "meta": [
                {
                    "aggnum-norm": 1,
                    "aggnum-rc": 0,
                    "archive": 0,
                    "consolidate-normfetch": "AverageConsolidator",
                    "consolidate-rc": "NoneConsolidator",
                    "count": 3,
                    "schema-id": 0
                }
            ],
            "target": "sumSeries(metrictank.stats.docker-env.default.input.carbon.*.received.counter32)"
        }
    ],
    "version": "v0.1"
}

shanson7 · 2019-09-30T19:59:37Z

What happens when the request gets proxied to graphite-web?

Dieterbe · 2019-09-30T20:35:29Z

What happens when the request gets proxied to graphite-web?

graphite-web will not ask metrictank for metadata, so it won't get it, it won't add any by itself, and will return the traditional response format.

to make this work with graphite, we need to:

make graphite request meta-enabled data from metrictank
transport the metadata through its internal request path, updating it where necessary (e.g. normalize(), runtime consolidation, etc)
processing functions should update it like MT does it (all functions, including the ones that MT already supports because to get a feature like this merged in graphite it should be implemented completely, and also MT proxies entire requests so if unspportedinMT(sumSeries(foo.*)) gets called, graphite will still execute the sumSeries)

This is a very significant amount of work. I would much rather spend our resources in porting over more graphite functions to metrictank.
Alternatively we could probably simplify this feature to reduce the amount of work with Graphite (e.g. make it a simple global "we have had to do normalization or runtime consolidation flag" but that would still mean touching many functions, and would make this feature much more like a one-off hack rather than a clean solution that integrates well with the rest of the stack.

shanson7 · 2019-09-30T20:46:10Z

I would much rather spend our resources in porting over more graphite functions to metrictank.

I agree with this sentiment (for a number of reasons) but I think it's worth thinking on.

Having &meta=true sometimes return the new format will be annoying for clients since they will have to conditionally parse two different formats. Perhaps the grafana side needs to handle this anyway (since real graphite set ups wouldn't support the "meta" concept).

Also, the Rollup indicator will not be something that can be relied on, since it won't indicate if the request was proxied (which itself is nice to know). Again, perhaps grafana could make the assumption that the request was handled by graphite proper if the old format was returned.

I don't think that initial versions of this feature need to handle this, but perhaps there could be something that metrictank can do on the way back from graphite (since it is proxying the response back) to smooth some of this?

Dieterbe · 2019-09-30T20:50:06Z

Having &meta=true sometimes return the new format will be annoying for clients since they will have to conditionally parse two different formats. Perhaps the grafana side needs to handle this anyway (since real graphite set ups wouldn't support the "meta" concept).

yes. note that is is already the case for the stats.

perhaps there could be something that metrictank can do on the way back from graphite (since it is proxying the response back) to smooth some of this?

right. do you mean it could modify the response and insert a meta section "sorry we don't have any metadata". in my view this is no better than grafana having an additional indicator state for "no metadata available" ?

shanson7 · 2019-09-30T20:54:10Z

do you mean it could modify the response

Yeah, something like that. It could indicate in a canonical format that the request was, in fact, proxied. It wouldn't really help with the rollup indication, but it would provide some context. I'm not saying it should be done this way, just trying to discuss options

this allows us to represent it smaller in memory

properties regarding: fetch, normalization, function processing, and runtime consolidation.

Dieterbe · 2019-10-02T15:11:16Z

feedback from awoods:

most users don't need all this information (dieter counterpoint: information seems cheap to collect, and is there for the few powerusers who can use it. grafana can provide a simple UI that just shows , for example, whether there was any runtime consolidation or not)
make the information easier to understand (e.g. mention interval changes rathar than aggNum numbers, structure the fields more in a way that clarifies the processing that was applied)

Dieterbe · 2019-10-03T14:06:46Z

make sure it works across cluster, and in cluster while upgrading from older version -> checked. counts get added up across cluster. older nodes obviously don't return metadata in render responses, and when their responses are aggregated by a newer node, their lack of metadata is treated as count=0
test for cases where series with different intervals get merged -> looks good. 2 sets of properties, differentiated by their aggnum-norm, and the consolidator respects consolidateBy (subject to availability of requested aggregate)

because our messagepack library doesn't like (de)serializing maps with non-string keys. This is equivalent and avoids serializing or interop issues with json.

* update docs to include tags and meta section * fix a bunch of bugs where previously data wasn't being deeply copied * cleanly copy the meta section where needed * clean up code with some helper functions

* fix cases where QueryFrom/QueryTo was not being set * simplify code * don't like constructing new series listing all properties explicitly, because when we add a new series property, it can be forgotten or we must update all render functions

as described in devdocs/expr.md

Dieterbe · 2019-10-07T18:19:43Z

force push to take out this commit:

Refs: v0.13.0-44-g3a147df1
Author:     Dieter Plaetinck <[email protected]>
AuthorDate: Wed Sep 25 20:04:24 2019 +0200
Commit:     Dieter Plaetinck <[email protected]>
CommitDate: Tue Oct 1 08:36:22 2019 -0400

    DO NOT MERGE - to test half-upgraded cluster
---
 docker/docker-cluster/docker-compose.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docker/docker-cluster/docker-compose.yml b/docker/docker-cluster/docker-compose.yml
index dfa5ebee..4143c640 100644
--- a/docker/docker-cluster/docker-compose.yml
+++ b/docker/docker-cluster/docker-compose.yml
@@ -53,7 +53,7 @@ services:

   metrictank2:
     hostname: metrictank2
-    image: grafana/metrictank
+    image: grafana/metrictank:v0.13.0
     expose:
      - 6060
     volumes:
@@ -78,7 +78,7 @@ services:

   metrictank3:
     hostname: metrictank3
-    image: grafana/metrictank
+    image: grafana/metrictank:v0.13.0
     expose:
      - 6060
     volumes:

api/models/series.go

Co-Authored-By: Anthony Woods <[email protected]>

Dieterbe mentioned this pull request Sep 25, 2019

Enhancement - indication of returned data is rollup data grafana/grafana#11700

Closed

Dieterbe force-pushed the rollup-indicator-1091 branch from fb9ddf3 to f99dfef Compare September 30, 2019 19:52

Dieterbe force-pushed the rollup-indicator-1091 branch 2 times, most recently from e278eb4 to 409a490 Compare October 1, 2019 04:25

Dieterbe added 3 commits October 1, 2019 08:36

restrict max archives per schema rule to 255

7b4a65a

this allows us to represent it smaller in memory

track series properties and counts through response path

48c658a

properties regarding: fetch, normalization, function processing, and runtime consolidation.

actually display the series source in a metadata response

44c91c9

Dieterbe force-pushed the rollup-indicator-1091 branch 2 times, most recently from 3af4ea8 to 08c858c Compare October 1, 2019 13:02

Dieterbe mentioned this pull request Oct 7, 2019

refine rollup metadata to make more userfriendly #1486

Closed

Dieterbe added the customer-impacting label Oct 7, 2019

Dieterbe added this to the sprint-2 milestone Oct 7, 2019

Dieterbe added 8 commits October 7, 2019 14:19

track properties and their counts as a slice, not map

8f02081

because our messagepack library doesn't like (de)serializing maps with non-string keys. This is equivalent and avoids serializing or interop issues with json.

go generate ./...

015fd06

revisit copy-on-write series processing

279bc35

* update docs to include tags and meta section * fix a bunch of bugs where previously data wasn't being deeply copied * cleanly copy the meta section where needed * clean up code with some helper functions

copy-on-write series processing part 2

ed442eb

* fix cases where QueryFrom/QueryTo was not being set * simplify code * don't like constructing new series listing all properties explicitly, because when we add a new series property, it can be forgotten or we must update all render functions

go generate ./...

1e145c0

clarify

bbaec9f

bugfix: "special" functions should reset runtime consolidation function

25572e0

as described in devdocs/expr.md

better words

de2322e

Dieterbe force-pushed the rollup-indicator-1091 branch from 53f9229 to de2322e Compare October 7, 2019 18:19

Dieterbe changed the title ~~WIP Rollup indicator~~ Rollup indicator Oct 7, 2019

woodsaj reviewed Oct 7, 2019

View reviewed changes

api/models/series.go Outdated Show resolved Hide resolved

woodsaj approved these changes Oct 7, 2019

View reviewed changes

Update api/models/series.go

3cce227

Co-Authored-By: Anthony Woods <[email protected]>

Dieterbe merged commit 61b5e89 into master Oct 7, 2019

Dieterbe deleted the rollup-indicator-1091 branch October 7, 2019 19:13

Dieterbe modified the milestones: sprint-2, sprint-1 Oct 7, 2019

Dieterbe mentioned this pull request Apr 23, 2020

fix field names in render metadata: consolidate -> consolidator #1798

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollup indicator #1481

Rollup indicator #1481

Dieterbe commented Sep 25, 2019 •

edited

Dieterbe commented Sep 30, 2019 •

edited

shanson7 commented Sep 30, 2019

Dieterbe commented Sep 30, 2019

shanson7 commented Sep 30, 2019

Dieterbe commented Sep 30, 2019

shanson7 commented Sep 30, 2019

Dieterbe commented Oct 2, 2019

Dieterbe commented Oct 3, 2019

Dieterbe commented Oct 7, 2019

Rollup indicator #1481

Rollup indicator #1481

Conversation

Dieterbe commented Sep 25, 2019 • edited

Dieterbe commented Sep 30, 2019 • edited

shanson7 commented Sep 30, 2019

Dieterbe commented Sep 30, 2019

shanson7 commented Sep 30, 2019

Dieterbe commented Sep 30, 2019

shanson7 commented Sep 30, 2019

Dieterbe commented Oct 2, 2019

Dieterbe commented Oct 3, 2019

Dieterbe commented Oct 7, 2019

Dieterbe commented Sep 25, 2019 •

edited

Dieterbe commented Sep 30, 2019 •

edited