cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics #1288

fkaleo · 2019-04-17T14:20:14Z

New Carbon metric 'tank.discarded.new-value-for-timestamp'.
Prometheus metric 'discarded_samples_total' has a new reason 'new-value-for-timestamp'.
Test dashboard 'Fakemetrics - discarded samples' plots both.
In order to be consistent with the Prometheus metrics, renamed carbon metrics:

'tank.metrics_too_old' into 'tank.discarded.sample-out-of-order'
'tank.add_to_closed_chunk' into 'tank.discarded.received-too-late'
'input...invalid' into 'input...discarded.invalid'
input...unknown' into 'input...discarded.unknown'

Fixes #1201, fixes #1202, fixes #1203

- New Carbon metric 'tank.discarded.new_value_for_timestamp' - Prometheus metric 'discarded_samples_total' has a new reason 'new-value-for-timestamp' Fixes #1201, #1202, #1203

fkaleo · 2019-04-17T20:28:48Z

I'm not very happy with the new mdata/errors package. Going a little further we could move the aggmetric errors into it and remove the conflicting 'errors'/'mdadaerrors' package name.

The other solution I had was having different error types/values for chunk and reorder_buffer which would imply duplicating the switch case in discardedMetricsInc. Avoiding duplication could be done using a common interface.

fkaleo · 2019-04-18T07:06:38Z

Test with:

docker stack 'docker-dev-custom-cfg-kafka'
fakemetrics --kafka-mdm-addr localhost:9092 bad --duplicate
opening the Grafana dashboard 'Fakemetrics - discarded samples'

Dieterbe · 2019-04-18T10:39:12Z

I'm not very happy with the new mdata/errors package. Going a little further we could move the aggmetric errors into it and remove the conflicting 'errors'/'mdadaerrors' package name.

yeah, the dependencies in mdata are a bit intricate. i suspect the right solution is moving mdata/chunk into mdata and then we can move the errors into mdata as well.
Sometimes trying to split up things in separate packages creates problems (e.g. you can't do circular imports). Since both the chunk and the ROB are things to which we can add points and that hence need to share some of the errors that can happen, they probably belong in the same package.

Let's think about a refactor later though. For now this will work

Dieterbe · 2019-04-18T11:10:52Z

mdata/aggmetric.go

+ discardedNewValueForTimestamp.Inc()
+ default:
+ reason = "unknown"
+ }


from #1201 :

Every point we receive should either be persisted or we should increment a counter to say we discarded

thus in the unknown case, we should also increment a carbon counter.

Shall we create a new catch-all sort of counter? tank.discarded.unknown perhaps?
(note that this case does not happen atm)

Dieterbe · 2019-04-18T11:38:26Z

I want a clear paraphrasing of all the conversations in all those previous tickets. Seems there were a couple of loose ends (e.g. new_value_for_timestamp or duplicate_ts .. Florian suggested the former and that sounds good to me,)

So here's my TLDR objective:

with rob enabled A) reject any dupes (regardless if it's a dupe wrt the last point or a prior point) + increment tank.discarded.new_value_for_timestamp B) allow ooo if we can handle it and it's not a dupe, C) if too old to handle, reject and increment tank.discarded.too_old (replaces tank.metrics_too_old)
without rob, A) and C) (note: requires change in chunk.Push)
any reason why we don't ingest a point should have a corresponding something.discarded.reason carbon metric and prometheus metric, except for internal errors like decode errors
let's improve consistency between carbon and prometheus with the metric naming

Florian, per 4), can you rename tank.metrics_too_old to tank.discarded.too_old ? also update the dashboards accordingly (no need to query for metrics_too_old, just query for discarded.*)

Dieterbe · 2019-04-18T11:55:09Z

it should also be noted that metrics can be incorrectly classified as "too old" when "new value for ts" would have been more incorrect, e.g. for a ts older than the ROB's retention, or when not using ROB, for any ts older than the last one.
I think this is an important caveat that should be documented in the source code, as well as in the metric description (well, metrics2docs will take care of that)

Dieterbe · 2019-04-18T12:00:32Z

tank.add_to_closed_chunk should also be renamed. also some of the metrics in NewDefaultHandler (invalid, unknown, ..)

Dieterbe

looks good, but needs a bit more work, see comments.

fkaleo · 2019-04-18T13:59:38Z

I want a clear paraphrasing of all the conversations in all those previous tickets. Seems there were a couple of loose ends (e.g. new_value_for_timestamp or duplicate_ts .. Florian suggested the former and that sounds good to me,)

So here's my TLDR objective:

with rob enabled A) reject any dupes (regardless if it's a dupe wrt the last point or a prior point) + increment tank.discarded.new_value_for_timestamp B) allow ooo if we can handle it and it's not a dupe, C) if too old to handle, reject and increment tank.discarded.too_old (replaces tank.metrics_too_old)

without rob, A) and C) (note: requires change in chunk.Push)

any reason why we don't ingest a point should have a corresponding something.discarded.reason carbon metric and prometheus metric, except for internal errors like decode errors

let's improve consistency between carbon and prometheus with the metric naming

Florian, per 4), can you rename tank.metrics_too_old to tank.discarded.too_old ? also update the dashboards accordingly (no need to query for metrics_too_old, just query for discarded.*)

Just to be clear: you would like to rename the carbon metrics renamed as per this table? (current name in column 2, new name in column 3)

As we are breaking compatibility with pre-existing metrics I would suggest we go the full route of actually using the same names as for the prometheus metrics. For example rename tank.metrics_too_old into tank.discarded.sample-out-of-order, etc.
Also I think this renaming should be done in a separate PR as it's very much independent from the new metric we are introducing here.

Dieterbe · 2019-04-18T14:33:40Z

yes.
and i think it should be in this PR, as all of this addresses what was discussed in #1201

…ple-out-of-order'

…tank.discarded.new-value-for-timestamp'

….received-too-late'

…carded.invalid/unknown'

Dieterbe · 2019-04-18T16:50:54Z

dashboards/main/metrictank.json

 },
 {
 "refId": "F",
- "target": "alias(sumSeries(perSecond(metrictank.stats.$environment.$instance.tank.add_to_closed_chunk.counter32)), 'add-to-saved')"
+ "target": "alias(sumSeries(perSecond(metrictank.stats.$environment.$instance.tank.discarded.received-too-late.counter32)), 'add-to-saved')"


ahh.. i love a clean dashboard json diff like this done.
was it a lot of work? I suspect you either did a manual json edit or had to do lots of git add -p or git checkout -p if you exported out of grafana.

This time I did manual json edit. Last time I did an export of grafana and managed to get a decent diff.

Dieterbe · 2019-04-23T20:48:55Z

@fkaleo : this looks solid! nice work.

Dieterbe

👍

Use different metric counts for out-of-order vs duplicate data points:

04a3197

- New Carbon metric 'tank.discarded.new_value_for_timestamp' - Prometheus metric 'discarded_samples_total' has a new reason 'new-value-for-timestamp' Fixes #1201, #1202, #1203

fkaleo changed the title ~~Use different metric counts for out-of-order vs duplicate data points:~~ Use different metric counts for out-of-order vs duplicate data points Apr 17, 2019

fkaleo mentioned this pull request Apr 17, 2019

handle duplicate datapoints in a consistent manner #1276

Closed

Added new test case

a70bac7

fkaleo marked this pull request as ready for review April 17, 2019 15:19

Updated metrics documentation

ba53077

fkaleo requested a review from Dieterbe April 18, 2019 07:05

Dieterbe reviewed Apr 18, 2019

View reviewed changes

Dieterbe suggested changes Apr 18, 2019

View reviewed changes

New carbon metric 'tank.discarded.unknown'

549b645

fkaleo added 7 commits April 18, 2019 17:42

Renamed carbon metric 'tank.metrics_too_old' into 'tank.discarded.sam…

cfecaa3

…ple-out-of-order'

Renamed carbon metric 'tank.discarded.new_value_for_timestamp' into '…

df1d873

…tank.discarded.new-value-for-timestamp'

Renamed carbon metric 'tank.add_to_closed_chunk' into 'tank.discarded…

0c2faea

….received-too-late'

Removed incorrect series from discarded samples dashboard

f634103

Renamed carbon metric 'input.*.*.invalid/unknown' into 'input.*.*.dis…

efddb94

…carded.invalid/unknown'

Improved documentation for 'tank.discarded.new-value-for-timestamp'

4e9b1bc

Updated metrics.md

d676e91

Dieterbe reviewed Apr 18, 2019

View reviewed changes

Dieterbe approved these changes Apr 23, 2019

View reviewed changes

Dieterbe changed the title ~~Use different metric counts for out-of-order vs duplicate data points~~ cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics Apr 23, 2019

Dieterbe merged commit 4c862d8 into master Apr 23, 2019

Dieterbe deleted the differentiate_duplicate_from_too_old branch April 23, 2019 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics #1288

cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics #1288

fkaleo commented Apr 17, 2019 •

edited

fkaleo commented Apr 17, 2019 •

edited

fkaleo commented Apr 18, 2019 •

edited

Dieterbe commented Apr 18, 2019

Dieterbe Apr 18, 2019

fkaleo Apr 18, 2019

fkaleo Apr 18, 2019

Dieterbe commented Apr 18, 2019 •

edited

Dieterbe commented Apr 18, 2019

Dieterbe commented Apr 18, 2019 •

edited

Dieterbe left a comment

fkaleo commented Apr 18, 2019

Dieterbe commented Apr 18, 2019

Dieterbe Apr 18, 2019 •

edited

fkaleo Apr 18, 2019

Dieterbe commented Apr 23, 2019

Dieterbe left a comment

cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics #1288

cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics #1288

Conversation

fkaleo commented Apr 17, 2019 • edited

fkaleo commented Apr 17, 2019 • edited

fkaleo commented Apr 18, 2019 • edited

Dieterbe commented Apr 18, 2019

Dieterbe Apr 18, 2019

Choose a reason for hiding this comment

fkaleo Apr 18, 2019

Choose a reason for hiding this comment

fkaleo Apr 18, 2019

Choose a reason for hiding this comment

Dieterbe commented Apr 18, 2019 • edited

Dieterbe commented Apr 18, 2019

Dieterbe commented Apr 18, 2019 • edited

Dieterbe left a comment

Choose a reason for hiding this comment

fkaleo commented Apr 18, 2019

Dieterbe commented Apr 18, 2019

Dieterbe Apr 18, 2019 • edited

Choose a reason for hiding this comment

fkaleo Apr 18, 2019

Choose a reason for hiding this comment

Dieterbe commented Apr 23, 2019

Dieterbe left a comment

Choose a reason for hiding this comment

fkaleo commented Apr 17, 2019 •

edited

fkaleo commented Apr 17, 2019 •

edited

fkaleo commented Apr 18, 2019 •

edited

Dieterbe commented Apr 18, 2019 •

edited

Dieterbe commented Apr 18, 2019 •

edited

Dieterbe Apr 18, 2019 •

edited