Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics #1288

Merged
merged 11 commits into from
Apr 23, 2019

Conversation

fkaleo
Copy link
Contributor

@fkaleo fkaleo commented Apr 17, 2019

New Carbon metric 'tank.discarded.new-value-for-timestamp'.
Prometheus metric 'discarded_samples_total' has a new reason 'new-value-for-timestamp'.
Test dashboard 'Fakemetrics - discarded samples' plots both.
In order to be consistent with the Prometheus metrics, renamed carbon metrics:

  • 'tank.metrics_too_old' into 'tank.discarded.sample-out-of-order'
  • 'tank.add_to_closed_chunk' into 'tank.discarded.received-too-late'
  • 'input...invalid' into 'input...discarded.invalid'
  • input...unknown' into 'input...discarded.unknown'

Fixes #1201, fixes #1202, fixes #1203

- New Carbon metric 'tank.discarded.new_value_for_timestamp'
- Prometheus metric 'discarded_samples_total' has a new reason 'new-value-for-timestamp'

Fixes #1201, #1202, #1203
@fkaleo fkaleo changed the title Use different metric counts for out-of-order vs duplicate data points: Use different metric counts for out-of-order vs duplicate data points Apr 17, 2019
@fkaleo fkaleo marked this pull request as ready for review April 17, 2019 15:19
@fkaleo
Copy link
Contributor Author

fkaleo commented Apr 17, 2019

I'm not very happy with the new mdata/errors package. Going a little further we could move the aggmetric errors into it and remove the conflicting 'errors'/'mdadaerrors' package name.

The other solution I had was having different error types/values for chunk and reorder_buffer which would imply duplicating the switch case in discardedMetricsInc. Avoiding duplication could be done using a common interface.

@fkaleo fkaleo requested a review from Dieterbe April 18, 2019 07:05
@fkaleo
Copy link
Contributor Author

fkaleo commented Apr 18, 2019

Test with:

  • docker stack 'docker-dev-custom-cfg-kafka'
  • fakemetrics --kafka-mdm-addr localhost:9092 bad --duplicate
  • opening the Grafana dashboard 'Fakemetrics - discarded samples'

@Dieterbe
Copy link
Contributor

I'm not very happy with the new mdata/errors package. Going a little further we could move the aggmetric errors into it and remove the conflicting 'errors'/'mdadaerrors' package name.

yeah, the dependencies in mdata are a bit intricate. i suspect the right solution is moving mdata/chunk into mdata and then we can move the errors into mdata as well.
Sometimes trying to split up things in separate packages creates problems (e.g. you can't do circular imports). Since both the chunk and the ROB are things to which we can add points and that hence need to share some of the errors that can happen, they probably belong in the same package.

Let's think about a refactor later though. For now this will work

discardedNewValueForTimestamp.Inc()
default:
reason = "unknown"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from #1201 :

Every point we receive should either be persisted or we should increment a counter to say we discarded

thus in the unknown case, we should also increment a carbon counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we create a new catch-all sort of counter? tank.discarded.unknown perhaps?
(note that this case does not happen atm)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 18, 2019

I want a clear paraphrasing of all the conversations in all those previous tickets. Seems there were a couple of loose ends (e.g. new_value_for_timestamp or duplicate_ts .. Florian suggested the former and that sounds good to me,)

So here's my TLDR objective:

  1. with rob enabled A) reject any dupes (regardless if it's a dupe wrt the last point or a prior point) + increment tank.discarded.new_value_for_timestamp B) allow ooo if we can handle it and it's not a dupe, C) if too old to handle, reject and increment tank.discarded.too_old (replaces tank.metrics_too_old)
  2. without rob, A) and C) (note: requires change in chunk.Push)
  3. any reason why we don't ingest a point should have a corresponding something.discarded.reason carbon metric and prometheus metric, except for internal errors like decode errors
  4. let's improve consistency between carbon and prometheus with the metric naming

Florian, per 4), can you rename tank.metrics_too_old to tank.discarded.too_old ? also update the dashboards accordingly (no need to query for metrics_too_old, just query for discarded.*)

@Dieterbe
Copy link
Contributor

it should also be noted that metrics can be incorrectly classified as "too old" when "new value for ts" would have been more incorrect, e.g. for a ts older than the ROB's retention, or when not using ROB, for any ts older than the last one.
I think this is an important caveat that should be documented in the source code, as well as in the metric description (well, metrics2docs will take care of that)

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 18, 2019

tank.add_to_closed_chunk should also be renamed. also some of the metrics in NewDefaultHandler (invalid, unknown, ..)

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, but needs a bit more work, see comments.

@fkaleo
Copy link
Contributor Author

fkaleo commented Apr 18, 2019

I want a clear paraphrasing of all the conversations in all those previous tickets. Seems there were a couple of loose ends (e.g. new_value_for_timestamp or duplicate_ts .. Florian suggested the former and that sounds good to me,)

So here's my TLDR objective:

  1. with rob enabled A) reject any dupes (regardless if it's a dupe wrt the last point or a prior point) + increment tank.discarded.new_value_for_timestamp B) allow ooo if we can handle it and it's not a dupe, C) if too old to handle, reject and increment tank.discarded.too_old (replaces tank.metrics_too_old)
  2. without rob, A) and C) (note: requires change in chunk.Push)
  3. any reason why we don't ingest a point should have a corresponding something.discarded.reason carbon metric and prometheus metric, except for internal errors like decode errors
  4. let's improve consistency between carbon and prometheus with the metric naming

Florian, per 4), can you rename tank.metrics_too_old to tank.discarded.too_old ? also update the dashboards accordingly (no need to query for metrics_too_old, just query for discarded.*)

Just to be clear: you would like to rename the carbon metrics renamed as per this table? (current name in column 2, new name in column 3)
image

As we are breaking compatibility with pre-existing metrics I would suggest we go the full route of actually using the same names as for the prometheus metrics. For example rename tank.metrics_too_old into tank.discarded.sample-out-of-order, etc.
Also I think this renaming should be done in a separate PR as it's very much independent from the new metric we are introducing here.

@Dieterbe
Copy link
Contributor

yes.
and i think it should be in this PR, as all of this addresses what was discussed in #1201

},
{
"refId": "F",
"target": "alias(sumSeries(perSecond(metrictank.stats.$environment.$instance.tank.add_to_closed_chunk.counter32)), 'add-to-saved')"
"target": "alias(sumSeries(perSecond(metrictank.stats.$environment.$instance.tank.discarded.received-too-late.counter32)), 'add-to-saved')"
Copy link
Contributor

@Dieterbe Dieterbe Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh.. i love a clean dashboard json diff like this done.
was it a lot of work? I suspect you either did a manual json edit or had to do lots of git add -p or git checkout -p if you exported out of grafana.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time I did manual json edit. Last time I did an export of grafana and managed to get a decent diff.

@Dieterbe
Copy link
Contributor

@fkaleo : this looks solid! nice work.

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Dieterbe Dieterbe changed the title Use different metric counts for out-of-order vs duplicate data points cleanup carbon metrics for out-of-order vs duplicate data points, cleaner names in sync with prom metrics Apr 23, 2019
@Dieterbe Dieterbe merged commit 4c862d8 into master Apr 23, 2019
@Dieterbe Dieterbe deleted the differentiate_duplicate_from_too_old branch April 23, 2019 20:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants