Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce -validation.create-grace-period in the ingester too #5712

Merged
merged 6 commits into from
Aug 16, 2023

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Aug 10, 2023

What this PR does

In this PR I propose to enforce the configured -validation.create-grace-period in the ingester too. The reason is that we've experienced an issue where a sample with a timestamp very far in the future has reached the ingester, and so screwed up the TSDB. We're still investigating the root cause (apparently looks like caused by few bit flipped) but in the meanwhile I would like to enforce the -validation.create-grace-period in the ingester too.

Note to reviewers:

  • I removed "0 to disable" from the -validation.create-grace-period CLI flag description because it wasn't true. 0 didn't disabled it. It's always enforced.
  • I noticed that -validation.create-grace-period is not checked against exemplar timestamp in the distributor. I will do it in a follow up PR.
  • I also manually tested it, running a modified version of load-generator to generate samples in the future and commenting the sample validation code run in the distributor. The samples with timestamp in the future reached ingesters and ingesters rejected them.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@@ -907,6 +912,8 @@ func (i *Ingester) pushSamplesToAppender(userID string, timeseries []mimirpb.Pre

// Return true if handled as soft error, and we can ingest more series.
handleAppendError := func(err error, timestamp int64, labels []mimirpb.LabelAdapter) bool {
stats.failedSamplesCount++
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I moved it here because it was always incremented right before calling handleAppendError().

@@ -101,6 +102,11 @@ func (id ID) LabelValue() string {
return strings.ReplaceAll(string(id), "-", "_")
}

// Error implements error.
func (id ID) Error() string {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: made it implement error interface so that I can use it in the ingester's handleAppendError().

@@ -199,7 +199,7 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) {
f.IntVar(&l.MaxMetadataLength, maxMetadataLengthFlag, 1024, "Maximum length accepted for metric metadata. Metadata refers to Metric Name, HELP and UNIT. Longer metadata is dropped except for HELP which is truncated.")
f.IntVar(&l.MaxNativeHistogramBuckets, maxNativeHistogramBucketsFlag, 0, "Maximum number of buckets per native histogram sample. 0 to disable the limit.")
_ = l.CreationGracePeriod.Set("10m")
f.Var(&l.CreationGracePeriod, creationGracePeriodFlag, "Controls how far into the future incoming samples are accepted compared to the wall clock. Any sample with timestamp `t` will be rejected if `t > (now + validation.create-grace-period)`. Also used by query-frontend to avoid querying too far into the future. 0 to disable.")
f.Var(&l.CreationGracePeriod, creationGracePeriodFlag, fmt.Sprintf("Controls how far into the future incoming samples are accepted compared to the wall clock. Any sample will be rejected if its timestamp is greater than '(now + -%s)'. This configuration is enforced in the distributor, ingester and query-frontend (to avoid querying too far into the future).", creationGracePeriodFlag))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: removed the backticks usage because they're not rendered as intended in our doc. The entire auto-generated doc is within triple-backtick quotes, so the single backtick is just rendered as a backtick.

@pracucci pracucci marked this pull request as ready for review August 11, 2023 08:21
@pracucci pracucci requested review from a team as code owners August 11, 2023 08:21
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :-)

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, but I wonder if this is the right approach. Bit flips may happen for other parts of the request (labels, sample value), and we don't protect against those.

I am also curious where exactly do we think that "bit flip" happened? At TCP level packets are checksummed, so I guess we can rule that out. Is it in distributor, after passing the checks? Is it in ingester after receiving the request? How can we protect against that?

pkg/util/validation/limits.go Outdated Show resolved Hide resolved
@pstibrany
Copy link
Member

One more note: query-frontend still checks if grace period is greater than zero. Shall we remove that check and always adjust "end" time?

// Enforce the max end time.
creationGracePeriod := validation.LargestPositiveNonZeroDurationPerTenant(tenantIDs, l.CreationGracePeriod)
if creationGracePeriod > 0 {
maxEndTime := util.TimeToMillis(time.Now().Add(creationGracePeriod))
if r.GetEnd() > maxEndTime {
// Replace the end time in the request.
level.Debug(log).Log(
"msg", "the end time of the query has been manipulated because of the 'creation grace period' setting",
"original", util.FormatTimeMillis(r.GetEnd()),
"updated", util.FormatTimeMillis(maxEndTime),
"creationGracePeriod", creationGracePeriod)
r = r.WithStartEnd(r.GetStart(), maxEndTime)
}
}

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…eadUpUntilNowMinusActiveSeriesMetricsIdleTimeout

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci
Copy link
Collaborator Author

pracucci commented Aug 16, 2023

Change looks good, but I wonder if this is the right approach. Bit flips may happen for other parts of the request (labels, sample value), and we don't protect against those.

Definitely. This is just an easy (and inexpensive) validation we can do in ingesters, to avoid screwing TSDB with a sample very far in the future. It's not intended to prevent any invalid series / sample case.

I am also curious where exactly do we think that "bit flip" happened? At TCP level packets are checksummed, so I guess we can rule that out. Is it in distributor, after passing the checks? Is it in ingester after receiving the request? How can we protect against that?

Despite several engineers spent days on it, we still don't have any clue on the actual root cause, but just theories. So I can't answer these questions. The new check introduced in this PR should also help to understand if the corruption happens before or after appending samples to TSDB.

@pracucci
Copy link
Collaborator Author

One more note: query-frontend still checks if grace period is greater than zero. Shall we remove that check and always adjust "end" time?

I will double check why we did the check in the first place (I'm wondering if was done to offer a smoother config transition) and eventually open a PR.

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the enforce-creation-grace-period-in-the-ingester-too branch from 9151b3e to 4655e73 Compare August 16, 2023 08:22
@pracucci pracucci enabled auto-merge (squash) August 16, 2023 08:25
@pracucci pracucci merged commit 2fa109f into main Aug 16, 2023
30 checks passed
@pracucci pracucci deleted the enforce-creation-grace-period-in-the-ingester-too branch August 16, 2023 08:36
@pstibrany
Copy link
Member

While investigating #5642, I've realised that if corruption happens in the WAL and ingester restarts and replays the WAL, then this safety measure will not be applied. (However that doesn't seem to be what happened in #5642)

@pracucci
Copy link
Collaborator Author

One more note: query-frontend still checks if grace period is greater than zero. Shall we remove that check and always adjust "end" time?

See: #5829

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants