Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent ingesting samples older than past_grace_period #8262

Merged
merged 15 commits into from
Jun 6, 2024
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
* [FEATURE] Query-frontend, querier: new experimental `/cardinality/active_native_histogram_metrics` API to get active native histogram metric names with statistics about active native histogram buckets. #7982 #7986 #8008
* [FEATURE] Alertmanager: Added `-alertmanager.max-silences-count` and `-alertmanager.max-silence-size-bytes` to set limits on per tenant silences. Disabled by default. #6898
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to ingesters. This can be enabled using `-ingester.circuit-breaker.enabled` option. Further `-ingester.circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180
* [FEATURE] Distributor, ingester: add new setting `-validation.past-grace-period` to limit how old (based on the wall clock minus OOO window) the ingested samples can be. The default 0 value disables this limit. #8262
* [ENHANCEMENT] Reduced memory allocations in functions used to propagate contextual information between gRPC calls. #7529
* [ENHANCEMENT] Distributor: add experimental limit for exemplars per series per request, enabled with `-distributor.max-exemplars-per-series-per-request`, the number of discarded exemplars are tracked with `cortex_discarded_exemplars_total{reason="too_many_exemplars_per_series_per_request"}` #7989 #8010
* [ENHANCEMENT] Store-gateway: merge series from different blocks concurrently. #7456
Expand Down
13 changes: 12 additions & 1 deletion cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3436,13 +3436,24 @@
"kind": "field",
"name": "creation_grace_period",
"required": false,
"desc": "Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + grace_period)'. This configuration is enforced in the distributor and ingester.",
"desc": "Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester.",
"fieldValue": null,
"fieldDefaultValue": 600000000000,
"fieldFlag": "validation.create-grace-period",
"fieldType": "duration",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "past_grace_period",
"required": false,
"desc": "Controls how far into the past incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is lower than '(now - OOO window - past_grace_period)'. This configuration is enforced in the distributor and ingester. 0 to disable.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "validation.past-grace-period",
"fieldType": "duration",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "enforce_metadata_metric_name",
Expand Down
4 changes: 3 additions & 1 deletion cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -2876,7 +2876,7 @@ Usage of ./cmd/mimir/mimir:
-usage-stats.installation-mode string
Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
-validation.create-grace-period duration
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
-validation.enforce-metadata-metric-name
Enforce every metadata has a metric name. (default true)
-validation.max-label-names-per-series int
Expand All @@ -2889,6 +2889,8 @@ Usage of ./cmd/mimir/mimir:
Maximum length accepted for metric metadata. Metadata refers to Metric Name, HELP and UNIT. Longer metadata is dropped except for HELP which is truncated. (default 1024)
-validation.max-native-histogram-buckets int
Maximum number of buckets per native histogram sample. 0 to disable the limit.
-validation.past-grace-period duration
Controls how far into the past incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is lower than '(now - OOO window - past_grace_period)'. This configuration is enforced in the distributor and ingester. 0 to disable.
-validation.reduce-native-histogram-over-max-buckets
Whether to reduce or reject native histogram samples with more buckets than the configured limit. (default true)
-validation.separate-metrics-group-label string
Expand Down
11 changes: 9 additions & 2 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3073,11 +3073,18 @@ The `limits` block configures default and per-tenant limits imposed by component

# (advanced) Controls how far into the future incoming samples and exemplars are
# accepted compared to the wall clock. Any sample or exemplar will be rejected
# if its timestamp is greater than '(now + grace_period)'. This configuration is
# enforced in the distributor and ingester.
# if its timestamp is greater than '(now + creation_grace_period)'. This
# configuration is enforced in the distributor and ingester.
# CLI flag: -validation.create-grace-period
[creation_grace_period: <duration> | default = 10m]

# (advanced) Controls how far into the past incoming samples and exemplars are
# accepted compared to the wall clock. Any sample or exemplar will be rejected
# if its timestamp is lower than '(now - OOO window - past_grace_period)'. This
# configuration is enforced in the distributor and ingester. 0 to disable.
# CLI flag: -validation.past-grace-period
[past_grace_period: <duration> | default = 0s]

# (advanced) Enforce every metadata has a metric name.
# CLI flag: -validation.enforce-metadata-metric-name
[enforce_metadata_metric_name: <boolean> | default = true]
Expand Down
25 changes: 25 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1698,6 +1698,31 @@ On a per-tenant basis, you can fine tune the tolerance by configuring the `creat
Only series with invalid samples are skipped during the ingestion. Valid samples within the same request are still ingested.
{{< /admonition >}}

### err-mimir-too-far-in-past

This non-critical error occurs when Mimir rejects a sample because its timestamp is too far in the past compared to the wall clock.

How it **works**:

- The distributor or the ingester implements an lower limit on the timestamp of incoming samples, it is used to protect the system from potential abuse or mistakes.
- The lower limit is defined by the current wall clock minus the `out_of_order_time_window` and minus the `past_grace_period` settings.
- The samples that are too far in the past are not ingested.
colega marked this conversation as resolved.
Show resolved Hide resolved

How to **fix** it:

- Make sure that it is intended that the timestamps of the incoming samples are that old.
- If the timestamps are correct, increase the `past_grace_period` setting, or set it to 0 to disable the limit.

{{< admonition type="note" >}}
Only the invalid samples are skipped during the ingestion. Valid samples within the same request are still ingested.
{{< /admonition >}}

### err-mimir-exemplar-too-far-in-past

This non-critical error occurs when Mimir rejects an exemplar because its timestamp is too far in the past compared to the wall clock.

See [`err-mimir-too-far-in-past`](#err-mimir-too-far-in-past) for more details and how to fix it.
colega marked this conversation as resolved.
Show resolved Hide resolved

### err-mimir-exemplar-labels-missing

This non-critical error occurs when Mimir receives a write request that contains an exemplar without a label that identifies the related metric.
Expand Down
4 changes: 4 additions & 0 deletions pkg/distributor/distributor.go
Original file line number Diff line number Diff line change
Expand Up @@ -1020,6 +1020,10 @@ func (d *Distributor) prePushValidationMiddleware(next PushFunc) PushFunc {
var minExemplarTS int64
if earliestSampleTimestampMs != math.MaxInt64 {
minExemplarTS = earliestSampleTimestampMs - 5*time.Minute.Milliseconds()

if d.limits.PastGracePeriod(userID) > 0 {
minExemplarTS = max(minExemplarTS, now.Add(-d.limits.PastGracePeriod(userID)).Add(-d.limits.OutOfOrderTimeWindow(userID)).UnixMilli())
}
}

// Enforce the creation grace period on exemplars too.
Expand Down
16 changes: 16 additions & 0 deletions pkg/distributor/distributor_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6762,6 +6762,7 @@ func TestDistributorValidation(t *testing.T) {
labels [][]mimirpb.LabelAdapter
samples []mimirpb.Sample
exemplars []*mimirpb.Exemplar
limits func(limits *validation.Limits)
expectedErr *status.Status
}{
"validation passes": {
Expand Down Expand Up @@ -6799,6 +6800,18 @@ func TestDistributorValidation(t *testing.T) {
expectedErr: status.New(codes.FailedPrecondition, fmt.Sprintf(sampleTimestampTooNewMsgFormat, future, "testmetric")),
},

"validation does not fail for samples from the past without past_grace_period setting": {
labels: [][]mimirpb.LabelAdapter{{{Name: "foo", Value: "bar"}, {Name: labels.MetricName, Value: "testmetric"}}},
samples: []mimirpb.Sample{{TimestampMs: int64(past), Value: 1}},
},

"validation fails for samples from the past": {
labels: [][]mimirpb.LabelAdapter{{{Name: labels.MetricName, Value: "testmetric"}, {Name: "foo", Value: "bar"}}},
samples: []mimirpb.Sample{{TimestampMs: int64(past), Value: 4}},
limits: func(limits *validation.Limits) { limits.PastGracePeriod = model.Duration(now.Sub(past) / 2) },
expectedErr: status.New(codes.FailedPrecondition, fmt.Sprintf(sampleTimestampTooOldMsgFormat, past, "testmetric")),
},

"exceeds maximum labels per series": {
labels: [][]mimirpb.LabelAdapter{{{Name: labels.MetricName, Value: "testmetric"}, {Name: "foo", Value: "bar"}, {Name: "foo2", Value: "bar2"}}},
samples: []mimirpb.Sample{{
Expand Down Expand Up @@ -6875,6 +6888,9 @@ func TestDistributorValidation(t *testing.T) {
limits.CreationGracePeriod = model.Duration(2 * time.Hour)
limits.MaxLabelNamesPerSeries = 2
limits.MaxGlobalExemplarsPerUser = 10
if tc.limits != nil {
tc.limits(&limits)
}

ds, _, _, _ := prepare(t, prepConfig{
numIngesters: 3,
Expand Down
23 changes: 23 additions & 0 deletions pkg/distributor/validate.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ var (
reasonInvalidNativeHistogramSchema = globalerror.InvalidSchemaNativeHistogram.LabelValue()
reasonDuplicateLabelNames = globalerror.SeriesWithDuplicateLabelNames.LabelValue()
reasonTooFarInFuture = globalerror.SampleTooFarInFuture.LabelValue()
reasonTooFarInPast = globalerror.SampleTooFarInPast.LabelValue()

// Discarded exemplars reasons.
reasonExemplarLabelsMissing = globalerror.ExemplarLabelsMissing.LabelValue()
Expand Down Expand Up @@ -84,6 +85,10 @@ var (
"received a sample whose timestamp is too far in the future, timestamp: %d series: '%.200s'",
validation.CreationGracePeriodFlag,
)
sampleTimestampTooOldMsgFormat = globalerror.SampleTooFarInPast.MessageWithPerTenantLimitConfig(
"received a sample whose timestamp is too far in the past, timestamp: %d series: '%.200s'",
validation.PastGracePeriodFlag,
)
exemplarEmptyLabelsMsgFormat = globalerror.ExemplarLabelsMissing.Message(
"received an exemplar with no valid labels, timestamp: %d series: %s labels: %s",
)
Expand All @@ -108,8 +113,10 @@ var (
// sampleValidationConfig helps with getting required config to validate sample.
type sampleValidationConfig interface {
CreationGracePeriod(userID string) time.Duration
PastGracePeriod(userID string) time.Duration
MaxNativeHistogramBuckets(userID string) int
ReduceNativeHistogramOverMaxBuckets(userID string) bool
OutOfOrderTimeWindow(userID string) time.Duration
}

// sampleValidationMetrics is a collection of metrics used during sample validation.
Expand All @@ -124,6 +131,7 @@ type sampleValidationMetrics struct {
invalidNativeHistogramSchema *prometheus.CounterVec
duplicateLabelNames *prometheus.CounterVec
tooFarInFuture *prometheus.CounterVec
tooFarInPast *prometheus.CounterVec
}

func (m *sampleValidationMetrics) deleteUserMetrics(userID string) {
Expand All @@ -138,6 +146,7 @@ func (m *sampleValidationMetrics) deleteUserMetrics(userID string) {
m.invalidNativeHistogramSchema.DeletePartialMatch(filter)
m.duplicateLabelNames.DeletePartialMatch(filter)
m.tooFarInFuture.DeletePartialMatch(filter)
m.tooFarInPast.DeletePartialMatch(filter)
}

func (m *sampleValidationMetrics) deleteUserMetricsForGroup(userID, group string) {
Expand All @@ -151,6 +160,7 @@ func (m *sampleValidationMetrics) deleteUserMetricsForGroup(userID, group string
m.invalidNativeHistogramSchema.DeleteLabelValues(userID, group)
m.duplicateLabelNames.DeleteLabelValues(userID, group)
m.tooFarInFuture.DeleteLabelValues(userID, group)
m.tooFarInPast.DeleteLabelValues(userID, group)
}

func newSampleValidationMetrics(r prometheus.Registerer) *sampleValidationMetrics {
Expand All @@ -165,6 +175,7 @@ func newSampleValidationMetrics(r prometheus.Registerer) *sampleValidationMetric
invalidNativeHistogramSchema: validation.DiscardedSamplesCounter(r, reasonInvalidNativeHistogramSchema),
duplicateLabelNames: validation.DiscardedSamplesCounter(r, reasonDuplicateLabelNames),
tooFarInFuture: validation.DiscardedSamplesCounter(r, reasonTooFarInFuture),
tooFarInPast: validation.DiscardedSamplesCounter(r, reasonTooFarInPast),
}
}

Expand Down Expand Up @@ -211,6 +222,12 @@ func validateSample(m *sampleValidationMetrics, now model.Time, cfg sampleValida
return fmt.Errorf(sampleTimestampTooNewMsgFormat, s.TimestampMs, unsafeMetricName)
}

if cfg.PastGracePeriod(userID) > 0 && model.Time(s.TimestampMs) < now.Add(-cfg.PastGracePeriod(userID)).Add(-cfg.OutOfOrderTimeWindow(userID)) {
m.tooFarInPast.WithLabelValues(userID, group).Inc()
unsafeMetricName, _ := extract.UnsafeMetricNameFromLabelAdapters(ls)
return fmt.Errorf(sampleTimestampTooOldMsgFormat, s.TimestampMs, unsafeMetricName)
}

return nil
}

Expand All @@ -224,6 +241,12 @@ func validateSampleHistogram(m *sampleValidationMetrics, now model.Time, cfg sam
return false, fmt.Errorf(sampleTimestampTooNewMsgFormat, s.Timestamp, unsafeMetricName)
}

if cfg.PastGracePeriod(userID) > 0 && model.Time(s.Timestamp) < now.Add(-cfg.PastGracePeriod(userID)).Add(-cfg.OutOfOrderTimeWindow(userID)) {
m.tooFarInPast.WithLabelValues(userID, group).Inc()
unsafeMetricName, _ := extract.UnsafeMetricNameFromLabelAdapters(ls)
return false, fmt.Errorf(sampleTimestampTooOldMsgFormat, s.Timestamp, unsafeMetricName)
}

if s.Schema < mimirpb.MinimumHistogramSchema || s.Schema > mimirpb.MaximumHistogramSchema {
m.invalidNativeHistogramSchema.WithLabelValues(userID, group).Inc()
return false, fmt.Errorf(invalidSchemaNativeHistogramMsgFormat, s.Schema)
Expand Down
8 changes: 8 additions & 0 deletions pkg/distributor/validate_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,14 @@ func (c sampleValidationCfg) CreationGracePeriod(_ string) time.Duration {
return 0
}

func (c sampleValidationCfg) PastGracePeriod(_ string) time.Duration {
return 0
}

func (c sampleValidationCfg) OutOfOrderTimeWindow(_ string) time.Duration {
return 0
}

func (c sampleValidationCfg) MaxNativeHistogramBuckets(_ string) int {
return c.maxNativeHistogramBuckets
}
Expand Down
4 changes: 0 additions & 4 deletions pkg/frontend/querymiddleware/limits.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,6 @@ type Limits interface {
// OutOfOrderTimeWindow returns the out-of-order time window for the user.
OutOfOrderTimeWindow(userID string) time.Duration

// CreationGracePeriod returns the time interval to control how far into the future
// incoming samples are accepted compared to the wall clock.
CreationGracePeriod(userID string) time.Duration

// NativeHistogramsIngestionEnabled returns whether to ingest native histograms in the ingester
NativeHistogramsIngestionEnabled(userID string) bool

Expand Down
3 changes: 3 additions & 0 deletions pkg/ingester/errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,9 @@ func newExemplarMissingSeriesError(timestamp model.Time, seriesLabels, exemplarL
func newExemplarTimestampTooFarInFutureError(timestamp model.Time, seriesLabels, exemplarLabels []mimirpb.LabelAdapter) exemplarError {
return newExemplarError(globalerror.ExemplarTooFarInFuture, "received an exemplar whose timestamp is too far in the future", timestamp, seriesLabels, exemplarLabels)
}
func newExemplarTimestampTooFarInPastError(timestamp model.Time, seriesLabels, exemplarLabels []mimirpb.LabelAdapter) exemplarError {
return newExemplarError(globalerror.ExemplarTooFarInPast, "received an exemplar whose timestamp is too far in the past", timestamp, seriesLabels, exemplarLabels)
}

// tsdbIngestExemplarErr is an ingesterError indicating a problem with an exemplar.
type tsdbIngestExemplarErr struct {
Expand Down
16 changes: 16 additions & 0 deletions pkg/ingester/ingester.go
Original file line number Diff line number Diff line change
Expand Up @@ -1384,7 +1384,11 @@ func (i *Ingester) pushSamplesToAppender(userID string, timeseries []mimirpb.Pre
var (
nativeHistogramsIngestionEnabled = i.limits.NativeHistogramsIngestionEnabled(userID)
maxTimestampMs = startAppend.Add(i.limits.CreationGracePeriod(userID)).UnixMilli()
minTimestampMs = int64(math.MinInt64)
)
if i.limits.PastGracePeriod(userID) > 0 {
minTimestampMs = startAppend.Add(-i.limits.PastGracePeriod(userID)).Add(-i.limits.OutOfOrderTimeWindow(userID)).UnixMilli()
}

var builder labels.ScratchBuilder
var nonCopiedLabels labels.Labels
Expand Down Expand Up @@ -1452,6 +1456,9 @@ func (i *Ingester) pushSamplesToAppender(userID string, timeseries []mimirpb.Pre
if s.TimestampMs > maxTimestampMs {
handleAppendError(globalerror.SampleTooFarInFuture, s.TimestampMs, ts.Labels)
continue
} else if s.TimestampMs < minTimestampMs {
handleAppendError(globalerror.SampleTooFarInPast, s.TimestampMs, ts.Labels)
continue
}

// If the cached reference exists, we try to use it.
Expand Down Expand Up @@ -1492,6 +1499,9 @@ func (i *Ingester) pushSamplesToAppender(userID string, timeseries []mimirpb.Pre
if h.Timestamp > maxTimestampMs {
handleAppendError(globalerror.SampleTooFarInFuture, h.Timestamp, ts.Labels)
continue
} else if h.Timestamp < minTimestampMs {
handleAppendError(globalerror.SampleTooFarInPast, h.Timestamp, ts.Labels)
continue
}

if h.IsFloatHistogram() {
Expand Down Expand Up @@ -1560,6 +1570,12 @@ func (i *Ingester) pushSamplesToAppender(userID string, timeseries []mimirpb.Pre
return newExemplarTimestampTooFarInFutureError(model.Time(ex.TimestampMs), ts.Labels, ex.Labels)
})
continue
} else if ex.TimestampMs < minTimestampMs {
stats.failedExemplarsCount++
updateFirstPartial(nil, func() softError {
return newExemplarTimestampTooFarInPastError(model.Time(ex.TimestampMs), ts.Labels, ex.Labels)
})
continue
}

e := exemplar.Exemplar{
Expand Down
4 changes: 4 additions & 0 deletions pkg/ingester/tenants_http.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,17 @@ func (i *Ingester) TenantsHandler(w http.ResponseWriter, req *http.Request) {
s := tenantStats{}
s.Tenant = t
s.Blocks = len(db.Blocks())
minMillis := db.Head().MinTime()
s.MinTime = formatMillisTime(db.Head().MinTime())
maxMillis := db.Head().MaxTime()
s.MaxTime = formatMillisTime(maxMillis)

if maxMillis-nowMillis > i.limits.CreationGracePeriod(t).Milliseconds() {
s.Warning = "TSDB Head max timestamp too far in the future"
}
if i.limits.PastGracePeriod(t) > 0 && nowMillis-minMillis > (i.limits.PastGracePeriod(t)+i.limits.OutOfOrderTimeWindow(t)).Milliseconds() {
s.Warning = "TSDB Head min timestamp too far in the past"
}

tss = append(tss, s)
}
Expand Down
2 changes: 2 additions & 0 deletions pkg/util/globalerror/user.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ const (
SeriesWithDuplicateLabelNames ID = "duplicate-label-names"
SeriesLabelsNotSorted ID = "labels-not-sorted"
SampleTooFarInFuture ID = "too-far-in-future"
SampleTooFarInPast ID = "too-far-in-past"
MaxSeriesPerMetric ID = "max-series-per-metric"
MaxMetadataPerMetric ID = "max-metadata-per-metric"
MaxSeriesPerUser ID = "max-series-per-user"
Expand Down Expand Up @@ -68,6 +69,7 @@ const (
SampleDuplicateTimestamp ID = "sample-duplicate-timestamp"
ExemplarSeriesMissing ID = "exemplar-series-missing"
ExemplarTooFarInFuture ID = "exemplar-too-far-in-future"
ExemplarTooFarInPast ID = "exemplar-too-far-in-past"

StoreConsistencyCheckFailed ID = "store-consistency-check-failed"
BucketIndexTooOld ID = "bucket-index-too-old"
Expand Down
Loading
Loading