Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable -ingester.error-sample-rate by default and promote from experimental to advanced #7807

Merged
merged 3 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

* [CHANGE] Ingester: `/ingester/flush` endpoint is now only allowed to execute only while the ingester is in `Running` state. The 503 status code is returned if the endpoint is called while the ingester is not in `Running` state. #7486
* [CHANGE] Distributor: Include label name in `err-mimir-label-value-too-long` error message: #7740
* [CHANGE] Ingester: enabled 1 out 10 errors log sampling by default. All the discarded samples will still be tracked by the `cortex_discarded_samples_total` metric. The feature can be configured via `-ingester.error-sample-rate` (0 to log all errors). #7807
* [FEATURE] Continuous-test: now runable as a module with `mimir -target=continuous-test`. #7747
* [FEATURE] Store-gateway: Allow specific tenants to be enabled or disabled via `-store-gateway.enabled-tenants` or `-store-gateway.disabled-tenants` CLI flags or their corresponding YAML settings. #7653
* [FEATURE] New `-<prefix>.s3.bucket-lookup-type` flag configures lookup style type, used to access bucket in s3 compatible providers. #7684
Expand Down
4 changes: 2 additions & 2 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3037,10 +3037,10 @@
"required": false,
"desc": "Each error will be logged once in this many times. Use 0 to log all of them.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.error-sample-rate",
"fieldType": "int",
"fieldCategory": "experimental"
"fieldCategory": "advanced"
},
{
"kind": "field",
Expand Down
2 changes: 1 addition & 1 deletion cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1324,7 +1324,7 @@ Usage of ./cmd/mimir/mimir:
-ingester.client.tls-server-name string
Override the expected name on the server certificate.
-ingester.error-sample-rate int
[experimental] Each error will be logged once in this many times. Use 0 to log all of them.
Each error will be logged once in this many times. Use 0 to log all of them. (default 10)
-ingester.ignore-series-limit-for-metric-names string
Comma-separated list of metric names, for which the -ingester.max-global-series-per-metric limit will be ignored. Does not affect the -ingester.max-global-series-per-user limit.
-ingester.instance-limits.max-inflight-push-requests int
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1177,10 +1177,10 @@ instance_limits:
# CLI flag: -ingester.limit-inflight-requests-using-grpc-method-limiter
[limit_inflight_requests_using_grpc_method_limiter: <boolean> | default = true]

# (experimental) Each error will be logged once in this many times. Use 0 to log
# all of them.
# (advanced) Each error will be logged once in this many times. Use 0 to log all
# of them.
# CLI flag: -ingester.error-sample-rate
[error_sample_rate: <int> | default = 0]
[error_sample_rate: <int> | default = 10]

# (deprecated) When enabled only gRPC errors will be returned by the ingester.
# CLI flag: -ingester.return-only-grpc-errors
Expand Down
52 changes: 52 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1487,6 +1487,10 @@ sum is a regular float number.
The series containing such samples are skipped during ingestion, and valid series within the same request are ingested.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, invalid native histogram errors are logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-native-histogram-count-not-big-enough

This non-critical error occures when Mimir receives a write request that contains a sample that is a native histogram
Expand All @@ -1497,6 +1501,10 @@ that the overall sum is not a float number (NaN).
The series containing such samples are skipped during ingestion, and valid series within the same request are ingested.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, invalid native histogram errors are logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-native-histogram-negative-bucket-count

This non-critical error occures when Mimir receives a write request that contains a sample that is a native histogram
Expand All @@ -1506,6 +1514,10 @@ where some bucket count is negative.
The series containing such samples are skipped during ingestion, and valid series within the same request are ingested.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, invalid native histogram errors are logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-native-histogram-span-negative-offset

This non-critical error occures when Mimir receives a write request that contains a sample that is a native histogram
Expand All @@ -1515,6 +1527,10 @@ where a bucket span has a negative offset.
The series containing such samples are skipped during ingestion, and valid series within the same request are ingested.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, invalid native histogram errors are logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-native-histogram-spans-buckets-mismatch

This non-critical error occures when Mimir receives a write request that contains a sample that is a native histogram
Expand All @@ -1524,6 +1540,10 @@ where the number of bucket counts does not agree with the number of buckets enco
The series containing such samples are skipped during ingestion, and valid series within the same request are ingested.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, invalid native histogram errors are logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-label-invalid

This non-critical error occurs when Mimir receives a write request that contains a series with an invalid label name.
Expand Down Expand Up @@ -1580,6 +1600,10 @@ On a per-tenant basis, you can fine tune the tolerance by configuring the `creat
Only series with invalid samples are skipped during the ingestion. Valid samples within the same request are still ingested.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-exemplar-too-far-in-future

This non-critical error occurs when Mimir receives a write request that contains an exemplar whose timestamp is in the future compared to the current "real world" time.
Expand Down Expand Up @@ -1770,6 +1794,10 @@ How to **fix** it:
- Ensure the actual number of series written by the affected tenant is legit.
- Consider increasing the per-tenant limit by using the `-ingester.max-global-series-per-user` option (or `max_global_series_per_user` in the runtime configuration).

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-max-series-per-metric

This error occurs when the number of in-memory series for a given tenant and metric name exceeds the configured limit.
Expand All @@ -1787,6 +1815,10 @@ How to **fix** it:
- Consider increasing the per-tenant limit by using the `-ingester.max-global-series-per-metric` option.
- Consider excluding specific metric names from this limit's check by using the `-ingester.ignore-series-limit-for-metric-names` option (or `max_global_series_per_metric` in the runtime configuration).

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-max-metadata-per-user

This non-critical error occurs when the number of in-memory metrics with metadata for a given tenant exceeds the configured limit.
Expand All @@ -1804,6 +1836,10 @@ How to **fix** it:
- Check the current number of metric names for the affected tenant, running the instant query `count(count by(__name__) ({__name__=~".+"}))`. Alternatively, you can get the cardinality of `__name__` label calling the API endpoint `/api/v1/cardinality/label_names`.
- Consider increasing the per-tenant limit setting to a value greater than the number of unique metric names returned by the previous query.

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-max-metadata-per-metric

This non-critical error occurs when the number of different metadata for a given metric name exceeds the configured limit.
Expand All @@ -1822,6 +1858,10 @@ How to **fix** it:
- If the different metadata is unexpected, consider fixing the discrepancy in the instrumented applications.
- If the different metadata is expected, consider increasing the per-tenant limit by using the `-ingester.max-global-series-per-metric` option (or `max_global_metadata_per_metric` in the runtime configuration).

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-max-chunks-per-query

This error occurs when execution of a query exceeds the limit on the number of series chunks fetched.
Expand Down Expand Up @@ -1962,6 +2002,10 @@ How it **works**:
If the out-of-order sample ingestion is enabled, then this error is similar to `err-mimir-sample-out-of-order` below with a difference that the sample is older than the out-of-order time window as it relates to the latest sample for that particular time series or the TSDB.
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-sample-out-of-order

This error occurs when the ingester rejects a sample because another sample with a more recent timestamp has already been ingested.
Expand All @@ -1983,6 +2027,10 @@ Common **causes**:
You can learn more about out of order samples in Prometheus, in the blog post [Debugging out of order samples](https://www.robustperception.io/debugging-out-of-order-samples/).
{{< /admonition >}}

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-sample-duplicate-timestamp

This error occurs when the ingester rejects a sample because it is a duplicate of a previously received sample with the same timestamp but different value in the same time series.
Expand All @@ -1992,6 +2040,10 @@ Common **causes**:
- Multiple endpoints are exporting the same metrics, or multiple Prometheus instances are scraping different metrics with identical labels.
- Prometheus relabelling has been configured and it causes series to clash after the relabelling. Check the error message for information about which series has received a duplicate sample.

{{< admonition type="note" >}}
When `-ingester.error-sample-rate` is configured to a value greater than `0`, this error is logged only once every `-ingester.error-sample-rate` times.
{{< /admonition >}}

### err-mimir-exemplar-series-missing

This error occurs when the ingester rejects an exemplar because its related series has not been ingested yet.
Expand Down
4 changes: 2 additions & 2 deletions pkg/ingester/ingester.go
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ type Config struct {

LimitInflightRequestsUsingGrpcMethodLimiter bool `yaml:"limit_inflight_requests_using_grpc_method_limiter" category:"deprecated"` // TODO Remove the configuration option in Mimir 2.14, keeping the same behavior as if it's enabled.

ErrorSampleRate int64 `yaml:"error_sample_rate" json:"error_sample_rate" category:"experimental"`
ErrorSampleRate int64 `yaml:"error_sample_rate" json:"error_sample_rate" category:"advanced"`

DeprecatedReturnOnlyGRPCErrors bool `yaml:"return_only_grpc_errors" json:"return_only_grpc_errors" category:"deprecated"`

Expand Down Expand Up @@ -233,7 +233,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet, logger log.Logger) {
f.Uint64Var(&cfg.ReadPathMemoryUtilizationLimit, "ingester.read-path-memory-utilization-limit", 0, "Memory limit, in bytes, for CPU/memory utilization based read request limiting. Use 0 to disable it.")
f.BoolVar(&cfg.LogUtilizationBasedLimiterCPUSamples, "ingester.log-utilization-based-limiter-cpu-samples", false, "Enable logging of utilization based limiter CPU samples.")
f.BoolVar(&cfg.LimitInflightRequestsUsingGrpcMethodLimiter, "ingester.limit-inflight-requests-using-grpc-method-limiter", true, "When enabled, in-flight write requests limit is checked as soon as the gRPC request is received, before the request is decoded and parsed.")
f.Int64Var(&cfg.ErrorSampleRate, "ingester.error-sample-rate", 0, "Each error will be logged once in this many times. Use 0 to log all of them.")
f.Int64Var(&cfg.ErrorSampleRate, "ingester.error-sample-rate", 10, "Each error will be logged once in this many times. Use 0 to log all of them.")
f.BoolVar(&cfg.UseIngesterOwnedSeriesForLimits, "ingester.use-ingester-owned-series-for-limits", false, "When enabled, only series currently owned by ingester according to the ring are used when checking user per-tenant series limit.")
f.BoolVar(&cfg.UpdateIngesterOwnedSeries, "ingester.track-ingester-owned-series", false, "This option enables tracking of ingester-owned series based on ring state, even if -ingester.use-ingester-owned-series-for-limits is disabled.")
f.DurationVar(&cfg.OwnedSeriesUpdateInterval, "ingester.owned-series-update-interval", 15*time.Second, "How often to check for ring changes and possibly recompute owned series as a result of detected change.")
Expand Down
Loading