Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester: split push and read circuit breakers #8315

Merged
merged 12 commits into from
Jun 11, 2024
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
* [FEATURE] mimirtool: Add `runtime-config verify` sub-command, for verifying Mimir runtime config files. #8123
* [FEATURE] Query-frontend, querier: new experimental `/cardinality/active_native_histogram_metrics` API to get active native histogram metric names with statistics about active native histogram buckets. #7982 #7986 #8008
* [FEATURE] Alertmanager: Added `-alertmanager.max-silences-count` and `-alertmanager.max-silence-size-bytes` to set limits on per tenant silences. Disabled by default. #6898
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to and reading from ingesters. This can be enabled using `-ingester.circuit-breaker.enabled` option. Further `-ingester.circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180 #8285
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to and reading from ingesters. This can be enabled using `-ingester.push-circuit-breaker.enabled` and `-ingester.read-circuit-breaker.enabled` options. Further `-ingester.push-circuit-breaker.*` and `-ingester.read-circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180 #8285 #8315
* [FEATURE] Distributor, ingester: add new setting `-validation.past-grace-period` to limit how old (based on the wall clock minus OOO window) the ingested samples can be. The default 0 value disables this limit. #8262
* [ENHANCEMENT] Distributor: add metrics `cortex_distributor_samples_per_request` and `cortex_distributor_exemplars_per_request` to track samples/exemplars per request. #8265
* [ENHANCEMENT] Reduced memory allocations in functions used to propagate contextual information between gRPC calls. #7529
Expand Down
106 changes: 91 additions & 15 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3143,7 +3143,7 @@
},
{
"kind": "block",
"name": "circuit_breaker",
"name": "push_circuit_breaker",
"required": false,
"desc": "",
"blockEntries": [
Expand All @@ -3154,7 +3154,7 @@
"desc": "Enable circuit breaking when making requests to ingesters",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "ingester.circuit-breaker.enabled",
"fieldFlag": "ingester.push-circuit-breaker.enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
Expand All @@ -3165,7 +3165,7 @@
"desc": "Max percentage of requests that can fail over period before the circuit breaker opens",
"fieldValue": null,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.circuit-breaker.failure-threshold-percentage",
"fieldFlag": "ingester.push-circuit-breaker.failure-threshold-percentage",
"fieldType": "int",
"fieldCategory": "experimental"
},
Expand All @@ -3176,7 +3176,7 @@
"desc": "How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures",
"fieldValue": null,
"fieldDefaultValue": 100,
"fieldFlag": "ingester.circuit-breaker.failure-execution-threshold",
"fieldFlag": "ingester.push-circuit-breaker.failure-execution-threshold",
"fieldType": "int",
"fieldCategory": "experimental"
},
Expand All @@ -3187,7 +3187,7 @@
"desc": "Moving window of time that the percentage of failed requests is computed over",
"fieldValue": null,
"fieldDefaultValue": 60000000000,
"fieldFlag": "ingester.circuit-breaker.thresholding-period",
"fieldFlag": "ingester.push-circuit-breaker.thresholding-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
Expand All @@ -3198,7 +3198,7 @@
"desc": "How long the circuit breaker will stay in the open state before allowing some requests",
"fieldValue": null,
"fieldDefaultValue": 10000000000,
"fieldFlag": "ingester.circuit-breaker.cooldown-period",
"fieldFlag": "ingester.push-circuit-breaker.cooldown-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
Expand All @@ -3209,31 +3209,107 @@
"desc": "How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.circuit-breaker.initial-delay",
"fieldFlag": "ingester.push-circuit-breaker.initial-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "push_timeout",
"name": "request_timeout",
"required": false,
"desc": "The maximum length of time an ingester's Push request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"desc": "The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"fieldValue": null,
"fieldDefaultValue": 2000000000,
"fieldFlag": "ingester.circuit-breaker.push-timeout",
"fieldFlag": "ingester.push-circuit-breaker.request-timeout",
"fieldType": "duration",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
"fieldDefaultValue": null
},
{
"kind": "block",
"name": "read_circuit_breaker",
"required": false,
"desc": "",
"blockEntries": [
{
"kind": "field",
"name": "enabled",
"required": false,
"desc": "Enable circuit breaking when making requests to ingesters",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "ingester.read-circuit-breaker.enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_threshold_percentage",
"required": false,
"desc": "Max percentage of requests that can fail over period before the circuit breaker opens",
"fieldValue": null,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.read-circuit-breaker.failure-threshold-percentage",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_execution_threshold",
"required": false,
"desc": "How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures",
"fieldValue": null,
"fieldDefaultValue": 100,
"fieldFlag": "ingester.read-circuit-breaker.failure-execution-threshold",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "thresholding_period",
"required": false,
"desc": "Moving window of time that the percentage of failed requests is computed over",
"fieldValue": null,
"fieldDefaultValue": 60000000000,
"fieldFlag": "ingester.read-circuit-breaker.thresholding-period",
"fieldType": "duration",
"fieldCategory": "experiment"
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "read_timeout",
"name": "cooldown_period",
"required": false,
"desc": "The maximum length of time an ingester's read-path request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"desc": "How long the circuit breaker will stay in the open state before allowing some requests",
"fieldValue": null,
"fieldDefaultValue": 10000000000,
"fieldFlag": "ingester.read-circuit-breaker.cooldown-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "initial_delay",
"required": false,
"desc": "How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.read-circuit-breaker.initial-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "request_timeout",
"required": false,
"desc": "The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"fieldValue": null,
"fieldDefaultValue": 30000000000,
"fieldFlag": "ingester.circuit-breaker.read-timeout",
"fieldFlag": "ingester.read-circuit-breaker.request-timeout",
"fieldType": "duration",
"fieldCategory": "experiment"
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
44 changes: 28 additions & 16 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1307,22 +1307,6 @@ Usage of ./cmd/mimir/mimir:
After what time a series is considered to be inactive. (default 10m0s)
-ingester.active-series-metrics-update-period duration
How often to update active series metrics. (default 1m0s)
-ingester.circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.circuit-breaker.push-timeout duration
The maximum length of time an ingester's Push request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 2s)
-ingester.circuit-breaker.read-timeout duration
The maximum length of time an ingester's read-path request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 30s)
-ingester.circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.client.backoff-max-period duration
Maximum delay when backing off. (default 10s)
-ingester.client.backoff-min-period duration
Expand Down Expand Up @@ -1417,8 +1401,36 @@ Usage of ./cmd/mimir/mimir:
[experimental] Non-zero value enables out-of-order support for most recent samples that are within the time window in relation to the TSDB's maximum time, i.e., within [db.maxTime-timeWindow, db.maxTime]). The ingester will need more memory as a factor of rate of out-of-order samples being ingested and the number of series that are getting out-of-order samples. If query falls into this window, cached results will use value from -query-frontend.results-cache-ttl-for-out-of-order-time-window option to specify TTL for resulting cache entry.
-ingester.owned-series-update-interval duration
[experimental] How often to check for ring changes and possibly recompute owned series as a result of detected change. (default 15s)
-ingester.push-circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.push-circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.push-circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.push-circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.push-circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.push-circuit-breaker.request-timeout duration
[experimental] The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 2s)
-ingester.push-circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.rate-update-period duration
Period with which to update the per-tenant ingestion rates. (default 15s)
-ingester.read-circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.read-circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.read-circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.read-circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.read-circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.read-circuit-breaker.request-timeout duration
[experimental] The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 30s)
-ingester.read-circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.read-path-cpu-utilization-limit float
[experimental] CPU utilization limit, as CPU cores, for CPU/memory utilization based read request limiting. Use 0 to disable it.
-ingester.read-path-memory-utilization-limit uint
Expand Down
4 changes: 0 additions & 4 deletions cmd/mimir/help.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -389,10 +389,6 @@ Usage of ./cmd/mimir/mimir:
Print basic help.
-help-all
Print help, also including advanced and experimental parameters.
-ingester.circuit-breaker.push-timeout duration
The maximum length of time an ingester's Push request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 2s)
-ingester.circuit-breaker.read-timeout duration
The maximum length of time an ingester's read-path request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 30s)
-ingester.max-global-metadata-per-metric int
The maximum number of metadata per metric, across the cluster. 0 to disable.
-ingester.max-global-metadata-per-user int
Expand Down
22 changes: 14 additions & 8 deletions docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,14 +118,20 @@ The following features are currently experimental:
- `-ingester.use-ingester-owned-series-for-limits`
- `-ingester.owned-series-update-interval`
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.circuit-breaker.enabled`
- `-ingester.circuit-breaker.failure-threshold-percentage`
- `-ingester.circuit-breaker.failure-execution-threshold`
- `-ingester.circuit-breaker.thresholding-period`
- `-ingester.circuit-breaker.cooldown-period`
- `-ingester.circuit-breaker.initial-delay`
- `-ingester.circuit-breaker.push-timeout`
- `-ingester.circuit-breaker.read-timeout`
- `-ingester.push-circuit-breaker.circuit-breaker.enabled`
- `-ingester.push-circuit-breaker.failure-threshold-percentage`
- `-ingester.push-circuit-breaker.failure-execution-threshold`
- `-ingester.push-circuit-breaker.thresholding-period`
- `-ingester.push-circuit-breaker.cooldown-period`
- `-ingester.push-circuit-breaker.initial-delay`
- `-ingester.push-circuit-breaker.request-timeout`
- `-ingester.read-circuit-breaker.circuit-breaker.enabled`
- `-ingester.read-circuit-breaker.failure-threshold-percentage`
- `-ingester.read-circuit-breaker.failure-execution-threshold`
- `-ingester.read-circuit-breaker.thresholding-period`
- `-ingester.read-circuit-breaker.cooldown-period`
- `-ingester.read-circuit-breaker.initial-delay`
- `-ingester.read-circuit-breaker.request-timeout`
- Ingester client
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.client.circuit-breaker.enabled`
Expand Down
Loading
Loading