Skip to content

Commit

Permalink
Ingester: split push and read circuit breakers (#8315)
Browse files Browse the repository at this point in the history
* Ingester: splitting push and read circuit breakers

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Improving TestIngester_StartReadRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Updating documentation

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing lint issues

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing documentation issues

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Do not call tryAcquirePermit() on the push cb from tryReadAcquirePermit()

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Rename prCircuitBreaker into ingesterCircuitBreaker

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Rename label name path to request_type and label value write to push

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Do not allow acquiring permit if cbs are not active

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

---------

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
(cherry picked from commit ee069d2)
  • Loading branch information
duricanikolic authored and grafanabot committed Jun 12, 2024
1 parent 6dc5a84 commit da18517
Show file tree
Hide file tree
Showing 12 changed files with 1,420 additions and 386 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
* [FEATURE] mimirtool: Add `runtime-config verify` sub-command, for verifying Mimir runtime config files. #8123
* [FEATURE] Query-frontend, querier: new experimental `/cardinality/active_native_histogram_metrics` API to get active native histogram metric names with statistics about active native histogram buckets. #7982 #7986 #8008
* [FEATURE] Alertmanager: Added `-alertmanager.max-silences-count` and `-alertmanager.max-silence-size-bytes` to set limits on per tenant silences. Disabled by default. #6898
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to and reading from ingesters. This can be enabled using `-ingester.circuit-breaker.enabled` option. Further `-ingester.circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180 #8285
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to and reading from ingesters. This can be enabled using `-ingester.push-circuit-breaker.enabled` and `-ingester.read-circuit-breaker.enabled` options. Further `-ingester.push-circuit-breaker.*` and `-ingester.read-circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180 #8285 #8315
* [FEATURE] Distributor, ingester: add new setting `-validation.past-grace-period` to limit how old (based on the wall clock minus OOO window) the ingested samples can be. The default 0 value disables this limit. #8262
* [ENHANCEMENT] Distributor: add metrics `cortex_distributor_samples_per_request` and `cortex_distributor_exemplars_per_request` to track samples/exemplars per request. #8265
* [ENHANCEMENT] Reduced memory allocations in functions used to propagate contextual information between gRPC calls. #7529
Expand Down
106 changes: 91 additions & 15 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3143,7 +3143,7 @@
},
{
"kind": "block",
"name": "circuit_breaker",
"name": "push_circuit_breaker",
"required": false,
"desc": "",
"blockEntries": [
Expand All @@ -3154,7 +3154,7 @@
"desc": "Enable circuit breaking when making requests to ingesters",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "ingester.circuit-breaker.enabled",
"fieldFlag": "ingester.push-circuit-breaker.enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
Expand All @@ -3165,7 +3165,7 @@
"desc": "Max percentage of requests that can fail over period before the circuit breaker opens",
"fieldValue": null,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.circuit-breaker.failure-threshold-percentage",
"fieldFlag": "ingester.push-circuit-breaker.failure-threshold-percentage",
"fieldType": "int",
"fieldCategory": "experimental"
},
Expand All @@ -3176,7 +3176,7 @@
"desc": "How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures",
"fieldValue": null,
"fieldDefaultValue": 100,
"fieldFlag": "ingester.circuit-breaker.failure-execution-threshold",
"fieldFlag": "ingester.push-circuit-breaker.failure-execution-threshold",
"fieldType": "int",
"fieldCategory": "experimental"
},
Expand All @@ -3187,7 +3187,7 @@
"desc": "Moving window of time that the percentage of failed requests is computed over",
"fieldValue": null,
"fieldDefaultValue": 60000000000,
"fieldFlag": "ingester.circuit-breaker.thresholding-period",
"fieldFlag": "ingester.push-circuit-breaker.thresholding-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
Expand All @@ -3198,7 +3198,7 @@
"desc": "How long the circuit breaker will stay in the open state before allowing some requests",
"fieldValue": null,
"fieldDefaultValue": 10000000000,
"fieldFlag": "ingester.circuit-breaker.cooldown-period",
"fieldFlag": "ingester.push-circuit-breaker.cooldown-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
Expand All @@ -3209,31 +3209,107 @@
"desc": "How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.circuit-breaker.initial-delay",
"fieldFlag": "ingester.push-circuit-breaker.initial-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "push_timeout",
"name": "request_timeout",
"required": false,
"desc": "The maximum length of time an ingester's Push request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"desc": "The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"fieldValue": null,
"fieldDefaultValue": 2000000000,
"fieldFlag": "ingester.circuit-breaker.push-timeout",
"fieldFlag": "ingester.push-circuit-breaker.request-timeout",
"fieldType": "duration",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
"fieldDefaultValue": null
},
{
"kind": "block",
"name": "read_circuit_breaker",
"required": false,
"desc": "",
"blockEntries": [
{
"kind": "field",
"name": "enabled",
"required": false,
"desc": "Enable circuit breaking when making requests to ingesters",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "ingester.read-circuit-breaker.enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_threshold_percentage",
"required": false,
"desc": "Max percentage of requests that can fail over period before the circuit breaker opens",
"fieldValue": null,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.read-circuit-breaker.failure-threshold-percentage",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_execution_threshold",
"required": false,
"desc": "How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures",
"fieldValue": null,
"fieldDefaultValue": 100,
"fieldFlag": "ingester.read-circuit-breaker.failure-execution-threshold",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "thresholding_period",
"required": false,
"desc": "Moving window of time that the percentage of failed requests is computed over",
"fieldValue": null,
"fieldDefaultValue": 60000000000,
"fieldFlag": "ingester.read-circuit-breaker.thresholding-period",
"fieldType": "duration",
"fieldCategory": "experiment"
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "read_timeout",
"name": "cooldown_period",
"required": false,
"desc": "The maximum length of time an ingester's read-path request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"desc": "How long the circuit breaker will stay in the open state before allowing some requests",
"fieldValue": null,
"fieldDefaultValue": 10000000000,
"fieldFlag": "ingester.read-circuit-breaker.cooldown-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "initial_delay",
"required": false,
"desc": "How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.read-circuit-breaker.initial-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "request_timeout",
"required": false,
"desc": "The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors.",
"fieldValue": null,
"fieldDefaultValue": 30000000000,
"fieldFlag": "ingester.circuit-breaker.read-timeout",
"fieldFlag": "ingester.read-circuit-breaker.request-timeout",
"fieldType": "duration",
"fieldCategory": "experiment"
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
44 changes: 28 additions & 16 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1307,22 +1307,6 @@ Usage of ./cmd/mimir/mimir:
After what time a series is considered to be inactive. (default 10m0s)
-ingester.active-series-metrics-update-period duration
How often to update active series metrics. (default 1m0s)
-ingester.circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.circuit-breaker.push-timeout duration
The maximum length of time an ingester's Push request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 2s)
-ingester.circuit-breaker.read-timeout duration
The maximum length of time an ingester's read-path request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 30s)
-ingester.circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.client.backoff-max-period duration
Maximum delay when backing off. (default 10s)
-ingester.client.backoff-min-period duration
Expand Down Expand Up @@ -1417,8 +1401,36 @@ Usage of ./cmd/mimir/mimir:
[experimental] Non-zero value enables out-of-order support for most recent samples that are within the time window in relation to the TSDB's maximum time, i.e., within [db.maxTime-timeWindow, db.maxTime]). The ingester will need more memory as a factor of rate of out-of-order samples being ingested and the number of series that are getting out-of-order samples. If query falls into this window, cached results will use value from -query-frontend.results-cache-ttl-for-out-of-order-time-window option to specify TTL for resulting cache entry.
-ingester.owned-series-update-interval duration
[experimental] How often to check for ring changes and possibly recompute owned series as a result of detected change. (default 15s)
-ingester.push-circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.push-circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.push-circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.push-circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.push-circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.push-circuit-breaker.request-timeout duration
[experimental] The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 2s)
-ingester.push-circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.rate-update-period duration
Period with which to update the per-tenant ingestion rates. (default 15s)
-ingester.read-circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.read-circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.read-circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.read-circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.read-circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.read-circuit-breaker.request-timeout duration
[experimental] The maximum duration of an ingester's request before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 30s)
-ingester.read-circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.read-path-cpu-utilization-limit float
[experimental] CPU utilization limit, as CPU cores, for CPU/memory utilization based read request limiting. Use 0 to disable it.
-ingester.read-path-memory-utilization-limit uint
Expand Down
4 changes: 0 additions & 4 deletions cmd/mimir/help.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -389,10 +389,6 @@ Usage of ./cmd/mimir/mimir:
Print basic help.
-help-all
Print help, also including advanced and experimental parameters.
-ingester.circuit-breaker.push-timeout duration
The maximum length of time an ingester's Push request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 2s)
-ingester.circuit-breaker.read-timeout duration
The maximum length of time an ingester's read-path request can last before it triggers a circuit breaker. This configuration is used for circuit breakers only, and its timeouts aren't reported as errors. (default 30s)
-ingester.max-global-metadata-per-metric int
The maximum number of metadata per metric, across the cluster. 0 to disable.
-ingester.max-global-metadata-per-user int
Expand Down
22 changes: 14 additions & 8 deletions docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,14 +118,20 @@ The following features are currently experimental:
- `-ingester.use-ingester-owned-series-for-limits`
- `-ingester.owned-series-update-interval`
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.circuit-breaker.enabled`
- `-ingester.circuit-breaker.failure-threshold-percentage`
- `-ingester.circuit-breaker.failure-execution-threshold`
- `-ingester.circuit-breaker.thresholding-period`
- `-ingester.circuit-breaker.cooldown-period`
- `-ingester.circuit-breaker.initial-delay`
- `-ingester.circuit-breaker.push-timeout`
- `-ingester.circuit-breaker.read-timeout`
- `-ingester.push-circuit-breaker.circuit-breaker.enabled`
- `-ingester.push-circuit-breaker.failure-threshold-percentage`
- `-ingester.push-circuit-breaker.failure-execution-threshold`
- `-ingester.push-circuit-breaker.thresholding-period`
- `-ingester.push-circuit-breaker.cooldown-period`
- `-ingester.push-circuit-breaker.initial-delay`
- `-ingester.push-circuit-breaker.request-timeout`
- `-ingester.read-circuit-breaker.circuit-breaker.enabled`
- `-ingester.read-circuit-breaker.failure-threshold-percentage`
- `-ingester.read-circuit-breaker.failure-execution-threshold`
- `-ingester.read-circuit-breaker.thresholding-period`
- `-ingester.read-circuit-breaker.cooldown-period`
- `-ingester.read-circuit-breaker.initial-delay`
- `-ingester.read-circuit-breaker.request-timeout`
- Ingester client
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.client.circuit-breaker.enabled`
Expand Down
Loading

0 comments on commit da18517

Please sign in to comment.