Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor: Return 529 for ingestion rate limit when serviceOverloadErrorEnabled #6549

Merged
merged 7 commits into from
Nov 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
* [ENHANCEMENT] Server: Add `-ingester.client.report-grpc-codes-in-instrumentation-label-enabled` CLI flag to specify whether gRPC status codes should be used in `status_code` label of `cortex_ingester_client_request_duration_seconds` metric. It defaults to false, meaning that successful and erroneous gRPC status codes are represented with `2xx` and `error` respectively. #6562
* [ENHANCEMENT] Server: Add `-server.http-log-closed-connections-without-response-enabled` option to log details about connections to HTTP server that were closed before any data was sent back. This can happen if client doesn't manage to send complete HTTP headers before timeout. #6612
* [ENHANCEMENT] Query-frontend: include length of query, time since the earliest and latest points of a query in "query stats" logs. Time parameters (start/end/time) are always formatted as RFC3339 now. #6473
* [BUGFIX] Distributor: return server overload error in the event of exceeding the ingestion rate limit. #6549
* [BUGFIX] Ring: Ensure network addresses used for component hash rings are formatted correctly when using IPv6. #6068
* [BUGFIX] Query-scheduler: don't retain connections from queriers that have shut down, leading to gradually increasing enqueue latency over time. #6100 #6145
* [BUGFIX] Ingester: prevent query logic from continuing to execute after queries are canceled. #6085
Expand Down
2 changes: 1 addition & 1 deletion cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3218,7 +3218,7 @@
"kind": "field",
"name": "service_overload_status_code_on_rate_limit_enabled",
"required": false,
"desc": "If enabled, rate limit errors will be reported to the client with HTTP status code 529 (Service is overloaded). If disabled, status code 429 (Too Many Requests) is used.",
"desc": "If enabled, rate limit errors will be reported to the client with HTTP status code 529 (Service is overloaded). If disabled, status code 429 (Too Many Requests) is used. Enabling -distributor.retry-after-header.enabled before utilizing this option is strongly recommended as it helps prevent premature request retries by the client.",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "distributor.service-overload-status-code-on-rate-limit-enabled",
Expand Down
2 changes: 1 addition & 1 deletion cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1212,7 +1212,7 @@ Usage of ./cmd/mimir/mimir:
-distributor.ring.store string
Backend storage to use for the ring. Supported values are: consul, etcd, inmemory, memberlist, multi. (default "memberlist")
-distributor.service-overload-status-code-on-rate-limit-enabled
[experimental] If enabled, rate limit errors will be reported to the client with HTTP status code 529 (Service is overloaded). If disabled, status code 429 (Too Many Requests) is used.
[experimental] If enabled, rate limit errors will be reported to the client with HTTP status code 529 (Service is overloaded). If disabled, status code 429 (Too Many Requests) is used. Enabling -distributor.retry-after-header.enabled before utilizing this option is strongly recommended as it helps prevent premature request retries by the client.
-distributor.write-requests-buffer-pooling-enabled
[experimental] Enable pooling of buffers used for marshaling write requests.
-enable-go-runtime-metrics
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2919,7 +2919,10 @@ The `limits` block configures default and per-tenant limits imposed by component

# (experimental) If enabled, rate limit errors will be reported to the client
# with HTTP status code 529 (Service is overloaded). If disabled, status code
# 429 (Too Many Requests) is used.
# 429 (Too Many Requests) is used. Enabling
# -distributor.retry-after-header.enabled before utilizing this option is
# strongly recommended as it helps prevent premature request retries by the
# client.
# CLI flag: -distributor.service-overload-status-code-on-rate-limit-enabled
[service_overload_status_code_on_rate_limit_enabled: <boolean> | default = false]

Expand Down
7 changes: 1 addition & 6 deletions pkg/distributor/push.go
Original file line number Diff line number Diff line change
Expand Up @@ -160,12 +160,7 @@ func toHTTPStatus(ctx context.Context, pushErr error, limits *validation.Overrid
switch distributorErr.errorCause() {
case mimirpb.BAD_DATA:
return http.StatusBadRequest
case mimirpb.INGESTION_RATE_LIMITED:
// Return a 429 here to tell the client it is going too fast.
// Client may discard the data or slow down and re-send.
// Prometheus v2.26 added a remote-write option 'retry_on_http_429'.
return http.StatusTooManyRequests
case mimirpb.REQUEST_RATE_LIMITED:
case mimirpb.INGESTION_RATE_LIMITED, mimirpb.REQUEST_RATE_LIMITED:
serviceOverloadErrorEnabled := false
userID, err := tenant.TenantID(ctx)
if err == nil {
Expand Down
6 changes: 6 additions & 0 deletions pkg/distributor/push_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -913,6 +913,12 @@ func TestHandler_ToHTTPStatus(t *testing.T) {
expectedHTTPStatus: http.StatusTooManyRequests,
expectedErrorMsg: ingestionRateLimitedErr.Error(),
},
"an ingestionRateLimitedError with serviceOverloadErrorEnabled gets translated into an HTTP 529": {
err: ingestionRateLimitedErr,
serviceOverloadErrorEnabled: true,
expectedHTTPStatus: StatusServiceOverloaded,
expectedErrorMsg: ingestionRateLimitedErr.Error(),
},
"a DoNotLogError of an ingestionRateLimitedError gets translated into an HTTP 429": {
err: middleware.DoNotLogError{Err: ingestionRateLimitedErr},
expectedHTTPStatus: http.StatusTooManyRequests,
Expand Down
2 changes: 1 addition & 1 deletion pkg/util/validation/limits.go
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) {
_ = l.CreationGracePeriod.Set("10m")
f.Var(&l.CreationGracePeriod, CreationGracePeriodFlag, "Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + grace_period)'. This configuration is enforced in the distributor, ingester and query-frontend (to avoid querying too far into the future).")
f.BoolVar(&l.EnforceMetadataMetricName, "validation.enforce-metadata-metric-name", true, "Enforce every metadata has a metric name.")
f.BoolVar(&l.ServiceOverloadStatusCodeOnRateLimitEnabled, "distributor.service-overload-status-code-on-rate-limit-enabled", false, "If enabled, rate limit errors will be reported to the client with HTTP status code 529 (Service is overloaded). If disabled, status code 429 (Too Many Requests) is used.")
f.BoolVar(&l.ServiceOverloadStatusCodeOnRateLimitEnabled, "distributor.service-overload-status-code-on-rate-limit-enabled", false, "If enabled, rate limit errors will be reported to the client with HTTP status code 529 (Service is overloaded). If disabled, status code 429 (Too Many Requests) is used. Enabling -distributor.retry-after-header.enabled before utilizing this option is strongly recommended as it helps prevent premature request retries by the client.")

f.IntVar(&l.MaxGlobalSeriesPerUser, MaxSeriesPerUserFlag, 150000, "The maximum number of in-memory series per tenant, across the cluster before replication. 0 to disable.")
f.IntVar(&l.MaxGlobalSeriesPerMetric, MaxSeriesPerMetricFlag, 0, "The maximum number of in-memory series per metric name, across the cluster before replication. 0 to disable.")
Expand Down
Loading