Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin(Store): handle ResourceExhausted as a non-server error #6218

Merged
merged 13 commits into from
Mar 30, 2023

Conversation

douglascamata
Copy link
Contributor

@douglascamata douglascamata commented Mar 16, 2023

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Stopped treating ResourceExchausted as a server error. It's an error being used whenever a request is limited. Because of this, it shouldn't help trigger the ThanosStoreGrpcErrorRate alert.

Verification

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata douglascamata changed the title Store: ResourceExchausted is not a server error Store: handle ResourceExchausted as not a server error Mar 16, 2023
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@fpetkovski
Copy link
Contributor

Thanks for the fix. Looks like the docs job failure is related to the changes.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata
Copy link
Contributor Author

@fpetkovski fixed in aa2ade6, thanks for the heads up.

Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have a panel for client errors as well? My only concern is that we have no good way to monitor this status code now.

@douglascamata
Copy link
Contributor Author

douglascamata commented Mar 17, 2023

@fpetkovski the dashboard is whole different problem that I plan to tackle soon. In this PR I only change alerts and rules for this reason.

We have a dashboard widget that shows different error timeseries (see below), but all labelled as "error". It's a pattern that repeats in grpc and http error charts. It's not that easy to fix it without breaking something else because there is a lot of reuse of functions (qpsErrTotalPanel, grpcErrorsPanel, and httpErrorsPanel) for this. My target is to make all these dashboards show a series for each error, properly labelled with their grpc_code or code (when applicable).

"expr": "sum by (job) (rate(http_requests_total{job=~\"$job\", handler=\"receive\",code=~\"5..\"}[$interval])) / sum by (job) (rate(http_requests_total{job=~\"$job\", handler=\"receive\"}[$interval]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "error",
"step": 10

matej-g
matej-g previously approved these changes Mar 17, 2023
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it makes no sense to include this as server error, although I'm also wary of #6218 (review) (but sounds like @douglascamata has a plan 😎).

examples/alerts/alerts.md Outdated Show resolved Hide resolved
Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata
Copy link
Contributor Author

@matej-g @fpetkovski if I cannot find a good way of fixing all the widgets at once and for all, I will definitely patch out the gRPC errors widget to not account for ResourceExhausted and add a separate widget for it. Our team use the mixin dashboards, so we have lots of interest in keeping them in good shape. 👍

matej-g
matej-g previously approved these changes Mar 17, 2023
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some remaining nits 👍

examples/alerts/alerts.yaml Outdated Show resolved Hide resolved
mixin/alerts/store.libsonnet Outdated Show resolved Hide resolved
mixin/runbook.md Outdated Show resolved Hide resolved
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata
Copy link
Contributor Author

@matej-g fixed all the suggestions myself in e9e0245 to use the docs/examples automatic generation to ensure nothing is forgotten. 👍

@douglascamata
Copy link
Contributor Author

@matej-g FYI I opened #6231 to take care of make the error types more evident in gRPC (and HTTP) error widgets. 👍

matej-g
matej-g previously approved these changes Mar 22, 2023
…usted-fix

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata
Copy link
Contributor Author

@matej-g finally got the build to be all green, without any intermittent failure. PTAL. 🙇

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Philip Gough <pgough@redhat.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata douglascamata changed the title Store: handle ResourceExchausted as not a server error Store: handle ResourceExhausted as a non-server error Mar 28, 2023
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
matej-g
matej-g previously approved these changes Mar 29, 2023
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit for changelog entry, otherwise good to go 🎸

CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata douglascamata changed the title Store: handle ResourceExhausted as a non-server error mixin(Store): handle ResourceExhausted as a non-server error Mar 29, 2023
@douglascamata douglascamata requested review from matej-g and fpetkovski and removed request for matej-g and fpetkovski March 29, 2023 13:52
@matej-g matej-g enabled auto-merge (squash) March 30, 2023 08:19
@matej-g matej-g merged commit 15dcfdb into thanos-io:main Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants