Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP/HTTP: Retry or no on status code 401, 403? #2915

Closed
newly12 opened this issue Nov 1, 2022 · 15 comments · Fixed by #3028
Closed

OTLP/HTTP: Retry or no on status code 401, 403? #2915

newly12 opened this issue Nov 1, 2022 · 15 comments · Fixed by #3028
Labels
spec:protocol Related to the specification/protocol directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal

Comments

@newly12
Copy link
Contributor

newly12 commented Nov 1, 2022

Is your feature request related to a problem? Please describe.
401, 403 means auth failure, typically retry should not be attempted.

Describe the solution you'd like
add 401, 403 as PermanentClientFailure here

@bogdandrutu
Copy link
Member

cc @tigrannajaryan

@tigrannajaryan
Copy link
Member

I think this needs to be discussed first.

An argument can be made that only responses that indicate that the request data is wrong should be permanent (since retrying it will never succeed). Everything else seems useful to retry using our usual backoff strategy. For example auth failures may be because of misconfiguration on the server side which can be fixed over time and a retry may succeed. It is not obvious to me that dropping the data is more useful than retrying it in this case.

@tigrannajaryan
Copy link
Member

I think this discussion belongs to the spec, moving there.

@tigrannajaryan tigrannajaryan transferred this issue from open-telemetry/opentelemetry-collector Nov 3, 2022
@tigrannajaryan tigrannajaryan changed the title otlphttp: do not retry on status code 401, 403 OTLP/HTTP: Retry or no on status code 401, 403? Nov 3, 2022
@tigrannajaryan
Copy link
Member

@open-telemetry/specs-approvers any thoughts on this?

@tigrannajaryan tigrannajaryan added spec:protocol Related to the specification/protocol directory needs discussion Need more information before all suitable labels can be applied labels Nov 3, 2022
@jack-berg
Copy link
Member

Related to #2217.

Without clarification from the spec, java decided that 429, 502, 503, and 504 are retryable, as seen here. We've left the retry feature in java unstable due to this lack of specificity (and also #1742) which really is a shame because retry is super important for production use cases.

@yurishkuro
Copy link
Member

  • Why would this be different from gRPC exporter implementation?
  • IIRC, in gRPC the retry is an instruction returned by the server, why can't be the case with HTTP?

@jack-berg
Copy link
Member

Why would this be different from gRPC exporter implementation?

gRPC retry is largely based on the grpc status code (not http status code) returned.

IIRC, in gRPC the retry is an instruction returned by the server, why can't be the case with HTTP?

gRPC clients can dictate a retry policy as described here.

@yurishkuro
Copy link
Member

gRPC retry is largely based on the grpc status code (not http status code) returned.

Doesn't answer my question - the status codes are conceptually similar, and the approach should be the same across transports.

gRPC clients can dictate a retry policy as described here.

Also doesn't answer my question. Yes, the client can do it, but why make it a client (distributed) problem instead of a server (centralized) problem? Server should know better anyway if the request is retryable.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Nov 3, 2022

For the context, the spec today says:

  • 400 Bad Data is non-retryable
  • 429 and 503 should be retried
  • All other responses "should be treated according to HTTP specification". This part is unclear since HTTP does not always define whether a client should retry when it receives a particular response code.

For OTLP/gRCP we have a more explicit table that lists every possible gRPC response code and tells whether it is retryable.

We can introduce a similar table for OTLP/HTTP and borrow from OTLP/gRPC's equivalent lines (e.g. UNAUTHENTICATED in gRPC corresponds to HTTP 401.

I still wanted to discuss this to make sure we agree this is the right approach. I am not certain that it is.

@jack-berg
Copy link
Member

I still wanted to discuss this to make sure we agree this is the right approach. I am not certain that it is.

Do you anticipate some sort of downside from being explicit about which are retryable?

@tigrannajaryan
Copy link
Member

I still wanted to discuss this to make sure we agree this is the right approach. I am not certain that it is.

Do you anticipate some sort of downside from being explicit about which are retryable?

No. We should be explicit. I had some imaginary situations in my mind that after thinking through I no longer think should be considered.

@gouthamve
Copy link
Member

Also see #2993

@reyang
Copy link
Member

reyang commented Dec 1, 2022

@alanwest @mic-max FYI. Related to https://github.com/open-telemetry/opentelemetry-dotnet/pulls?q=is%3Apr+OTLP+retry.

@newly12
Copy link
Contributor Author

newly12 commented Dec 7, 2022

Thanks for moving the issue to the right place. any idea how this can be moved forward?

For example auth failures may be because of misconfiguration on the server side which can be fixed over time and a retry may succeed.

IHMO this may or may not be the case, i.e. if the server failed to authenticate the request because of external dependencies, say the auth server for example, it should be a 5xx error i.e. 502?

@tigrannajaryan tigrannajaryan added triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal and removed needs discussion Need more information before all suitable labels can be applied labels Dec 7, 2022
@tigrannajaryan
Copy link
Member

PRs to fix this are welcome. I think we can mirror what we do for OTLP/gRPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:protocol Related to the specification/protocol directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal
Projects
None yet
8 participants