Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity around which connection errors are retriable #171

Open
paddybyers opened this issue Nov 4, 2023 · 10 comments
Open

Ambiguity around which connection errors are retriable #171

paddybyers opened this issue Nov 4, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@paddybyers
Copy link
Member

A recent issue in ably-cocoa highlighted the problem that a specific TLS-related connection error response was being treated as fatal instead of retriable. If this error occurs in a connection attempt, the connection would then move to failed.

However, there are valid arguments for some TLS failures not being retried. Suppose for example that a platform only supports TLS 1.1, and the Ably endpoint only admits TLS 1.2+, then you're going to get a TLS handshake error. If you keep retrying in that situation, then it's still never going to succeed in connecting. So there might be an argument to detect that specific case, and not attempt retrying; however, the risk with that is that you also exclude things that you really do want to retry, which is the case we have in the issue linked. So we have to err on the side of retrying in any situation where the possible underlying cause is ambiguous, and then we have to try to eliminate the possibility of things like protocol incompatibility by other means.

Similar arbitrary decisions as to what is retried are made in ably-java. The spec is currently ambiguous about what errors should be retried. The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.

@paddybyers paddybyers added the bug Something isn't working label Nov 4, 2023
Copy link

sync-by-unito bot commented Nov 4, 2023

➤ Automation for Jira commented:

The link to the corresponding Jira issue is https://ably.atlassian.net/browse/SDK-3927

@paddybyers
Copy link
Member Author

Internal discussion: https://github.com/ably/realtime/issues/5283

@paddybyers
Copy link
Member Author

cc @lawrence-forooghian

lawrence-forooghian added a commit to ably/ably-cocoa that referenced this issue Nov 6, 2023
RTN14d says that a transport error should be considered recoverable if
it is

> a network failure, a timeout such as RTN14c, or a disconnected
> response, other than a token failure RTN14b)

However, it does not define what it means by a "network failure",
leading to each platform having to provide its own interpretation.

In particular, a client recently told us that they’ve been seeing lots
of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate
some sort of failure to perform an SSL handshake. We don’t understand
the cause of this issue but we noticed that it was causing the client to
transition to the FAILED state.

Speaking to Paddy, he said that this error should be provoking a
connection retry, not failure. And more broadly, he indicated that,
basically, _all_ transport errors should be considered recoverable. So,
that’s what we do here. He’s raised a specification issue [1] for us to
properly specify this and to decide if there are any nuances to
consider, but is keen for us to implement this broad behaviour in
ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort
errors.

Resolves #1817.

[1] ably/specification#171
@SimonWoolf
Copy link
Member

The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.

This should definitely be improved, yes. The spec is clearer about what failures should be retried to a fallback host: RSC15l, including RSC15l1: "host unresolvable or unreachable". I'd have thought TLS issues definitely fall under that. (That's a spec item about the rest api but is incorporated into the realtime api by RTN17f, which adds DISCONNECTED with 5xx).

The set of failures that should be retried to a fallback host is by definition a subset of the set of failures that should be retried. So perhaps RTN14d should reference RTN17f, and additionally list any conditions that should result in a retry but without changing the endpoint endpoint (which I guess think is just 4xx disconnecteds?)

@paddybyers
Copy link
Member Author

So we're clear that the philosophy intended in the spec, perhaps able to be improved, is correct:

If Ably specifically gives you an error response, and that error implies that you shouldn't retry, then you respect that. In every other situation, you retry. So:

  • any error response that is not from an Ably realtime endpoint (which is determined by having an Ably error code, either in the response body or a header, or is a ProtocolMessage containing an error), is retriable;
  • an error response that is from an Ably endpoint:
    • is a token error if it has a code in the token range, and should be retried after a token renewal sequence;
    • otherwise, if it has a 4xx status code, is non-retriable
    • otherwise, is retriable.

This will mean that some connection problems (such as the TLS protocol version incompatibility example above) will give rise to indefinite doomed retries, but this is an accepted penalty for giving ourselves the best chance of connecting after transient errors.

There are also some responses from non-realtime elements of the infrastructure that we want clients to respect, such as rate-limiting responses from the WAF, but there might be others. We need to decide how to address these in the spec.

lawrence-forooghian added a commit to ably/ably-cocoa that referenced this issue Nov 6, 2023
RTN14d says that a transport error should be considered recoverable if
it is

> a network failure, a timeout such as RTN14c, or a disconnected
> response, other than a token failure RTN14b)

However, it does not define what it means by a "network failure",
leading to each platform having to provide its own interpretation.

In particular, a client recently told us that they’ve been seeing lots
of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate
some sort of failure to perform an SSL handshake. We don’t understand
the cause of this issue but we noticed that it was causing the client to
transition to the FAILED state.

Speaking to Paddy, he said that this error should be provoking a
connection retry, not failure. And more broadly, he indicated that,
basically, _all_ transport errors should be considered recoverable. So,
that’s what we do here. He’s raised a specification issue [1] for us to
properly specify this and to decide if there are any nuances to
consider, but is keen for us to implement this broad behaviour in
ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort
errors.

Resolves #1817.

[1] ably/specification#171
@lawrence-forooghian
Copy link
Collaborator

lawrence-forooghian commented Nov 6, 2023

@paddybyers I think I mentioned in our Slack chat that it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake?

@lawrence-forooghian
Copy link
Collaborator

(@paddybyers when I mentioned this to you, you rephrased the logic in terms of WebSocket API close and error events.)

@paddybyers
Copy link
Member Author

it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake?

I attempted to reflect that in what I wrote above, "or is a ProtocolMessage containing an error".

If realtime is trying to tell you something, and you're a ws connection, then it does it by sending a ProtocolMessage.

If you get a ws-level event (close, error), without having received a prior ProtocolMessage, then you have to treat that as a retriable error.

A ProtocolMessage whose action is error, and that does not have a channel, is treated as a connection-level error indication, and the rules above apply, ie if msg.error.statusCode is 4xx then it's non-retriable.

A ProtocolMessage whose action is error and does have a channel, is treated as a channel-level error indication.

A ProtoclMessage whose action is disconnected is by definition retriable.

@lawrence-forooghian
Copy link
Collaborator

A ProtocolMessage whose action is error, and that does not have a channel, is treated as a connection-level error indication, and the rules above apply, ie if msg.error.statusCode is 4xx then it's non-retriable.

A ProtocolMessage whose action is error and does have a channel, is treated as a channel-level error indication.

A ProtoclMessage whose action is disconnected is by definition retriable.

We already have spec points that describe expected behaviour upon receiving ERROR and DISCONNECTED ProtocolMessages. Do you believe that the spec needs expanding in this sense beyond what's already there?

@paddybyers
Copy link
Member Author

We already have spec points that describe expected behaviour upon receiving ERROR and DISCONNECTED ProtocolMessages. Do you believe that the spec needs expanding in this sense beyond what's already there?

Without trawling through the detail now, I think the answer is:

  • the handling of disconnections and declined connections, in the case that there is a ProtocolMessage, is well defined;
  • the possible ambiguity of what do do with an error or close event, when there was no ProtocolMessage, is what we need to check

lawrence-forooghian added a commit to ably/ably-cocoa that referenced this issue Nov 6, 2023
RTN14d says that a transport error should be considered recoverable if
it is

> a network failure, a timeout such as RTN14c, or a disconnected
> response, other than a token failure RTN14b)

However, it does not define what it means by a "network failure",
leading to each platform having to provide its own interpretation.

In particular, a client recently told us that they’ve been seeing lots
of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate
some sort of failure to perform an SSL handshake. We don’t understand
the cause of this issue but we noticed that it was causing the client to
transition to the FAILED state.

Speaking to Paddy, he said that this error should be provoking a
connection retry, not failure. And more broadly, he indicated that,
basically, _all_ transport errors should be considered recoverable. So,
that’s what we do here. He’s raised a specification issue [1] for us to
properly specify this and to decide if there are any nuances to
consider, but is keen for us to implement this broad behaviour in
ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort
errors.

Resolves #1817.

[1] ably/specification#171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants