Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x #571

Closed
plowin opened this issue Nov 20, 2023 · 4 comments
Closed

Comments

@plowin
Copy link
Contributor

plowin commented Nov 20, 2023

Description

This was observed in Cloud Foundry landscapes with the following ingress path: Client -> LB -> HAProxy -> gorouter -> app. On the frontend, HTTP/2 is enabled and apps are free to choose between HTTP/1.1 and HTTP/2. If the speaks H/2, H/2 is used towards gorouter who might to a downgrade and opens several HTTP/1.1 connection towards the respective app for a paralllel H/2 request at the frontend. With the HAProxy upgrade from 2.7.10 to 2.8.3, we started observing a failure rate of ~1% for POST and PUT requests.

Initial Analysis

In the platform logs, we observed differing status codes for HAProxy and Gorouter respectively. If gorouter logs 200 response, HAProxy might log a 502 for the same request. HAProxy also logs termination state SH--, from the docs:

     SH   The server aborted before sending its full HTTP response headers, or
          it crashed while processing the request. Since a server aborting at
          this moment is very rare, it would be wise to inspect its logs to
          control whether it crashed and why. The logged request may indicate a
          small set of faulty requests, demonstrating bugs in the application.
          Sometimes this might also be caused by an IDS killing the connection
          between HAProxy and the server.

The issue could be reproduced with 100 req/s, e.g. via:

hey -q 2 -z 1m -h2 -m POST -d "some body" https://some-test-app.<ls-domain>

Explanation: execute HTTP/2 POST requests with the given string as body for 1 minute with 2 requests per worker per second using the default worker pool of 50. This results in 100 requests per second using 50 connections for 60s = 6000 requests (if everything goes well, might be a bit less as the program hard-stops after one minute).

Countermeasures

Do not move to 2.8.x if you run a similar setup. We keep maintaining the 2.7.y-line as well.

@maxmoehl
Copy link
Member

So we went through three separate stages (HAProxy version; status code; termination state: explanation):

  1. 2.7.X; 200; ----: Known-good behaviour.
  2. 2.8.3; 502; SH--: HAProxy started reporting errors due to backend issues although the backend sent a valid response code.
  3. 2.8.4; 200; SD--: Fix applied, error reporting still broken.

1. Known Good

With 2.7 we didn't observe any particular issues.

2. Broken Behaviour

With 2.8 some of the scenarios broke. We suspect that this broken behaviour was introduced with haproxy/haproxy@f2b02cf but we don't know for sure. The observed behaviour was that for 0.1% to 1% of requests with a body the backend responded with a 200 but HAProxy detected some broken backend state and responded with a 502.

My current understanding is that this was previously masked because of the way the states were checked. With 2.8 there was an over-haul of error checking. The specific case is if HAProxy still has data to send (the request body) but the server already responded with a header frame which had the STREAM_END flag set. HAProxy considered this an error but the spec does allow for this, the server simply does not expect a body so it immediately sends a response and the client does not have to send all of the body.

3. Fix Applied

With haproxy/haproxy@d3e379b HAProxy still processes the response even if the server already closed the stream as the closing of the stream and the response can be a single frame.


There's probably a lot more details to this but for me this explanation is sound enough to accept that it's fixed with 2.8.4. The error reporting is still broken but that seems to depend on a pending refactor.

@domdom82
Copy link
Contributor

domdom82 commented Nov 20, 2023

I'd say for the purpose of this issue it should be sufficient to bump to 2.8.4 and close it as the traffic is no longer hampered. We may or may not decide to roll out the change or wait for a "better" fix. However, upstream clients may still want to use 2.8 and should not be forced to wait because of a seemingly visual-only error.

@plowin
Copy link
Contributor Author

plowin commented Nov 24, 2023

Fixed in https://github.com/cloudfoundry/haproxy-boshrelease/releases/tag/v13.3.0%2B2.8.4.

As mentioned above, the new version might contain some misleading 200s with termination state SD-- which will hopefully soon be fixed.

Edit: Adding additional feedback from the HAProxy community

So the fix to delay the server abort detection to
be able to send the response to the client was pushed to 2.8.x. However, the "SD--"
termination state is a side effect of the current design and will require an
important refactoring to be fixed. Not this specific case of course. It is only
a small piece of a bigger problem.
We will work on it for sure. But I don't really know when it will be
available. Hopefully for the 3.0, but not sure. And unfortunately I doubt it
will ever been backported to 2.8.x. Maybe some fixes will be backportable. Hard
to know what for now.

@plowin plowin closed this as completed Nov 24, 2023
@capflam
Copy link

capflam commented Dec 19, 2023

Just a little update about this issue. A recent fix was pushed to haproxy-3.0-DEV. It may help to limit false SD-- reports in logs: haproxy/haproxy@d9eb6d6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants