Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x #571

plowin · 2023-11-20T06:49:21Z

Description

This was observed in Cloud Foundry landscapes with the following ingress path: Client -> LB -> HAProxy -> gorouter -> app. On the frontend, HTTP/2 is enabled and apps are free to choose between HTTP/1.1 and HTTP/2. If the speaks H/2, H/2 is used towards gorouter who might to a downgrade and opens several HTTP/1.1 connection towards the respective app for a paralllel H/2 request at the frontend. With the HAProxy upgrade from 2.7.10 to 2.8.3, we started observing a failure rate of ~1% for POST and PUT requests.

Initial Analysis

In the platform logs, we observed differing status codes for HAProxy and Gorouter respectively. If gorouter logs 200 response, HAProxy might log a 502 for the same request. HAProxy also logs termination state SH--, from the docs:

     SH   The server aborted before sending its full HTTP response headers, or
          it crashed while processing the request. Since a server aborting at
          this moment is very rare, it would be wise to inspect its logs to
          control whether it crashed and why. The logged request may indicate a
          small set of faulty requests, demonstrating bugs in the application.
          Sometimes this might also be caused by an IDS killing the connection
          between HAProxy and the server.

The issue could be reproduced with 100 req/s, e.g. via:

hey -q 2 -z 1m -h2 -m POST -d "some body" https://some-test-app.<ls-domain>

Explanation: execute HTTP/2 POST requests with the given string as body for 1 minute with 2 requests per worker per second using the default worker pool of 50. This results in 100 requests per second using 50 connections for 60s = 6000 requests (if everything goes well, might be a bit less as the program hard-stops after one minute).

Countermeasures

Do not move to 2.8.x if you run a similar setup. We keep maintaining the 2.7.y-line as well.

The text was updated successfully, but these errors were encountered:

maxmoehl · 2023-11-20T12:56:58Z

So we went through three separate stages (HAProxy version; status code; termination state: explanation):

2.7.X; 200; ----: Known-good behaviour.
2.8.3; 502; SH--: HAProxy started reporting errors due to backend issues although the backend sent a valid response code.
2.8.4; 200; SD--: Fix applied, error reporting still broken.

1. Known Good

With 2.7 we didn't observe any particular issues.

2. Broken Behaviour

With 2.8 some of the scenarios broke. We suspect that this broken behaviour was introduced with haproxy/haproxy@f2b02cf but we don't know for sure. The observed behaviour was that for 0.1% to 1% of requests with a body the backend responded with a 200 but HAProxy detected some broken backend state and responded with a 502.

My current understanding is that this was previously masked because of the way the states were checked. With 2.8 there was an over-haul of error checking. The specific case is if HAProxy still has data to send (the request body) but the server already responded with a header frame which had the STREAM_END flag set. HAProxy considered this an error but the spec does allow for this, the server simply does not expect a body so it immediately sends a response and the client does not have to send all of the body.

3. Fix Applied

With haproxy/haproxy@d3e379b HAProxy still processes the response even if the server already closed the stream as the closing of the stream and the response can be a single frame.

There's probably a lot more details to this but for me this explanation is sound enough to accept that it's fixed with 2.8.4. The error reporting is still broken but that seems to depend on a pending refactor.

domdom82 · 2023-11-20T13:46:07Z

I'd say for the purpose of this issue it should be sufficient to bump to 2.8.4 and close it as the traffic is no longer hampered. We may or may not decide to roll out the change or wait for a "better" fix. However, upstream clients may still want to use 2.8 and should not be forced to wait because of a seemingly visual-only error.

plowin · 2023-11-24T12:31:23Z

Fixed in https://github.com/cloudfoundry/haproxy-boshrelease/releases/tag/v13.3.0%2B2.8.4.

As mentioned above, the new version might contain some misleading 200s with termination state SD-- which will hopefully soon be fixed.

Edit: Adding additional feedback from the HAProxy community

So the fix to delay the server abort detection to
be able to send the response to the client was pushed to 2.8.x. However, the "SD--"
termination state is a side effect of the current design and will require an
important refactoring to be fixed. Not this specific case of course. It is only
a small piece of a bigger problem.
We will work on it for sure. But I don't really know when it will be
available. Hopefully for the 3.0, but not sure. And unfortunately I doubt it
will ever been backported to 2.8.x. Maybe some fixes will be backportable. Hard
to know what for now.

capflam · 2023-12-19T06:56:12Z

Just a little update about this issue. A recent fix was pushed to haproxy-3.0-DEV. It may help to limit false SD-- reports in logs: haproxy/haproxy@d9eb6d6

plowin closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x #571

Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x #571

plowin commented Nov 20, 2023

maxmoehl commented Nov 20, 2023

domdom82 commented Nov 20, 2023 •

edited

Loading

plowin commented Nov 24, 2023 •

edited

Loading

capflam commented Dec 19, 2023 •

edited

Loading

Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x #571

Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x #571

Comments

plowin commented Nov 20, 2023

Description

Initial Analysis

Countermeasures

maxmoehl commented Nov 20, 2023

1. Known Good

2. Broken Behaviour

3. Fix Applied

domdom82 commented Nov 20, 2023 • edited Loading

plowin commented Nov 24, 2023 • edited Loading

capflam commented Dec 19, 2023 • edited Loading

domdom82 commented Nov 20, 2023 •

edited

Loading

plowin commented Nov 24, 2023 •

edited

Loading

capflam commented Dec 19, 2023 •

edited

Loading