rpc/transport: release caller units when timeout occurs #6738

mmaslankaprv · 2022-10-12T12:57:22Z

Cover letter

When request is completed with timeout we must release request related caller passed semaphore units. Otherwise the units may be held in the requests queue long enough to outlive the caller and cause use after free error when released.

Fixes: #6711
Fixes: #5261

Fixes #ISSUE-NUMBER, Fixes #ISSUE-NUMBER, ...

Backport Required

UX changes

Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.

Release notes

Bug Fixes

Fix heap use after free during redpanda shutdown

dotnwat

yeh this looks correct to me

dotnwat · 2022-10-12T16:00:21Z

looks like maybe there are more UAF issues?

INFO  2022-10-12 14:00:00,572 [shard 1] rpc - transport.cc:163 - Request timeout to {host: docker-rp-22, port: 33145}, correlation id: 656 (1 in flight)
=================================================================
==3826==ERROR: AddressSanitizer: heap-use-after-free on address 0x608000d6c650 at pc 0x56324e614bc2 bp 0x7f249a951050 sp 0x7f249a951048
READ of size 8 at 0x608000d6c650 thread T1 (reactor-1)
INFO  2022-10-12 14:00:00,572 [shard 0] cluster - controller_backend.cc:683 - [{kafka/topic-pcymajrjrp/20}] result: Timeout occurred while processing request operation: {type: update, revision: 209,

.

bharathv · 2022-10-12T17:11:44Z

src/v/rpc/transport.cc

@@ -163,6 +164,13 @@ transport::make_response_handler(netbuf& b, const rpc::client_opts& opts) {
            _probe.request_timeout();
            _correlations.erase(it);
        }
+        /*


Don't we release the units in L239? I added a test for this in the last patch ..

redpanda/src/v/rpc/test/rpc_gen_cycling_test.cc

Line 390 in 75fb210

// Verify that the resources are released correctly after timeout.

or is this an edge case (race) where a timeout kicks in before the dispatch fiber is scheduled?

It looks like it's possible that we begin calling send after the call to fail_outstanding_futures() since that only calls shutdown() which doesn't close the gate. Maybe our failure-mode calls to fail_outstanding_futures() should actually be calls to stop()?

Also, around the other call to _requests_queue.erase() we explicitly move the resource_units and let them leave scope. Do we have to do that here?

id we call send on an output stream after shutdown it will always throw as the underlying fd is closed. The edge case is is that the request is timed out so it returns to the caller, then the caller can continue and be destroyed. The timeout may happen before we dispatch send and then the send. Hence the units will stay in the _requests_queue

mmaslankaprv · 2022-10-13T13:27:20Z

ci failure: #5575

bharathv

makes sense.

dotnwat · 2022-10-13T18:49:12Z

/ci-repeat 10
debug
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py
tests/rptest/tests/compaction_end_to_end_test.py

dotnwat · 2022-10-14T01:01:39Z



test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
--
  | status:     FAIL
  | run time:   5 minutes 17.580 seconds
  |  
  |  
  | <BadLogLines nodes=docker-rp-2(1) example="redpanda: /vectorized/include/seastar/core/future.hh:648: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.">
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 55, in wrapped
  | self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  | File "/root/tests/rptest/services/redpanda.py", line 1367, in raise_on_bad_logs
  | raise BadLogLines(bad_lines)
  | rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-2(1) example="redpanda: /vectorized/include/seastar/core/future.hh:648: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.">

When request is completed with timeout we must release request related caller passed semaphore units. Otherwise the units may be held in the requests queue long enough to outlive the caller and cause use after free error when released. Fixes: redpanda-data#6711 Signed-off-by: Michal Maslanka <michal@redpanda.com>

dotnwat · 2022-10-17T16:02:14Z

/ci-repeat 5
debug
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py
tests/rptest/tests/compaction_end_to_end_test.py

mmaslankaprv · 2022-10-18T09:47:01Z

the ci failure is: #6614

jcsp · 2022-10-25T17:12:46Z

/backport v22.2.x

BenPope · 2022-11-16T20:13:52Z

/backport v22.1.x

mmaslankaprv force-pushed the fix-6711 branch from dbc4f8e to 3bd85a6 Compare October 12, 2022 12:57

github-actions bot added the area/redpanda label Oct 12, 2022

mmaslankaprv marked this pull request as ready for review October 12, 2022 14:15

mmaslankaprv requested review from dotnwat, bharathv and andrwng October 12, 2022 14:15

dotnwat previously approved these changes Oct 12, 2022

View reviewed changes

bharathv reviewed Oct 12, 2022

View reviewed changes

mmaslankaprv force-pushed the fix-6711 branch from 3bd85a6 to 1bcbd36 Compare October 13, 2022 06:48

bharathv previously approved these changes Oct 13, 2022

View reviewed changes

mmaslankaprv dismissed bharathv’s stale review via 8967962 October 14, 2022 15:33

mmaslankaprv force-pushed the fix-6711 branch from 1bcbd36 to 8967962 Compare October 14, 2022 15:33

mmaslankaprv requested review from dotnwat and bharathv October 17, 2022 11:31

dotnwat approved these changes Oct 17, 2022

View reviewed changes

bharathv approved these changes Oct 17, 2022

View reviewed changes

piyushredpanda merged commit 27afe92 into redpanda-data:dev Oct 18, 2022

mmedenjak added kind/bug Something isn't working ci-failure labels Oct 19, 2022

jcsp mentioned this pull request Oct 25, 2022

Backports 6864 to v22.2.x #6918

Merged

vbotbuildovich mentioned this pull request Oct 25, 2022

[v22.2.x] test: test_migrating_consume_offsets.failurer: AddressSanitizer: heap-use-after-free #6926

Closed

vbotbuildovich mentioned this pull request Oct 25, 2022

[v22.2.x] rpc/transport: release caller units when timeout occurs #6927

Merged

This was referenced Nov 16, 2022

[v22.1.x] test: test_migrating_consume_offsets.failurer: AddressSanitizer: heap-use-after-free #7324

Closed

[v22.1.x] rpc/transport: release caller units when timeout occurs #7325

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc/transport: release caller units when timeout occurs #6738

rpc/transport: release caller units when timeout occurs #6738

mmaslankaprv commented Oct 12, 2022 •

edited by andrewhsu

Loading

dotnwat left a comment

dotnwat commented Oct 12, 2022

bharathv Oct 12, 2022

andrwng Oct 12, 2022 •

edited

Loading

andrwng Oct 12, 2022

mmaslankaprv Oct 13, 2022

mmaslankaprv commented Oct 13, 2022

bharathv left a comment

dotnwat commented Oct 13, 2022

dotnwat commented Oct 14, 2022

dotnwat commented Oct 17, 2022

mmaslankaprv commented Oct 18, 2022

jcsp commented Oct 25, 2022

BenPope commented Nov 16, 2022

rpc/transport: release caller units when timeout occurs #6738

rpc/transport: release caller units when timeout occurs #6738

Conversation

mmaslankaprv commented Oct 12, 2022 • edited by andrewhsu Loading

Cover letter

Backport Required

UX changes

Release notes

Bug Fixes

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat commented Oct 12, 2022

bharathv Oct 12, 2022

Choose a reason for hiding this comment

andrwng Oct 12, 2022 • edited Loading

Choose a reason for hiding this comment

andrwng Oct 12, 2022

Choose a reason for hiding this comment

mmaslankaprv Oct 13, 2022

Choose a reason for hiding this comment

mmaslankaprv commented Oct 13, 2022

bharathv left a comment

Choose a reason for hiding this comment

dotnwat commented Oct 13, 2022

dotnwat commented Oct 14, 2022

dotnwat commented Oct 17, 2022

mmaslankaprv commented Oct 18, 2022

jcsp commented Oct 25, 2022

BenPope commented Nov 16, 2022

mmaslankaprv commented Oct 12, 2022 •

edited by andrewhsu

Loading

andrwng Oct 12, 2022 •

edited

Loading