Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc/transport: release caller units when timeout occurs #6738

Merged
merged 1 commit into from
Oct 18, 2022

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Oct 12, 2022

Cover letter

When request is completed with timeout we must release request related caller passed semaphore units. Otherwise the units may be held in the requests queue long enough to outlive the caller and cause use after free error when released.

Fixes: #6711
Fixes: #5261

Fixes #ISSUE-NUMBER, Fixes #ISSUE-NUMBER, ...

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.

Release notes

Bug Fixes

  • Fix heap use after free during redpanda shutdown

dotnwat
dotnwat previously approved these changes Oct 12, 2022
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh this looks correct to me

@dotnwat
Copy link
Member

dotnwat commented Oct 12, 2022

looks like maybe there are more UAF issues?

INFO  2022-10-12 14:00:00,572 [shard 1] rpc - transport.cc:163 - Request timeout to {host: docker-rp-22, port: 33145}, correlation id: 656 (1 in flight)
=================================================================
==3826==ERROR: AddressSanitizer: heap-use-after-free on address 0x608000d6c650 at pc 0x56324e614bc2 bp 0x7f249a951050 sp 0x7f249a951048
READ of size 8 at 0x608000d6c650 thread T1 (reactor-1)
INFO  2022-10-12 14:00:00,572 [shard 0] cluster - controller_backend.cc:683 - [{kafka/topic-pcymajrjrp/20}] result: Timeout occurred while processing request operation: {type: update, revision: 209, 

@@ -163,6 +164,13 @@ transport::make_response_handler(netbuf& b, const rpc::client_opts& opts) {
_probe.request_timeout();
_correlations.erase(it);
}
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we release the units in L239? I added a test for this in the last patch ..

// Verify that the resources are released correctly after timeout.

or is this an edge case (race) where a timeout kicks in before the dispatch fiber is scheduled?

Copy link
Contributor

@andrwng andrwng Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's possible that we begin calling send after the call to fail_outstanding_futures() since that only calls shutdown() which doesn't close the gate. Maybe our failure-mode calls to fail_outstanding_futures() should actually be calls to stop()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, around the other call to _requests_queue.erase() we explicitly move the resource_units and let them leave scope. Do we have to do that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id we call send on an output stream after shutdown it will always throw as the underlying fd is closed. The edge case is is that the request is timed out so it returns to the caller, then the caller can continue and be destroyed. The timeout may happen before we dispatch send and then the send. Hence the units will stay in the _requests_queue

@mmaslankaprv
Copy link
Member Author

ci failure: #5575

bharathv
bharathv previously approved these changes Oct 13, 2022
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

@dotnwat
Copy link
Member

dotnwat commented Oct 13, 2022

/ci-repeat 10
debug
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py
tests/rptest/tests/compaction_end_to_end_test.py

@dotnwat
Copy link
Member

dotnwat commented Oct 14, 2022



test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
--
  | status:     FAIL
  | run time:   5 minutes 17.580 seconds
  |  
  |  
  | <BadLogLines nodes=docker-rp-2(1) example="redpanda: /vectorized/include/seastar/core/future.hh:648: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.">
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 55, in wrapped
  | self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  | File "/root/tests/rptest/services/redpanda.py", line 1367, in raise_on_bad_logs
  | raise BadLogLines(bad_lines)
  | rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-2(1) example="redpanda: /vectorized/include/seastar/core/future.hh:648: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.">


When request is completed with timeout we must release request related
caller passed semaphore units. Otherwise the units may be held in the
requests queue long enough to outlive the caller and cause use after
free error when released.

Fixes: redpanda-data#6711

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@dotnwat
Copy link
Member

dotnwat commented Oct 17, 2022

/ci-repeat 5
debug
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py
tests/rptest/tests/compaction_end_to_end_test.py

@mmaslankaprv
Copy link
Member Author

the ci failure is: #6614

@piyushredpanda piyushredpanda merged commit 27afe92 into redpanda-data:dev Oct 18, 2022
@mmedenjak mmedenjak added kind/bug Something isn't working ci-failure labels Oct 19, 2022
@jcsp
Copy link
Contributor

jcsp commented Oct 25, 2022

/backport v22.2.x

@BenPope
Copy link
Member

BenPope commented Nov 16, 2022

/backport v22.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants