Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: add conn_quota for connection count limits #3901

Merged
merged 11 commits into from
Mar 14, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Feb 25, 2022

Cover letter

RFC: #3744

The main complexity here is the cross-shard coordination of limits that apply at a per-node level. Each client IP has a "home" shard where their home_allowance lives, and other shards can "borrow" some tokens in their remote_allowance. To avoid starvation when many tokens have been borrowed, there is a "reclaim" mechanism whereby the home shard reclaims all borrowed tokens, and sets a flag on the remote shards to dispatch any freed tokens back to the home shard.

In normal operation where the limit is comfortably high enough for the workload, all shards that are handling connections will end up with a few tokens in their borrowed pool, and service incoming requests without having to do cross-shard communication. Cross-shard communication only kicks in when there is some pressure for the tokens.

This PR does not add the "overrides" variant of the config for setting limits on individual addresses, that is future work.

These settings are similar to Kafka's max.connections.per.ip and max.connections settings.

Connection limits are only applied to the Kafka protocol server, not the internal RPC server.

Fixes #3698

Release notes

Features

  • It is now possible to impose limits on the number of client connections. Cluster configuration properties kafka_connections_max and kafka_connections_max_per_ip are added to control this behavior. By default both properties are switched off (i.e. the connection count is unlimited). These limits apply on a per-node basis. Note that the number of connections is not exactly equal to the number of clients: clients typically open 2-3 connections each.

@jcsp jcsp force-pushed the connection-count-limits branch 5 times, most recently from 489b0f3 to 32ed576 Compare March 2, 2022 19:01
@jcsp jcsp marked this pull request as ready for review March 3, 2022 09:38
@jcsp jcsp requested a review from VadimPlh March 3, 2022 11:02
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall is pretty great. I think there might be some memory safety issues in the conn_quota service but that's about it.

src/v/net/server.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.h Show resolved Hide resolved
src/v/net/conn_quota.h Show resolved Hide resolved
src/v/net/conn_quota.h Show resolved Hide resolved
src/v/net/conn_quota.h Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Show resolved Hide resolved
src/v/net/conn_quota.cc Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
@jcsp jcsp requested a review from dotnwat March 10, 2022 22:37
@jcsp
Copy link
Contributor Author

jcsp commented Mar 11, 2022

CI failure was known issue #3924

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes for semaphore lifetime looks good. one question about a unique scenario which looks like safety depends on capture list destruction order. f test commit subject probably needs changing.

src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/tests/conn_quota_test.cc Show resolved Hide resolved
@jcsp
Copy link
Contributor Author

jcsp commented Mar 14, 2022

This had a rather unlucky arm64 run with three (!) unrelated test failures:

})
.finally([u = std::move(units)] {})
// Keep allowance alive until after units are dropped
.finally([allowance, u = std::move(units)] {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intend to remove the units capture in the second finally block? it should be a noop, but clangd should also warn here about use-after-move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was a mistake. Fixed.

This helper was previously only used in a test
that didn't set a non-empty default value, but
it shouldn't have been ignoring the value passed
to constructor.
These settings will enable limiting the number
of concurrent connections per broker, or per
client IP per broker.
This is just a pass through to the equivalent semaphore
method.  It's useful if calling from a context where
you know you won't sleep.
This is the logic for tracking node-wide
allowances for how many connections may
be opened, or how many connections may
be opened per client IP.
This is the first unit test definition in src/v/net,
so comes with a new CMakeLists etc.
This counts how many connections we dropped for
hitting a connection count limit.
I'm about to use it in connection_limits_test.
This is a ducktape test for exercising the new
configurable connection count limits.
This fixes the potential crash from clear()
getting called in the background while
reclaim_to or cancel_reclaim_to was holding
the mutex in home_allowance.

remote_allowance doesn't need this special
treatment.
@jcsp jcsp merged commit 46e7515 into redpanda-data:dev Mar 14, 2022
@jcsp jcsp deleted the connection-count-limits branch March 14, 2022 22:24
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kafka: support dynamic max connections / max connections per ip
2 participants