[v22.1.x] Improve idempotency latency by introducing rm_stm pipelining #5450

rystsov · 2022-07-13T00:44:04Z

Backport from pull request #5157, #5204

fixes #5445

BenPope

First commit references an invalid sha and is not the same as from #5204

fixes redpanda-data#5189 (cherry picked from commit 41e7d2f)

(cherry picked from commit 93b8bd4)

(cherry picked from commit 9a693fc)

(cherry picked from commit 0791a5f)

(cherry picked from commit 9b5491b)

Return invalid_request instead of processing a wrong if branch (cherry picked from commit 8ea6b14)

Kafka protocol is asynchronous: a client may send the next write request without waiting for the previous request to finish. Since Redpanda's Raft implementation is also asynchronous, it's possible to process Kafka requests with minimal overhead and/or delays. The challenge is to process causally related requests without pipeline stalling. Imagine that we have two requests A and B. When we want to process them as fast as possible we should start repli- cating the latter request without waiting until the replication of the first request is over. But in case the requests are the RSM commands and are causally related e.g. request A is "set x=2" and B is "set y=3 if x=2" then to guarantee that B's precondition holds Redpanda can't start replication of B request without knowing that - The request A is successfully executed - No other command sneaks in between A and B In case of idempotency the condition we need to preserve is the monotonicity of the seq numbers without gaps. In order to preserve causality and to avoid pipeline stalls we use optimistic replication. A leader knows about its ongoing operations so it may optimistically predict the outcome of operation A and start executing operation B assuming its prediction is true. But what happens if Redpanda is wrong with its prediction, how does it ensure safety? Let's break down all the failure scenarios. A leader uses a mutex and consesnsus::replicate_in_stages order all the incoming requests. Before the first stage is resolves the mutex controls the order and after the first stage is resolved Redpanda's raft implementation guarantees order: auto u = co_await mutex.lock(); auto stages = raft.replicate_in_stages(request); co_await stages.enqueued; u.return_all(); // after this point the order between the requests is certain // the order enforced by mutex is preserved in the log The mutex / replicate_in_stages combination may enforce partial order but can't prevent out-of-nowhere writes. Imagine a node loses the leadership then new leader inserts a command C and transfers the leadership back to the original node. To fight this Redpanda uses conditional replication: auto term = persisted_stm::sync(); auto u = co_await mutex.lock(); auto stages = raft.replicate_in_stages(term, request); co_await stages.enqueued; u.return_all(); It uses a sync method to make sure that the RSM's state reflects all the commands replicated by previous term (during this phase it may learn about the "A" command) and then it uses the raft's term to issue the replication command (conditional replicate). By design the replication call can't succeed if the term is wrong so instead of leading the ACB state the replication of B is doomed to fail. Redpanda's Raft implementation guarantees that if A was enqueued before B within the same term then the replication of B can't be successful without replication of A so we may not worry that A may be dropped. But we still have uncertainty. Should Redpanda process B assuming that A was successful or should it assume it failed? In order to resolve the uncertainty the leader steps down. Since sync() guarantees that the RSM's state reflects all the commands replicated by previous term - the next leader will resolve the uncertainty. The combination of those methods lead to preserving causal relationships between the requests (idempotency) without introducing pipelining stalls. ------------------------------------------------------------------- The idempotency logic is straightforward: - Take a lock identified by producer id (concurrent producers don't affect each other) - If the seq number and its offset is already known then return the offset - If the seq number is known and in flight then "park" the current request (once the original resolved it pings all the parked requests with the written offsets or an error) - If the seq number is seen for the first time then: - Reject it if there is a gap between it and the last known seq - Start processing if there is no gap fixes redpanda-data#5054 (cherry picked from commit 88423d7)

Leadership transfer and rm_stm should be coordinated to avoid a possibility of the live lock: - leadership transfer sets _transferring_leadership - it causes a write to fail - failing write initiate leader step down - the step down causes the transfer to fail (cherry picked from commit 68848c1)

(cherry picked from commit 1064521)

making transfer_leadership_to more stable by checking that all nodes has updated metadata. previously we could choose a stale server, get 503 and retry the requests (cherry picked from commit f15e3d8)

(cherry picked from commit 9ccbfce)

temporary adding a faster version of wait_until before the ducktape repo is updated (cherry picked from commit b4a5fb2)

rystsov · 2022-07-13T03:23:57Z

First commit references an invalid sha and is not the same as from #5204

@BenPope don't know how it happened, I've recherry picked it

bharathv

Skimmed thru the changes, look ok to me, shout out if there are specific conflicts (if any) that need an additional pair of eyes.

rystsov requested review from dotnwat, NyaliaLui, BenPope, mmaslankaprv, ztlpn, jcsp and VadimPlh as code owners July 13, 2022 00:44

rystsov added this to the v22.1.5 milestone Jul 13, 2022

github-actions bot added the area/redpanda label Jul 13, 2022

BenPope reviewed Jul 13, 2022

View reviewed changes

rystsov added 12 commits July 12, 2022 20:19

ducky: fix test_deletion_stops_move ci-failure

bff52eb

fixes redpanda-data#5189 (cherry picked from commit 41e7d2f)

make available_promise support variable templates

1ddd97a

(cherry picked from commit 93b8bd4)

consensus: add step_down

636cbaf

(cherry picked from commit 9a693fc)

rm_stm: add replicate_in_stages wrapper

27406f6

(cherry picked from commit 0791a5f)

rm_stm: implement replicate pipelining

02ecde0

(cherry picked from commit 9b5491b)

rm_stm: make request validation stronger

99120ff

Return invalid_request instead of processing a wrong if branch (cherry picked from commit 8ea6b14)

admin: rename target to target_id

2201ef4

(cherry picked from commit 1064521)

admin: make transfer_leadership_to more stable

266e145

making transfer_leadership_to more stable by checking that all nodes has updated metadata. previously we could choose a stale server, get 503 and retry the requests (cherry picked from commit f15e3d8)

raft_availability_test: use stable leadership transfer

d7661f7

(cherry picked from commit 9ccbfce)

ducky: make wait_until faster

f41c809

temporary adding a faster version of wait_until before the ducktape repo is updated (cherry picked from commit b4a5fb2)

rystsov force-pushed the v22.1.x-5157 branch from 0c8f9f4 to f41c809 Compare July 13, 2022 03:22

bharathv approved these changes Jul 13, 2022

View reviewed changes

bharathv requested a review from BenPope July 13, 2022 05:01

rystsov merged commit 606b7a1 into redpanda-data:v22.1.x Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v22.1.x] Improve idempotency latency by introducing rm_stm pipelining #5450

[v22.1.x] Improve idempotency latency by introducing rm_stm pipelining #5450

rystsov commented Jul 13, 2022 •

edited

Loading

BenPope left a comment

rystsov commented Jul 13, 2022

bharathv left a comment

[v22.1.x] Improve idempotency latency by introducing rm_stm pipelining #5450

[v22.1.x] Improve idempotency latency by introducing rm_stm pipelining #5450

Conversation

rystsov commented Jul 13, 2022 • edited Loading

BenPope left a comment

Choose a reason for hiding this comment

rystsov commented Jul 13, 2022

bharathv left a comment

Choose a reason for hiding this comment

rystsov commented Jul 13, 2022 •

edited

Loading