Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve idempotency latency by introducing rm_stm pipelining #5157

Merged
merged 11 commits into from
Jun 24, 2022

Commits on Jun 23, 2022

  1. Configuration menu
    Copy the full SHA
    93b8bd4 View commit details
    Browse the repository at this point in the history
  2. consensus: add step_down

    rystsov committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    9a693fc View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0791a5f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9b5491b View commit details
    Browse the repository at this point in the history
  5. rm_stm: make request validation stronger

    Return invalid_request instead of processing a wrong if branch
    rystsov committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    8ea6b14 View commit details
    Browse the repository at this point in the history
  6. rm_stm: implement replicate_seq pipelining

    Kafka protocol is asynchronous: a client may send the next write
    request without waiting for the previous request to finish. Since
    Redpanda's Raft implementation is also asynchronous, it's possible
    to process Kafka requests with minimal overhead and/or delays.
    
    The challenge is to process causally related requests without
    pipeline stalling. Imagine that we have two requests A and B. When
    we want to process them as fast as possible we should start repli-
    cating the latter request without waiting until the replication of
    the first request is over.
    
    But in case the requests are the RSM commands and are causally
    related e.g. request A is "set x=2" and B is "set y=3 if x=2" then
    to guarantee that B's precondition holds Redpanda can't start
    replication of B request without knowing that
    
      - The request A is successfully executed
      - No other command sneaks in between A and B
    
    In case of idempotency the condition we need to preserve is the
    monotonicity of the seq numbers without gaps.
    
    In order to preserve causality and to avoid pipeline stalls we use
    optimistic replication. A leader knows about its ongoing operations
    so it may optimistically predict the outcome of operation A and
    start executing operation B assuming its prediction is true.
    
    But what happens if Redpanda is wrong with its prediction, how does
    it ensure safety? Let's break down all the failure scenarios.
    
    A leader uses a mutex and consesnsus::replicate_in_stages order all
    the incoming requests. Before the first stage is resolves the mutex
    controls the order and after the first stage is resolved Redpanda's
    raft implementation guarantees order:
    
      auto u = co_await mutex.lock();
      auto stages = raft.replicate_in_stages(request);
      co_await stages.enqueued;
      u.return_all();
      // after this point the order between the requests is certain
      // the order enforced by mutex is preserved in the log
    
    The mutex / replicate_in_stages combination may enforce partial
    order but can't prevent out-of-nowhere writes. Imagine a node loses
    the leadership then new leader inserts a command C and transfers
    the leadership back to the original node.
    
    To fight this Redpanda uses conditional replication:
    
      auto term = persisted_stm::sync();
      auto u = co_await mutex.lock();
      auto stages = raft.replicate_in_stages(term, request);
      co_await stages.enqueued;
      u.return_all();
    
    It uses a sync method to make sure that the RSM's state reflects
    all the commands replicated by previous term (during this phase it
    may learn about the "A" command) and then it uses the raft's term
    to issue the replication command (conditional replicate). By design
    the replication call can't succeed if the term is wrong so instead
    of leading the ACB state the replication of B is doomed to fail.
    
    Redpanda's Raft implementation guarantees that if A was enqueued
    before B within the same term then the replication of B can't be
    successful without replication of A so we may not worry that A may
    be dropped.
    
    But we still have uncertainty. Should Redpanda process B assuming
    that A was successful or should it assume it failed? In order to
    resolve the uncertainty the leader steps down.
    
    Since sync() guarantees that the RSM's state reflects all the
    commands replicated by previous term - the next leader will resolve
    the uncertainty.
    
    The combination of those methods lead to preserving causal
    relationships between the requests (idempotency) without
    introducing pipelining stalls.
    
    -------------------------------------------------------------------
    
    The idempotency logic is straightforward:
    
     - Take a lock identified by producer id (concurrent producers
       don't affect each other)
        - If the seq number and its offset is already known then
          return the offset
        - If the seq number is known and in flight then "park" the
          current request (once the original resolved it pings all
          the parked requests with the written offsets or an error)
        - If the seq number is seen for the first time then:
            - Reject it if there is a gap between it and the last
              known seq
            - Start processing if there is no gap
    
    fixes redpanda-data#5054
    rystsov committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    88423d7 View commit details
    Browse the repository at this point in the history
  7. rm_stm: add leadership transfer coordination

    Leadership transfer and rm_stm should be coordinated to avoid a
    possibility of the live lock:
    
      - leadership transfer sets _transferring_leadership
      - it causes a write to fail
      - failing write initiate leader step down
      - the step down causes the transfer to fail
    rystsov committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    68848c1 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    1064521 View commit details
    Browse the repository at this point in the history
  9. admin: make transfer_leadership_to more stable

    making transfer_leadership_to more stable by checking that all
    nodes has updated metadata. previously we could choose a stale
    server, get 503 and retry the requests
    rystsov committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    f15e3d8 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    9ccbfce View commit details
    Browse the repository at this point in the history
  11. ducky: make wait_until faster

    temporary adding a faster version of wait_until before the
    ducktape repo is updated
    rystsov committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    b4a5fb2 View commit details
    Browse the repository at this point in the history