Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: delay manual splits that would result in more snapshots #32594

Merged
merged 1 commit into from
Nov 30, 2018

Commits on Nov 30, 2018

  1. storage: delay manual splits that would result in more snapshots

    When a Range has followers that aren't replicating properly, splitting
    that range results in a right-hand side with followers in a similar
    state. Certain workloads (restore/import/presplit) can run large numbers
    of splits against a given range, and this can result in a large number
    of Raft snapshots that backs up the Raft snapshot queue.
    
    Ideally we'd never have any ranges that require a snapshot, but over
    the last weeks it has become clear that this is very difficult to
    achieve since the knowledge required to decide whether a snapshot
    can efficiently be prevented is distributed across multiple nodes
    that don't share the necessary information.
    
    This commit is a bit of a nuclear option to prevent the likely last big
    culprit in large numbers of Raft snapshots in cockroachdb#31409.
    
    With this change, we should expect to see Raft snapshots regularly when
    a split/scatter phase of an import/restore is active, but never large
    volumes at once (except perhaps for an initial spike).
    
    Splits are delayed only for manual splits. In particular, the split
    queue is not affected and could in theory cause Raft snapshots. However,
    at the present juncture, adding delays in the split queue could cause
    problems as well, so we retain the previous behavior there which isn't
    known to have caused problems.
    
    More follow-up work in the area of Raft snapshots will be necessary to
    add some more sanity to this area of the code.
    
    Release note (bug fix): resolve a cluster degradation scenario that
    could occur during IMPORT/RESTORE operations, manifested through a
    high number of pending Raft snapshots.
    tbg committed Nov 30, 2018
    Configuration menu
    Copy the full SHA
    2ab0f5b View commit details
    Browse the repository at this point in the history