From 27080ea4a0ae22c79ca442263ae47e452bcc1d24 Mon Sep 17 00:00:00 2001 From: Connor Date: Thu, 25 Oct 2018 16:39:51 +0800 Subject: [PATCH 01/12] Add batch split Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 170 +++++++++++++++++++++++++++++++++ 1 file changed, 170 insertions(+) create mode 100644 text/2018-10-25-batch-split.md diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md new file mode 100644 index 00000000..8caafa77 --- /dev/null +++ b/text/2018-10-25-batch-split.md @@ -0,0 +1,170 @@ +# Summary + +Support `BatchSplit` feature that split one Region into multiple Regions at a time if the size is large enough. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces not only one split key but multiple split keys and change inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to permit TiKV asking for `region_id` and `peer_id` in batch. + +# Motivation + +Current split only split one Region at a time. It may be very slow when sequential write is too fast, namely, split speed can not keep up with write speed. Slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of IO and make everything slow. Also, large region is hard for hotspot schedule, so it makes performance even worse. + +# Detailed design + +## RPC interface + +### PD + +```protobuf +service PD { + // ... + + rpc AskSplit(AskSplitRequest) returns (AskSplitResponse) { + // Use AskBatchSplit instead. + option deprecated = true; + } + rpc ReportSplit(ReportSplitRequest) returns (ReportSplitResponse) { + // Use ResportBatchSplit instead. + option deprecated = true; + } + rpc AskBatchSplit(AskBatchSplitRequest) returns (AskBatchSplitResponse) {} + rpc ReportBatchSplit(ReportBatchSplitRequest) returns (ReportBatchSplitResponse) {} +} + +message AskBatchSplitRequest { + RequestHeader header = 1; + metapb.Region region = 2; + uint32 split_count = 3; +} + +message SplitID { + uint64 new_region_id = 1; + repeated uint64 new_peer_ids = 2; +} + +message AskBatchSplitResponse { + ResponseHeader header = 1; + repeated SplitID ids = 2; +} + +message ReportBatchSplitRequest { + RequestHeader header = 1; + repeated metapb.Region regions = 2; +} + +message ReportBatchSplitResponse { + ResponseHeader header = 1; +} +``` + +Add `AskBatchSplit` to replace `AskSplit` , it is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Region to be generated, and `AskBatchSplitResponse` returns all new allocated ids to TiKV. + +Add `ReportBatchSplit` to replace `ReportBatchSplit`, it is called when TiKV finish splitting Region. `ReportBatchSplitRequest` takes all metas of new generated Region for PD to update its related information. + +For compatibility issue, the old interface is not deleted but set to deprecated. + +### TiKV + +```protobuf +message SplitRequest { + // ... + // Will be ignored in batch split, use `BatchSplitRequest::right_derive` instead. + bool right_derive = 4 [deprecated=true]; +} + +message BatchSplitRequest { + repeated SplitRequest requests = 1; + // If true, the last region derive the origin region_id, + // other regions use new ids. + bool right_derive = 2; +} + +message BatchSplitResponse { + repeated metapb.Region regions = 1; +} + +enum AdminCmdType { + // ... + Split = 2 [deprecated=true]; + // ... + BatchSplit = 10; +} + +message AdminRequest { + // ... + SplitRequest split = 3 [deprecated=true]; + // ... + BatchSplitRequest splits = 10; +} + +message AdminResponse { + // ... + SplitResponse split = 3 [deprecated=true]; + // ... + BatchSplitResponse splits = 10; +} +``` + +Add a new admin command type `BatchSplit` with related request and response. `BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` which invalidates the `right_derive` in each `SplitRequest`. + +When in rolling upgrade process, new TiKVs are mixed up with old TiKVs, so old command type `Split` still needs to be preserved. + +## Implementation in TiKV + +### How to produce multiple split keys + +This part mainly focus on `SplitChecker`. + +First of all, adjust trait to make it can return multiple split keys. + +```rust +pub trait SplitChecker { + // ... + + // before: fn split_key(&mut self) -> Option> + fn split_keys(&mut self) -> Vec>; + + // before: fn approximate_split_key(&self, _: &Region, _: &DB) -> Result>> + fn approximate_split_keys(&self, _: &Region, _: &DB) -> Result>> { +} +``` + +Then add one config `batch_split_limit` to limit the number of produced split keys at a time. If it is unlimited, for once split check, it scans all over the Region's range, and in some extreme case it would cause performance issue. + +Now we have four split-checkers: half, key, size, table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stay unchanged. + +The general logic of SizeChecker and KeysChecker is similiar, the only difference of them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: + +- before: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_max_size` or scans to the end of range. If `total_size` is smaller than `region_max_size` at the end, checker wouldn't produce any split key; if not, it regards the very key at which `total_size` reachs to `region_split_size` as split key. +- after: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of range. During the scan process, it reocrds the key every `region_split_size` as split keys, but after finish scanning, it may discards the last split key if the size of rest Region doesn't over `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behaves as before that split without batch. + +### Compatibility concern + +The general process in raftstore changes a little, it mainly replaces `Split` with `BatchSplit`. But one thing should be noted, when rolling update PD version control will refuse `AskBatchSplit` request, thus split can't be performed during this process until all TiKV bump to new version. To let TiKV know whether `AskBatchSplit` fail reason is compatibility or not, we introduce a new error type for `ResponseHeader` : + +```protobuf +enum ErrorType { + // ... + INCOMPATIBLE_VERSION = 5; +} +``` + +So once TiKV gets `AskBatchSplitResponse` with `ErrorType::INCOMPATIBLE_VERSION`, it uses original `AskSplit` instead of `AskBatchSplit`, and all following processes will degrade to original way. So original code path is not deleted. + +### Approximate split key + +What we said above can ease the problem, however scanning a large Region can also consumes a lot of time and CPU. Tests show that large region can still easily show up even with batch split implemented, although split is speeded up. + +When a Region becomes large enough, it's more practical to divide it into smaller chunks quickly. This can be achieved via size estimation, which can be calculated from SST properties. Although it may not be accurate enough, it's okay for a large Region. + +So if the size of Region is larger than `region_max_size * batch_split_limit * 2`, TiKV will use approximate way to produce split key. The approximate way is quite similar to the algorithm we describe above, but to estimate TiKV uses approximate size of the Region and the number of keys in the Region's range to calculate the average distance between two SST property keys, and produces a split key every `region_split_size / distance` keys. + +# Drawbacks + +- When use approximate way, Region may split into several disproportion Regions due to size estimation. + +# Alternatives + +None + + +# Unresolved questions + +A large Region is usually more emergent to be split, so we can change the split check queue from a naive FIFO queue to a priority queue so that large Region can be split early and quickly. From 49f184aae5638132fb4acfcdefec241a4fe70122 Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Thu, 25 Oct 2018 17:12:13 +0800 Subject: [PATCH 02/12] some change Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 8caafa77..7751861b 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -4,7 +4,7 @@ Support `BatchSplit` feature that split one Region into multiple Regions at a ti # Motivation -Current split only split one Region at a time. It may be very slow when sequential write is too fast, namely, split speed can not keep up with write speed. Slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of IO and make everything slow. Also, large region is hard for hotspot schedule, so it makes performance even worse. +Current split only split one Region at a time. It may be very slow when sequential write is too fast, namely, split speed can not keep up with write speed. Slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of IO and make everything slow. Also, large region is hard for scheduling hotspot, so it makes performance even worse. # Detailed design @@ -56,7 +56,7 @@ message ReportBatchSplitResponse { Add `AskBatchSplit` to replace `AskSplit` , it is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Region to be generated, and `AskBatchSplitResponse` returns all new allocated ids to TiKV. -Add `ReportBatchSplit` to replace `ReportBatchSplit`, it is called when TiKV finish splitting Region. `ReportBatchSplitRequest` takes all metas of new generated Region for PD to update its related information. +Add `ReportBatchSplit` to replace `ReportBatchSplit`, it is called when TiKV finish splitting Region. `ReportBatchSplitRequest` takes all metas of new generated Region for PD to update PD's related information. For compatibility issue, the old interface is not deleted but set to deprecated. @@ -126,18 +126,18 @@ pub trait SplitChecker { } ``` -Then add one config `batch_split_limit` to limit the number of produced split keys at a time. If it is unlimited, for once split check, it scans all over the Region's range, and in some extreme case it would cause performance issue. +Then add one config `batch_split_limit` to limit the number of produced split keys in a batch. If it is unlimited, for once split check, it scans all over the Region's range, and in some extreme case this would cause performance issue. -Now we have four split-checkers: half, key, size, table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stay unchanged. +Now we have four split-checkers: half, keys, size and table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stay unchanged. -The general logic of SizeChecker and KeysChecker is similiar, the only difference of them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: +The general logic of SizeChecker and KeysChecker are similiar, the only difference between them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: - before: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_max_size` or scans to the end of range. If `total_size` is smaller than `region_max_size` at the end, checker wouldn't produce any split key; if not, it regards the very key at which `total_size` reachs to `region_split_size` as split key. -- after: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of range. During the scan process, it reocrds the key every `region_split_size` as split keys, but after finish scanning, it may discards the last split key if the size of rest Region doesn't over `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behaves as before that split without batch. +- after: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of range. During the scan process, it reocrds the key as split key every `region_split_size`, but after finishing scanning, it may discards the last split key if the size of rest Region doesn't over `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behave as before that split without batch. ### Compatibility concern -The general process in raftstore changes a little, it mainly replaces `Split` with `BatchSplit`. But one thing should be noted, when rolling update PD version control will refuse `AskBatchSplit` request, thus split can't be performed during this process until all TiKV bump to new version. To let TiKV know whether `AskBatchSplit` fail reason is compatibility or not, we introduce a new error type for `ResponseHeader` : +The general process in raftstore changes a little, it mainly replaces `Split` with `BatchSplit`. But one thing should be noted, when rolling upgrade, PD version control will refuse `AskBatchSplit` request, thus split can't be performed during this process until all TiKV bump to new version. To let TiKV know whether `AskBatchSplit` fail for compatibility or not, we introduce a new error type for `ResponseHeader` : ```protobuf enum ErrorType { @@ -146,15 +146,15 @@ enum ErrorType { } ``` -So once TiKV gets `AskBatchSplitResponse` with `ErrorType::INCOMPATIBLE_VERSION`, it uses original `AskSplit` instead of `AskBatchSplit`, and all following processes will degrade to original way. So original code path is not deleted. +So once TiKV gets `AskBatchSplitResponse` with `ErrorType::INCOMPATIBLE_VERSION`, it uses original `AskSplit` instead of `AskBatchSplit`, and all following processes will degrade to original way. So original code path is not deleted. ### Approximate split key -What we said above can ease the problem, however scanning a large Region can also consumes a lot of time and CPU. Tests show that large region can still easily show up even with batch split implemented, although split is speeded up. +What we said above can ease the problem, however scanning a large Region can also consume a lot of time and CPU. Test shows that large Region can still easily show up even with batch split implemented, although split is speeded up. When a Region becomes large enough, it's more practical to divide it into smaller chunks quickly. This can be achieved via size estimation, which can be calculated from SST properties. Although it may not be accurate enough, it's okay for a large Region. -So if the size of Region is larger than `region_max_size * batch_split_limit * 2`, TiKV will use approximate way to produce split key. The approximate way is quite similar to the algorithm we describe above, but to estimate TiKV uses approximate size of the Region and the number of keys in the Region's range to calculate the average distance between two SST property keys, and produces a split key every `region_split_size / distance` keys. +So if the size of Region is larger than `region_max_size * batch_split_limit * 2`, TiKV uses approximate way to produce split keys. The approximate way is quite similar to the algorithm we describe above, but to estimate TiKV uses approximate size of the Region and the number of keys in the Region's range to calculate the average distance between two SST property keys, and produces a split key every `region_split_size / distance` keys. # Drawbacks From ac7096df256dc66b417a4fbc271770b200977e25 Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Thu, 25 Oct 2018 18:09:17 +0800 Subject: [PATCH 03/12] address comment Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 7751861b..f749d303 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -1,6 +1,6 @@ # Summary -Support `BatchSplit` feature that split one Region into multiple Regions at a time if the size is large enough. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces not only one split key but multiple split keys and change inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to permit TiKV asking for `region_id` and `peer_id` in batch. +Support `BatchSplit` feature that split one Region into multiple Regions at a time if the size is large enough. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and change inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to permit TiKV asking for `region_id` and `peer_id` in batch. # Motivation @@ -54,7 +54,7 @@ message ReportBatchSplitResponse { } ``` -Add `AskBatchSplit` to replace `AskSplit` , it is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Region to be generated, and `AskBatchSplitResponse` returns all new allocated ids to TiKV. +Add `AskBatchSplit` to replace `AskSplit` , it is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Region to be generated, and `AskBatchSplitResponse` returns all new allocated IDs to TiKV. Add `ReportBatchSplit` to replace `ReportBatchSplit`, it is called when TiKV finish splitting Region. `ReportBatchSplitRequest` takes all metas of new generated Region for PD to update PD's related information. From 8e6e69884f4549a7deb1ae9826331c4e597d440e Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Tue, 30 Oct 2018 10:55:26 +0800 Subject: [PATCH 04/12] address comment Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index f749d303..564000d9 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -1,6 +1,6 @@ # Summary -Support `BatchSplit` feature that split one Region into multiple Regions at a time if the size is large enough. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and change inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to permit TiKV asking for `region_id` and `peer_id` in batch. +Support `BatchSplit` feature that split one Region into multiple Regions at a time if the size or the number of keys exceeds a threshold. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and change inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to permit TiKV asking for `region_id` and `peer_id` in batch. # Motivation @@ -14,7 +14,7 @@ Current split only split one Region at a time. It may be very slow when sequenti ```protobuf service PD { - // ... + // ... rpc AskSplit(AskSplitRequest) returns (AskSplitResponse) { // Use AskBatchSplit instead. @@ -36,7 +36,7 @@ message AskBatchSplitRequest { message SplitID { uint64 new_region_id = 1; - repeated uint64 new_peer_ids = 2; + repeated uint64 new_peer_ids = 2; } message AskBatchSplitResponse { @@ -46,11 +46,11 @@ message AskBatchSplitResponse { message ReportBatchSplitRequest { RequestHeader header = 1; - repeated metapb.Region regions = 2; + repeated metapb.Region regions = 2; } message ReportBatchSplitResponse { - ResponseHeader header = 1; + ResponseHeader header = 1; } ``` @@ -158,7 +158,7 @@ So if the size of Region is larger than `region_max_size * batch_split_limit * 2 # Drawbacks -- When use approximate way, Region may split into several disproportion Regions due to size estimation. +- When use approximate way, Region may split into several disproportional Regions due to size estimation. # Alternatives From 300f8b39479afc27942a35b38cf8b8dfa09ed041 Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Fri, 2 Nov 2018 13:34:07 +0800 Subject: [PATCH 05/12] address comment Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 564000d9..66e5d0a6 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -1,10 +1,10 @@ # Summary -Support `BatchSplit` feature that split one Region into multiple Regions at a time if the size or the number of keys exceeds a threshold. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and change inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to permit TiKV asking for `region_id` and `peer_id` in batch. +Support `BatchSplit` feature that splits one Region into multiple Regions at a time if the size or the number of keys exceeds a threshold. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and changes inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch. # Motivation -Current split only split one Region at a time. It may be very slow when sequential write is too fast, namely, split speed can not keep up with write speed. Slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of IO and make everything slow. Also, large region is hard for scheduling hotspot, so it makes performance even worse. +Current split only splits one Region at a time. It may be very slow when sequential write is too fast, namely, split speed can not keep up with write speed. Slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of IO and make everything slow. Also, it is hard to schedule hotspots for a large Region, so it makes performance even worse. # Detailed design @@ -54,9 +54,9 @@ message ReportBatchSplitResponse { } ``` -Add `AskBatchSplit` to replace `AskSplit` , it is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Region to be generated, and `AskBatchSplitResponse` returns all new allocated IDs to TiKV. +Add `AskBatchSplit` to replace `AskSplit`. It is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Regions to be generated, and `AskBatchSplitResponse` returns all new allocated IDs to TiKV. -Add `ReportBatchSplit` to replace `ReportBatchSplit`, it is called when TiKV finish splitting Region. `ReportBatchSplitRequest` takes all metas of new generated Region for PD to update PD's related information. +Add `ReportBatchSplit` to replace `ReportBatchSplit`. It is called when TiKV finishes splitting a Region. `ReportBatchSplitRequest` takes all metas of a new generated Region for PD to update PD's related information. For compatibility issue, the old interface is not deleted but set to deprecated. @@ -65,7 +65,7 @@ For compatibility issue, the old interface is not deleted but set to deprecated. ```protobuf message SplitRequest { // ... - // Will be ignored in batch split, use `BatchSplitRequest::right_derive` instead. + // Will be ignored in batch split. Use `BatchSplitRequest::right_derive` instead. bool right_derive = 4 [deprecated=true]; } @@ -133,7 +133,7 @@ Now we have four split-checkers: half, keys, size and table. SizeChecker and Key The general logic of SizeChecker and KeysChecker are similiar, the only difference between them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: - before: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_max_size` or scans to the end of range. If `total_size` is smaller than `region_max_size` at the end, checker wouldn't produce any split key; if not, it regards the very key at which `total_size` reachs to `region_split_size` as split key. -- after: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of range. During the scan process, it reocrds the key as split key every `region_split_size`, but after finishing scanning, it may discards the last split key if the size of rest Region doesn't over `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behave as before that split without batch. +- after: it scans key-value pairs in a Region's range sequentially to accumulate their size as `total_size` and stops once the size reaches `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of the range. During the scan process, it records the key as split key every `region_split_size`, but after finishing scanning, it may discard the last split key if the size of rest Region is not bigger than `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behave the same way as the split without batch. ### Compatibility concern @@ -146,11 +146,11 @@ enum ErrorType { } ``` -So once TiKV gets `AskBatchSplitResponse` with `ErrorType::INCOMPATIBLE_VERSION`, it uses original `AskSplit` instead of `AskBatchSplit`, and all following processes will degrade to original way. So original code path is not deleted. +So once TiKV gets `AskBatchSplitResponse` with `ErrorType::INCOMPATIBLE_VERSION`, it uses the original `AskSplit` instead of `AskBatchSplit`, and all the following processes will degrade to the original way. So the original code path is not deleted. ### Approximate split key -What we said above can ease the problem, however scanning a large Region can also consume a lot of time and CPU. Test shows that large Region can still easily show up even with batch split implemented, although split is speeded up. +What we said above can ease the problem. However, scanning a large Region can also consume a lot of time and CPU. The test shows that large Regions can still easily show up even with batch split implemented, although split is speeded up. When a Region becomes large enough, it's more practical to divide it into smaller chunks quickly. This can be achieved via size estimation, which can be calculated from SST properties. Although it may not be accurate enough, it's okay for a large Region. @@ -158,7 +158,7 @@ So if the size of Region is larger than `region_max_size * batch_split_limit * 2 # Drawbacks -- When use approximate way, Region may split into several disproportional Regions due to size estimation. +- When the approximate way is used, Region may split into several disproportional Regions due to size estimation. # Alternatives @@ -167,4 +167,4 @@ None # Unresolved questions -A large Region is usually more emergent to be split, so we can change the split check queue from a naive FIFO queue to a priority queue so that large Region can be split early and quickly. +Generally, it is more urgent to split a large Region, so we can change the split check queue from a naive FIFO queue to a priority queue so that a large Region can be split early and quickly. From 918d432ea17617540511313240951bc6a5916396 Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Fri, 23 Nov 2018 16:40:11 +0800 Subject: [PATCH 06/12] address comment Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 66e5d0a6..1b8785cf 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -1,10 +1,10 @@ # Summary -Support `BatchSplit` feature that splits one Region into multiple Regions at a time if the size or the number of keys exceeds a threshold. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and changes inner split related interface into batch style. For PD, add RPC `AskBatchSplit` and `ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch. +Support a `BatchSplit` feature that splits one Region into multiple Regions at a time if the size or the number of keys exceeds a threshold. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and changes inner split related interface into batch style. For PD, add RPCs `AskBatchSplit` and `ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch. # Motivation -Current split only splits one Region at a time. It may be very slow when sequential write is too fast, namely, split speed can not keep up with write speed. Slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of IO and make everything slow. Also, it is hard to schedule hotspots for a large Region, so it makes performance even worse. +Current split only splits one Region at a time. It may be very slow when sequential write is too fast, namely, the split speed can not keep up with write speed. A slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of I/O and make everything slow. Also, it is hard to schedule hotspots for a large Region, so it makes performance even worse. # Detailed design From 095d56117e14043cc617ea5f5317e7a0f97114dd Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Mon, 26 Nov 2018 14:13:25 +0800 Subject: [PATCH 07/12] address comment Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 1b8785cf..1a87fac0 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -71,8 +71,8 @@ message SplitRequest { message BatchSplitRequest { repeated SplitRequest requests = 1; - // If true, the last region derive the origin region_id, - // other regions use new ids. + // If true, the last Region obtains the origin region_id, + // and other Regions use new Ids. bool right_derive = 2; } @@ -102,17 +102,17 @@ message AdminResponse { } ``` -Add a new admin command type `BatchSplit` with related request and response. `BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` which invalidates the `right_derive` in each `SplitRequest`. +Add a new admin command type `BatchSplit` with related request and response. `BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` which invalidates the `right_derive` in each `SplitRequest`. -When in rolling upgrade process, new TiKVs are mixed up with old TiKVs, so old command type `Split` still needs to be preserved. +When in the rolling upgrade process, new TiKVs are mixed up with old TiKVs, so old command type `Split` still needs to be preserved. ## Implementation in TiKV ### How to produce multiple split keys -This part mainly focus on `SplitChecker`. +This part mainly focuses on `SplitChecker`. -First of all, adjust trait to make it can return multiple split keys. +First of all, adjust `trait` so that it can return multiple split keys. ```rust pub trait SplitChecker { @@ -128,16 +128,16 @@ pub trait SplitChecker { Then add one config `batch_split_limit` to limit the number of produced split keys in a batch. If it is unlimited, for once split check, it scans all over the Region's range, and in some extreme case this would cause performance issue. -Now we have four split-checkers: half, keys, size and table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stay unchanged. +Now we have four split-checkers: half, keys, size and table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stays unchanged. -The general logic of SizeChecker and KeysChecker are similiar, the only difference between them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: +The general logic of SizeChecker and KeysChecker are similar. The only difference between them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: -- before: it scans key-value pairs in a Region's range sequentially to accumlate their size as `total_size` and stops once the size reachs to `region_max_size` or scans to the end of range. If `total_size` is smaller than `region_max_size` at the end, checker wouldn't produce any split key; if not, it regards the very key at which `total_size` reachs to `region_split_size` as split key. +- before: it scans key-value pairs in a Region's range sequentially to accumulate their size as `total_size` and stops once the size reaches `region_max_size` or scans to the end of the range. If `total_size` is smaller than `region_max_size` at the end, checker wouldn't produce any split key; if not, it regards the very key at which `total_size` reaches `region_split_size` as split key. - after: it scans key-value pairs in a Region's range sequentially to accumulate their size as `total_size` and stops once the size reaches `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of the range. During the scan process, it records the key as split key every `region_split_size`, but after finishing scanning, it may discard the last split key if the size of rest Region is not bigger than `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behave the same way as the split without batch. ### Compatibility concern -The general process in raftstore changes a little, it mainly replaces `Split` with `BatchSplit`. But one thing should be noted, when rolling upgrade, PD version control will refuse `AskBatchSplit` request, thus split can't be performed during this process until all TiKV bump to new version. To let TiKV know whether `AskBatchSplit` fail for compatibility or not, we introduce a new error type for `ResponseHeader` : +The general process in raftstore changes a little. It mainly replaces `Split` with `BatchSplit`. But one thing should be noted that during the rolling upgrade, PD version control will refuse the `AskBatchSplit` request, thus split can't be performed during this process until all TiKVs bump to a new version. To let TiKV know whether `AskBatchSplit` fails for compatibility or not, we introduce a new error type for `ResponseHeader`: ```protobuf enum ErrorType { From 84a7a166a80c7b21d0005a030b51699fb4f0eedb Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Mon, 26 Nov 2018 14:43:44 +0800 Subject: [PATCH 08/12] format Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 116 +++++++++++++++++++++++++-------- 1 file changed, 90 insertions(+), 26 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 1a87fac0..3ff128e9 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -1,10 +1,20 @@ # Summary -Support a `BatchSplit` feature that splits one Region into multiple Regions at a time if the size or the number of keys exceeds a threshold. This includes modifications of both TiKV and PD. For TiKV, every round of split-check produces multiple split keys instead of one and changes inner split related interface into batch style. For PD, add RPCs `AskBatchSplit` and `ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch. +Support a `BatchSplit` feature that splits one Region into multiple Regions at +a time if the size or the number of keys exceeds a threshold. This includes +modifications of both TiKV and PD. For TiKV, every round of split-check +produces multiple split keys instead of one and changes inner split related +interface into batch style. For PD, add RPCs `AskBatchSplit` and +`ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch. # Motivation -Current split only splits one Region at a time. It may be very slow when sequential write is too fast, namely, the split speed can not keep up with write speed. A slow split can lead to large region. In this case, if a snapshot is triggered, it will occupy a lot of I/O and make everything slow. Also, it is hard to schedule hotspots for a large Region, so it makes performance even worse. +Current split only splits one Region at a time. It may be very slow when +sequential write is too fast, namely, the split speed can not keep up with +write speed. A slow split can lead to large region. In this case, if a snapshot +is triggered, it will occupy a lot of I/O and make everything slow. Also, it is +hard to schedule hotspots for a large Region, so it makes performance even +worse. # Detailed design @@ -25,7 +35,8 @@ service PD { option deprecated = true; } rpc AskBatchSplit(AskBatchSplitRequest) returns (AskBatchSplitResponse) {} - rpc ReportBatchSplit(ReportBatchSplitRequest) returns (ReportBatchSplitResponse) {} + rpc ReportBatchSplit(ReportBatchSplitRequest) + returns (ReportBatchSplitResponse) {} } message AskBatchSplitRequest { @@ -54,18 +65,26 @@ message ReportBatchSplitResponse { } ``` -Add `AskBatchSplit` to replace `AskSplit`. It is called when TiKV produces some split keys for one Region and asks PD to allocate new `region_id` and `peer_id` for that Region. `split_count` in `AskBatchSplitRequest` indicates the number of Regions to be generated, and `AskBatchSplitResponse` returns all new allocated IDs to TiKV. +Add `AskBatchSplit` to replace `AskSplit`. It is called when TiKV produces some +split keys for one Region and asks PD to allocate new `region_id` and `peer_id` +for that Region. `split_count` in `AskBatchSplitRequest` indicates the number +of Regions to be generated, and `AskBatchSplitResponse` returns all new +allocated IDs to TiKV. -Add `ReportBatchSplit` to replace `ReportBatchSplit`. It is called when TiKV finishes splitting a Region. `ReportBatchSplitRequest` takes all metas of a new generated Region for PD to update PD's related information. +Add `ReportBatchSplit` to replace `ReportBatchSplit`. It is called when TiKV +finishes splitting a Region. `ReportBatchSplitRequest` takes all metas of a new +generated Region for PD to update PD's related information. -For compatibility issue, the old interface is not deleted but set to deprecated. +For compatibility issue, the old interface is not deleted but set to +deprecated. ### TiKV ```protobuf message SplitRequest { // ... - // Will be ignored in batch split. Use `BatchSplitRequest::right_derive` instead. + // Will be ignored in batch split. Use `BatchSplitRequest::right_derive` + // instead. bool right_derive = 4 [deprecated=true]; } @@ -102,9 +121,12 @@ message AdminResponse { } ``` -Add a new admin command type `BatchSplit` with related request and response. `BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` which invalidates the `right_derive` in each `SplitRequest`. +Add a new admin command type `BatchSplit` with related request and response. +`BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` +which invalidates the `right_derive` in each `SplitRequest`. -When in the rolling upgrade process, new TiKVs are mixed up with old TiKVs, so old command type `Split` still needs to be preserved. +When in the rolling upgrade process, new TiKVs are mixed up with old TiKVs, so +old command type `Split` still needs to be preserved. ## Implementation in TiKV @@ -121,23 +143,49 @@ pub trait SplitChecker { // before: fn split_key(&mut self) -> Option> fn split_keys(&mut self) -> Vec>; - // before: fn approximate_split_key(&self, _: &Region, _: &DB) -> Result>> - fn approximate_split_keys(&self, _: &Region, _: &DB) -> Result>> { + // before: fn approximate_split_key(&self, _: &Region, _: &DB) + // -> Result>> + fn approximate_split_keys(&self, _: &Region, _: &DB) -> +Result>> { } ``` -Then add one config `batch_split_limit` to limit the number of produced split keys in a batch. If it is unlimited, for once split check, it scans all over the Region's range, and in some extreme case this would cause performance issue. - -Now we have four split-checkers: half, keys, size and table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stays unchanged. - -The general logic of SizeChecker and KeysChecker are similar. The only difference between them is one splits Region based on size and the other splits Region based on the number of keys. So here we mainly describe the logic of SizeChecker: - -- before: it scans key-value pairs in a Region's range sequentially to accumulate their size as `total_size` and stops once the size reaches `region_max_size` or scans to the end of the range. If `total_size` is smaller than `region_max_size` at the end, checker wouldn't produce any split key; if not, it regards the very key at which `total_size` reaches `region_split_size` as split key. -- after: it scans key-value pairs in a Region's range sequentially to accumulate their size as `total_size` and stops once the size reaches `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the end of the range. During the scan process, it records the key as split key every `region_split_size`, but after finishing scanning, it may discard the last split key if the size of rest Region is not bigger than `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behave the same way as the split without batch. +Then add one config `batch_split_limit` to limit the number of produced split +keys in a batch. If it is unlimited, for once split check, it scans all over +the Region's range, and in some extreme case this would cause performance issue. + +Now we have four split-checkers: half, keys, size and table. SizeChecker and +KeysChecker can be rewritten to produce multiple keys, and other checkers' +logic stays unchanged. + +The general logic of SizeChecker and KeysChecker are similar. The only +difference between them is one splits Region based on size and the other splits +Region based on the number of keys. So here we mainly describe the logic of +SizeChecker: + +- before: it scans key-value pairs in a Region's range sequentially to +accumulate their size as `total_size` and stops once the size reaches +`region_max_size` or scans to the end of the range. If `total_size` is smaller +than `region_max_size` at the end, checker wouldn't produce any split key; if +not, it regards the very key at which `total_size` reaches `region_split_size` +as split key. +- after: it scans key-value pairs in a Region's range sequentially to +accumulate their size as `total_size` and stops once the size reaches +`region_split_size * (batch_split_limit-1) + region_max_size` or scans to the +end of the range. During the scan process, it records the key as split key +every `region_split_size`, but after finishing scanning, it may discard the +last split key if the size of rest Region is not bigger than `region_max_size - +region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, +TiKV can perfectly behave the same way as the split without batch. ### Compatibility concern -The general process in raftstore changes a little. It mainly replaces `Split` with `BatchSplit`. But one thing should be noted that during the rolling upgrade, PD version control will refuse the `AskBatchSplit` request, thus split can't be performed during this process until all TiKVs bump to a new version. To let TiKV know whether `AskBatchSplit` fails for compatibility or not, we introduce a new error type for `ResponseHeader`: +The general process in raftstore changes a little. It mainly replaces `Split` +with `BatchSplit`. But one thing should be noted that during the rolling +upgrade, PD version control will refuse the `AskBatchSplit` request, thus split +can't be performed during this process until all TiKVs bump to a new version. +To let TiKV know whether `AskBatchSplit` fails for compatibility or not, we +introduce a new error type for `ResponseHeader`: ```protobuf enum ErrorType { @@ -146,19 +194,33 @@ enum ErrorType { } ``` -So once TiKV gets `AskBatchSplitResponse` with `ErrorType::INCOMPATIBLE_VERSION`, it uses the original `AskSplit` instead of `AskBatchSplit`, and all the following processes will degrade to the original way. So the original code path is not deleted. +So once TiKV gets `AskBatchSplitResponse` with +`ErrorType::INCOMPATIBLE_VERSION`, it uses the original `AskSplit` instead of +`AskBatchSplit`, and all the following processes will degrade to the original +way. So the original code path is not deleted. ### Approximate split key -What we said above can ease the problem. However, scanning a large Region can also consume a lot of time and CPU. The test shows that large Regions can still easily show up even with batch split implemented, although split is speeded up. +What we said above can ease the problem. However, scanning a large Region can +also consume a lot of time and CPU. The test shows that large Regions can still +easily show up even with batch split implemented, although split is speeded up. -When a Region becomes large enough, it's more practical to divide it into smaller chunks quickly. This can be achieved via size estimation, which can be calculated from SST properties. Although it may not be accurate enough, it's okay for a large Region. +When a Region becomes large enough, it's more practical to divide it into +smaller chunks quickly. This can be achieved via size estimation, which can be +calculated from SST properties. Although it may not be accurate enough, it's +okay for a large Region. -So if the size of Region is larger than `region_max_size * batch_split_limit * 2`, TiKV uses approximate way to produce split keys. The approximate way is quite similar to the algorithm we describe above, but to estimate TiKV uses approximate size of the Region and the number of keys in the Region's range to calculate the average distance between two SST property keys, and produces a split key every `region_split_size / distance` keys. +So if the size of Region is larger than `region_max_size * batch_split_limit * +2`, TiKV uses approximate way to produce split keys. The approximate way is +quite similar to the algorithm we describe above, but to estimate TiKV uses +approximate size of the Region and the number of keys in the Region's range to +calculate the average distance between two SST property keys, and produces a +split key every `region_split_size / distance` keys. # Drawbacks -- When the approximate way is used, Region may split into several disproportional Regions due to size estimation. +- When the approximate way is used, Region may split into several +disproportional Regions due to size estimation. # Alternatives @@ -167,4 +229,6 @@ None # Unresolved questions -Generally, it is more urgent to split a large Region, so we can change the split check queue from a naive FIFO queue to a priority queue so that a large Region can be split early and quickly. +Generally, it is more urgent to split a large Region, so we can change the +split check queue from a naive FIFO queue to a priority queue so that a large +Region can be split early and quickly. From d9fa78533a96b0a058421e2e45a6e3469d759927 Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Wed, 28 Nov 2018 11:18:04 +0800 Subject: [PATCH 09/12] adjust format Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 3ff128e9..b8cd7a8e 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -35,8 +35,7 @@ service PD { option deprecated = true; } rpc AskBatchSplit(AskBatchSplitRequest) returns (AskBatchSplitResponse) {} - rpc ReportBatchSplit(ReportBatchSplitRequest) - returns (ReportBatchSplitResponse) {} + rpc ReportBatchSplit(ReportBatchSplitRequest) returns (ReportBatchSplitResponse) {} } message AskBatchSplitRequest { @@ -83,8 +82,7 @@ deprecated. ```protobuf message SplitRequest { // ... - // Will be ignored in batch split. Use `BatchSplitRequest::right_derive` - // instead. + // Will be ignored in batch split. Use `BatchSplitRequest::right_derive` instead. bool right_derive = 4 [deprecated=true]; } @@ -143,10 +141,8 @@ pub trait SplitChecker { // before: fn split_key(&mut self) -> Option> fn split_keys(&mut self) -> Vec>; - // before: fn approximate_split_key(&self, _: &Region, _: &DB) - // -> Result>> - fn approximate_split_keys(&self, _: &Region, _: &DB) -> -Result>> { + // before: fn approximate_split_key(&self, _: &Region, _: &DB) -> Result>> + fn approximate_split_keys(&self, _: &Region, _: &DB) -> Result>>; } ``` From 4529dda48a856f1cf38bdf1031898f228cc0172f Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Wed, 28 Nov 2018 11:25:46 +0800 Subject: [PATCH 10/12] fix grammar Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index b8cd7a8e..e4a4a9cf 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -9,9 +9,9 @@ interface into batch style. For PD, add RPCs `AskBatchSplit` and # Motivation -Current split only splits one Region at a time. It may be very slow when -sequential write is too fast, namely, the split speed can not keep up with -write speed. A slow split can lead to large region. In this case, if a snapshot +Current split only splits one Region at a time. It may be very slow when a +sequential write is too fast, namely, the split speed cannot keep up with +write speed. A slow split can lead to large Regions. In this case, if a snapshot is triggered, it will occupy a lot of I/O and make everything slow. Also, it is hard to schedule hotspots for a large Region, so it makes performance even worse. @@ -147,10 +147,10 @@ pub trait SplitChecker { ``` Then add one config `batch_split_limit` to limit the number of produced split -keys in a batch. If it is unlimited, for once split check, it scans all over -the Region's range, and in some extreme case this would cause performance issue. +keys in a batch. If it is unlimited, for a once split check, it scans all over +the Region's range, and in some extreme case, this would cause performance issue. -Now we have four split-checkers: half, keys, size and table. SizeChecker and +Now we have four split-checkers: half, keys, size, and table. SizeChecker and KeysChecker can be rewritten to produce multiple keys, and other checkers' logic stays unchanged. @@ -168,9 +168,9 @@ as split key. - after: it scans key-value pairs in a Region's range sequentially to accumulate their size as `total_size` and stops once the size reaches `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the -end of the range. During the scan process, it records the key as split key +end of the range. During the scan process, it records the key as a split key every `region_split_size`, but after finishing scanning, it may discard the -last split key if the size of rest Region is not bigger than `region_max_size - +last split key if the size of the rest is not bigger than `region_max_size - region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, TiKV can perfectly behave the same way as the split without batch. @@ -207,7 +207,7 @@ calculated from SST properties. Although it may not be accurate enough, it's okay for a large Region. So if the size of Region is larger than `region_max_size * batch_split_limit * -2`, TiKV uses approximate way to produce split keys. The approximate way is +2`, TiKV uses an approximate way to produce split keys. The approximate way is quite similar to the algorithm we describe above, but to estimate TiKV uses approximate size of the Region and the number of keys in the Region's range to calculate the average distance between two SST property keys, and produces a From c193676320c1ceec22be31cd0d250daf4db621fc Mon Sep 17 00:00:00 2001 From: Connor1996 Date: Wed, 28 Nov 2018 14:34:10 +0800 Subject: [PATCH 11/12] simplify Signed-off-by: Connor1996 --- text/2018-10-25-batch-split.md | 36 +++++++++------------------------- 1 file changed, 9 insertions(+), 27 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index e4a4a9cf..74006e49 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -130,33 +130,15 @@ old command type `Split` still needs to be preserved. ### How to produce multiple split keys -This part mainly focuses on `SplitChecker`. - -First of all, adjust `trait` so that it can return multiple split keys. - -```rust -pub trait SplitChecker { - // ... - - // before: fn split_key(&mut self) -> Option> - fn split_keys(&mut self) -> Vec>; - - // before: fn approximate_split_key(&self, _: &Region, _: &DB) -> Result>> - fn approximate_split_keys(&self, _: &Region, _: &DB) -> Result>>; -} -``` - -Then add one config `batch_split_limit` to limit the number of produced split -keys in a batch. If it is unlimited, for a once split check, it scans all over -the Region's range, and in some extreme case, this would cause performance issue. - -Now we have four split-checkers: half, keys, size, and table. SizeChecker and -KeysChecker can be rewritten to produce multiple keys, and other checkers' -logic stays unchanged. - -The general logic of SizeChecker and KeysChecker are similar. The only -difference between them is one splits Region based on size and the other splits -Region based on the number of keys. So here we mainly describe the logic of +First introducing one config `batch_split_limit` to limit the number of produced +split keys in a batch. If it is unlimited, for a once split check, it scans all +over the Region's range, and in some extreme case, this would cause performance +issue. + +SizeChecker and KeysChecker can be rewritten to produce multiple keys, and the +general logic of SizeChecker and KeysChecker are similar. The only difference +between them is one splits Region based on size and the other splits Region +based on the number of keys. So here we mainly describe the logic of SizeChecker: - before: it scans key-value pairs in a Region's range sequentially to From dec91b9657d5477488501d100a117e25fce50839 Mon Sep 17 00:00:00 2001 From: Hoverbear Date: Thu, 20 Dec 2018 14:24:35 -0800 Subject: [PATCH 12/12] Fix lints Signed-off-by: Hoverbear --- text/2018-10-25-batch-split.md | 167 +++++++++++++++++---------------- 1 file changed, 84 insertions(+), 83 deletions(-) diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md index 74006e49..5c589d13 100644 --- a/text/2018-10-25-batch-split.md +++ b/text/2018-10-25-batch-split.md @@ -1,35 +1,37 @@ -# Summary +# Batch Split -Support a `BatchSplit` feature that splits one Region into multiple Regions at -a time if the size or the number of keys exceeds a threshold. This includes -modifications of both TiKV and PD. For TiKV, every round of split-check -produces multiple split keys instead of one and changes inner split related -interface into batch style. For PD, add RPCs `AskBatchSplit` and +## Summary + +Support a `BatchSplit` feature that splits one Region into multiple Regions at +a time if the size or the number of keys exceeds a threshold. This includes +modifications of both TiKV and PD. For TiKV, every round of split-check +produces multiple split keys instead of one and changes inner split related +interface into batch style. For PD, add RPCs `AskBatchSplit` and `ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch. -# Motivation +## Motivation -Current split only splits one Region at a time. It may be very slow when a -sequential write is too fast, namely, the split speed cannot keep up with -write speed. A slow split can lead to large Regions. In this case, if a snapshot -is triggered, it will occupy a lot of I/O and make everything slow. Also, it is -hard to schedule hotspots for a large Region, so it makes performance even +Current split only splits one Region at a time. It may be very slow when a +sequential write is too fast, namely, the split speed cannot keep up with +write speed. A slow split can lead to large Regions. In this case, if a snapshot +is triggered, it will occupy a lot of I/O and make everything slow. Also, it is +hard to schedule hotspots for a large Region, so it makes performance even worse. -# Detailed design +## Detailed design -## RPC interface +### RPC interface -### PD +#### PD ```protobuf service PD { // ... - + rpc AskSplit(AskSplitRequest) returns (AskSplitResponse) { // Use AskBatchSplit instead. option deprecated = true; - } + } rpc ReportSplit(ReportSplitRequest) returns (ReportSplitResponse) { // Use ResportBatchSplit instead. option deprecated = true; @@ -64,20 +66,20 @@ message ReportBatchSplitResponse { } ``` -Add `AskBatchSplit` to replace `AskSplit`. It is called when TiKV produces some -split keys for one Region and asks PD to allocate new `region_id` and `peer_id` -for that Region. `split_count` in `AskBatchSplitRequest` indicates the number -of Regions to be generated, and `AskBatchSplitResponse` returns all new +Add `AskBatchSplit` to replace `AskSplit`. It is called when TiKV produces some +split keys for one Region and asks PD to allocate new `region_id` and `peer_id` +for that Region. `split_count` in `AskBatchSplitRequest` indicates the number +of Regions to be generated, and `AskBatchSplitResponse` returns all new allocated IDs to TiKV. -Add `ReportBatchSplit` to replace `ReportBatchSplit`. It is called when TiKV -finishes splitting a Region. `ReportBatchSplitRequest` takes all metas of a new +Add `ReportBatchSplit` to replace `ReportBatchSplit`. It is called when TiKV +finishes splitting a Region. `ReportBatchSplitRequest` takes all metas of a new generated Region for PD to update PD's related information. -For compatibility issue, the old interface is not deleted but set to -deprecated. +For compatibility issue, the old interface is not deleted but set to +deprecated. -### TiKV +#### TiKV ```protobuf message SplitRequest { @@ -88,7 +90,7 @@ message SplitRequest { message BatchSplitRequest { repeated SplitRequest requests = 1; - // If true, the last Region obtains the origin region_id, + // If true, the last Region obtains the origin region_id, // and other Regions use new Ids. bool right_derive = 2; } @@ -119,50 +121,50 @@ message AdminResponse { } ``` -Add a new admin command type `BatchSplit` with related request and response. -`BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` +Add a new admin command type `BatchSplit` with related request and response. +`BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive` which invalidates the `right_derive` in each `SplitRequest`. -When in the rolling upgrade process, new TiKVs are mixed up with old TiKVs, so +When in the rolling upgrade process, new TiKVs are mixed up with old TiKVs, so old command type `Split` still needs to be preserved. -## Implementation in TiKV +### Implementation in TiKV -### How to produce multiple split keys +#### How to produce multiple split keys -First introducing one config `batch_split_limit` to limit the number of produced -split keys in a batch. If it is unlimited, for a once split check, it scans all -over the Region's range, and in some extreme case, this would cause performance +First introducing one config `batch_split_limit` to limit the number of produced +split keys in a batch. If it is unlimited, for a once split check, it scans all +over the Region's range, and in some extreme case, this would cause performance issue. -SizeChecker and KeysChecker can be rewritten to produce multiple keys, and the -general logic of SizeChecker and KeysChecker are similar. The only difference -between them is one splits Region based on size and the other splits Region -based on the number of keys. So here we mainly describe the logic of +SizeChecker and KeysChecker can be rewritten to produce multiple keys, and the +general logic of SizeChecker and KeysChecker are similar. The only difference +between them is one splits Region based on size and the other splits Region +based on the number of keys. So here we mainly describe the logic of SizeChecker: -- before: it scans key-value pairs in a Region's range sequentially to -accumulate their size as `total_size` and stops once the size reaches -`region_max_size` or scans to the end of the range. If `total_size` is smaller -than `region_max_size` at the end, checker wouldn't produce any split key; if -not, it regards the very key at which `total_size` reaches `region_split_size` -as split key. -- after: it scans key-value pairs in a Region's range sequentially to -accumulate their size as `total_size` and stops once the size reaches -`region_split_size * (batch_split_limit-1) + region_max_size` or scans to the -end of the range. During the scan process, it records the key as a split key -every `region_split_size`, but after finishing scanning, it may discard the -last split key if the size of the rest is not bigger than `region_max_size - -region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, -TiKV can perfectly behave the same way as the split without batch. - -### Compatibility concern - -The general process in raftstore changes a little. It mainly replaces `Split` -with `BatchSplit`. But one thing should be noted that during the rolling -upgrade, PD version control will refuse the `AskBatchSplit` request, thus split -can't be performed during this process until all TiKVs bump to a new version. -To let TiKV know whether `AskBatchSplit` fails for compatibility or not, we +- **Before:** it scans key-value pairs in a Region's range sequentially to + accumulate their size as `total_size` and stops once the size reaches + `region_max_size` or scans to the end of the range. If `total_size` is + smaller than `region_max_size` at the end, checker wouldn't produce any split + key; if not, it regards the very key at which `total_size` reaches + `region_split_size` as split key. +- **After:** it scans key-value pairs in a Region's range sequentially to + accumulate their size as `total_size` and stops once the size reaches + `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the + end of the range. During the scan process, it records the key as a split key + every `region_split_size`, but after finishing scanning, it may discard the + last split key if the size of the rest is not bigger than `region_max_size - + region_split_size`. With this algorithm, if `batch_split_limit` is set to 1, + TiKV can perfectly behave the same way as the split without batch. + +#### Compatibility concern + +The general process in raftstore changes a little. It mainly replaces `Split` +with `BatchSplit`. But one thing should be noted that during the rolling +upgrade, PD version control will refuse the `AskBatchSplit` request, thus split +can't be performed during this process until all TiKVs bump to a new version. +To let TiKV know whether `AskBatchSplit` fails for compatibility or not, we introduce a new error type for `ResponseHeader`: ```protobuf @@ -172,41 +174,40 @@ enum ErrorType { } ``` -So once TiKV gets `AskBatchSplitResponse` with -`ErrorType::INCOMPATIBLE_VERSION`, it uses the original `AskSplit` instead of -`AskBatchSplit`, and all the following processes will degrade to the original +So once TiKV gets `AskBatchSplitResponse` with +`ErrorType::INCOMPATIBLE_VERSION`, it uses the original `AskSplit` instead of +`AskBatchSplit`, and all the following processes will degrade to the original way. So the original code path is not deleted. -### Approximate split key +#### Approximate split key -What we said above can ease the problem. However, scanning a large Region can -also consume a lot of time and CPU. The test shows that large Regions can still +What we said above can ease the problem. However, scanning a large Region can +also consume a lot of time and CPU. The test shows that large Regions can still easily show up even with batch split implemented, although split is speeded up. -When a Region becomes large enough, it's more practical to divide it into -smaller chunks quickly. This can be achieved via size estimation, which can be -calculated from SST properties. Although it may not be accurate enough, it's +When a Region becomes large enough, it's more practical to divide it into +smaller chunks quickly. This can be achieved via size estimation, which can be +calculated from SST properties. Although it may not be accurate enough, it's okay for a large Region. -So if the size of Region is larger than `region_max_size * batch_split_limit * -2`, TiKV uses an approximate way to produce split keys. The approximate way is -quite similar to the algorithm we describe above, but to estimate TiKV uses -approximate size of the Region and the number of keys in the Region's range to -calculate the average distance between two SST property keys, and produces a +So if the size of Region is larger than `region_max_size * batch_split_limit * +2`, TiKV uses an approximate way to produce split keys. The approximate way is +quite similar to the algorithm we describe above, but to estimate TiKV uses +approximate size of the Region and the number of keys in the Region's range to +calculate the average distance between two SST property keys, and produces a split key every `region_split_size / distance` keys. -# Drawbacks +## Drawbacks -- When the approximate way is used, Region may split into several -disproportional Regions due to size estimation. +- When the approximate way is used, Region may split into several + disproportional Regions due to size estimation. -# Alternatives +## Alternatives None +## Unresolved questions -# Unresolved questions - -Generally, it is more urgent to split a large Region, so we can change the -split check queue from a naive FIFO queue to a priority queue so that a large +Generally, it is more urgent to split a large Region, so we can change the +split check queue from a naive FIFO queue to a priority queue so that a large Region can be split early and quickly.