[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

dreamer-89 · 2022-08-04T19:08:07Z

Coming from #3988 where RoutingNodes#failShard was identified as another workflow where master eagerly promotes replica as part of node removal workflow. failShard method is also handles cluster state updates (e.g. assigned shards etc).

dreamer-89 · 2022-08-05T17:32:37Z

dreamer-89 · 2022-08-09T19:29:51Z

RoutingNodes#failShard method can not be removed as

It is used by master to immediately promote one of active shards (from cluster state) to avoid shard failure.
The method is followed up with cluster balancing actions which needs an active primary.

Removing failShard method would need lot of core level changes in shard allocation, which is not intended as part of issue.

dreamer-89 · 2022-08-09T20:32:50Z

There are following approaches for handling RoutingNodes#failShard to include ReplicationCheckpoint. There are three options here:

Include ReplicationCheckpoint in ClusterState. During failover, cluster-manager can simply choose shard copy with highest ReplicationCheckpoint. This is not correct as ReplicationCheckpoint is updated on index refreshes and is not ideal to be part of ClusterState. Also, it looks like be a lot bigger change and not necessary as checkpoints are used only during failover to identify furthest ahead replica.
Pull active shard's ReplicationCheckpoint synchronously. This can be problematic for big clusters (with high replica & shard count), where some delay in shard promotion may be observed.
[Proposed initially on this issue] Using AsyncShardFetch to fetch shard data asynchronously workflow. This is problematic as cluster-manager will not be able to promote active shard copy immediately but there are assertions on primary nodes; post reroute as part of cluster shard balancing actions.

I am planning to move forward with option 2 above.

CC @Bukhtawar @mch2 @andrross

mch2 · 2022-08-10T00:16:44Z

I think 2 is the best option given we want this as a best effort. I also wouldn't be worried about it delaying shard promotion right now, we can set timeouts to reduce that impact and measure the total time.

dreamer-89 · 2022-08-11T18:31:45Z

During standup discussion among @mch2 @kartg, we decided to take up this issue as part of next minor release and focus on existing open bugs in #3969 which have higher priority. This work basically is an optimization work which reduces the segment files copies amoung repilcas.

CC @CEHENKLE @anasalkouz @mch2

dreamer-89 · 2022-09-16T18:01:18Z

Discussed this during team standup, where we identified that we need to get data around segment replication performance when furthese ahead replica is not chosen. This is to also to evaluate the trade off we will get with implementing this core change.

CC @mch2 @anasalkouz @Bukhtawar

saratvemulapalli · 2023-01-05T19:58:11Z

@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?

dreamer-89 · 2023-01-06T05:23:54Z

@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?

Thank you @saratvemulapalli for bringing this up. This work will not make into 2.5.0 release, removing the tag.

From previous discussion, this is an optimization task which tries to select the replica with highest checkpoint (to ensure minimum file copy ops from new selected primary & prevent segment conflicts). We also don't have data around how bad this I/O can go if we do not select the replica with highest replication checkpoint. The segment conflicts are avoided today by bumping the SegGen on selected primary.

Even with approach 2 above (sync call to replicas to fetch highest replication checkpoint), this solution will be best effort and can't guarantee the selection of furthest ahead replica; which leaves room for segment conflict. Based on this, prioritizing existing GA task over this.

CC @mch2 @anasalkouz

This was referenced Aug 4, 2022

[Segment Replication] Experimental Release Tracking #3969

Closed

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

Closed

dreamer-89 changed the title ~~Primary promotion on shard failing (RoutingNodes#failShard)~~ Primary promotion on shard failing during node removal (RoutingNodes#failShard) Aug 5, 2022

dreamer-89 changed the title ~~Primary promotion on shard failing during node removal (RoutingNodes#failShard)~~ Primary promotion on shard failing during node removal in RoutingNodes#failShard Aug 5, 2022

dreamer-89 mentioned this issue Aug 9, 2022

[Segment Replication] Support shard promotion. #2212

Closed

4 tasks

dreamer-89 added the v2.4.0 'Issues and PRs related to version v2.4.0' label Sep 9, 2022

mch2 changed the title ~~Primary promotion on shard failing during node removal in RoutingNodes#failShard~~ [Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard Sep 19, 2022

anasalkouz added v2.5.0 'Issues and PRs related to version v2.5.0' distributed framework and removed v2.4.0 'Issues and PRs related to version v2.4.0' labels Oct 31, 2022

dreamer-89 removed the v2.5.0 'Issues and PRs related to version v2.5.0' label Jan 6, 2023

anasalkouz added Migration:ReadyToPick and removed Migration:ReadyToPick labels Mar 17, 2023

anasalkouz added the enhancement Enhancement or improvement to existing feature or request label Jul 13, 2023

Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023

anasalkouz removed the distributed framework label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

dreamer-89 commented Aug 4, 2022 •

edited

Loading

dreamer-89 commented Aug 5, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading

mch2 commented Aug 10, 2022

dreamer-89 commented Aug 11, 2022

dreamer-89 commented Sep 16, 2022

saratvemulapalli commented Jan 5, 2023

dreamer-89 commented Jan 6, 2023 •

edited

Loading

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

Comments

dreamer-89 commented Aug 4, 2022 • edited Loading

dreamer-89 commented Aug 5, 2022 • edited Loading

dreamer-89 commented Aug 9, 2022 • edited Loading

dreamer-89 commented Aug 9, 2022 • edited Loading

mch2 commented Aug 10, 2022

dreamer-89 commented Aug 11, 2022

dreamer-89 commented Sep 16, 2022

saratvemulapalli commented Jan 5, 2023

dreamer-89 commented Jan 6, 2023 • edited Loading

dreamer-89 commented Aug 4, 2022 •

edited

Loading

dreamer-89 commented Aug 5, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Jan 6, 2023 •

edited

Loading