Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

Open
dreamer-89 opened this issue Aug 4, 2022 · 8 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 4, 2022

Coming from #3988 where RoutingNodes#failShard was identified as another workflow where master eagerly promotes replica as part of node removal workflow. failShard method is also handles cluster state updates (e.g. assigned shards etc).

@dreamer-89 dreamer-89 changed the title Primary promotion on shard failing (RoutingNodes#failShard) Primary promotion on shard failing during node removal (RoutingNodes#failShard) Aug 5, 2022
@dreamer-89 dreamer-89 changed the title Primary promotion on shard failing during node removal (RoutingNodes#failShard) Primary promotion on shard failing during node removal in RoutingNodes#failShard Aug 5, 2022
@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 5, 2022

Broken down tasks in following sub-tasks:

  • Identify cause of two different failover mechanism i.e. PrimaryShardAllocator & RoutingNodes#failShard
    • Code scan to identify cause of separate failover handling
    • Test if returning empty from activeReplicaWithHighestVersion doesn't cause shard failure (red cluster) and shard promotion works (with minor delay)
    • Identify is RoutingNodes#failShard can be removed all together.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 9, 2022

RoutingNodes#failShard method can not be removed as

  1. It is used by master to immediately promote one of active shards (from cluster state) to avoid shard failure.
  2. The method is followed up with cluster balancing actions which needs an active primary.

Removing failShard method would need lot of core level changes in shard allocation, which is not intended as part of issue.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 9, 2022

There are following approaches for handling RoutingNodes#failShard to include ReplicationCheckpoint. There are three options here:

  1. Include ReplicationCheckpoint in ClusterState. During failover, cluster-manager can simply choose shard copy with highest ReplicationCheckpoint. This is not correct as ReplicationCheckpoint is updated on index refreshes and is not ideal to be part of ClusterState. Also, it looks like be a lot bigger change and not necessary as checkpoints are used only during failover to identify furthest ahead replica.
  2. Pull active shard's ReplicationCheckpoint synchronously. This can be problematic for big clusters (with high replica & shard count), where some delay in shard promotion may be observed.
  3. [Proposed initially on this issue] Using AsyncShardFetch to fetch shard data asynchronously workflow. This is problematic as cluster-manager will not be able to promote active shard copy immediately but there are assertions on primary nodes; post reroute as part of cluster shard balancing actions.

I am planning to move forward with option 2 above.

CC @Bukhtawar @mch2 @andrross

@mch2
Copy link
Member

mch2 commented Aug 10, 2022

I think 2 is the best option given we want this as a best effort. I also wouldn't be worried about it delaying shard promotion right now, we can set timeouts to reduce that impact and measure the total time.

@dreamer-89
Copy link
Member Author

During standup discussion among @mch2 @kartg, we decided to take up this issue as part of next minor release and focus on existing open bugs in #3969 which have higher priority. This work basically is an optimization work which reduces the segment files copies amoung repilcas.

CC @CEHENKLE @anasalkouz @mch2

@dreamer-89 dreamer-89 added the v2.4.0 'Issues and PRs related to version v2.4.0' label Sep 9, 2022
@dreamer-89
Copy link
Member Author

Discussed this during team standup, where we identified that we need to get data around segment replication performance when furthese ahead replica is not chosen. This is to also to evaluate the trade off we will get with implementing this core change.

CC @mch2 @anasalkouz @Bukhtawar

@mch2 mch2 changed the title Primary promotion on shard failing during node removal in RoutingNodes#failShard [Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard Sep 19, 2022
@anasalkouz anasalkouz added v2.5.0 'Issues and PRs related to version v2.5.0' distributed framework and removed v2.4.0 'Issues and PRs related to version v2.4.0' labels Oct 31, 2022
@saratvemulapalli
Copy link
Member

@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?

@dreamer-89 dreamer-89 removed the v2.5.0 'Issues and PRs related to version v2.5.0' label Jan 6, 2023
@dreamer-89
Copy link
Member Author

dreamer-89 commented Jan 6, 2023

@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?

Thank you @saratvemulapalli for bringing this up. This work will not make into 2.5.0 release, removing the tag.

From previous discussion, this is an optimization task which tries to select the replica with highest checkpoint (to ensure minimum file copy ops from new selected primary & prevent segment conflicts). We also don't have data around how bad this I/O can go if we do not select the replica with highest replication checkpoint. The segment conflicts are avoided today by bumping the SegGen on selected primary.

Even with approach 2 above (sync call to replicas to fetch highest replication checkpoint), this solution will be best effort and can't guarantee the selection of furthest ahead replica; which leaves room for segment conflict. Based on this, prioritizing existing GA task over this.

CC @mch2 @anasalkouz

@anasalkouz anasalkouz added the enhancement Enhancement or improvement to existing feature or request label Jul 13, 2023
@Bukhtawar Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: No status
Development

No branches or pull requests

5 participants