Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override local disk state if we are able to restore from remote #10748

Merged

Conversation

linuxpi
Copy link
Collaborator

@linuxpi linuxpi commented Oct 19, 2023

Description

  • Immediately after remote state restore, we write restored state to disk with UNKNOWN_UUID cluster uuid.
  • This leads to issues if process restarts before committing newly generated state with valid uuid to disk
  • Now, If restart happens before performing a new write to disk with valid cluster uuid, the auto restore fails with conflicting states in remote and local disk.
  • Please find more details in the linked issue

Related Issues

Resolves #10776

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@linuxpi linuxpi force-pushed the fix-local-remote-state-conflict branch 3 times, most recently from 8d41733 to e10817a Compare October 19, 2023 15:46
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 19, 2023

Compatibility status:

Checks if related components are compatible with change 00cce63

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git]

Signed-off-by: bansvaru <bansvaru@amazon.com>
@linuxpi linuxpi force-pushed the fix-local-remote-state-conflict branch from e10817a to 1bbd7df Compare October 19, 2023 16:41
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@linuxpi linuxpi force-pushed the fix-local-remote-state-conflict branch from ac1ce0a to 98b15da Compare October 20, 2023 07:06
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: bansvaru <bansvaru@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: bansvaru <bansvaru@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @linuxpi for the changes, please add an integ test to confirm if the state before and after recovery matches. ALso please populate the PR template to add more context and issue link for the same

@github-actions github-actions bot added the bug Something isn't working label Oct 20, 2023
@linuxpi
Copy link
Collaborator Author

linuxpi commented Oct 20, 2023

Thanks @linuxpi for the changes, please add an integ test to confirm if the state before and after recovery matches. ALso please populate the PR template to add more context and issue link for the same

Updated the description with details. Integ tests we already have in place to ensure correctness and recovery is successful. For the particular case we faced this issue we need to introduce a new test. I'm working on writing a test for it.

@shwetathareja
Copy link
Member

Lets add an IT which covers following sequence of steps

  1. Create single node cluster
  2. Create an index with documents
  3. Stop this node
  4. when the new node comes up, change the logic so that it doesn't seed initial_master_nodes
  5. restart the new node and seed the initial_master_nodes
  6. cluster should come up with index metadata restore and marked red.

@linuxpi
Copy link
Collaborator Author

linuxpi commented Oct 21, 2023

Lets add an IT which covers following sequence of steps

1. Create single node cluster

2. Create an index with documents

3. Stop this node

4. when the new node comes up, change the logic so that it doesn't seed initial_master_nodes

5. restart the new node and seed the initial_master_nodes

6. cluster should come up with index metadata restore and marked red.

i am working on adding this test by controlling seeding of cluster manager nodes, but currently after all operations, cluster fails to form, i am working in fixing it but doesnt look trivial.
Let me know if you think we can create a followup issue for this particular test. We have basic integ already in place.

Added the integ test

Signed-off-by: bansvaru <bansvaru@amazon.com>
… test

Signed-off-by: bansvaru <bansvaru@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testRestoreSnapshotToIndexWithSameNameDifferentUUID

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@linuxpi
Copy link
Collaborator Author

linuxpi commented Oct 22, 2023

Gradle Check (Jenkins) Run Completed with:

* **RESULT:** FAILURE ❌

* **URL:** https://build.ci.opensearch.org/job/gradle-check/28723/

* **CommitID:** [00cce63](https://github.com/opensearch-project/OpenSearch/commit/00cce63565f354532742b5b83cd869d4f8744039)
  Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
  Is the failure [a flaky test](https://github.com/opensearch-project/OpenSearch/blob/main/DEVELOPER_GUIDE.md#flaky-tests) unrelated to your change?

Flaky Test

org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

#10006

@shwetathareja shwetathareja merged commit 9b7a9d0 into opensearch-project:main Oct 22, 2023
19 of 25 checks passed
@shwetathareja shwetathareja added the backport 2.x Backport to 2.x branch label Oct 22, 2023
@linuxpi linuxpi deleted the fix-local-remote-state-conflict branch October 22, 2023 10:46
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 22, 2023
* Override local disk state if we are able to restore from remote

Signed-off-by: bansvaru <bansvaru@amazon.com>
(cherry picked from commit 9b7a9d0)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
sachinpkale pushed a commit that referenced this pull request Oct 23, 2023
…) (#10833)

* Override local disk state if we are able to restore from remote


(cherry picked from commit 9b7a9d0)

Signed-off-by: bansvaru <bansvaru@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…search-project#10748)

* Override local disk state if we are able to restore from remote

Signed-off-by: bansvaru <bansvaru@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Remote State] Remote state auto restore fails when process restarts before successful master election.
3 participants