Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize remote store operations during snapshot Deletion #12319

Merged

Conversation

harishbhakuni
Copy link
Contributor

@harishbhakuni harishbhakuni commented Feb 14, 2024

Description

  • Optimize stale segments deletion method [StaleSegmentsDeletion(int lastNmdFilesToKeep)]:
    • Current approach:
      1. read all the non deletable (locked or in last N md files) md files to get active segments,
      2. for each deletable (not locked or not in last N md files) md file,
        1. read md file to get segments and filter the active segments (fetched in (a)),
        2. delete the non active segments one by one.
      3. if all the non active segments for md file deleted, delete the md file.
    • Optimized approach:
      1. since each remote store md file is snapshot of live segments at the time of its creation, we can assume that (1) if a segment file is not present in a md file, it will never be present in any md file after that and (2) if (md1, md2, md3) are in sorted order, it is not possible that a segment file will be in md1 and md3 but not in md2.
      2. for each deletable md file, we can consider the segments present in non deletable md file before this and non deletable md file after this as the active segments for this. so, to filter out non active segment files for a deletable metadata file, we just need to read the non deletable md file before and after this md file.
      3. we can batch delete the non active segment files for each deletable md file.
      4. we can batch delete the md files for which non active segment files delete successfully.
  • Lazy initialize the remote directory instance as part of remote_purge threadpool task itself, so that we create/initialize the instance only once the task is picked up for execution.
  • Optimize snapshot deletion logic to skip only the shard blob deletion for which remote store cleanup failed.
  • Maintain a local cache to track successful remote store operations which would avoid multiple calls for same resource cleanup.

Related Issues

#12302,
#12253

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@harishbhakuni harishbhakuni self-assigned this Feb 14, 2024
Copy link
Contributor

❌ Gradle check result for 79ef792: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Feb 14, 2024

Compatibility status:

Checks if related components are compatible with change f69acff

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/performance-analyzer.git]

Copy link
Contributor

❌ Gradle check result for a50df48: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for ca5d929: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 97c6ead: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harishbhakuni harishbhakuni changed the title Initial changes for remote store cleanup during snapshot optimizations. Optimize remote store operations during snapshot Deletion Feb 20, 2024
@harishbhakuni harishbhakuni marked this pull request as ready for review February 20, 2024 02:14
Copy link
Contributor

✅ Gradle check result for 5475951: SUCCESS

Copy link
Contributor

✅ Gradle check result for 488c81c: SUCCESS

Copy link
Contributor

❌ Gradle check result for 5ba410b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add more cases for the RemoteStoreShardCleanupTask covering concurrent clean up tasks.

Signed-off-by: Harish Bhakuni <hbhakuni@amazon.com>
Copy link
Contributor

✅ Gradle check result for f69acff: SUCCESS

@gbbafna gbbafna merged commit b265215 into opensearch-project:main Mar 14, 2024
30 of 34 checks passed
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Mar 14, 2024
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-12319-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b26521562b2c51991f30a75c7d266d4be8e2b3de
# Push it to GitHub
git push --set-upstream origin backport/backport-12319-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-12319-to-2.x.

@harishbhakuni harishbhakuni added backport 2.x Backport to 2.x branch and removed backport 2.x Backport to 2.x branch backport-failed labels Mar 14, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 14, 2024
Signed-off-by: Harish Bhakuni <hbhakuni@amazon.com>
(cherry picked from commit b265215)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.11 failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.11 2.11
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.11
# Create a new branch
git switch --create backport/backport-12319-to-2.11
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b26521562b2c51991f30a75c7d266d4be8e2b3de
# Push it to GitHub
git push --set-upstream origin backport/backport-12319-to-2.11
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.11

Then, create a pull request where the base branch is 2.11 and the compare/head branch is backport/backport-12319-to-2.11.

andrross pushed a commit that referenced this pull request Mar 15, 2024
…12677)

(cherry picked from commit b265215)

Signed-off-by: Harish Bhakuni <hbhakuni@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…-project#12319)

Signed-off-by: Harish Bhakuni <hbhakuni@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✅ Done
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants