Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry download of RemoteFSTranslog to fix transient race conditions #9565

Merged
merged 3 commits into from
Sep 6, 2023

Conversation

gbbafna
Copy link
Collaborator

@gbbafna gbbafna commented Aug 26, 2023

Description

Retrying translog download if it fails due to no file found for a finite time

In Primary to Primary relocation , there can be concurrent upload and download of translog.
While translog files are getting downloaded by new primary, it might hence be deleted by the primary
Hence we retry if tlog/ckp files are not found .

This doesn't happen in last download , where it is ensured that older primary has stopped modifying tlog data.

Related Issues

Resolves #9191

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Primary to Primary relocation , there can be concurrent upload and download of translog.

When you say concurrent upload and download, are we referring to concurrency due to old primary uploading and new primary downloading?

While translog files are getting downloaded by new primary, it might hence be deleted by the primary
Hence we retry if tlog/ckp files are not found .

I did not follow through this. The older primary can delete the translog when the new primary is trying to download the same? If so, is there a way we can disable the translog deletion somehow and disallow the erroneous state.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 3cae883

Incompatible components

Incompatible components: [https://github.com/opensearch-project/asynchronous-search.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Aug 28, 2023

Codecov Report

Merging #9565 (65e1f9e) into main (3d54d96) will increase coverage by 0.00%.
Report is 7 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 65e1f9e differs from pull request most recent head 7045e94. Consider uploading reports for the commit 7045e94 to get more accurate results

@@             Coverage Diff             @@
##               main    #9565     +/-   ##
===========================================
  Coverage     71.04%   71.05%             
+ Complexity    57821    57508    -313     
===========================================
  Files          4818     4781     -37     
  Lines        273093   271191   -1902     
  Branches      39811    39593    -218     
===========================================
- Hits         194028   192700   -1328     
+ Misses        62762    62257    -505     
+ Partials      16303    16234     -69     
Files Changed Coverage Δ
...rg/opensearch/index/translog/RemoteFsTranslog.java 72.25% <100.00%> (+3.21%) ⬆️

... and 610 files with indirect coverage changes

Copy link
Contributor

@BhumikaSaini-Amazon BhumikaSaini-Amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this 🙌

@ashking94
Copy link
Member

Lets close this once we have good confidence on #9603.

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT:
  • URL:
  • CommitID: 93357ce
    Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
    Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

Compatibility status:

Checks if related components are compatible with change 93357ce

Incompatible components

Incompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

Compatibility status:

Checks if related components are compatible with change 7045e94

Incompatible components

Incompatible components: [https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@gbbafna
Copy link
Collaborator Author

gbbafna commented Sep 6, 2023

Flaky test failing : #9688

@gbbafna gbbafna merged commit ff2b127 into opensearch-project:main Sep 6, 2023
10 of 11 checks passed
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Sep 6, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 6, 2023
…9565)

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
(cherry picked from commit ff2b127)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Sep 6, 2023
…9565) (#9798)

(cherry picked from commit ff2b127)

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…pensearch-project#9565)

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…pensearch-project#9565)

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…pensearch-project#9565)

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
4 participants