Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform unreferenced file cleanup for any operation failure and mute flaky test #12128

Closed
wants to merge 2 commits into from

Conversation

RS146BIJAY
Copy link
Contributor

@RS146BIJAY RS146BIJAY commented Feb 1, 2024

Description

As of now, we cleanup unreferenced files whenever last write is performed by merge and it caused disk to get full and shard to fail. Incase some other operation performs the last write and caused disk to get 100% full and a merge is ongoing, merge will just get aborted and no cleanup will be performed. Since when closing the shard it is the other operation causing disk full and not segment merge.

In order to fix this we need to cleanup unreferenced files when any operation failed due to disk full.

Related Issues

#12054

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Feb 1, 2024

Compatibility status:

Checks if related components are compatible with change 56424ec

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/alerting.git]

Copy link
Contributor

github-actions bot commented Feb 1, 2024

❌ Gradle check result for 22ba8eb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…flaky tests

Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>
Copy link
Contributor

github-actions bot commented Feb 1, 2024

✅ Gradle check result for eb68176: SUCCESS

Copy link

codecov bot commented Feb 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.34%. Comparing base (4471a8d) to head (56424ec).
Report is 378 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12128      +/-   ##
============================================
+ Coverage     71.30%   71.34%   +0.04%     
- Complexity    59393    59507     +114     
============================================
  Files          4925     4925              
  Lines        279540   279539       -1     
  Branches      40646    40645       -1     
============================================
+ Hits         199333   199448     +115     
+ Misses        63580    63517      -63     
+ Partials      16627    16574      -53     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: rishavz_sagar <rishavsagar4b1@gmail.com>
Copy link
Contributor

github-actions bot commented Feb 1, 2024

❕ Gradle check result for 56424ec: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

// clean up all unreferenced files on best effort basis created during failed merge and reset the
// shard state back to last Lucene Commit
if (shouldCleanupUnreferencedFiles() && isOperationFailureDueToIOException(failure)) {
logger.info("Cleaning up unreferenced files created during failed merge due to: {}", reason);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be warning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would be doing extra clean ups on may be other shards as well due to this . Please evaluate if we can avoid the same and do clean up only on the responsible shard instead

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Mar 5, 2024
Comment on lines +1323 to +1324
if (shouldCleanupUnreferencedFiles() && isOperationFailureDueToIOException(failure)) {
logger.info("Cleaning up unreferenced files created during failed merge due to: {}", reason);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log if the exception isn't IOE

@gbbafna
Copy link
Collaborator

gbbafna commented Apr 4, 2024

@RS146BIJAY : are you still working on this change ?

@gbbafna gbbafna added the stalled Issues that have stalled label Apr 4, 2024
@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Apr 4, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label May 7, 2024
@RS146BIJAY RS146BIJAY closed this May 8, 2024
@RS146BIJAY RS146BIJAY deleted the seg_merge branch July 8, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants