Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment Replication stats throwing NPE when shards are unassigned or are in delayed allocation phase #14580

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rampreeth
Copy link

@rampreeth rampreeth commented Jun 27, 2024

Segment Replication stats throwing NPE when shards are unassigned or are in delayed allocation phase

Description

[Describe what this change achieves]

Related Issues

Resolves #11945

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for ec6091b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…are in delayed allocation phase

Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
Copy link
Contributor

❌ Gradle check result for aca2198: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 0603be3: SUCCESS

@rampreeth
Copy link
Author

Looks like some other tests are failing, seems to be related to my change, will fix that

Are you able to repro this locally? isPrimaryRelocation is only invoked when fetching stats for segrep, i'd be surprised if this is related to those failures?

Yes, I'm able to repro it locally. The other tests are failing due to addition of an unassigned shard in the routing table.

Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
Copy link
Contributor

❌ Gradle check result for 2677ac5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for b7577d7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rampreeth rampreeth closed this Jul 16, 2024
@rampreeth rampreeth reopened this Jul 16, 2024
Copy link
Contributor

❌ Gradle check result for b7577d7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
@rampreeth
Copy link
Author

Hi @mch2 , please review this again. Tests seem to be running fine now.

Copy link
Contributor

✅ Gradle check result for 070b695: SUCCESS

Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
Copy link
Contributor

❌ Gradle check result for 476e64d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
Copy link
Contributor

❕ Gradle check result for 0db942d: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@rampreeth rampreeth requested a review from mch2 July 18, 2024 16:26
Signed-off-by: Rampreeth Ethiraj <ethirajrampreeth@gmail.com>
@rampreeth
Copy link
Author

Hi @mch2 , gentle reminder to review this

Copy link
Contributor

❌ Gradle check result for a6bbe1a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rampreeth rampreeth closed this Jul 22, 2024
@rampreeth rampreeth reopened this Jul 22, 2024
Copy link
Contributor

✅ Gradle check result for a6bbe1a: SUCCESS

@@ -104,13 +104,22 @@ static String nodeIdFromAllocationId(final AllocationId allocationId) {
}

static IndexShardRoutingTable routingTable(final Set<AllocationId> initializingIds, final AllocationId primaryId) {
return routingTable(initializingIds, Collections.singleton(primaryId), primaryId);
return routingTable(initializingIds, Collections.singleton(primaryId), primaryId, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to test this without adding this flag and updating these base methods. We can have an individual test create the routing table and add the routing with null aId. That should avoid the need to change unrelated tests like PeerRecoveryRetentionLease as well. Please see

final IndexShardRoutingTable.Builder builder = new IndexShardRoutingTable.Builder(shardId);
from testSegmentReplicationCheckpointForRelocatingPrimary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make that change

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Aug 22, 2024
@shourya035
Copy link
Member

@rampreeth Let us know if you are blocked on this. I see that @mch2 has left some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working good first issue Good for newcomers low hanging fruit Storage Issues and PRs relating to data and metadata storage
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

[BUG] Segment Replication stats throwing NPE when shards are unassigned or are in delayed allocation phase
5 participants