Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully cancel the replication when the replica shard is closed #8089

Closed
ankitkala opened this issue Jun 15, 2023 · 3 comments · Fixed by #10725
Closed

Gracefully cancel the replication when the replica shard is closed #8089

ankitkala opened this issue Jun 15, 2023 · 3 comments · Fixed by #10725
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@ankitkala
Copy link
Member

ankitkala commented Jun 15, 2023

ReplicationSource defines a cancel method intended to allow source implementations to cancel and clean up any ongoing requests. The remote store implementation does not currently provide an implementation.

We need to validate cancellation with remote store completes successfully. We can reuse ITs introduced with #8478 to do this validation.

I think we will need to treat the remote store implementation as a sync call and wrap in CancellableThreads to perform interrupt, but maybe there is a similar cancel mechanism to RetryableTransportClient.cancel to support this.

@ankitkala ankitkala added bug Something isn't working untriaged labels Jun 15, 2023
@anasalkouz anasalkouz assigned anasalkouz and mch2 and unassigned anasalkouz Jun 15, 2023
@mch2 mch2 removed their assignment Jul 10, 2023
@mch2
Copy link
Member

mch2 commented Jul 11, 2023

Adding description here -
ReplicationSource defines a cancel method intended to allow source implementations to cancel and clean up any ongoing requests. The remote store implementation does not currently provide an implementation.

We need to validate cancellation with remote store completes successfully. We can reuse ITs introduced with #8478 to do this validation.

I think we will need to treat the remote store implementation as a sync call and wrap in CancellableThreads to perform interrupt, but maybe there is a similar cancel mechanism to RetryableTransportClient.cancel to support this.

@dreamer-89 dreamer-89 added enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed bug Something isn't working labels Aug 4, 2023
@dreamer-89 dreamer-89 self-assigned this Aug 7, 2023
@dreamer-89
Copy link
Member

dreamer-89 commented Sep 1, 2023

Discussed separately with @mch2, as we are close to 2.10.0 release (Sep 04, 2023 code freeze) and feature is pretty stable (non-flaky tests for node-node) and no data points around use-cases this change (PR #9234) will solve. We decided to not merge this in for 2.10.0. Removing 2.10.0 label from the issue and moving to backlog.

PR : #9234

CC @anasalkouz @mch2

@dreamer-89 dreamer-89 removed the v2.10.0 label Sep 1, 2023
@mch2
Copy link
Member

mch2 commented Oct 16, 2023

I'm not sure if this is an issue with cancellation as a replica, but it looks like post failover a corrupt file is breaking recovery. I don't know if this is because the replica is still writing concurrently in a previous sync, or if its not handling reading cksum properly.

org.opensearch.index.shard.IndexShardRecoveryException: Exception while copying segment files from remote segment store
	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4825) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4765) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4702) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:701) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:3996) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:3966) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$36(IndexShard.java:3917) ~[classes/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [classes/:?]
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1589) [?:?]
Caused by: java.io.IOException: Error occurred when downloading segments from remote store
	at org.opensearch.index.shard.IndexShard.copySegmentFiles(IndexShard.java:4931) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4805) ~[classes/:?]
	... 13 more
Caused by: java.nio.file.FileAlreadyExistsException: File "_0.cfs" was already written to.
	at org.apache.lucene.tests.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:704) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:153) ~[classes/:?]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.Store$StoreDirectory.copyFileAndValidateChecksum(Store.java:1000) ~[classes/:?]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:981) ~[classes/:?]
	at org.opensearch.index.store.RemoteStoreFileDownloader.lambda$copyOneFile$1(RemoteStoreFileDownloader.java:132) ~[classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[classes/:?]
	... 3 more

I log the local files inside of syncSegmentsFromRemoteSegmentStore.

[2023-10-17T01:43:37,936][INFO ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] File _0.cfe already exists
[2023-10-17T01:43:37,936][INFO ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] File _0.si already exists
[2023-10-17T01:43:37,943][WARN ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] Exception while reading checksum of file: _0.cfs, this can happen if file is corrupted
[2023-10-17T01:43:37,943][INFO ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] Local dir [_0.cfe, _0.cfs, _0.si, segments_6, write.lock]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Done
4 participants