Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust dynamic timeout for get_segment_files operation to prevent request timeouts #4392

Closed
Tracked by #3969
dreamer-89 opened this issue Sep 2, 2022 · 0 comments
Closed
Tracked by #3969
Assignees

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 2, 2022

GetSegmentFiles transport request times out during requests with the current timeout of 1 minute from the recovery setting - indices.recovery.internal_action_retry_timeout.

To come up with a better timeout option, we can set it dynamically according to the total file segment size (from FileStoreMetadata) and the cluster's network bandwidth.

Without having access to knowledge of the cluster's network bandwidth, we can experiment to set a value of timeout that takes into account segment files' size.

Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [seed][10.9.0.166:9300][internal:index/shard/replication/get_segment_files] request_id [552738] timed out after [599988ms]

Failure stack trace from benchmarking

2022-09-02T09:34:08,220][ERROR][o.o.i.r.SegmentReplicationTargetService] [data-e20223d0] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:293) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.2.0.jar:2.2.0]        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.2.0.jar:2.2.0]        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.2.0.jar:2.2.0]        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1270) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.2.0.jar:2.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [seed][10.9.0.166:9300][internal:index/shard/replication/get_segment_files] request_id [552738] timed out after [599988ms]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1273) ~[opensearch-2.2.0.jar:2.2.0]
        ... 4 more
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants