Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky test failure IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex #6124

Closed
mch2 opened this issue Feb 1, 2023 · 9 comments · Fixed by #11337
Closed

[BUG] Flaky test failure IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex #6124

mch2 opened this issue Feb 1, 2023 · 9 comments · Fixed by #11337
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run untriaged

Comments

@mch2
Copy link
Member

mch2 commented Feb 1, 2023

./gradlew ':server:test' --tests "org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex" -Dtests.seed=A35A0BBD1D4D543A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=da-DK -Dtests.timezone=Australia/Darwin -Druntime.java=19
#6122

@mch2 mch2 added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Feb 1, 2023
@minalsha minalsha removed the untriaged label Feb 3, 2023
@reta reta mentioned this issue Apr 4, 2023
8 tasks
@peternied
Copy link
Member

./gradlew ':server:test' --tests "org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex" -Dtests.seed=322535FDADE7B0A4 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ms -Dtests.timezone=SystemV/MST7 -Druntime.java=19
https://build.ci.opensearch.org/job/gradle-check/14778/testReport/junit/org.opensearch.index/IndexServiceTests/testAsyncTranslogTrimTaskOnClosedIndex/

@dbwiddis
Copy link
Member

dbwiddis commented Aug 6, 2023

Unable to reproduce locally.

From Jenkins log, the failing assertion is:

assertBusy(
() -> assertThat(
readOnlyEngine.translogManager().getTranslogStats().getTranslogSizeInBytes(),
equalTo((long) Translog.DEFAULT_HEADER_SIZE_IN_BYTES)
)
);

Expected: <55L>
     but: was <110L>

This implies there is potentially a second copy of the translog somewhere being accumulated in the stats.

@ankitkala
Copy link
Member

I was able to reproduce this only once after running around 10K test iterations.

  • So the test creates an index, ingests few docs and then closes the index. After closing the index we assert that the translog is empty by asserting that translogStats.getTranslogSizeInBytes equals DEFAULT_HEADER_SIZE_IN_BYTES(55 in this case).
  • Translog stats for closed index are getting initialized by creating a new LocalTranslog and fetching its stats here
  • LocalTranslog computes the size by adding the size of all readers here. These reader are getting initialized with translogs from disk here
  • As hinted by @dbwiddis, the translog size being 110 instead of 55(default header size) implies there are 2 empty translog files (on the disk).
  • During index close we perform a flush where we rollover the translog and trim unreferenced readers(here).
  • Translog rollover should happen only when the current writer has some operations(here). So Ideally we shouldn't have 2 empty files on the disk.

@ankitkala
Copy link
Member

[2023-11-21T03:55:00,762][INFO ][o.o.i.IndexService       ] [node_s_0] [test] [0] closing... (reason: [removing index (CLOSED)])
[2023-11-21T03:55:00,762][INFO ][o.o.i.IndexService       ] [node_s_0] [test] [0] closed (reason: [removing index (CLOSED)])
[2023-11-21T03:55:00,775][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test/6Z35y_VQTpmWn0x78xUklQ]
[2023-11-21T03:55:00,787][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankikala: Found a reader for /private/var/folders/dz/5106kpws767d85yrxw9y1m040000gr/T/org.opensearch.index.IndexServiceTests_2FC8623D3F0B25BE-001/tempDir-002/data/nodes/0/indices/6Z35y_VQTpmWn0x78xUklQ/0/translog/translog-16.tlog
[2023-11-21T03:55:00,802][INFO ][o.o.c.r.a.AllocationService] [node_s_0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[test][0]]]).
[2023-11-21T03:55:00,980][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankikala: Found a reader for /private/var/folders/dz/5106kpws767d85yrxw9y1m040000gr/T/org.opensearch.index.IndexServiceTests_2FC8623D3F0B25BE-001/tempDir-002/data/nodes/0/indices/6Z35y_VQTpmWn0x78xUklQ/0/translog/translog-17.tlog
[2023-11-21T03:55:00,981][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankikala: Found a reader for /private/var/folders/dz/5106kpws767d85yrxw9y1m040000gr/T/org.opensearch.index.IndexServiceTests_2FC8623D3F0B25BE-001/tempDir-002/data/nodes/0/indices/6Z35y_VQTpmWn0x78xUklQ/0/translog/translog-16.tlog
[2023-11-21T03:55:00,987][WARN ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] deleted previously created, but not yet committed, next generation [translog-18.tlog]. This can happen due to a tragic exception when creating a new generation
[2023-11-21T03:55:00,990][WARN ][o.o.i.e.Engine           ] [node_s_0] [test][0] failed engine [translog trimming failed]
org.opensearch.index.translog.TranslogException: Unable to get the earliest last modified time for the transaction log
	at org.opensearch.index.translog.Translog.earliestLastModifiedAge(Translog.java:423) ~[classes/:?]
	at org.opensearch.index.translog.LocalTranslog.stats(LocalTranslog.java:138) ~[classes/:?]
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:211) [classes/:?]
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1484) [classes/:?]
	at org.opensearch.index.shard.IndexShard$4.doRun(IndexShard.java:2848) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
[2023-11-21T03:55:00,998][INFO ][o.o.i.IndexService       ] [node_s_0] [test] [0] closing... (reason: [shard failure, reason [translog trimming failed]])
[2023-11-21T03:55:00,998][INFO ][o.o.i.IndexService       ] [node_s_0] [test] [0] closed (reason: [shard failure, reason [translog trimming failed]])
[2023-11-21T03:55:00,998][WARN ][o.o.i.s.IndexShard       ] [node_s_0] [test][0] failed to turn off translog retention
org.opensearch.index.engine.EngineException: failed to trim translog
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:229) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1484) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard$4.doRun(IndexShard.java:2848) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.opensearch.index.translog.TranslogException: Unable to get the earliest last modified time for the transaction log
	at org.opensearch.index.translog.Translog.earliestLastModifiedAge(Translog.java:423) ~[classes/:?]
	at org.opensearch.index.translog.LocalTranslog.stats(LocalTranslog.java:138) ~[classes/:?]
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:211) ~[classes/:?]
	... 7 more

Note: Found a reader for... is a log that i had added within this.readers.addAll(recoverFromFiles(checkpoint)) in LocalTranslog here

@ankitkala ankitkala assigned ankitkala and unassigned gauravruhela Nov 23, 2023
@ankitkala
Copy link
Member

Added stack trace everytime translog is getting created. I can see that NoOpEngine.trimUnreferencedTranslogFiles also creates a new translog object. Looks like some race condition is happening and we somehow have 2 translog files on the disk(both empty). Ideally, NoOpEngine should only create and reuse single copy of Translog. But the change might require some refactoring which i don't want to include in this task's scope.
Since test is not broken functionally, i think we can also accept twice the header size within the assertion.

2023-11-23T11:34:55,811][INFO ][o.o.i.IndexServiceTests  ] [testAsyncTranslogTrimTaskOnClosedIndex] ankikala: Closing the index
[2023-11-23T11:34:55,811][INFO ][o.o.c.m.MetadataIndexStateService] [node_s_0] closing indices [test/NvFcQ9b9Qgq8pQsipEnsOg]
[2023-11-23T11:34:55,819][INFO ][o.o.c.m.MetadataIndexStateService] [node_s_0] completed closing of indices [test]
[2023-11-23T11:34:55,833][INFO ][o.o.i.IndexService       ] [node_s_0] [test] [0] closing... (reason: [removing index (CLOSED)])
[2023-11-23T11:34:55,834][INFO ][o.o.i.IndexService       ] [node_s_0] [test] [0] closed (reason: [removing index (CLOSED)])
[2023-11-23T11:34:55,849][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test/NvFcQ9b9Qgq8pQsipEnsOg]
[2023-11-23T11:34:55,861][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankitkala: creating local translog
[2023-11-23T11:34:55,862][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankikala: Found a reader for /private/var/folders/dz/5106kpws767d85yrxw9y1m040000gr/T/org.opensearch.index.IndexServiceTests_72FC9098026A84A7-001/tempDir-002/data/nodes/0/indices/NvFcQ9b9Qgq8pQsipEnsOg/0/translog/translog-17.tlog
[2023-11-23T11:34:55,871][INFO ][o.o.c.r.a.AllocationService] [node_s_0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[test][0]]]).
java.lang.Exception: 
	at org.opensearch.index.translog.LocalTranslog.<init>(LocalTranslog.java:57)
	at org.opensearch.index.translog.InternalTranslogFactory.newTranslog(InternalTranslogFactory.java:34)
	at org.opensearch.index.engine.ReadOnlyEngine.translogStats(ReadOnlyEngine.java:274)
	at org.opensearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:147)
	at org.opensearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:76)
	at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2460)
	at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2402)
	at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2371)
	at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:660)
	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:121)
	at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344)
	at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:119)
	at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2745)
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
[2023-11-23T11:34:55,885][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankitkala: creating local translog
java.lang.Exception: 
	at org.opensearch.index.translog.LocalTranslog.<init>(LocalTranslog.java:57)
	at org.opensearch.index.translog.InternalTranslogFactory.newTranslog(InternalTranslogFactory.java:34)
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:199)
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1484)
	at org.opensearch.index.shard.IndexShard$4.doRun(IndexShard.java:2848)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
[2023-11-23T11:34:58,075][WARN ][o.o.m.j.JvmGcMonitorService] [node_s_0] [gc][1] overhead, spent [2.2s] collecting in the last [2.9s]
[2023-11-23T11:34:58,075][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankikala: Found a reader for /private/var/folders/dz/5106kpws767d85yrxw9y1m040000gr/T/org.opensearch.index.IndexServiceTests_72FC9098026A84A7-001/tempDir-002/data/nodes/0/indices/NvFcQ9b9Qgq8pQsipEnsOg/0/translog/translog-18.tlog
[2023-11-23T11:34:58,076][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankikala: Found a reader for /private/var/folders/dz/5106kpws767d85yrxw9y1m040000gr/T/org.opensearch.index.IndexServiceTests_72FC9098026A84A7-001/tempDir-002/data/nodes/0/indices/NvFcQ9b9Qgq8pQsipEnsOg/0/translog/translog-17.tlog
[2023-11-23T11:34:58,078][INFO ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] ankitkala: creating local translog
[2023-11-23T11:34:58,080][WARN ][o.o.i.t.LocalTranslog    ] [node_s_0] [test][0] deleted previously created, but not yet committed, next generation [translog-19.tlog]. This can happen due to a tragic exception when creating a new generation
[2023-11-23T11:34:58,082][WARN ][o.o.i.e.Engine           ] [node_s_0] [test][0] failed engine [translog trimming failed]
org.opensearch.index.translog.TranslogException: Unable to get the earliest last modified time for the transaction log
	at org.opensearch.index.translog.Translog.earliestLastModifiedAge(Translog.java:423) ~[classes/:?]
	at org.opensearch.index.translog.LocalTranslog.stats(LocalTranslog.java:144) ~[classes/:?]
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:211) [classes/:?]
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1484) [classes/:?]
	at org.opensearch.index.shard.IndexShard$4.doRun(IndexShard.java:2848) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
java.lang.Exception: 
	at org.opensearch.index.translog.LocalTranslog.<init>(LocalTranslog.java:57)
	at org.opensearch.index.translog.InternalTranslogFactory.newTranslog(InternalTranslogFactory.java:34)
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:199)
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1484)
	at org.opensearch.index.IndexService.maybeTrimTranslog(IndexService.java:1044)
	at org.opensearch.index.IndexService$AsyncTrimTranslogTask.runInternal(IndexService.java:1202)
	at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
[2023-11-23T11:34:58,086][WARN ][o.o.i.s.IndexShard       ] [node_s_0] [test][0] failed to turn off translog retention
org.opensearch.index.engine.EngineException: failed to trim translog
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:229) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1484) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard$4.doRun(IndexShard.java:2848) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.opensearch.index.translog.TranslogException: Unable to get the earliest last modified time for the transaction log
	at org.opensearch.index.translog.Translog.earliestLastModifiedAge(Translog.java:423) ~[classes/:?]
	at org.opensearch.index.translog.LocalTranslog.stats(LocalTranslog.java:144) ~[classes/:?]
	at org.opensearch.index.engine.NoOpEngine$3.trimUnreferencedTranslogFiles(NoOpEngine.java:211) ~[classes/:?]
	... 7 more

@kotwanikunal
Copy link
Member

@ankitkala Can we create an issue for the unexpected behavior and add in the investigation there?
It will be simpler to keep track of it.

And just so that its documented as well - NoOpEngine is used only for tests right?

@ankitkala
Copy link
Member

ankitkala commented Dec 1, 2023

@ankitkala Can we create an issue for the unexpected behavior and add in the investigation there? It will be simpler to keep track of it.

And just so that its documented as well - NoOpEngine is used only for tests right?

Sure. Just to clarify, there isn't any functional regression though.

NoOpEngine is used when we close a shard.

@shourya035
Copy link
Member

Reopening since this again failed in https://build.ci.opensearch.org/job/gradle-check/39959/

@shourya035 shourya035 reopened this Jun 6, 2024
@r1walz
Copy link
Contributor

r1walz commented Jun 6, 2024

Closing as already tracked here: #11547

@r1walz r1walz closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants