Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Trott · 2018-09-14T17:22:40Z

Looks like we're now getting SIGABRT rather than SIGSEGV? Is that a step in the right direction? (I think so...)

https://ci.nodejs.org/job/node-test-commit-freebsd/20477/nodes=freebsd11-x64/console

22:24:54 not ok 1905 parallel/test-trace-events-fs-sync
22:24:54   ---
22:24:54   duration_ms: 3.60
22:24:54   severity: fail
22:24:54   exitcode: 1
22:24:54   stack: |-
22:24:54     assert.js:84
22:24:54       throw new AssertionError(obj);
22:24:54       ^
22:24:54     
22:24:54     AssertionError [ERR_ASSERTION]: fs.sync.stat:
22:24:54     { status: null,
22:24:54       signal: 'SIGABRT',
22:24:54       output: [ null, '', '' ],
22:24:54       pid: 99508,
22:24:54       stdout: '',
22:24:54       stderr: '' }
22:24:54         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:139:10)
22:24:54         at Module._compile (internal/modules/cjs/loader.js:694:30)
22:24:54         at Object.Module._extensions..js (internal/modules/cjs/loader.js:705:10)
22:24:54         at Module.load (internal/modules/cjs/loader.js:604:32)
22:24:54         at tryModuleLoad (internal/modules/cjs/loader.js:543:12)
22:24:54         at Function.Module._load (internal/modules/cjs/loader.js:535:3)
22:24:54         at Function.Module.runMain (internal/modules/cjs/loader.js:747:12)
22:24:54         at startup (internal/bootstrap/node.js:270:19)
22:24:54         at bootstrapNodeJSCore (internal/bootstrap/node.js:801:3)
22:24:54   ...

Previous issue: #21038

@ofrobots @nodejs/trace-events @nodejs/platform-freebsd @nodejs/testing

Trott · 2018-09-14T17:24:13Z

Seems to me that the thing to do would be one of:

reduce parallel-ism in the test
move it to sequential
find a way to increase resources on our FreeBSD machine in CI (or decrease the value for -j on FreeBSD in CI--I think it's ridiculously high on FreeBSD for reasons I never fully understood)

Seem about right to others? Like, I assume we're getting SIGABRT because we're spawning too many processes or something like that?

Trott · 2018-09-14T17:24:38Z

Oh...although the test is using spawnSync and not spawn so...maybe not?

Trott · 2018-09-14T17:53:04Z

Stress test to confirm unreliability on current master: https://ci.nodejs.org/job/node-stress-single-test/2018/

ofrobots · 2018-09-14T17:59:19Z

I think this is due to the fact that the TraceEvent implementation we have presently is not fully robust to thread races, specially in shutdown code. I didn't have full confidence that we have fixed /all/ the issues in my previous set of fixes.

Can I haz access to the FreeBSD machine again? I can take a look.

Trott · 2018-09-14T19:21:34Z

Can I haz access to the FreeBSD machine again? I can take a look.

@ofrobots Your access has been restored. (Thanks, @refack!) Go for it!

Ref: nodejs/build#1489 (comment)

ofrobots · 2018-09-17T23:23:43Z

From a core dump on FreeBSD:

Thread 1 (process 100772):
#0  0x000000080349247a in thr_kill () from /lib/libc.so.7
#1  0x0000000803492444 in raise () from /lib/libc.so.7
#2  0x00000008034923b9 in abort () from /lib/libc.so.7
#3  0x00000000009fcd74 in uv_mutex_lock (mutex=<value optimized out>) at ../deps/uv/src/unix/thread.c:275
#4  0x000000000097bb8c in node::tracing::InternalTraceBuffer::AddTraceEvent ()
#5  0x00000000013d6622 in v8::platform::tracing::TracingController::AddTraceEvent ()
#6  0x00000000009417b7 in node::(anonymous namespace)::PlatformWorkerThread ()
#7  0x00000008031a8c06 in pthread_create () from /lib/libthr.so.3
#8  0x0000000000000000 in ?? ()

In the meanwhile, the main thread was here:

Thread 2 (process 101594):
#0  0x00000008031b6b2c in pthread_cleanup_pop () from /lib/libthr.so.3
#1  0x00000000009fccae in uv_thread_join (tid=<value optimized out>) at ../deps/uv/src/unix/thread.c:216
#2  0x0000000000941b29 in node::WorkerThreadsTaskRunner::Shutdown ()
#3  0x0000000000942dfa in node::NodePlatform::Shutdown ()
#4  0x00000000008c5dff in node::Start ()
#5  0x0000000000880095 in _start ()
#6  0x0000000802263000 in ?? ()
#7  0x0000000000000000 in ?? ()

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865

Trott · 2018-09-23T06:59:03Z

No surprise here, as #22938 hasn't landed yet, but still a thing:

https://ci.nodejs.org/job/node-test-commit-freebsd/20670/nodes=freebsd11-x64/console

00:10:49 not ok 1911 parallel/test-trace-events-fs-sync # TODO : Fix flaky test
00:10:49   ---
00:10:49   duration_ms: 1.387
00:10:49   severity: flaky
00:10:49   exitcode: 1
00:10:49   stack: |-
00:10:49     /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:143
00:10:49         throw new Error(`${tr}:\n${util.inspect(proc)}`);
00:10:49         ^
00:10:49     
00:10:49     Error: fs.sync.fstat:
00:10:49     { status: null,
00:10:49       signal: 'SIGABRT',
00:10:49       output: [ null, '', '' ],
00:10:49       pid: 13999,
00:10:49       stdout: '',
00:10:49       stderr: '' }
00:10:49         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:143:11)
00:10:50         at Module._compile (internal/modules/cjs/loader.js:703:30)
00:10:50         at Object.Module._extensions..js (internal/modules/cjs/loader.js:714:10)
00:10:50         at Module.load (internal/modules/cjs/loader.js:613:32)
00:10:50         at tryModuleLoad (internal/modules/cjs/loader.js:552:12)
00:10:50         at Function.Module._load (internal/modules/cjs/loader.js:544:3)
00:10:50         at Function.Module.runMain (internal/modules/cjs/loader.js:756:12)
00:10:50         at startup (internal/bootstrap/node.js:302:19)
00:10:50         at bootstrapNodeJSCore (internal/bootstrap/node.js:855:3)
00:10:50   ...

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865 PR-URL: nodejs#22938 Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

Trott · 2018-10-06T13:06:20Z

Fixed in 68b3e46

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865 PR-URL: nodejs#22938 Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Sep 14, 2018

ofrobots mentioned this issue Sep 18, 2018

trace_event: destroy platform before tracing #22938

Closed

2 tasks

ofrobots mentioned this issue Sep 28, 2018

src: trace_event: secondary storage for metadata #20900

Closed

3 tasks

refack mentioned this issue Oct 1, 2018

test: mark some flakes #23208

Merged

3 tasks

Trott closed this as completed Oct 6, 2018

ofrobots mentioned this issue Oct 15, 2018

Assertion in parallel/test-trace-events-dynamic-enable from libuv #23675

Closed

MylesBorins pushed a commit that referenced this issue Nov 29, 2018

trace_events: destroy platform before tracing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Trott commented Sep 14, 2018

Trott commented Sep 14, 2018

Trott commented Sep 14, 2018

Trott commented Sep 14, 2018

ofrobots commented Sep 14, 2018

Trott commented Sep 14, 2018 •

edited

Loading

ofrobots commented Sep 17, 2018 •

edited

Loading

Trott commented Sep 23, 2018

Trott commented Oct 6, 2018

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Comments

Trott commented Sep 14, 2018

Trott commented Sep 14, 2018

Trott commented Sep 14, 2018

Trott commented Sep 14, 2018

ofrobots commented Sep 14, 2018

Trott commented Sep 14, 2018 • edited Loading

ofrobots commented Sep 17, 2018 • edited Loading

Trott commented Sep 23, 2018

Trott commented Oct 6, 2018

Trott commented Sep 14, 2018 •

edited

Loading

ofrobots commented Sep 17, 2018 •

edited

Loading