Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Closed
Trott opened this issue Sep 14, 2018 · 9 comments
Closed

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Trott opened this issue Sep 14, 2018 · 9 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@Trott
Copy link
Member

Trott commented Sep 14, 2018

Looks like we're now getting SIGABRT rather than SIGSEGV? Is that a step in the right direction? (I think so...)

https://ci.nodejs.org/job/node-test-commit-freebsd/20477/nodes=freebsd11-x64/console

22:24:54 not ok 1905 parallel/test-trace-events-fs-sync
22:24:54   ---
22:24:54   duration_ms: 3.60
22:24:54   severity: fail
22:24:54   exitcode: 1
22:24:54   stack: |-
22:24:54     assert.js:84
22:24:54       throw new AssertionError(obj);
22:24:54       ^
22:24:54     
22:24:54     AssertionError [ERR_ASSERTION]: fs.sync.stat:
22:24:54     { status: null,
22:24:54       signal: 'SIGABRT',
22:24:54       output: [ null, '', '' ],
22:24:54       pid: 99508,
22:24:54       stdout: '',
22:24:54       stderr: '' }
22:24:54         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:139:10)
22:24:54         at Module._compile (internal/modules/cjs/loader.js:694:30)
22:24:54         at Object.Module._extensions..js (internal/modules/cjs/loader.js:705:10)
22:24:54         at Module.load (internal/modules/cjs/loader.js:604:32)
22:24:54         at tryModuleLoad (internal/modules/cjs/loader.js:543:12)
22:24:54         at Function.Module._load (internal/modules/cjs/loader.js:535:3)
22:24:54         at Function.Module.runMain (internal/modules/cjs/loader.js:747:12)
22:24:54         at startup (internal/bootstrap/node.js:270:19)
22:24:54         at bootstrapNodeJSCore (internal/bootstrap/node.js:801:3)
22:24:54   ...

Previous issue: #21038

@ofrobots @nodejs/trace-events @nodejs/platform-freebsd @nodejs/testing

@Trott
Copy link
Member Author

Trott commented Sep 14, 2018

Seems to me that the thing to do would be one of:

  • reduce parallel-ism in the test
  • move it to sequential
  • find a way to increase resources on our FreeBSD machine in CI (or decrease the value for -j on FreeBSD in CI--I think it's ridiculously high on FreeBSD for reasons I never fully understood)

Seem about right to others? Like, I assume we're getting SIGABRT because we're spawning too many processes or something like that?

@Trott
Copy link
Member Author

Trott commented Sep 14, 2018

Oh...although the test is using spawnSync and not spawn so...maybe not?

@Trott
Copy link
Member Author

Trott commented Sep 14, 2018

Stress test to confirm unreliability on current master: https://ci.nodejs.org/job/node-stress-single-test/2018/

@Trott Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Sep 14, 2018
@ofrobots
Copy link
Contributor

I think this is due to the fact that the TraceEvent implementation we have presently is not fully robust to thread races, specially in shutdown code. I didn't have full confidence that we have fixed /all/ the issues in my previous set of fixes.

Can I haz access to the FreeBSD machine again? I can take a look.

@Trott
Copy link
Member Author

Trott commented Sep 14, 2018

Can I haz access to the FreeBSD machine again? I can take a look.

@ofrobots Your access has been restored. (Thanks, @refack!) Go for it!

Ref: nodejs/build#1489 (comment)

@ofrobots
Copy link
Contributor

ofrobots commented Sep 17, 2018

From a core dump on FreeBSD:

Thread 1 (process 100772):
#0  0x000000080349247a in thr_kill () from /lib/libc.so.7
#1  0x0000000803492444 in raise () from /lib/libc.so.7
#2  0x00000008034923b9 in abort () from /lib/libc.so.7
#3  0x00000000009fcd74 in uv_mutex_lock (mutex=<value optimized out>) at ../deps/uv/src/unix/thread.c:275
#4  0x000000000097bb8c in node::tracing::InternalTraceBuffer::AddTraceEvent ()
#5  0x00000000013d6622 in v8::platform::tracing::TracingController::AddTraceEvent ()
#6  0x00000000009417b7 in node::(anonymous namespace)::PlatformWorkerThread ()
#7  0x00000008031a8c06 in pthread_create () from /lib/libthr.so.3
#8  0x0000000000000000 in ?? ()

In the meanwhile, the main thread was here:

Thread 2 (process 101594):
#0  0x00000008031b6b2c in pthread_cleanup_pop () from /lib/libthr.so.3
#1  0x00000000009fccae in uv_thread_join (tid=<value optimized out>) at ../deps/uv/src/unix/thread.c:216
#2  0x0000000000941b29 in node::WorkerThreadsTaskRunner::Shutdown ()
#3  0x0000000000942dfa in node::NodePlatform::Shutdown ()
#4  0x00000000008c5dff in node::Start ()
#5  0x0000000000880095 in _start ()
#6  0x0000000802263000 in ?? ()
#7  0x0000000000000000 in ?? ()

ofrobots added a commit to ofrobots/node that referenced this issue Sep 20, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: nodejs#22865
@Trott
Copy link
Member Author

Trott commented Sep 23, 2018

No surprise here, as #22938 hasn't landed yet, but still a thing:

https://ci.nodejs.org/job/node-test-commit-freebsd/20670/nodes=freebsd11-x64/console

00:10:49 not ok 1911 parallel/test-trace-events-fs-sync # TODO : Fix flaky test
00:10:49   ---
00:10:49   duration_ms: 1.387
00:10:49   severity: flaky
00:10:49   exitcode: 1
00:10:49   stack: |-
00:10:49     /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:143
00:10:49         throw new Error(`${tr}:\n${util.inspect(proc)}`);
00:10:49         ^
00:10:49     
00:10:49     Error: fs.sync.fstat:
00:10:49     { status: null,
00:10:49       signal: 'SIGABRT',
00:10:49       output: [ null, '', '' ],
00:10:49       pid: 13999,
00:10:49       stdout: '',
00:10:49       stderr: '' }
00:10:49         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:143:11)
00:10:50         at Module._compile (internal/modules/cjs/loader.js:703:30)
00:10:50         at Object.Module._extensions..js (internal/modules/cjs/loader.js:714:10)
00:10:50         at Module.load (internal/modules/cjs/loader.js:613:32)
00:10:50         at tryModuleLoad (internal/modules/cjs/loader.js:552:12)
00:10:50         at Function.Module._load (internal/modules/cjs/loader.js:544:3)
00:10:50         at Function.Module.runMain (internal/modules/cjs/loader.js:756:12)
00:10:50         at startup (internal/bootstrap/node.js:302:19)
00:10:50         at bootstrapNodeJSCore (internal/bootstrap/node.js:855:3)
00:10:50   ...

@refack refack mentioned this issue Oct 1, 2018
3 tasks
Trott pushed a commit to Trott/io.js that referenced this issue Oct 6, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: nodejs#22865

PR-URL: nodejs#22938
Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
@Trott
Copy link
Member Author

Trott commented Oct 6, 2018

Fixed in 68b3e46

@Trott Trott closed this as completed Oct 6, 2018
ofrobots added a commit to ofrobots/node that referenced this issue Oct 13, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: nodejs#22865

PR-URL: nodejs#22938
Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
ofrobots added a commit that referenced this issue Oct 15, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
jasnell pushed a commit that referenced this issue Oct 17, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
MylesBorins pushed a commit that referenced this issue Oct 30, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
rvagg pushed a commit that referenced this issue Nov 28, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>