Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate test-child-process-fork-closed-channel-segfault #20836

Closed
BridgeAR opened this issue May 19, 2018 · 18 comments
Closed

Investigate test-child-process-fork-closed-channel-segfault #20836

BridgeAR opened this issue May 19, 2018 · 18 comments
Labels
child_process Issues and PRs related to the child_process subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.

Comments

@BridgeAR
Copy link
Member

https://ci.nodejs.org/job/node-test-binary-windows/17414/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=3/console

not ok 51 parallel/test-child-process-fork-closed-channel-segfault
  ---
  duration_ms: 0.259
  severity: fail
  exitcode: 1
  stack: |-
    c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:70
                throw err;
                ^
    
    Error: write EMFILE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at Worker.send (internal/cluster/worker.js:40:28)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:36:16)
        at Object.onceWrapper (events.js:273:13)
        at Socket.emit (events.js:182:13)
        at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1147:10)

Likely related:

not ok 52 parallel/test-child-process-fork-net
  ---
  duration_ms: 0.354
  severity: fail
  exitcode: 1
  stack: |-
    PARENT: server listening
    CHILD: server listening
    CLIENT: connected
    PARENT: got connection
    CLIENT: connected
    CHILD: got connection
    CLIENT: closed
    CHILD: got connection
    CHILD: got connection
    CLIENT: closed
    CLIENT: connected
    CLIENT: connected
    CLIENT: closed
    CLIENT: closed
    PARENT: server closed
    CHILD: got socket
    CLIENT: got data
    CLIENT: closed
    testSocket, listening
    events.js:167
          throw er; // Unhandled 'error' event
          ^
    
    Error: write EPIPE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at SocketListSend._request (internal/socket_list.js:20:16)
        at SocketListSend.close (internal/socket_list.js:40:10)
        at Server.close (net.js:1624:24)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-net.js:179:16)
        at Socket.emit (events.js:182:13)
        at TCP._handle.close [as _onclose] (net.js:596:12)
    Emitted 'error' event at:
        at process.nextTick (internal/child_process.js:745:39)
        at process._tickCallback (internal/process/next_tick.js:61:11)
@BridgeAR BridgeAR added child_process Issues and PRs related to the child_process subsystem. windows Issues and PRs related to the Windows platform. test Issues and PRs related to the tests. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels May 19, 2018
@gireeshpunathil
Copy link
Member

For the first one, EMFILE is predicted to occur according to this comment but not sure why it wasn't captured in the test.

@Trott
Copy link
Member

Trott commented May 20, 2018

@gireeshpunathil The test checks that err.message is 'Channel closed' and it's not for EMFILE. Not sure if that's a recent change to the message or what. The code that checks for that string was added in #8954 by @santigimeno (when the test had a different name--it was subsequently renamed).

@nodejs/testing @nodejs/platform-windows @nodejs/child_process

@Trott
Copy link
Member

Trott commented May 20, 2018

(Interestingly, the string was 'channel closed' in the original PR but is now 'Channel closed' with different capitalization.)

@BridgeAR
Copy link
Member Author

And another one (again both tests failed - I wonder if the first test has influence on the second one).

https://ci.nodejs.org/job/node-test-binary-windows/17446/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=3/console

@Trott
Copy link
Member

Trott commented May 21, 2018

One issue with test-child-process-fork-net is that is uses both console.log() and console.error(). The test runner does not preserve the interleaving of messages from stdout and stderr, so the messages as shown in CI are not the order they appear in when you run the test from the command line. PR to fix that aspect of things: #20873 A 👍 to fast-track that would be great.

@Trott
Copy link
Member

Trott commented May 22, 2018

Better output for test-child-process-fork-net (stdout and stderr properly interleaved):

not ok 52 parallel/test-child-process-fork-net
  ---
  duration_ms: 0.324
  severity: fail
  exitcode: 1
  stack: |-
    PARENT: server listening
    CHILD: server listening
    CLIENT: connected
    PARENT: got connection
    CLIENT: connected
    CHILD: got connection
    CLIENT: closed
    CHILD: got connection
    CHILD: got connection
    CLIENT: closed
    CLIENT: connected
    CLIENT: connected
    CLIENT: closed
    CLIENT: closed
    PARENT: server closed
    testSocket, listening
    CHILD: got socket
    CLIENT: got data
    CLIENT: closed
    events.js:167
          throw er; // Unhandled 'error' event
          ^
    
    Error: write EPIPE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at SocketListSend._request (internal/socket_list.js:20:16)
        at SocketListSend.close (internal/socket_list.js:40:10)
        at Server.close (net.js:1624:24)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-net.js:178:16)
        at Socket.emit (events.js:182:13)
        at TCP._handle.close [as _onclose] (net.js:596:12)
    Emitted 'error' event at:
        at process.nextTick (internal/child_process.js:745:39)
        at process._tickCallback (internal/process/next_tick.js:61:11)
  ...

@Trott
Copy link
Member

Trott commented May 23, 2018

I tried to run Windows stress tests for this to see how often it fails but after compiling for a long time, the builds hit a three-minute timeout and don't run the tests. See https://ci.nodejs.org/job/node-stress-single-test/1858/nodes=win2016-1p-vs2017/console and https://ci.nodejs.org/job/node-stress-single-test/1857/nodes=win2016-1p-vs2017/console.

19:31:05      Creating library c:\workspace\node-stress-single-test\nodes\win2016-1p-vs2017\Release\mksnapshot.lib and object c:\workspace\node-stress-single-test\nodes\win2016-1p-vs2017\Release\mksnapshot.exp
19:31:05   Generating code
19:34:05 Build timed out (after 3 minutes). Marking the build as aborted.
19:34:05 Build was aborted
19:34:05 Notifying upstream projects of job completion
19:34:05 Finished: ABORTED

@nodejs/build

@bzoz
Copy link
Contributor

bzoz commented May 23, 2018

On my Win 10 box test-child-process-fork-closed-channel-segfault fails about 6% times about parallel/test-child-process-fork-net fails about once every 30k runs.

@apapirovski
Copy link
Member

@bzoz any chance you could have a look at test-http2-pipe and see if it still fails under load? (unrelated to this issue, of course) If so, could you build with --debug-http2 --debug-nghttp2 and post the debugging info from a failed run to this issue: #20750

It would really be appreciated!

@bzoz
Copy link
Contributor

bzoz commented May 23, 2018

@apapirovski sure, I'll take a look

@apapirovski
Copy link
Member

apapirovski commented May 23, 2018

Just my 2c but this test has been flaky since its introduction, only very infrequently replicates the condition on the originally bugged version (Node 4.0.0) and is unlikely to occur exactly like this again. As such, I think we should scrap the test and move the one important check (assert.strictEqual(worker.process.channel, null);) into another test that uses child_process in a similar manner.

If we try to fix it yet again, it'll be a 6th PR to make this test less flaky. There are currently 5 separate error conditions that we ignore in this. We might be adding a sixth. IMO it's more trouble than it's worth.

@MylesBorins
Copy link
Contributor

@MylesBorins
Copy link
Contributor

@Trott
Copy link
Member

Trott commented May 25, 2018

Failure on node-daily-master task:

https://ci.nodejs.org/job/node-test-binary-windows/17579/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=1/console

Host: test-azure_msft-win2016-x64-5

not ok 51 parallel/test-child-process-fork-closed-channel-segfault
  ---
  duration_ms: 0.299
  severity: fail
  exitcode: 1
  stack: |-
    c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:70
                throw err;
                ^
    
    Error: write EMFILE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at Worker.send (internal/cluster/worker.js:40:28)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:36:16)
        at Object.onceWrapper (events.js:273:13)
        at Socket.emit (events.js:182:13)
        at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1147:10)
  ...

@gireeshpunathil
Copy link
Member

ok - I guess know the reason for failure (though cannot test now, about to leave for the day.)

#3635 (comment) (refefenced in the test case) was applicable to this line - connection establishment to a server whose state is arbitrary, is prone to 3 types of errors, and hence we checked and skipped them in the client socket's error handler.

This was almost 3 years back!

Now it turns out to be that this line too is prone to the same problem, under similar circumstances - and that makes sense, as the server can close any time between client connecting and worker sending messages.

In summary, we need an error callback for worker object too, that has the same code as in the client socket callback.

@mmarchini
Copy link
Contributor

Is parallel/test-child-process-fork-net actually related or should we open another issue to fix it?

@gireeshpunathil
Copy link
Member

@mmarchini - looking the failing stack, I believe it is the same issue. So you can apply the same patch as in #20973 over there too, if you are going for a PR. However, I have found issues with it, and please wait while we fix this.

@mmarchini
Copy link
Contributor

@gireeshpunathil do you mind if I open a PR to fix parallel/test-child-process-fork-net as well?

addaleax pushed a commit that referenced this issue May 31, 2018
In test-child-process-fork-closed-channel-segfault.js, race condition
is observed between the server getting closed and the worker sending
a message. Accommodate the potential errors.

Earlier, the same race was observed between the client and server
and was addressed through ignoring the relevant errors through error
handler. The same mechanism is re-used for worker too.

The only difference is that the filter is applied at the callback
instead of at the worker's error listener.

Refs: #3635 (comment)
Fixes: #20836
PR-URL: #20973

Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Matheus Marchini <matheus@sthima.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Bartosz Sosnowski <bartosz@janeasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants