Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: improve test-child-process-fork-dgram #8697

Closed
wants to merge 1 commit into from

Conversation

mhdawson
Copy link
Member

@mhdawson mhdawson commented Sep 21, 2016

Checklist
  • make -j4 test (UNIX), or vcbuild test nosign (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

test

Description of change

Previous implementation was subject to timing out as there
was no guarantee that the parent or child would ever
receive a message. Modified test so that once parent or
child receives a message it stops receiving them so that
we are guaranteed the next message will be received by the
other end.

The test also left lingering processes when it timed out
so pulled the timeout into the test itself so that it
could properly cleanup in case of failure.

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Sep 21, 2016
@mscdex mscdex added dgram Issues and PRs related to the dgram subsystem / UDP. child_process Issues and PRs related to the child_process subsystem. labels Sep 21, 2016
Copy link
Member

@imyller imyller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gibfahn gibfahn mentioned this pull request Sep 21, 2016
3 tasks
@gibfahn
Copy link
Member

gibfahn commented Sep 21, 2016

Commit message nit:

Modified test so that once parent or child receives a message it stops receiving them so that we are guaranteed the next message will be received by the other end.

Maybe: Modified test so that once parent or child receives a message it closes its connection to the socket, so we can guarantee the next message will be received by the other node.

* a message. In this case the test runner will timeout after 60 secs
* and the test will fail.
* message. When either the parent or child receives a message we close
* the server on that side so the next message is guarranteed to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/guarranteed/guaranteed/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will update commit message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhdawson The typo is in the test comment on line 12.

client.send(
msg,
0,
msg.length,
server.address().port,
serverPort,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to add the serverPort variable? It only seems to be used here, and makes things a bit more complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its needed because the server is closed before we have stopped sending. Therefore, you can't call server.address().port. I found this the hard way while building up the changeset.

@gibfahn
Copy link
Member

gibfahn commented Sep 21, 2016

So if we're now closing the child/parent server once they receive a message, the test should finish after two messages unless some are disappearing (this is UDP so I guess messages can fail). I haven't seen any dropped messages at all in any of the test runs, so having a 45s internal timeout (i.e. 45,000 messages) may be overkill?

Changes LGTM if CI passes (probably also worth running a stress test on AIX as well to make sure).

@gibfahn
Copy link
Member

gibfahn commented Sep 21, 2016

Also just to document, this fixes #8271

@mhdawson
Copy link
Member Author

Already ran the stress run of 500 on AIX and all passed versus about at 1/10 failure before.

CI run here: https://ci.nodejs.org/job/node-test-pull-request/4223/

@mhdawson
Copy link
Member Author

In terms of the timeout as long as it is shorter than the 60s from test harness I don't think it really matters as we will normally finish fast anyway.

@mhdawson
Copy link
Member Author

Commit message updated.

@mhdawson
Copy link
Member Author

mhdawson commented Sep 22, 2016

Lots of failures in the CI run, but none related to the test being changed. Several test tick processor related timeouts as well as checkouts failing across a number of platforms.
I'll try again later when things seem quieter.

@mhdawson
Copy link
Member Author

@mhdawson
Copy link
Member Author

Ok second CI run only had failures (windows and smartos) related to test-tick-processor which were pre-existing and not related to the changes to this test.

@gibfahn
Copy link
Member

gibfahn commented Sep 23, 2016

@mhdawson You missed a nit on L20, otherwise LGTM.

@misterdjules LGTY? (also @AndreasMadsen , but feel free to ignore if you don't think you could usefully review).

@AndreasMadsen
Copy link
Member

@gibfahn Could you explain why this comment isn't a concern here:

Not too sure. If the intention of the test was that messages reach both the parent and the child while both are listening, then probably the fix is not correct. If not, probably it is.
by @santigimeno #8549 (comment)

@gibfahn
Copy link
Member

gibfahn commented Sep 23, 2016

@AndreasMadsen I think the point is that due to the unreliable nature of the protocol, we have no grounds to expect that messages will reach both while both are listening. The system may route all messages to either child or parent.

Previous implementation was subject to timing out as there
was no guarantee that the parent or child would ever
receive a message.  Modified test so that once parent or
child receives a message it closes its connection to the
socket so we can guarantee the next messsage will be recieved
by the other node.

The test also left lingering processes when it timed out
so pulled the timeout into the test itself so that it
could properly cleanup in case of failure.
@mhdawson
Copy link
Member Author

@gibfahn thats for the catch, think I got the nit this time :)

@gibfahn
Copy link
Member

gibfahn commented Sep 23, 2016

LGTM

@imyller imyller self-assigned this Sep 23, 2016
@imyller
Copy link
Member

imyller commented Sep 23, 2016

@mhdawson
Copy link
Member Author

There are 2 hung processes on test-osuosl-aix61-ppc64_be-1, one from Oct 3 and one from Oct 7 so I think the problem still exists unless it was fixed by something after Oct 7.

@mhdawson
Copy link
Member Author

@gibfahn I think the two get separate copies of the handle so one closing may not affect the other. It should however be possible to write a test that validates it one way or the other by making sure it either passes/fails after one side has done the close.

@mhdawson
Copy link
Member Author

Lastest process on test-osuosl-aix61-ppc64_be-1 is from Oct 8th.

@Trott
Copy link
Member

Trott commented Oct 13, 2016

Annnnnddddd....now it's failing consistently on freebsd in CI.

So, this is probably a thing again.

I guess someone should write some code to confirm the speculation in #8697 (comment) that the two copies get separate handles and then, if so, this can get a rebase and land? Or am I mistaken? /cc @mhdawson @gibfahn

@gibfahn
Copy link
Member

gibfahn commented Oct 13, 2016

@Trott Well we need to work out why it's failing on macOS 10.12 first I think

@Trott
Copy link
Member

Trott commented Oct 13, 2016

Ooof. OK, for fun, or at least "fun", CI stress test across all available platforms: https://ci.nodejs.org/job/node-stress-single-test/999/

@mhdawson
Copy link
Member Author

@gibfahn and I did discuss writing the test to confirm close behavior across processes. I think he'll be looking at that and the failures on macOS 10.12 soon.

@santigimeno
Copy link
Member

I think the OS X could be similar to the one described in #7512: if for some reason, the server on the parent side is closed before the handle has been received by the child process, the server remains closed in the child, thus never receiving any message. This patch, fixes the issue for me:

diff --git a/test/parallel/test-child-process-fork-dgram.js b/test/parallel/test-child-process-fork-dgram.js
index 57d8fb0..bd72631 100644
--- a/test/parallel/test-child-process-fork-dgram.js
+++ b/test/parallel/test-child-process-fork-dgram.js
@@ -39,6 +39,7 @@ if (process.argv[2] === 'child') {
         server.close();
       });

+      process.send('handleReceived');
     } else if (msg === 'stop') {
       process.removeListener('message', removeMe);
       if (!serverClosed) {
@@ -69,13 +70,13 @@ if (process.argv[2] === 'child') {
     serverPort = server.address().port;
     child.send('server', server);

-    child.once('message', function(msg) {
+    child.on('message', function(msg) {
       if (msg === 'gotMessage') {
         childGotMessage = true;
+      } else if (msg === 'handleReceived') {
+        sendMessages();
       }
     });
-
-    sendMessages();
   });

   var iterations = 0;

@gibfahn
Copy link
Member

gibfahn commented Oct 14, 2016

@santigimeno This might well be the reason that the test was failing on AIX as well. If it fixes the test for all platforms, then I think it is preferable over this PR.

@Trott
Copy link
Member

Trott commented Oct 14, 2016

@santigimeno @gibfahn That unfortunately does not fix it on FreeBSD on our CI, but it is a good thing to add to the test anyway, IMO. And if it clears it up on AIX and/or macos, awesome.

@Trott
Copy link
Member

Trott commented Oct 14, 2016

Perhaps this is a libuv issue ultimately? Like, ignorant guess here, but maybe this bit of code explains why the test is reliable on Linux, but not so on other Unices? Or something else along those lines?

EDIT: Eh, or maybe not...

Trott added a commit to Trott/io.js that referenced this pull request Oct 14, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: nodejs#8697
Fixes: nodejs#8949
Fixes: nodejs#8271
@Trott
Copy link
Member

Trott commented Oct 14, 2016

Alternative fix: #9098

Haven't tested it yet on AIX. About to run CI. But if someone is doing these tests locally and whatnot, maybe give it a shot?

That alternative definitely works on macos and works on FreeBSD.

Trott added a commit to Trott/io.js that referenced this pull request Oct 18, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: nodejs#8697
Fixes: nodejs#8949
Fixes: nodejs#8271
PR-URL: nodejs#9098
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
@rvagg rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:02
@mhdawson
Copy link
Member Author

Closing since alternative has landed

@mhdawson mhdawson closed this Oct 18, 2016
jasnell pushed a commit that referenced this pull request Oct 18, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: #8697
Fixes: #8949
Fixes: #8271
PR-URL: #9098
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Trott added a commit to Trott/io.js that referenced this pull request Nov 12, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: nodejs#8697
Fixes: nodejs#8949
Fixes: nodejs#8271
PR-URL: nodejs#9098
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Trott added a commit to Trott/io.js that referenced this pull request Nov 12, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: nodejs#8697
Fixes: nodejs#8949
Fixes: nodejs#8271
PR-URL: nodejs#9098
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
MylesBorins pushed a commit that referenced this pull request Nov 14, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: #8697
Fixes: #8949
Fixes: #8271
PR-URL: #9098
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
MylesBorins pushed a commit that referenced this pull request Nov 15, 2016
`test-child-process-fork-dgram` is unreliable on some platforms,
especially FreeBSD and AIX within the project's continuous integration
testing. It has also been observed to be flaky on macos.

* Confirm child has received the server before sending packets
* Close the server instance on the parent or child after receiving a

Refs: #8697
Fixes: #8949
Fixes: #8271
PR-URL: #9098
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
@mhdawson mhdawson deleted the fork branch March 15, 2017 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. dgram Issues and PRs related to the dgram subsystem / UDP. test Issues and PRs related to the tests. wip Issues and PRs that are still a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants