Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-cluster-shared-leak.js #4173

Closed
wants to merge 1 commit into from

Conversation

Trott
Copy link
Member

@Trott Trott commented Dec 6, 2015

test-cluster-shared-leak.js was flaky because a worker can emit EPIPE.
This error event is expected.

Fixes: #3956

@Trott Trott added cluster Issues and PRs related to the cluster subsystem. windows Issues and PRs related to the Windows platform. test Issues and PRs related to the tests. labels Dec 6, 2015
@Trott
Copy link
Member Author

Trott commented Dec 6, 2015

One thing that needs to be checked is that the test with this change still fails on Windows with Node 4.2.1. (The test checks for a bug in Node 4.2.1.)

@Trott
Copy link
Member Author

Trott commented Dec 6, 2015

@@ -15,6 +15,11 @@ if (cluster.isMaster) {
worker1 = cluster.fork();
worker1.on('message', common.mustCall(function() {
worker2 = cluster.fork();
worker2.on('error', function(e) {
// EPIPE is OK on Windows
if ((! common.isWindows) || (e.code !== 'EPIPE'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from some style issues, would an opposite check be more clear? i.e. if (common.isWindows && e.code === 'EPIPE') return;?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to write it that way but I went with this to more closely mirror the (simpler) condition involving ECONNRESET later in the file. You're right, though, and I'll switch it.

@r-52
Copy link
Contributor

r-52 commented Dec 6, 2015

@Trott sorry - I saw this PR too late and pushed your PR #4162 up to master. Makes it sense to delete the flaky mark for this test?

@Trott
Copy link
Member Author

Trott commented Dec 6, 2015

@romankl No problem. The test should be marked flaky until this (or some other fix) lands, so that's totally fine. I'll rebase this and add the removal from the .status file. Thanks!

@Trott
Copy link
Member Author

Trott commented Dec 6, 2015

Fixed up per nits from @romankl and @Fishrock123

@@ -15,14 +15,21 @@ if (cluster.isMaster) {
worker1 = cluster.fork();
worker1.on('message', common.mustCall(function() {
worker2 = cluster.fork();
worker2.on('error', function(e) {
// EPIPE is OK on Windows
if (common.isWindows && e.code === 'EPIPE')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too surprised it only happens on Windows but I don't understand why you only need to the check for the second worker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I had a logical explanation for that too. I've been poking at it a bit, but I'm slowed by not having direct access to a Windows machine. I've been adding logging statements and running on CI which is obviously a limiting and slow process...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured out why it's only afflicting worker2 and how to rewrite the code without the special handling for Windows.

worker2 gets EPIPE if we try to send to it before it's actually listening. This never happens with worker1 because by the time we get to worker1.send(), we're in the worker1 message handler, so it's obviously already listening. But there's no such protection for worker2. So the fix is to wait for the listening event to fire on cluster for worker2 before doing all the send() stuff. Running a stress test right now to confirm that's really the fix. Looks good so far. Will update this PR if that doesn't change. (Already confirmed that it still fails for Node 4.2.1, which is what we want.)

Trott added a commit to Trott/io.js that referenced this pull request Dec 8, 2015
test-cluster-shared-leak.js was flaky because a worker can emit EPIPE.
Wait for workers to be listening so that EPIPE does not happen.

Fixes: nodejs#3956
PR-URL: nodejs#4173
@Trott
Copy link
Member Author

Trott commented Dec 8, 2015

@Trott
Copy link
Member Author

Trott commented Dec 8, 2015

sigh CI is good but stress test is not. Back to the drawing board...

@Trott
Copy link
Member Author

Trott commented Dec 8, 2015

Reverted to the previous workaround.

@bnoordhuis
Copy link
Member

Too bad the wait-until-listening approach didn't work out. Doesn't the current approach of calling .disconnect() in both workers leave behind stray processes?

@Trott
Copy link
Member Author

Trott commented Dec 8, 2015

Yes, all routes so far are terrible, but I'll figure something out.

@Trott
Copy link
Member Author

Trott commented Dec 8, 2015

Right now, I'm focusing on an assertion that can be fired in internal/child_process.js. I'm hoping that fixing that bug will make this all workable. I'm unrealistically optimistic like that.

@Trott
Copy link
Member Author

Trott commented Dec 8, 2015

Here's the bug that I think/hope might be related to the issues here: #4205

@Trott
Copy link
Member Author

Trott commented Dec 27, 2015

Fix for #4205 has landed, so time to resume work on this issue...

Stress test with current master to confirm that this bug still exists:
https://ci.nodejs.org/job/node-stress-single-test/211/nodes=win2012r2/console

Trott added a commit to Trott/io.js that referenced this pull request Dec 27, 2015
Refactor test-cluster-shared-leak.js to remove flakiness on Windows.

Fixes: nodejs#3956
PR-URL: nodejs#4173
@Trott Trott changed the title test: accommodate EPIPE on Windows test: fix flaky test-cluster-shared-leak.js Dec 28, 2015
@Trott
Copy link
Member Author

Trott commented Dec 28, 2015

OK, current minimal fix works.

CI: https://ci.nodejs.org/job/node-test-commit/1555/

Stress test: https://ci.nodejs.org/job/node-stress-single-test/233/nodes=win2012r2/consoleFull

Here's the explanation:

There is no guarantee that the pipe will be there when worker2.send() fires. It will probably be there but sometimes not. This is an error in the code, but I have not been able to remove that error without removing the invariant firing in internal/child_process when the code is run in Node 4.2.1, which is the point of the test. However, the error in the code is kind of irrelevant anyway. The invariant should never fire under any circumstances. So I've added an error event listener on worker2 which simply swallows the error. The test still fires the invariant in Node 4.2.1 (which is the bug that this test is supposed to detect, so that's good) and works fine in subsequent versions (which are fixed, so that's good too).

PTAL @bnoordhuis and anyone else interested. I'd really like to get rid of this flaky test!

(And pre-emptive strike: Yes, I or someone else should definitely squash this down to one commit before landing. It's been a journey. ¯\_(ツ)_/¯ )

EDIT: Red failures on Windows are a known flaky test (that I already have a PR open for to mark as flaky which I'll land momentarily) and a newly flaky but unrelated test. Hooray. :-| I'll look into it, but it shouldn't stop this from landing...

@Trott
Copy link
Member Author

Trott commented Dec 28, 2015

One more (hopefully last) time with feeling:

CI: https://ci.nodejs.org/job/node-test-commit/1558/
Stress: https://ci.nodejs.org/job/node-stress-single-test/246/nodes=win2012r2/console

(And the test still fails in Node 4.2.1, which is the last release that had the bug, so that's good.)

@Trott
Copy link
Member Author

Trott commented Dec 29, 2015

CI etc. looks good. Still needs an LGTM, though. /cc @nodejs/testing

One last cut-and-paste of the explanation:

There is no guarantee that the pipe will be there when worker2.send() fires. It will probably be there but sometimes not. This is an error in the code, but I have not been able to remove that error without removing the invariant firing in internal/child_process when the code is run in Node 4.2.1, which is the point of the test. However, the error in the code is kind of irrelevant anyway. The invariant should never fire under any circumstances. So I've added an error event listener on worker2 which simply swallows the error. The test still fires the invariant in Node 4.2.1 (which is the bug that this test is supposed to detect, so that's good) and works fine in subsequent versions (which are fixed, so that's good too).

Swallow EPIPE as there is it is expected to come up from time to time.
This does not invalidate the test.

Fixes: nodejs#3956
PR-URL: nodejs#4173
@Trott
Copy link
Member Author

Trott commented Jan 1, 2016

Closing in favor of #4510

@Trott Trott closed this Jan 1, 2016
@Trott Trott deleted the fix-3956 branch January 13, 2022 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants