Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM failures in CI #951

Closed
gibfahn opened this issue Oct 27, 2017 · 16 comments
Closed

ARM failures in CI #951

gibfahn opened this issue Oct 27, 2017 · 16 comments

Comments

@gibfahn
Copy link
Member

gibfahn commented Oct 27, 2017

Thought it might be a good idea to have a tracker issue for ARM failures, I see a lot, but am not sure whether it's the same failures again and again or something new.

1 comment per new type of failure is probably a good starting point.

cc/ @rvagg

@gibfahn
Copy link
Member Author

gibfahn commented Oct 27, 2017

Failure:

https://ci.nodejs.org/job/node-test-binary-arm/11211/RUN_SUBSET=3,label=pi1-raspbian-wheezy/console

Building remotely on test-requireio_chrislea-debian7-arm_pi1p-1 (pi1-raspbian-wheezy) in workspace /home/iojs/build/workspace/node-test-binary-arm

+ git clean -fdx
warning: failed to remove out/Release/.nfs00000000000f537d000002f9

@refack
Copy link
Contributor

refack commented Oct 27, 2017

I had an idea, by way of cataloging toward a HOWTO, to add a "code" to such failures - then we can count them and add steps for troubleshooting:

  • failed-to-remove-.nfs has the following signature:
    + git clean -fdx
    warning: failed to remove out/Release/.nfs00000000000f537d000002f9
    
    possible step for resolution:
    • (build/test): check that there are no [node] <defunct> zombies.
      If there are they need to be sudo killed.
ps -ef | grep "[node] <defunct>" # To check
ps -ef | grep "[node] <defunct>" | awk '{print $2}' | xargs sudo kill # To clean up

Seen several times recently:

2017-10-25

  • test-requireio_chrislea-debian7-arm_pi1p-1
  • probably the same incident as above
  • Trott took machine offline @ 16:30EDT
  • resolved by refack @ 18:00EDT (zombie killed)

2017-10-17

  • test-requireio_securogroup-debian7-arm_pi1p-1 & test-requireio_bengl-debian7-arm_pi1p-2
  • similar signatures
  • resolved by rvagg @ 2017-10-18-08:00EDT (zombie killed)

2017-10-29

  • test-requireio_davglass-debian7-arm_pi1p-1
  • signature a little different:
    + git clean -fdx
    warning: failed to remove out/
    Removing out/
    Build step 'Execute shell' marked build as failure
    
  • [node] <defunct> confirmed and killed by refack @ 2017-10-29-11:00EDT

@gibfahn
Copy link
Member Author

gibfahn commented Oct 27, 2017

@refack great idea, but I'd add something else to the Resolution: section: who has the access (node/build, node/infra, just Rod etc.)

@rvagg
Copy link
Member

rvagg commented Nov 4, 2017

@refack you offering to start a HOWTO somewhere?

@gibfahn
Copy link
Member Author

gibfahn commented Nov 4, 2017

Thinking this might be a good use for the wiki, stuff that changes often, we don't really need to worry about source control for, basically a scratchpad for anyone to edit.

Looks like Johan had a similar idea a while ago.

@rvagg
Copy link
Member

rvagg commented Nov 6, 2017

I'm starting to keep notes now on my maintenance of the cluster. I can drop them in here also if that's helpful. If there are repeating correlations with erroring machines even after replacing SD cards then we might be able to pinpoint ones that need to be retired.

Today I r/w tested the SD cards on test-requireio_bengl-debian7-arm_pi1p-1, test-requireio_mhdawson-debian7-arm_pi1p-1 and test-requireio_securogroup-debian7-arm_pi1p-1. Notice they are listed above in @refack's comment, they show up regularly and I'm pretty sure I've reprovisioned these before, test-requireio_ceejbot-debian7-arm_pi1p-1 is another one that shows up more often than I'd like. I don't have good enough records to make a solid assessment though so I'm just going with SD card testing for now.

I've thrown one of the cards out and inserted a new one and set these 3 up from scratch and they are back in the cluster.

Also, I'm hoping that by pulling back on the overclocking on these that we might have more stability. We'll see.

@maclover7 maclover7 added the infra label Nov 7, 2017
@refack
Copy link
Contributor

refack commented Nov 14, 2017

I've added the "kill defunct" line to the job config (after manual testing and one mistake):
https://ci.nodejs.org/job/node-test-binary-arm/jobConfigHistory/showDiffFiles?timestamp1=2017-11-05_03-13-42&timestamp2=2017-11-14_06-49-15

@BridgeAR
Copy link
Member

There are multiple builds where pretty much all arm runs failed
See e.g.
https://ci.nodejs.org/job/node-test-binary-arm/12674/
https://ci.nodejs.org/job/node-test-binary-arm/12685/

@Trott
Copy link
Member

Trott commented Dec 19, 2017

I'm removing stale .git/index.lock on the Raspberry Pi devices as I find them, but I don't know the cause.

@Trott
Copy link
Member

Trott commented Dec 19, 2017

Also seeing multiple failures that look like this:

11:56:00 Started by upstream project "node-test-binary-arm" build number 12701
11:56:00 originally caused by:
11:56:00  Started by upstream project "node-test-commit-arm-fanned" build number 13496
11:56:00  originally caused by:
11:56:00   Started by upstream project "node-test-commit" build number 14936
11:56:00   originally caused by:
11:56:00    Started by upstream project "node-daily-master" build number 977
11:56:00    originally caused by:
11:56:00     Started by timer
11:56:00 [EnvInject] - Loading node environment variables.
11:56:01 Building remotely on test-requireio_rvagg-debian7-arm_pi2-1 (pi2-raspbian-wheezy) in workspace /home/iojs/build/workspace/node-test-binary-arm
11:56:03 [node-test-binary-arm] $ /bin/sh -xe /tmp/jenkins1159433904551122903.sh
11:56:03 + set +x
11:56:03 Tue Dec 19 16:56:03 UTC 2017
11:56:04 + pgrep node
11:56:04 7241
11:56:04 7247
11:56:04 7252
11:56:04 7253
11:56:04 7258
11:56:04 7260
11:56:04 7269
11:56:04 7274
11:56:04 7276
11:56:05 [node-test-binary-arm] $ /bin/bash -ex /tmp/jenkins199448051140569265.sh
11:56:05 + rm -rf RUN_SUBSET
11:56:05 + case $label in
11:56:05 + REF=cc-armv7
11:56:05 + REFERENCE_REFS=+refs/heads/master:refs/remotes/reference/master
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging'
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging'
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/reference/v7.x-staging'
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/reference/v7.x-staging +refs/heads/v8.x-staging:refs/remotes/reference/v8.x-staging'
11:56:05 + ORIGIN_REFS=+refs/heads/master:refs/remotes/origin/master
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging'
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/origin/v6.x-staging'
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/origin/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/origin/v7.x-staging'
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/origin/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/origin/v7.x-staging +refs/heads/v8.x-staging:refs/remotes/origin/v8.x-staging'
11:56:05 + git --version
11:56:06 git version 2.15.0
11:56:06 + git init
11:56:06 Reinitialized existing Git repository in /home/iojs/build/workspace/node-test-binary-arm/.git/
11:56:06 + git fetch --no-tags file:///home/iojs/.ccache/node.shared.reference +refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/reference/v7.x-staging +refs/heads/v8.x-staging:refs/remotes/reference/v8.x-staging
11:56:09 fatal: Couldn't find remote ref refs/heads/v7.x-staging
11:56:09 
11:56:09 real	0m3.441s
11:56:09 user	0m0.050s
11:56:09 sys	0m0.040s
11:56:09 + echo 'Problem fetching the shared reference repo.'
11:56:09 Problem fetching the shared reference repo.
11:56:09 + git fetch --no-tags file:///home/iojs/.ccache/node.shared.reference +refs/heads/jenkins-node-test-commit-arm-fanned-13496-binary-pi1p/cc-armv7:refs/remotes/jenkins_tmp
11:56:09 fatal: The remote end hung up unexpectedly
11:56:11 
11:56:11 real	0m1.893s
11:56:11 user	0m0.180s
11:56:11 sys	0m0.400s
11:56:11 + ps -ef
11:56:11 + grep '\[node\] <defunct>'
11:56:11 + awk '{print $2}'
11:56:11 + xargs -rl kill
11:56:11 + rm -f ****
11:56:11 + git checkout -f refs/remotes/jenkins_tmp
11:56:22 HEAD is now at 6c29aa6896... added binaries
11:56:22 
11:56:22 real	0m11.204s
11:56:22 user	0m1.480s
11:56:22 sys	0m9.850s
11:56:22 + git reset --hard
11:56:26 HEAD is now at 6c29aa6896 added binaries
11:56:26 
11:56:26 real	0m3.466s
11:56:26 user	0m1.460s
11:56:26 sys	0m0.860s
11:56:26 + git clean -fdx
11:56:30 warning: failed to remove out/Release: Directory not empty
11:56:30 Removing config.gypi
11:56:30 Removing icu_config.gypi
11:56:30 Removing node
11:56:30 Removing out/Release/node
11:56:30 Removing out/Release/openssl-cli
11:56:30 Removing test.tap
11:56:30 Removing test/.tmp.0/
11:56:30 Removing test/abort/testcfg.pyc
11:56:30 Removing test/addons-napi/testcfg.pyc
11:56:30 Removing test/addons/testcfg.pyc
11:56:30 Removing test/async-hooks/testcfg.pyc
11:56:30 Removing test/doctool/testcfg.pyc
11:56:30 Removing test/es-module/testcfg.pyc
11:56:30 Removing test/gc/testcfg.pyc
11:56:30 Removing test/internet/testcfg.pyc
11:56:30 Removing test/known_issues/testcfg.pyc
11:56:30 Removing test/message/testcfg.pyc
11:56:30 Removing test/parallel/testcfg.pyc
11:56:30 Removing test/pseudo-tty/testcfg.pyc
11:56:30 Removing test/pummel/testcfg.pyc
11:56:30 Removing test/sequential/testcfg.pyc
11:56:30 Removing test/testpy/__init__.pyc
11:56:30 Removing test/tick-processor/testcfg.pyc
11:56:30 Removing test/timers/testcfg.pyc
11:56:30 Removing tools/test.pyc
11:56:30 Removing tools/utils.pyc
11:56:31 Build step 'Execute shell' marked build as failure
11:56:31 TAP Reports Processing: START
11:56:31 Looking for TAP results report in workspace using pattern: *.tap
11:56:32 Did not find any matching files. Setting build result to FAILURE.
11:56:32 Checking ^not ok
11:56:32 Jenkins Text Finder: File set '*.tap' is empty
11:56:32 Notifying upstream projects of job completion
11:56:32 Finished: FAILURE

@Trott
Copy link
Member

Trott commented Dec 19, 2017

Note that the above is the aforementioned warning: failed to remove out/Release: Directory not empty but there is are other errors above it that may or may not be significant like:

11:56:09 + git fetch --no-tags file:///home/iojs/.ccache/node.shared.reference +refs/heads/jenkins-node-test-commit-arm-fanned-13496-binary-pi1p/cc-armv7:refs/remotes/jenkins_tmp
11:56:09 fatal: The remote end hung up unexpectedly
11:56:11 

@Trott
Copy link
Member

Trott commented Dec 19, 2017

Certainly some NFS issues showing up now:

2:11:32 + git clean -fdx
12:11:40 warning: failed to remove out/Release/.nfs00000000000b132d00000005: Device or resource busy

@Trott
Copy link
Member

Trott commented Dec 19, 2017

But it seems to be self-healing. https://ci.nodejs.org/job/node-test-binary-arm/12703/ is now 2/3 green and looking promising. All I've done is remove stale .git/index.lock files and re-run the one CI job a few times.

@gibfahn
Copy link
Member Author

gibfahn commented Dec 20, 2017

FWIW our (IBM) Jenkins farm used to have all the workspaces running on a shared NFS mount, but we moved to having everything local because it kept causing problems like this.

@Trott
Copy link
Member

Trott commented Dec 20, 2017

FWIW our (IBM) Jenkins farm used to have all the workspaces running on a shared NFS mount, but we moved to having everything local because it kept causing problems like this.

I imagine that's probably not an option with Raspberry Pi devices. :-( Still, good to know.

@rvagg
Copy link
Member

rvagg commented Dec 21, 2017

This is caused by the nfs server having some internal problems, twice now in a few days. I'm a bit embarrassed to admit that it's likely to do with the hot weather we've been having down here (no I don't have a cooled datacenter in my garage unfortunately). I've done some restarting and cleaning up and have a couple of jobs working at the moment that seem to indicate that it's all good now.

@rvagg rvagg closed this as completed Dec 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants