Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue on fedora in latest V8 (8.8) requirement #2527

Closed
gengjiawen opened this issue Jan 25, 2021 · 25 comments
Closed

Memory issue on fedora in latest V8 (8.8) requirement #2527

gengjiawen opened this issue Jan 25, 2021 · 25 comments
Labels

Comments

@gengjiawen
Copy link
Member

According to /usr/bin/time -v on my machine, compilation of array-sort-tq-csa.o takes about 810 MB of memory.

Do you know if that's increased from before? It could very well be that this version of V8 has tipped the memory requirements for compilation such that the 2GiB is no longer enough and we need to add either more dedicated or swap to bring the Fedora hosts on par with the others (4GiB seems to be what other similar hosts are on, nodejs/node#36139 (comment)). Maybe open an issue over in nodejs/build?

Originally posted by @richardlau in nodejs/node#36139 (comment)

@gengjiawen gengjiawen changed the title Memory issue on latest V8 requirement Memory issue on fedora in latest V8 (8.8) requirement Jan 25, 2021
@richardlau
Copy link
Member

From nodejs/node#36139 (comment):

Still failing: https://ci.nodejs.org/job/node-test-commit-linux/39436/nodes=fedora-latest-x64/console

It looks like the host is running out of memory.

09:56:58 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/test/torque/test-torque-tq-csa.o] Terminated
09:58:43 FATAL: command execution failed
09:58:43 java.nio.channels.ClosedChannelException

and from the system log on test-rackspace-fedora32-x64-1:

Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: A process of this unit has been killed by the OOM killer.
Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Main process exited, code=exited, status=143/n/a
Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Failed with result 'oom-kill'.
Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 16h 38min 49.487s CPU time.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 5.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: Stopped Jenkins Slave.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 16h 38min 49.487s CPU time.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: Started Jenkins Slave.

The same for the earlier https://ci.nodejs.org/job/node-test-commit-linux/nodes=fedora-latest-x64/39376/console

08:13:30 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/third_party/v8/builtins/array-sort-tq-csa.o] Terminated
08:14:54 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/test/torque/test-torque-tq-csa.o] Terminated
08:14:54 FATAL: command execution failed
08:14:54 java.nio.channels.ClosedChannelException
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: A process of this unit has been killed by the OOM killer.
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Main process exited, code=exited, status=143/n/a
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Failed with result 'oom-kill'.
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 1h 12min 3.598s CPU time.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 3.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: Stopped Jenkins Slave.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 1h 12min 3.598s CPU time.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: Started Jenkins Slave.

From nodejs/node#36139 (comment):

How much memory does it have compared to other similar hosts?

Appears to be 2GiB

[root@test-rackspace-fedora32-x64-1 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:          1.9Gi       290Mi       402Mi        10Mi       1.2Gi       1.5Gi
Swap:            0B          0B          0B
[root@test-rackspace-fedora32-x64-1 ~]#

For comparison, the other fedora-latest-x64 host:

$ ssh test-digitalocean-fedora32-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:          1.9Gi       317Mi       202Mi       0.0Ki       1.4Gi       1.4Gi
Swap:            0B          0B          0B
$

The two fedora-last-latest-x64 hosts:

$ ssh test-digitalocean-fedora30-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:          1.9Gi       292Mi       1.3Gi       0.0Ki       368Mi       1.5Gi
Swap:            0B          0B          0B
$ ssh test-digitalocean-fedora30-x64-2 "free -h"
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       284Mi       2.0Gi       0.0Ki       1.6Gi       3.3Gi
Swap:            0B          0B          0B
$

centos7-64-gcc8:

$ ssh test-rackspace-centos7-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:           1.8G        256M        975M        3.5M        600M        1.4G
Swap:          2.0G        295M        1.7G
$ ssh test-softlayer-centos7-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:           1.8G        142M        1.3G        6.0M        376M        1.5G
Swap:          2.0G        260M        1.7G
$

@richardlau
Copy link
Member

richardlau commented Jan 25, 2021

I've added the build agenda label to this in case nobody gets around to looking at this. I know that the WG members from Red Hat are busy this week.

@rvagg had some suggestions in nodejs/node#36139 (comment) to see if clearing things up on the existing hosts helps. Otherwise we might look at either adding 2GiB of swap to the Fedora hosts (if we have the disk space) or bumping the allocated memory.

@gengjiawen
Copy link
Member Author

I'm only just seeing this so don't have too much intelligent to add (such as why it's failing) other than:

  • Failures on standard configurations are intended to be a signal that something is not right, switching to clang might "fix" this problem, but then you're just shipping software that's likely to fail on the particular configuration that's failing in CI. I see suggestion of memory, is there a known OOM here? I'm not seeing that in the log for that last CI run.
  • Fedora 33 is out so fedora-latest needs to be upgraded to that when someone (probably me) has time to do that. But Fedora 32, which is failing here, will still be in the mix as fedora-last-latest. It'd be quite interesting to see whether this is still failing on 33.
  • Someone with @nodejs/build test permissions could log in to the two machines (test-rackspace-fedora32-x64-1 and test-digitalocean-fedora32-x64-1), run dnf upgrade, update slave.jar, clear out ~iojs/build/workspace and reboot. I reckon it's been ages since anyone was in these machines and there might be something local that would be fixed up with a clean (maybe there's a memory hogging program in the background?).

Originally posted by @rvagg in nodejs/node#36139 (comment)

@mhdawson
Copy link
Member

@richardlau is the failure only on Fedora because the machines were configured with less memory or is it something specific to Fedora?

@richardlau
Copy link
Member

@mhdawson I haven't found anything yet to suggest a Fedora specific issue vs a simple memory issue.

@mhdawson
Copy link
Member

@richardlau thanks, in terms of:

Someone with @nodejs/build test permissions could log in to the two machines (test-rackspace-fedora32-x64-1 and test-digitalocean-fedora32-x64-1), run dnf upgrade, update slave.jar, clear out ~iojs/build/workspace and reboot. I reckon it's been ages since anyone was in these machines and there might be something local that would be fixed up with a clean (maybe there's a memory hogging program in the background?).

Is that something you will have time to do on one of the machines?

@richardlau
Copy link
Member

@mhdawson I'm not sure. I don't have much work time available in the remainder of this week outside of the scheduled Red Hat meetings. I could make time next week.

@rvagg
Copy link
Member

rvagg commented Jan 27, 2021

I updated those two machines, cleared workspaces and rebooted. Here's a green run for you for that problematic PR: https://ci.nodejs.org/job/node-test-commit-linux/39601/

We've historically targeted ~2Gb ~2 core machines in CI, they should be our most common configuration. If it were a universal memory problem then I'd expect to see it in more places than just one type of machine. My guess is that it's a bug in the toolchain that's been resolved. There were a number of toolchain updates in the big batch of updates installed, including gcc and glibc. The biggest memory hog on the machine is java running Jenkins, sitting at ~200Mb, and they're back near that level after being restarted so it doesn't look like they were bloating and there wasn't anything else taking up very much.

🤷 we'll keep an eye on these machines but for now it seems to be addressed.

@gengjiawen
Copy link
Member Author

Nice work ❤️ @rvagg

@targos
Copy link
Member

targos commented Aug 2, 2021

I'm reopening because everytime there's a V8 update that requires to recompile everything I have to run CI many times hoping it passes.

It also happens with centos7-arm64-gcc8

@targos
Copy link
Member

targos commented Aug 2, 2021

@rvagg
Copy link
Member

rvagg commented Aug 4, 2021

Well .. centos7-arm64-gcc8 is interesting because it's got plenty of memory. I think we're dealing with too much parallelism on that machine. For all of the arm64 machines we have server_jobs: 50 which is .. a bit much, I think we need to pull that right back. Maybe something more reasonable like 12.

We are also migrating our arm64 machines to new hardware and it'll be a good opportunity to fix all of this. I was hoping to do some nice containerised arm64 infra like we have for our *linux-containered* builds but for different arm64 distros, but I think @sxa might have other ideas and has put up his hand to jump in on that. Something we'll need to pay attention to.

As for fedora, I'm still at a bit of a loss. But we do need to upgrade, we're stuck on 30 and 32 but should be on 34 (probably keep 32 as our "last"). I don't know why they stand out, they're running on the same spec hardware as many of our other VMs. They have JOBS set to 2 so shouldn't be overdoing it.

@rvagg
Copy link
Member

rvagg commented Aug 4, 2021

arm64 machines updated to JOBS=12, systems updated and rebooted, I think they should be good to go now

Maybe someone from RedHat might volunteer to update our Fedora systems? I usually take the older ones, reimage them with the latest fedora, then change labels so that fedora-latest is the new ones and fedora-last-latest is the ones that were previously newest (32 in this case). If not, I might get to it sometime, there's also Alpine to do I think.

@sxa
Copy link
Member

sxa commented Aug 4, 2021

I was hoping to do some nice containerised arm64 infra like we have for our linux-containered builds but for different arm64 distros, but I think @sxa might have other ideas and has put up his hand to jump in on that.

@rvagg Yep need to get on with that, but other critical stuff has come up - next week hopefullly (I'm on vacation until Tuesday now). Superficially Sounds like we're pretty much on the same page in terms of what it makes sense to do though :-)

@richardlau
Copy link
Member

Maybe someone from RedHat might volunteer to update our Fedora systems? I usually take the older ones, reimage them with the latest fedora, then change labels so that fedora-latest is the new ones and fedora-last-latest is the ones that were previously newest (32 in this case). If not, I might get to it sometime, there's also Alpine to do I think.

I've started this now (starting with test-digitalocean-fedora30-x64-1). Reimaging was fairly painless but I ran into https://www.digitalocean.com/community/questions/fedora-33-how-to-persist-dns-settings-via-etc-resolv-conf (with Fedora 34) meaning our playbooks failed until I went onto the machine and fixed the DNS settings (as per https://www.digitalocean.com/community/questions/fedora-33-how-to-persist-dns-settings-via-etc-resolv-conf?answer=66950).

@sxa
Copy link
Member

sxa commented Oct 7, 2021

Did we ever try increasing the swap space on these machines? (since it looks from the output earlier in thisk issue that they had none.)

@rvagg
Copy link
Member

rvagg commented Oct 7, 2021

I don't think so

@richardlau
Copy link
Member

We have not. Is it easily done via Ansible? I'm up for trying.

@richardlau
Copy link
Member

I've added swap to the two Fedora 32 hosts

dd if=/dev/zero of=/swapfile bs=1024 count=2097152
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

@richardlau
Copy link
Member

Might also try adding swap to the Debian 10 hosts, e.g. https://ci.nodejs.org/job/node-test-commit-linux/nodes=debian10-x64/43295/console

01:15:34 cc1plus: out of memory allocating 2097152 bytes after a total of 22560768 bytes
01:15:34 make[2]: *** [tools/v8_gypfiles/v8_compiler.target.mk:266: /home/iojs/build/workspace/node-test-commit-linux/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/pipeline.o] Error 1

@richardlau
Copy link
Member

Have added swap to test-rackspace-debian10-x64-1.

@mhdawson
Copy link
Member

We ended up having an informal meeting and not streaming. In retrospect we probably should have streamed but at the start we were not sure we were going to discuss too much.

@richardlau richardlau reopened this Dec 15, 2021
@mhdawson
Copy link
Member

I see I got the wrong issue for the last comment.

@mhdawson
Copy link
Member

mhdawson commented Jan 25, 2022

Seems like adding swap has resolved the issue, but leaving this open until we have added the swap addition to our ansible scripts.

EDIT we agreed to add to manual instructions for now and then close this issue.

@github-actions
Copy link

github-actions bot commented May 9, 2023

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label May 9, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants