Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option -g: kill all processes within a process group #247

Closed
wants to merge 6 commits into from

Conversation

TaWeiTu
Copy link
Contributor

@TaWeiTu TaWeiTu commented May 5, 2021

Add an option that allows EarlyOOM to kill all processes that are in the same process group as the one with excessive memory usage.

@hakavlad
Copy link
Contributor

hakavlad commented May 5, 2021

cooool

@jayenashar
Copy link

will this work to kill slack cleanly?

Copy link
Owner

@rfjakob rfjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Two comments in the review, otherwise good.

MANPAGE.md Outdated
@@ -104,6 +104,9 @@ When earlyoom is run through its default systemd service, the `-p` switch doesn'
#### -n
Enable notifications via d-bus.

#### -g
Kill all processes that are in the same process group as the one with excessive memory usage.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give a short description when a user should enable this flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Added a more detailed description.

kill.c Outdated
@@ -73,6 +73,9 @@ int kill_wait(const poll_loop_args_t* args, pid_t pid, int sig)
}
meminfo_t m = { 0 };
const unsigned poll_ms = 100;
if (args->kill_process_group) {
pid = -getpgid(pid);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling, please! getpgid() can fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm not sure how the error code in fatal() is chosen so for now I just randomly picked 7.

- Error handling of getpgid()
- Describe when this flag should be enabled
kill.c Outdated
@@ -73,6 +73,12 @@ int kill_wait(const poll_loop_args_t* args, pid_t pid, int sig)
}
meminfo_t m = { 0 };
const unsigned poll_ms = 100;
if (args->kill_process_group) {
if ((pid = getpgid(pid)) < 0) {
fatal(7, "%s: could not get PGID: %s", __func__, strerror(errno));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fatal error is too much here ;)

getpgid can fail when the process has already exited, this is normal. Please just "return res" as below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds more reasonable indeed. Thanks!

@rfjakob
Copy link
Owner

rfjakob commented May 6, 2021

Code is good now, thanks!

But I still need to understand when this should be used. Does firefox create a process group for itself? I guess you use it - for what?

@rfjakob
Copy link
Owner

rfjakob commented May 6, 2021

I figured out that you can use pstree -g to view the PGIDs. I suggest pstree -gT to make it easier to read. On my Fedora 34, output looks like this: https://gist.github.com/rfjakob/c5f463d0fe256fa3571f384c27074a85

Problem: gnome-shell and chrome share the same PGID

@rfjakob
Copy link
Owner

rfjakob commented May 6, 2021

I now also started Atom and Firefox: https://gist.github.com/rfjakob/c5f463d0fe256fa3571f384c27074a85
Unfortunately, everything shares the PGID with gnome-shell.

@TaWeiTu
Copy link
Contributor Author

TaWeiTu commented May 6, 2021

Thanks for the review!

We maintain several GPU servers (with NVIDIA GPUs to be precise) with EarlyOOM installed, which primarily run deep-learning jobs like PyTorch or Tensorflow applications.
Occasionally, some zombie-like processes are left on the GPUs, consuming VRAM but are not visible in the OS.
We suspect that this is because EarlyOOM kills part of the jobs and the result is not handled well by the GPUs and the driver, hence this patch.

CC @oToToT and @WillyPillow who are also in charge of this issue.

@TaWeiTu
Copy link
Contributor Author

TaWeiTu commented May 6, 2021

I now also started Atom and Firefox: https://gist.github.com/rfjakob/c5f463d0fe256fa3571f384c27074a85
Unfortunately, everything shares the PGID with gnome-shell.

This seems to be the same on my Ubuntu 20.04 VM, unfortunately.

@jayenashar
Copy link

I now also started Atom and Firefox: https://gist.github.com/rfjakob/c5f463d0fe256fa3571f384c27074a85
Unfortunately, everything shares the PGID with gnome-shell.

i'm using kde and not having this. things looks separated as i expect.

but i'm curious as i launch multiple firefox profiles from the same shell and they end up in their own separate process groups. why is this not happening for you?

@WillyPillow
Copy link

We maintain several GPU servers (with NVIDIA GPUs to be precise) with EarlyOOM installed, which primarily run deep-learning jobs like PyTorch or Tensorflow applications.
Occasionally, some zombie-like processes are left on the GPUs, consuming VRAM but are not visible in the OS.
We suspect that this is because EarlyOOM kills part of the jobs and the result is not handled well by the GPUs and the driver, hence this patch.

CC @oToToT and @WillyPillow who are also in charge of this issue.

To elaborate on this, we observed earlyoom killing parent processes in programs utilizing GPUs, leaving the (now orphaned) child processes holding on to the allocated VRAM.

While killing the whole process group is rather blunt, as can be seen from your examples, it should be fine in our use case. This is because i) we're essentially using it to take care of misbehaving users/processes, so even killing all processes by that user is acceptable, and ii) killing the process group seems to usually kill the respective login session, which is fairly reasonable from a shell account server PoV.

@WillyPillow
Copy link

I now also started Atom and Firefox: https://gist.github.com/rfjakob/c5f463d0fe256fa3571f384c27074a85
Unfortunately, everything shares the PGID with gnome-shell.

Another data point: on awesomeWM, the PGIDs seem properly separated. This is probably pretty dependent on how the DE/launcher spawns applications.

@TaWeiTu
Copy link
Contributor Author

TaWeiTu commented May 6, 2021

I added some warnings in MANPAGE.md regarding this issue.
Not sure whether it should be in README and --help as well.

rfjakob pushed a commit that referenced this pull request May 7, 2021
@rfjakob
Copy link
Owner

rfjakob commented May 7, 2021

Merged as 8f4b654 , thanks!

@rfjakob rfjakob closed this May 7, 2021
@TaWeiTu
Copy link
Contributor Author

TaWeiTu commented May 7, 2021

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants