Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not ever want to use an "e" machine on GCP #7474

Open
GregoryDougherty opened this issue Jul 26, 2024 · 10 comments
Open

Do not ever want to use an "e" machine on GCP #7474

GregoryDougherty opened this issue Jul 26, 2024 · 10 comments

Comments

@GregoryDougherty
Copy link

We are running a wld pipeline on GCP. Requested 16 CPSs. we have
runtime {
cpuPlatform: "Intel Cascade Lake"
}
set in the .conf file

Nevertheless, the task was given an e2-standard-16 machine, and took 9 hours to turn a bam into a cram, and another hour to upload the 51 gb ram file

I ran the code again, only change was I used an n2-standard-16
The entire task, including the upload, finished in less than 30 minutes

How do we make it so we never, ever, get an "e" machine? I downloaded the cromwell code and looked through it, trying to figure out where it is you pick the machine type on GCP, but never found anything that looked like it would pick an e machine. What am I missing?

The cromwell that assigned the e2-standard-16 was cromwell-87.jar, if that matters

Thank you

@dspeck1
Copy link
Collaborator

dspeck1 commented Jul 26, 2024

This is with the GCP Batch backend correct? Machine Type is a parameter in Batch, but not a parameter in Cromwell. If a machine type is not defined Batch selects the machine type based on the CPU and Memory request. Setting cpuPlatform is the way to not get an e series machine. Did you get an e series machine with "Intel Cascade Lake" in the configuration?

@GregoryDougherty
Copy link
Author

Assuming that putting that in the runtime {cpuPlatform: "Intel Cascade Lake"} setting in the Cromwell conf file counts as "in the configuration", then yes.

If we have to hand edit every single one of our WDL task files to put that in a runtime block, not I haven't tried that

@GregoryDougherty
Copy link
Author

$ gustil cat gs://bucket/path/to/run.stdout | grep standard | grep machine | sort | uniq -c
1601 machine_type: "e2-standard-2"

No happiness there

@dspeck1
Copy link
Collaborator

dspeck1 commented Jul 29, 2024

Thanks. Setting cpuPlatform in the runtime attributes is the only way to avoid scheduling to an e series machines through Cromwell. GCP Batch is limited to setting cpuPlatform or instance type. There is no preferred machine family type setting in GCP Batch. With that said the e series machine should not be that slow. Performance is supposed to be comparable to N1. If it repeatable open a support case with GCP.

@GregoryDougherty
Copy link
Author

"The E2 machine series also contains shared-core machine types that use context- switching to time-share a physical core between vCPUs for multitasking"

Essentially, they are garbage machines that give you 1/2 the CPUs you ask for, and have horrid I/O. IMO, no one should ever be given one, unless they've explicitly asked for it. Especially in a bioinformatics environment, where you're going to be reading and writing large files on a regular basis.

Where in the code is the E2 default set? That's the part I was unable to figure out. If I could have that, I can fix it, put in a PR, and make our own version that doesn't require us to rewrite all our task files.

Thank you

@aednichols
Copy link
Collaborator

aednichols commented Jul 29, 2024

From the GCP docs, it seemed like Cascade Lake wasn't among the CPU platforms used for E2s. So I would think the cpuPlatform should accomplish the goal if it works as intended.

Intel Skylake, Broadwell, and Haswell, AMD EPYC Rome and EPYC Milan

https://cloud.google.com/compute/docs/machine-resource#machine_type_comparison

@aednichols
Copy link
Collaborator

@GregoryDougherty can you try a different CPU platform that similarly excludes E2 based on the matrix linked above?

@GregoryDougherty
Copy link
Author

I did, no change.

I put "Intel Cascade Lake" in cpuPlatform in the .wdl code's runtime block. So it gave me an e2-highcpu-16 with an AMD EPYC chip.
Which was moderately faster, but still not nearly as good as an n2-standard-16

So, my question remains: where in the code is teh machine-type actually chosen for GCP Batch?

There has to be something that is saying "yes, use e2". Where is that? What is the code that makes e2 a possible option, rather than just starting at n2 and going up from there?

because our other choice is to always demand at least 40 CPUs, since that gets us out of e territory. But that's a really sub-optimal solution

@mcovarr
Copy link
Contributor

mcovarr commented Aug 30, 2024

Fixes for this will be available in the next Cromwell release, no ETA yet. If you need the fixes immediately and are comfortable building from the develop branch, that is also an option.

@GregoryDougherty
Copy link
Author

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants