Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure scram to disable one of the GPU bakends in a development area ? #45859

Open
fwyzard opened this issue Sep 2, 2024 · 26 comments
Open

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Sep 2, 2024

The ROCm (and to some extend CUDA) alpaka backends add a noticeable amount to the time it takes to build some packages.

For users that do not care about running on (AMD) GPUs, we could speed up the compilation process disabling the ROCm (or CUDA) alpaka backend(s).

Also note that it could be much worse if we manage to add the SYCL/oneAPI backend...

This could be implemented in scram, with a syntax like

scram b disable-backend {cuda,rocm}
scram b enable-backend {cuda,rocm}

?

An other way to speed up the compilation would be to target only one actual GPU type, like an NVIDIA T4 or an AMD Mi250.

This could be implemented with a syntax like

scram b enable-backend cuda=sm_89
scram b enable-backend rocm=gfx90a

We could also get the hardware type from cudaComputeCapabilities or rocmComputeCapabilities with a syntax like

scram b enable-backend cuda=native
scram b enable-backend rocm=native

@smuzaffar do you think this could be implemented in scram ?

If you think so, we can discuss the implementation detail here or in person.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 2, 2024

assign core,heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2024

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2024

A new Issue was created by @fwyzard.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor

@fwyzard , yes we should be able to implement this via scram b ..... How about , for local development (e.g. where user only wants to test things on local host) we can just use scram build enable-alpaka-native which on host with

  • nvidia gpu: disable rocm, use cudaComputeCapabilities to get the get actual gpu and only build for that gpu type
  • amd gpu: disable cuda, use rocmComputeCapabilities to get the get actual gpu and only build for that gpu type
  • without gpu: disable both rocm and cuda backends

we can also add scram b {enable|disable}-alpaka-{rocm|cuda} for expicitly enable/disable rocm/cuda backend build

If needed, we can discuss this in core sw meeting tomorrow

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 2, 2024

If needed, we can discuss this in core sw meeting tomorrow

Sounds good.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 3, 2024

About updating the flags in cuda.xml and rocm.xml tools.

cuda.xml

The syntax for enabling sm_## is -gencode arch=compute_##,code=[sm_##,compute_##].
So, calling e.g.

scram b enable-backend cuda=sm_89

should remove all the CUDA_FLAGS of the form -gencode arch=compute_[0-9]+,code=[sm_[0-9]+,compute_[0-9]+], and add -gencode arch=compute_89,code=[sm_89,compute_89].

The "native" CUDA architectures used by the NVIDIA GPUs in the local machine can be extracted from cudaComputeCapabilities:

$ cudaComputeCapabilities 
   0     8.9    NVIDIA L4
   1     7.5    Tesla T4

should use the architecture sm_75.

Currently there is a script cmsCudaSetup.sh, that does part of what scram b enable-backend cuda=native should do.

rocm.xml

The syntax for enabling gfx#### is --offload-arch=gfx####, so

scram b enable-backend rocm=gfx1100

should remove all the ROCM_FLAGS of the form --offload-arch=gfx[0-9a-f]+, and add --offload-arch=gfx1100.

Note that the value after gfx can have 3 or 4 hexadecimal digits.

The "native" ROCm architectures used by the AMD GPUs in the local machine can be extracted from rocmComputeCapabilities:

$ rocmComputeCapabilities 
   0     gfx1100    AMD Radeon Pro W7800 (unsupported)

@smuzaffar
Copy link
Contributor

smuzaffar commented Sep 12, 2024

@fwyzard , thanks for the hints in #45859 (comment).
As scram build ... passes every thing to gmake as build targets so it is not easy to implement scram build enable-backend cuda as in this case cuda becomes a build target and gmake will try to run it OR scram build enable-backend cuda=sm_89 in this case cuda becomes a variable overriding its value set by cuda tool. Instead how about

  • scram build {en,dis}able-backend-{cuda,rocm}: To enable/disable cuda/rocm alpaka backends
  • scram build enable-backend-{cuda,rocm}-[comma-separated-compute-capabilities] e.g
    • scram build enable-backend-cuda-sm_75 or scram build enable-backend-cuda-sm_75,sm_89
    • scram build enable-backend-rocm-gfx1100 or scram build enable-backend-rocm-gfx1100,gfx90a
    • scram build enable-backend-cuda-native: To find the native compute capabilities and use those
    • scram build enable-backend-cuda-reset: To reset the compute capabilities to their original value ( from the release area)
  • scram build enable-backend-native: To disable the backend not available and call enable-backend-cuda-native for the backend which is available

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

I see.

Maybe we could shorten the commands, like

  • scram build {en,dis}able-{cuda,rocm}
  • scram build enable-cuda-sm_75
  • scram build enable-rocm-gfx1100,gfx90a

etc?

And it might be more clear if we split the backend and individual targets with a :

  • scram build enable-cuda:sm_75
  • scram build enable-rocm:gfx1100,gfx90a

(I would suggest using = but Make would interpret it as setting a variable)

What do you think ?

@smuzaffar
Copy link
Contributor

sounds good, so I will drop -backend from the target and use : for the compute capabilities

@smuzaffar
Copy link
Contributor

@fwyzard , for now I have enable-alpaka:native to automatically enable/disable cuda/rocm backend and set the native compute capabilities. Is this a good target name of should I change it to enable-alpaka-native ( enable-native sounds very generic )

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

Maybe enable-gpus:native ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

But it affects only Alpaka modules, not other modules that may use the process.options.accelerators, right ?

Then enable-alpaka:native may be more correct.

@smuzaffar
Copy link
Contributor

yes it only afftects the alpaka modules. OK so I will go with enable-alpaka:native then

@smuzaffar
Copy link
Contributor

@fwyzard , {en,dis}able-{cuda,rocm} also affect alpaka only, should we change these to {en,dis}able-alpaka:{cuda,rocm} ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

I'm undecided, because then calls like scram b enable-alpaka:cuda:sm_75 starts to become complicated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

So I'm leaning more towards scram b enable-gpus:native.

Could you implement that, and later today we ask @makortel his opinion ?

@smuzaffar
Copy link
Contributor

As enable-{cuda,rocm}:capabilities only affects cuda/rocm directly so those call can remain enable-{cuda,rocm}:capability.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

What about disable-cuda ?

@smuzaffar
Copy link
Contributor

currently disable-cuda only disables the alpaka-cuda backend. It does not disable the cuda build rules so scram will still compile .cu files for non-alpaka packages

@smuzaffar
Copy link
Contributor

smuzaffar commented Sep 12, 2024

But if we want disable-cuda to disable both alpaka-cuda backend and also stop building .cu files then I can do it but I think for now that will break builds ( there are packages which has gpu code depenency)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

OK, let me try to summarise:

  • scram b disable-cuda
    • ❌ does not build the alpaka CUDA backend
    • ✔️ builds regular .cu files
  • scram b disable-rocm:
    • ❌ does not build the alpaka ROCm backend
    • ✔️ builds regular .hip.cc files
  • scram b enable-cuda:
    • ✔️ build the alpaka CUDA backend
    • ✔️ builds regular .cu files
  • scram b enable-cuda:sm_90:
    • changes the cuda.xml tool file to support (only) the sm_90 architecture
    • ✔️ build the alpaka CUDA backend
    • ✔️ builds regular .cu files
  • scram b enable-cuda:native:
    • ❔ uses cudaComputeCapabilities to determine the architecture of the NVIDIA GPUs in the system
    • ✔️ changes the cuda.xml tool file to support (only) these architectures
    • ✔️ build the alpaka CUDA backend
    • ✔️ builds regular .cu files
  • scram b enable-rocm, enable-rocm:gfx1100, enable-rocm:native:
    • same for the AMD GPUs, .hip.cc files, and ROCm alpaka backend
  • scram b enable-alpaka:native
    • ❔ checks for both NVIDIA and AMD GPUs
    • ✔️ updates the corresponding tool file to support (only) the GPUs present on the system,
    • ❔ enable only the alpaka backend for the GPUs present on the system
    • ✔️ builds all regular .cu and .hip.cc files.

Is it correct ?

Basically, it would never affect whether the regular .cu and .hip.cc files are built (other than which architecture is built), only whether the alpaka backends are built or not.

So I think I would prefer scram b enable-gpus:native :-)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 12, 2024

And, once #45844 is complete, we could revisit this

currently disable-cuda only disables the alpaka-cuda backend. It does not disable the cuda build rules so scram will still compile .cu files for non-alpaka packages

and try to disable the CUDA or ROCm backends completely.

@smuzaffar
Copy link
Contributor

Is it correct ?

yes this is correct.

So I think I would prefer scram b enable-gpus:native

OK

@smuzaffar
Copy link
Contributor

cms-sw/cmssw-config#110 should implement these new rules. scram build help in dev area should show these new build rules

@makortel
Copy link
Contributor

I'd find it clearest if the {enable,disable}-{cuda,rocm} and enable-gpus:native would apply equally to the compilation of .cu and .hip.cc files as well. But to be practical I'm ok with leaving that to the time #45844 becomes complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants