Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the use of Alpaka and CUPLA #325

Open
fwyzard opened this issue Apr 18, 2019 · 8 comments
Open

Investigate the use of Alpaka and CUPLA #325

fwyzard opened this issue Apr 18, 2019 · 8 comments
Assignees
Labels

Comments

@fwyzard
Copy link

fwyzard commented Apr 18, 2019

We should look into Alpaka, and possibly CUPLA, as a way of semi-automatically port the CUDA kernels back to run on the cpu (and potentially other accelerators in the future).

Some links:

@fwyzard
Copy link
Author

fwyzard commented Apr 18, 2019

@makortel has set up a standalone version of the pixel unpacker to simplify testing.

@fwyzard fwyzard self-assigned this Apr 18, 2019
@fwyzard fwyzard added the task label Apr 18, 2019
@fwyzard
Copy link
Author

fwyzard commented Apr 18, 2019

Here are some simple instructions using CUPLA without CMake, which would likely be incompatible with CMSSW's scram.
The approach used by CMake is to re-build the CUPLA wrappers for each target and each backend, and link them statically with the target code.
These instructions use a slightly different approach: build the CUPLA wrappers and link them as a shared library, once for each backend. However, these shared libraries cannot be linked in a single application, as they export the same symbols.

Set up the environment

BASE=$PWD
export CUDA_ROOT=/usr/local/cuda-10.0
export ALPAKA_ROOT=$BASE/alpaka
export CUPLA_ROOT=$BASE/cupla

CXX="/usr/bin/g++-7"
CXX_FLAGS="-m64 -std=c++11 -g -O2 -DALPAKA_DEBUG=0 -DCUPLA_STREAM_ASYNC_ENABLED=1 -I$CUDA_ROOT/include -I$ALPAKA_ROOT/include -I$CUPLA_ROOT/include"
HOST_FLAGS="-fPIC -ftemplate-depth-512 -Wall -Wextra -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-local-typedefs -Wno-attributes -Wno-reorder -Wno-sign-compare"

NVCC="$CUDA_ROOT/bin/nvcc"
NVCC_FLAGS="-ccbin $CXX -lineinfo --expt-extended-lambda --expt-relaxed-constexpr --generate-code arch=compute_50,code=sm_50 --use_fast_math --ftz=false --cudart shared"

Download alpaka and cupla

git clone git@github.com:ComputationalRadiationPhysics/alpaka.git -b 0.3.5 $ALPAKA_ROOT
git clone git@github.com:ComputationalRadiationPhysics/cupla.git  -b 0.1.1 $CUPLA_ROOT

Build cupla ...

... for the CUDA backend

FILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"

mkdir -p $CUPLA_ROOT/build/cuda $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/cuda

for FILE in $FILES; do
  $NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $CXX_FLAGS $NVCC_FLAGS -Xcompiler "$HOST_FLAGS" -x cu -c $FILE -o $(basename $FILE).o
done
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $CXX_FLAGS $NVCC_FLAGS -Xcompiler "$HOST_FLAGS" -shared *.o -o $CUPLA_ROOT/lib/libcupla-cuda.so

... for the serial backend

FILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"

mkdir -p $CUPLA_ROOT/build/serial $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/serial

for FILE in $FILES; do
  $CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -c $FILE -o $(basename $FILE).o
done
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -shared *.o -o $CUPLA_ROOT/lib/libcupla-serial.so

... for the TBB backend

FILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"

mkdir -p $CUPLA_ROOT/build/tbb $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/tbb

for FILE in $FILES; do
  $CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -c $FILE -o $(basename $FILE).o
done
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -shared *.o -ltbbmalloc -ltbb -lpthread -lrt -o $CUPLA_ROOT/lib/libcupla-tbb.so

Build an example with cupla

Using CUDA on the gpu

cd $BASE
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $NVCC_FLAGS -x cu $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o cuda-vectorAdd -L$CUPLA_ROOT/lib -lcupla-cuda

Using the serial backend on the cpu

cd $BASE
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o serial-vectorAdd -L$CUPLA_ROOT/lib -lcupla-serial -lpthread

Using the TBB backend on the cpu

cd $BASE
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o tbb-vectorAdd -L$CUPLA_ROOT/lib -lcupla-tbb -lpthread

@fwyzard
Copy link
Author

fwyzard commented Apr 18, 2019

Here is a first reimplementation of the CUDA kernel with CUPLA, following the instructions in the porting guide.

@fwyzard
Copy link
Author

fwyzard commented Apr 18, 2019

It seems that alpaka#23 and cupla#12 would be blockers for the use of Alpaka and Cupla in CMSSW.

@fwyzard
Copy link
Author

fwyzard commented Apr 18, 2019

Here is an optimised version of the cupla implementation, following the discussion in the tuning guide and adding support for the TBB backend.

With these changes, I get the following performance on my laptop (Intel Core i7-6700HQ 2.60GHz, NVIDIA GeForce GTX 960M):

$ ./main-naive
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 2242.15 us
$ ./main-cupla-serial
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 2418.56 us
$ taskset -c 0 ./main-cupla-tbb
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 2349.14 us
$ taskset -c 0,1,2,3 ./main-cupla-tbb
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 799.58 us
$ cudarun ./main-cuda
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 3229.69 us
$ cudarun ./main-cupla-cuda
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 3233.61 us

On the cpu, the serial and (single core) tbb versions are only 5-8% slower than the naïve implementation; the tbb implementation running on four cores is roughly 3 times faster.

On the gpu side, using cupla's CUDA backend does not seem to introduce any significant overhead over the native CUDA version.

@makortel
Copy link

It seems that ComputationalRadiationPhysics/cupla#12 could be a blocker for the use of CUPLA in CMSSW.

On the other hand, we are primarily looking for making the kernel code portable (right?), so the EDProducer classes could still call cudaSetDevice(), cudaMalloc() etc directly.

(although it would be an interesting exercise to think of the consequences of making the EDProducers and hence(?) EDProducts "portable" as well)

@fwyzard
Copy link
Author

fwyzard commented Apr 18, 2019

Good point... I will ask what, exactly, is not thread safe.

@fwyzard
Copy link
Author

fwyzard commented Dec 28, 2019

Some more comments about Alpaka and Cupla.

Alpaka

  • the syntax to define the "work division" (the equivalent of the CUDA grid) is atrocious; the equivalent of CUDA:

    dim3 blockSize {1, 1, 1};
    dim3 gridSize {4, 8, 16};

    in helloWorld.cpp is:

    using Vec = alpaka::vec::Vec<Dim, Idx>;
    Vec const elementsPerThread(Vec::all(static_cast<Idx>(1)));
    Vec const threadsPerBlock(Vec::all(static_cast<Idx>(1)));
    Vec const blocksPerGrid(
        static_cast<Idx>(4),
        static_cast<Idx>(8),
        static_cast<Idx>(16));
    
    using WorkDiv = alpaka::workdiv::WorkDivMembers<Dim, Idx>;
    WorkDiv const workDiv(
        blocksPerGrid,
        threadsPerBlock,
        elementsPerThread);
  • the syntax to use the work division (i.e. get the number of blocks and threads) is just as convoluted

Cupla

  • the lack of thread-safety in host code is a blocker (see e.g. cupla#12 and cupla#121)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants