Investigate the use of Alpaka and CUPLA #325

fwyzard · 2019-04-18T07:24:35Z

We should look into Alpaka, and possibly CUPLA, as a way of semi-automatically port the CUDA kernels back to run on the cpu (and potentially other accelerators in the future).

Some links:

Alpaka on GitHub
Alpaka's documentation
CUPLA on GitHub
CUPLA's porting guide
CUPLA's tuning guide

fwyzard · 2019-04-18T07:25:31Z

@makortel has set up a standalone version of the pixel unpacker to simplify testing.

fwyzard · 2019-04-18T07:34:14Z

Here are some simple instructions using CUPLA without CMake, which would likely be incompatible with CMSSW's scram.
The approach used by CMake is to re-build the CUPLA wrappers for each target and each backend, and link them statically with the target code.
These instructions use a slightly different approach: build the CUPLA wrappers and link them as a shared library, once for each backend. However, these shared libraries cannot be linked in a single application, as they export the same symbols.

Set up the environment

BASE=$PWD
export CUDA_ROOT=/usr/local/cuda-10.0
export ALPAKA_ROOT=$BASE/alpaka
export CUPLA_ROOT=$BASE/cupla

CXX="/usr/bin/g++-7"
CXX_FLAGS="-m64 -std=c++11 -g -O2 -DALPAKA_DEBUG=0 -DCUPLA_STREAM_ASYNC_ENABLED=1 -I$CUDA_ROOT/include -I$ALPAKA_ROOT/include -I$CUPLA_ROOT/include"
HOST_FLAGS="-fPIC -ftemplate-depth-512 -Wall -Wextra -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-local-typedefs -Wno-attributes -Wno-reorder -Wno-sign-compare"

NVCC="$CUDA_ROOT/bin/nvcc"
NVCC_FLAGS="-ccbin $CXX -lineinfo --expt-extended-lambda --expt-relaxed-constexpr --generate-code arch=compute_50,code=sm_50 --use_fast_math --ftz=false --cudart shared"

Download alpaka and cupla

git clone git@github.com:ComputationalRadiationPhysics/alpaka.git -b 0.3.5 $ALPAKA_ROOT
git clone git@github.com:ComputationalRadiationPhysics/cupla.git  -b 0.1.1 $CUPLA_ROOT

Build cupla ...

... for the CUDA backend

FILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"

mkdir -p $CUPLA_ROOT/build/cuda $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/cuda

for FILE in $FILES; do
  $NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $CXX_FLAGS $NVCC_FLAGS -Xcompiler "$HOST_FLAGS" -x cu -c $FILE -o $(basename $FILE).o
done
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $CXX_FLAGS $NVCC_FLAGS -Xcompiler "$HOST_FLAGS" -shared *.o -o $CUPLA_ROOT/lib/libcupla-cuda.so

... for the serial backend

FILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"

mkdir -p $CUPLA_ROOT/build/serial $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/serial

for FILE in $FILES; do
  $CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -c $FILE -o $(basename $FILE).o
done
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -shared *.o -o $CUPLA_ROOT/lib/libcupla-serial.so

... for the TBB backend

FILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"

mkdir -p $CUPLA_ROOT/build/tbb $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/tbb

for FILE in $FILES; do
  $CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -c $FILE -o $(basename $FILE).o
done
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -shared *.o -ltbbmalloc -ltbb -lpthread -lrt -o $CUPLA_ROOT/lib/libcupla-tbb.so

Build an example with cupla

Using CUDA on the gpu

cd $BASE
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $NVCC_FLAGS -x cu $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o cuda-vectorAdd -L$CUPLA_ROOT/lib -lcupla-cuda

Using the serial backend on the cpu

cd $BASE
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o serial-vectorAdd -L$CUPLA_ROOT/lib -lcupla-serial -lpthread

Using the TBB backend on the cpu

cd $BASE
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o tbb-vectorAdd -L$CUPLA_ROOT/lib -lcupla-tbb -lpthread

fwyzard · 2019-04-18T07:35:50Z

Here is a first reimplementation of the CUDA kernel with CUPLA, following the instructions in the porting guide.

fwyzard · 2019-04-18T07:46:11Z

It seems that alpaka#23 and cupla#12 would be blockers for the use of Alpaka and Cupla in CMSSW.

fwyzard · 2019-04-18T14:03:52Z

Here is an optimised version of the cupla implementation, following the discussion in the tuning guide and adding support for the TBB backend.

With these changes, I get the following performance on my laptop (Intel Core i7-6700HQ 2.60GHz, NVIDIA GeForce GTX 960M):

$ ./main-naive
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 2242.15 us

$ ./main-cupla-serial
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 2418.56 us

$ taskset -c 0 ./main-cupla-tbb
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 2349.14 us

$ taskset -c 0,1,2,3 ./main-cupla-tbb
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 799.58 us

$ cudarun ./main-cuda
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 3229.69 us

$ cudarun ./main-cupla-cuda
Got 53376 for cabling, wordCounter 36328
Output: 1699 modules in 3233.61 us

On the cpu, the serial and (single core) tbb versions are only 5-8% slower than the naïve implementation; the tbb implementation running on four cores is roughly 3 times faster.

On the gpu side, using cupla's CUDA backend does not seem to introduce any significant overhead over the native CUDA version.

makortel · 2019-04-18T21:04:45Z

It seems that ComputationalRadiationPhysics/cupla#12 could be a blocker for the use of CUPLA in CMSSW.

On the other hand, we are primarily looking for making the kernel code portable (right?), so the EDProducer classes could still call cudaSetDevice(), cudaMalloc() etc directly.

(although it would be an interesting exercise to think of the consequences of making the EDProducers and hence(?) EDProducts "portable" as well)

fwyzard · 2019-04-18T22:40:23Z

Good point... I will ask what, exactly, is not thread safe.

fwyzard · 2019-12-28T07:00:13Z

Some more comments about Alpaka and Cupla.

Alpaka

the syntax to define the "work division" (the equivalent of the CUDA grid) is atrocious; the equivalent of CUDA:

dim3 blockSize {1, 1, 1};
dim3 gridSize {4, 8, 16};

in helloWorld.cpp is:

using Vec = alpaka::vec::Vec<Dim, Idx>;
Vec const elementsPerThread(Vec::all(static_cast<Idx>(1)));
Vec const threadsPerBlock(Vec::all(static_cast<Idx>(1)));
Vec const blocksPerGrid(
    static_cast<Idx>(4),
    static_cast<Idx>(8),
    static_cast<Idx>(16));

using WorkDiv = alpaka::workdiv::WorkDivMembers<Dim, Idx>;
WorkDiv const workDiv(
    blocksPerGrid,
    threadsPerBlock,
    elementsPerThread);

the syntax to use the work division (i.e. get the number of blocks and threads) is just as convoluted

Cupla

the lack of thread-safety in host code is a blocker (see e.g. cupla#12 and cupla#121)

fwyzard self-assigned this Apr 18, 2019

fwyzard added the task label Apr 18, 2019

fwyzard mentioned this issue May 13, 2019

Add simple CUDA compatibility wrappers #343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the use of Alpaka and CUPLA #325

Investigate the use of Alpaka and CUPLA #325

fwyzard commented Apr 18, 2019

fwyzard commented Apr 18, 2019

fwyzard commented Apr 18, 2019 •

edited

Loading

fwyzard commented Apr 18, 2019 •

edited

Loading

fwyzard commented Apr 18, 2019 •

edited

Loading

fwyzard commented Apr 18, 2019

makortel commented Apr 18, 2019

fwyzard commented Apr 18, 2019

fwyzard commented Dec 28, 2019 •

edited

Loading

Investigate the use of Alpaka and CUPLA #325

Investigate the use of Alpaka and CUPLA #325

Comments

fwyzard commented Apr 18, 2019

fwyzard commented Apr 18, 2019

fwyzard commented Apr 18, 2019 • edited Loading

Set up the environment

Download alpaka and cupla

Build cupla ...

... for the CUDA backend

... for the serial backend

... for the TBB backend

Build an example with cupla

Using CUDA on the gpu

Using the serial backend on the cpu

Using the TBB backend on the cpu

fwyzard commented Apr 18, 2019 • edited Loading

fwyzard commented Apr 18, 2019 • edited Loading

fwyzard commented Apr 18, 2019

makortel commented Apr 18, 2019

fwyzard commented Apr 18, 2019

fwyzard commented Dec 28, 2019 • edited Loading

Alpaka

Cupla

fwyzard commented Apr 18, 2019 •

edited

Loading

fwyzard commented Apr 18, 2019 •

edited

Loading

fwyzard commented Apr 18, 2019 •

edited

Loading

fwyzard commented Dec 28, 2019 •

edited

Loading