Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Add optimized unique_and_compact_batched. #7239

Merged
merged 42 commits into from
Apr 7, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a2d8d39
[GraphBolt][CUDA] Add batched unique_and_compact API.
mfbalin Mar 24, 2024
8a7610b
add python binding
mfbalin Mar 24, 2024
1e9c1cb
take back debug dispatch removal.
mfbalin Mar 24, 2024
84f65fe
use the batched API from python.
mfbalin Mar 24, 2024
eee35d6
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Mar 26, 2024
1c19c06
add actual map based batched implementation.
mfbalin Mar 28, 2024
82d52eb
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Mar 28, 2024
cbea9e6
implement feature as it was added in torch 2.1 release.
mfbalin Mar 28, 2024
1ae18e1
avoid using `torch::Tensor::to` as it synchronizes.
mfbalin Mar 28, 2024
3fd7efd
Make the CI pass by dropping support for old CUDA architectures from …
mfbalin Mar 31, 2024
100238f
Properly filter cuda architectures.
mfbalin Mar 31, 2024
96582da
Make it so that users can compile graphbolt for older CUDA architectu…
mfbalin Mar 31, 2024
aa7053d
use CCCL macros instead of libcudacxx
mfbalin Mar 31, 2024
e858af3
use most reliable way to check msvc, `_MSC_VER`.
mfbalin Mar 31, 2024
6ad3044
better way of handling filtering.
mfbalin Mar 31, 2024
b00ede2
seperate map implementation to a different file for better code organ…
mfbalin Apr 1, 2024
6f8d7e6
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Apr 1, 2024
926d41e
add missing `;`.
mfbalin Apr 1, 2024
b5d9696
refactor the common stuff into a separate header.
mfbalin Apr 1, 2024
36e570d
address reviews.
mfbalin Apr 1, 2024
c864ef9
fix the linker bug.
mfbalin Apr 1, 2024
520e547
Solve the old architecture problem by creating a cuda extensions libr…
mfbalin Apr 2, 2024
21dc5c5
make the newly added library static.
mfbalin Apr 2, 2024
e0bda01
suppress warning on cmake.
mfbalin Apr 2, 2024
ab2f2c5
add diagnostic compiler messages.
mfbalin Apr 2, 2024
6d9fddc
print single valued defines.
mfbalin Apr 2, 2024
7de401b
fix the issue in the CI about `TORCH_CUDA_ARCH_LIST`
mfbalin Apr 2, 2024
a4946ab
Update CCCL to 2.4.0 as 2.3.0 has a bug.
mfbalin Apr 2, 2024
0d40df9
Update CCCL to 2.3.2 instead.
mfbalin Apr 2, 2024
be951a8
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Apr 2, 2024
b6d4c73
add comments in CMake and build script.
mfbalin Apr 2, 2024
5f51a15
add comment about node_id_bits.
mfbalin Apr 2, 2024
4589fea
add more comments.
mfbalin Apr 2, 2024
49a55ec
clarify comment.
mfbalin Apr 2, 2024
1bdf3ac
fix minor typo.
mfbalin Apr 2, 2024
da15251
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Apr 2, 2024
b930b0f
minor code style change in how map constructed.
mfbalin Apr 2, 2024
7886f4b
Add explanation on the difference between map based and sort based al…
mfbalin Apr 3, 2024
899d30e
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Apr 3, 2024
85a1cf2
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Apr 7, 2024
2b22578
Merge branch 'master' into gb_batched_unique_and_compact
mfbalin Apr 7, 2024
8a547ca
address reviews.
mfbalin Apr 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@
[submodule "third_party/liburing"]
path = third_party/liburing
url = https://github.com/axboe/liburing.git
[submodule "third_party/cuco"]
path = third_party/cuco
url = https://github.com/NVIDIA/cuCollections.git
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -590,5 +590,5 @@ if(BUILD_GRAPHBOLT)
endif(USE_CUDA)
if(CMAKE_SYSTEM_NAME MATCHES "Linux")
add_dependencies(graphbolt liburing)
endif(USE_CUDA)
endif()
endif(BUILD_GRAPHBOLT)
37 changes: 32 additions & 5 deletions graphbolt/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,19 @@ if(USE_CUDA)
if(DEFINED ENV{CUDAARCHS})
set(CMAKE_CUDA_ARCHITECTURES $ENV{CUDAARCHS})
endif()
set(CMAKE_CUDA_ARCHITECTURES_FILTERED ${CMAKE_CUDA_ARCHITECTURES})
# CUDA extension supports only sm_70 and up (Volta+).
list(FILTER CMAKE_CUDA_ARCHITECTURES_FILTERED EXCLUDE REGEX "[2-6][0-9]")
list(LENGTH CMAKE_CUDA_ARCHITECTURES_FILTERED CMAKE_CUDA_ARCHITECTURES_FILTERED_LEN)
if(CMAKE_CUDA_ARCHITECTURES_FILTERED_LEN EQUAL 0)
# Build the CUDA extension at least build for Volta.
set(CMAKE_CUDA_ARCHITECTURES_FILTERED "70")
endif()
set(LIB_GRAPHBOLT_CUDA_NAME "${LIB_GRAPHBOLT_NAME}_cuda")
endif()

add_library(${LIB_GRAPHBOLT_NAME} SHARED ${BOLT_SRC} ${BOLT_HEADERS})
target_include_directories(${LIB_GRAPHBOLT_NAME} PRIVATE ${BOLT_DIR}
include_directories(BEFORE ${BOLT_DIR}
${BOLT_HEADERS}
"../third_party/dmlc-core/include"
"../third_party/pcg/include")
Expand All @@ -73,12 +82,25 @@ if(CMAKE_SYSTEM_NAME MATCHES "Linux")
endif()

if(USE_CUDA)
file(GLOB BOLT_CUDA_EXTENSION_SRC
${BOLT_DIR}/cuda/extension/*.cu
${BOLT_DIR}/cuda/extension/*.cc
)
# Until https://github.com/NVIDIA/cccl/issues/1083 is resolved, we need to
# compile the cuda/extension folder with Volta+ CUDA architectures.
add_library(${LIB_GRAPHBOLT_CUDA_NAME} STATIC ${BOLT_CUDA_EXTENSION_SRC} ${BOLT_HEADERS})
target_link_libraries(${LIB_GRAPHBOLT_CUDA_NAME} "${TORCH_LIBRARIES}")

set_target_properties(${LIB_GRAPHBOLT_NAME} PROPERTIES CUDA_STANDARD 17)
set_target_properties(${LIB_GRAPHBOLT_CUDA_NAME} PROPERTIES CUDA_STANDARD 17)
set_target_properties(${LIB_GRAPHBOLT_CUDA_NAME} PROPERTIES CUDA_ARCHITECTURES "${CMAKE_CUDA_ARCHITECTURES_FILTERED}")
set_target_properties(${LIB_GRAPHBOLT_CUDA_NAME} PROPERTIES POSITION_INDEPENDENT_CODE TRUE)
message(STATUS "Use external CCCL library for a consistent API and performance for graphbolt.")
target_include_directories(${LIB_GRAPHBOLT_NAME} PRIVATE
"../third_party/cccl/thrust"
"../third_party/cccl/cub"
"../third_party/cccl/libcudacxx/include")
include_directories(BEFORE
"../third_party/cccl/thrust"
"../third_party/cccl/cub"
"../third_party/cccl/libcudacxx/include"
"../third_party/cuco/include")

message(STATUS "Use HugeCTR gpu_cache for graphbolt with INCLUDE_DIRS $ENV{GPU_CACHE_INCLUDE_DIRS}.")
target_include_directories(${LIB_GRAPHBOLT_NAME} PRIVATE $ENV{GPU_CACHE_INCLUDE_DIRS})
Expand All @@ -87,6 +109,11 @@ if(USE_CUDA)

get_property(archs TARGET ${LIB_GRAPHBOLT_NAME} PROPERTY CUDA_ARCHITECTURES)
message(STATUS "CUDA_ARCHITECTURES for graphbolt: ${archs}")

get_property(archs TARGET ${LIB_GRAPHBOLT_CUDA_NAME} PROPERTY CUDA_ARCHITECTURES)
message(STATUS "CUDA_ARCHITECTURES for graphbolt extension: ${archs}")
mfbalin marked this conversation as resolved.
Show resolved Hide resolved

target_link_libraries(${LIB_GRAPHBOLT_NAME} ${LIB_GRAPHBOLT_CUDA_NAME})
endif()

# The Torch CMake configuration only sets up the path for the MKL library when
Expand Down
4 changes: 2 additions & 2 deletions graphbolt/build.bat
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ IF x%1x == xx GOTO single

FOR %%X IN (%*) DO (
DEL /S /Q *
"%CMAKE_COMMAND%" -DGPU_CACHE_BUILD_DIR=%BINDIR% -DCMAKE_CONFIGURATION_TYPES=Release -DPYTHON_INTERP=%%X .. -G "Visual Studio 16 2019" || EXIT /B 1
"%CMAKE_COMMAND%" -DGPU_CACHE_BUILD_DIR=%BINDIR% -DCMAKE_CONFIGURATION_TYPES=Release -DPYTHON_INTERP=%%X -DTORCH_CUDA_ARCH_LIST=Volta .. -G "Visual Studio 16 2019" || EXIT /B 1
msbuild graphbolt.sln /m /nr:false || EXIT /B 1
COPY /Y Release\*.dll "%BINDIR%\graphbolt" || EXIT /B 1
)
Expand All @@ -21,7 +21,7 @@ GOTO end
:single

DEL /S /Q *
"%CMAKE_COMMAND%" -DGPU_CACHE_BUILD_DIR=%BINDIR% -DCMAKE_CONFIGURATION_TYPES=Release .. -G "Visual Studio 16 2019" || EXIT /B 1
"%CMAKE_COMMAND%" -DGPU_CACHE_BUILD_DIR=%BINDIR% -DCMAKE_CONFIGURATION_TYPES=Release -DTORCH_CUDA_ARCH_LIST=Volta .. -G "Visual Studio 16 2019" || EXIT /B 1
msbuild graphbolt.sln /m /nr:false || EXIT /B 1
COPY /Y Release\*.dll "%BINDIR%\graphbolt" || EXIT /B 1

Expand Down
6 changes: 5 additions & 1 deletion graphbolt/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ else
CPSOURCE=*.so
fi

CMAKE_FLAGS="-DCUDA_TOOLKIT_ROOT_DIR=$CUDA_TOOLKIT_ROOT_DIR -DUSE_CUDA=$USE_CUDA -DGPU_CACHE_BUILD_DIR=$BINDIR"
# We build for the same architectures as DGL, thus we hardcode
# TORCH_CUDA_ARCH_LIST and we need to at least compile for Volta. Until
# https://github.com/NVIDIA/cccl/issues/1083 is resolved, we need to compile the
# cuda/extension folder with Volta+ CUDA architectures.
CMAKE_FLAGS="-DCUDA_TOOLKIT_ROOT_DIR=$CUDA_TOOLKIT_ROOT_DIR -DUSE_CUDA=$USE_CUDA -DGPU_CACHE_BUILD_DIR=$BINDIR -DTORCH_CUDA_ARCH_LIST=Volta"
echo $CMAKE_FLAGS

if [ $# -eq 0 ]; then
Expand Down
11 changes: 11 additions & 0 deletions graphbolt/include/graphbolt/cuda_ops.h
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,17 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> UniqueAndCompact(
const torch::Tensor src_ids, const torch::Tensor dst_ids,
const torch::Tensor unique_dst_ids, int num_bits = 0);

/**
* @brief Batched version of UniqueAndCompact. The ith element of the return
* value is equal to the passing the ith elements of the input arguments to
* UniqueAndCompact.
*/
std::vector<std::tuple<torch::Tensor, torch::Tensor, torch::Tensor>>
mfbalin marked this conversation as resolved.
Show resolved Hide resolved
UniqueAndCompactBatched(
const std::vector<torch::Tensor>& src_ids,
const std::vector<torch::Tensor>& dst_ids,
const std::vector<torch::Tensor>& unique_dst_ids, int num_bits = 0);

} // namespace ops
} // namespace graphbolt

Expand Down
6 changes: 6 additions & 0 deletions graphbolt/include/graphbolt/unique_and_compact.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> UniqueAndCompact(
const torch::Tensor& src_ids, const torch::Tensor& dst_ids,
const torch::Tensor unique_dst_ids);

std::vector<std::tuple<torch::Tensor, torch::Tensor, torch::Tensor>>
UniqueAndCompactBatched(
const std::vector<torch::Tensor>& src_ids,
const std::vector<torch::Tensor>& dst_ids,
const std::vector<torch::Tensor> unique_dst_ids);

} // namespace sampling
} // namespace graphbolt

Expand Down
29 changes: 26 additions & 3 deletions graphbolt/src/cuda/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAException.h>
#include <c10/cuda/CUDAStream.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <torch/script.h>

Expand Down Expand Up @@ -38,12 +39,17 @@ namespace cuda {
*
* int_array.get() gives the raw pointer.
*/
template <typename value_t = char>
struct CUDAWorkspaceAllocator {
static_assert(sizeof(char) == 1, "sizeof(char) == 1 should hold.");
// Required by thrust to satisfy allocator requirements.
using value_type = char;
using value_type = value_t;

explicit CUDAWorkspaceAllocator() { at::globalContext().lazyInitCUDA(); }

template <class U>
CUDAWorkspaceAllocator(CUDAWorkspaceAllocator<U> const&) noexcept {}

CUDAWorkspaceAllocator& operator=(const CUDAWorkspaceAllocator&) = default;

void operator()(void* ptr) const {
Expand All @@ -53,7 +59,7 @@ struct CUDAWorkspaceAllocator {
// Required by thrust to satisfy allocator requirements.
value_type* allocate(std::ptrdiff_t size) const {
return reinterpret_cast<value_type*>(
c10::cuda::CUDACachingAllocator::raw_alloc(size));
c10::cuda::CUDACachingAllocator::raw_alloc(size * sizeof(value_type)));
mfbalin marked this conversation as resolved.
Show resolved Hide resolved
}

// Required by thrust to satisfy allocator requirements.
Expand All @@ -63,7 +69,9 @@ struct CUDAWorkspaceAllocator {
std::unique_ptr<T, CUDAWorkspaceAllocator> AllocateStorage(
std::size_t size) const {
return std::unique_ptr<T, CUDAWorkspaceAllocator>(
reinterpret_cast<T*>(allocate(sizeof(T) * size)), *this);
reinterpret_cast<T*>(
c10::cuda::CUDACachingAllocator::raw_alloc(sizeof(T) * size)),
*this);
}
};

Expand All @@ -81,6 +89,21 @@ inline bool is_zero<dim3>(dim3 size) {
return size.x == 0 || size.y == 0 || size.z == 0;
}

#define CUDA_DRIVER_CHECK(EXPR) \
do { \
CUresult __err = EXPR; \
if (__err != CUDA_SUCCESS) { \
const char* err_str; \
CUresult get_error_str_err C10_UNUSED = \
cuGetErrorString(__err, &err_str); \
if (get_error_str_err != CUDA_SUCCESS) { \
AT_ERROR("CUDA driver error: unknown error"); \
} else { \
AT_ERROR("CUDA driver error: ", err_str); \
} \
} \
} while (0)

#define CUDA_CALL(func) C10_CUDA_CHECK((func))

#define CUDA_KERNEL_CALL(kernel, nblks, nthrs, shmem, ...) \
Expand Down
26 changes: 26 additions & 0 deletions graphbolt/src/cuda/extension/unique_and_compact.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/**
* Copyright (c) 2023, GT-TDAlab (Muhammed Fatih Balin & Umit V. Catalyurek)
* @file cuda/unique_and_compact.h
* @brief Unique and compact operator utilities on CUDA using hash table.
*/

#ifndef GRAPHBOLT_CUDA_UNIQUE_AND_COMPACT_H_
#define GRAPHBOLT_CUDA_UNIQUE_AND_COMPACT_H_

#include <torch/script.h>

#include <vector>

namespace graphbolt {
namespace ops {

std::vector<std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> >
mfbalin marked this conversation as resolved.
Show resolved Hide resolved
UniqueAndCompactBatchedMap(
const std::vector<torch::Tensor>& src_ids,
const std::vector<torch::Tensor>& dst_ids,
const std::vector<torch::Tensor>& unique_dst_ids);

} // namespace ops
} // namespace graphbolt

#endif // GRAPHBOLT_CUDA_UNIQUE_AND_COMPACT_H_
Loading
Loading