Skip to content

Commit

Permalink
[UR][L0] Unify use of large allocation in L0 adapter
Browse files Browse the repository at this point in the history
Intel(R) GPUs have two modes of operation in terms of allocations:
Stateful and stateless mode.

Stateful optimizes memory accesses through pointer arithmetic.
This can be done as long as allocations used by the allocation
are smaller than 4GB.

Stateless disables such pointer-arithmetic optimization to
allow the kernel to use allocations larger than 4GB.

Currently, L0 adapter dynamically and automatically requests
the L0 driver large allocations if it detects an allocation size
is larger than 4GB. This creates a problem if a kernel has been
previously compiled for stateful access. This ultimately means
the adapter mixes stateful and stateless behavior, which is not
a user-friendly experience.

This patch aims at correcting this behavior by defining a default
one. On Intel(R) GPUs previous to Intel(R) Data Center GPU Max,
default behavior is now stateful, meaning small allocations are
only allowed and any allocation larger than 4GB fails. Users
can opt-in for stateless mode setting a new environment variable
UR_L0_ALLOW_LARGE_ALLOCATIONS.

Intel(R) Data Center GPU Max use stateless mode by default.

Addresses:
https://stackoverflow.com/questions/75621264/sycl-dot-product-code-gives-wrong-results

Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
  • Loading branch information
Jaime Arteaga committed Nov 22, 2023
1 parent 109ed46 commit 3482cdf
Show file tree
Hide file tree
Showing 4 changed files with 78 additions and 20 deletions.
53 changes: 37 additions & 16 deletions source/adapters/level_zero/device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,24 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGet(
return UR_RESULT_SUCCESS;
}

inline uint64_t getGlobalMemSize(ur_device_handle_t Device) {
uint64_t GlobalMemSize = 0;
// Support to read physicalSize depends on kernel,
// so fallback into reading totalSize if physicalSize
// is not available.
for (const auto &ZeDeviceMemoryExtProperty :
Device->ZeDeviceMemoryProperties->second) {
GlobalMemSize += ZeDeviceMemoryExtProperty.physicalSize;
}
if (GlobalMemSize == 0) {
for (const auto &ZeDeviceMemoryProperty :
Device->ZeDeviceMemoryProperties->first) {
GlobalMemSize += ZeDeviceMemoryProperty.totalSize;
}
}
return GlobalMemSize;
}

UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(
ur_device_handle_t Device, ///< [in] handle of the device instance
ur_device_info_t ParamName, ///< [in] type of the info to retrieve
Expand Down Expand Up @@ -249,23 +267,15 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(
return ReturnValue(uint32_t{64});
}
case UR_DEVICE_INFO_MAX_MEM_ALLOC_SIZE:
return ReturnValue(uint64_t{Device->ZeDeviceProperties->maxMemAllocSize});
case UR_DEVICE_INFO_GLOBAL_MEM_SIZE: {
uint64_t GlobalMemSize = 0;
// Support to read physicalSize depends on kernel,
// so fallback into reading totalSize if physicalSize
// is not available.
for (const auto &ZeDeviceMemoryExtProperty :
Device->ZeDeviceMemoryProperties->second) {
GlobalMemSize += ZeDeviceMemoryExtProperty.physicalSize;
}
if (GlobalMemSize == 0) {
for (const auto &ZeDeviceMemoryProperty :
Device->ZeDeviceMemoryProperties->first) {
GlobalMemSize += ZeDeviceMemoryProperty.totalSize;
}
// if using large allocations, then return total size in the device.
// if not, then return L0's maxMemAllocSize.
if (Device->useLargeAllocations()) {
return ReturnValue(uint64_t{getGlobalMemSize(Device)});
} else {
return ReturnValue(uint64_t{Device->ZeDeviceProperties->maxMemAllocSize});
}
return ReturnValue(uint64_t{GlobalMemSize});
case UR_DEVICE_INFO_GLOBAL_MEM_SIZE: {
return ReturnValue(uint64_t{getGlobalMemSize(Device)});
}
case UR_DEVICE_INFO_LOCAL_MEM_SIZE:
return ReturnValue(
Expand Down Expand Up @@ -900,6 +910,17 @@ ur_device_handle_t_::useImmediateCommandLists() {
}
}

bool ur_device_handle_t_::useLargeAllocations() {
static const bool UseLargeAllocations = [this] {
const char *UrRet = std::getenv("UR_L0_ALLOW_LARGE_ALLOCATIONS");
if (!UrRet)
return (this->isPVC() ? true : false);
return std::atoi(UrRet) != 0;
}();

return UseLargeAllocations;
}

ur_result_t ur_device_handle_t_::initialize(int SubSubDeviceOrdinal,
int SubSubDeviceIndex) {
// Maintain various device properties cache.
Expand Down
11 changes: 11 additions & 0 deletions source/adapters/level_zero/device.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,17 @@ struct ur_device_handle_t_ : _ur_object {
// Returns whether immediate command lists are used on this device.
ImmCmdlistMode ImmCommandListUsed{};

// Returns whether large allocations are being used or not.
// On some Intel GPUs, this influences how kernels are compiled.
// If large allocations (>4GB) are requested, then kernels are
// compiled with stateless access.
// If small allocations (<4GB) are requested, then kernels are
// compiled with stateful access, with potential performance
// improvements.
// Some GPUs support only one mode, such us Intel(R) Data Center GPU Max,
// which supports only stateless.
bool useLargeAllocations();

bool isSubDevice() { return RootDevice != nullptr; }

// Is this a Data Center GPU Max series (aka PVC)?
Expand Down
28 changes: 26 additions & 2 deletions source/adapters/level_zero/program.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -148,9 +148,24 @@ UR_APIEXPORT ur_result_t UR_APICALL urProgramBuildExp(
ZeModuleDesc.format = (hProgram->State == ur_program_handle_t_::IL)
? ZE_MODULE_FORMAT_IL_SPIRV
: ZE_MODULE_FORMAT_NATIVE;

ZeModuleDesc.inputSize = hProgram->CodeLength;
ZeModuleDesc.pInputModule = hProgram->Code.get();
ZeModuleDesc.pBuildFlags = pOptions;

// if large allocations are selected, then pass
// ze-opt-greater-than-4GB-buffer-required to disable
// stateful optimizations and be able to use larger than
// 4GB allocations on these kernels.
std::string ZeBuildOptions{};
if (pOptions) {
ZeBuildOptions += pOptions;
}

if (phDevices[0]->useLargeAllocations()) {
ZeBuildOptions += " -ze-opt-greater-than-4GB-buffer-required";
}

ZeModuleDesc.pBuildFlags = ZeBuildOptions.c_str();
ZeModuleDesc.pConstants = Shim.ze();

ze_device_handle_t ZeDevice = phDevices[0]->ZeDevice;
Expand Down Expand Up @@ -234,8 +249,17 @@ UR_APIEXPORT ur_result_t UR_APICALL urProgramCompile(
// This produces better code because the driver can do cross-module
// optimizations. Therefore, we just remember the compilation flags, so we
// can use them later.
if (Options)
if (Options) {
Program->BuildFlags = Options;

// if large allocations are selected, then pass
// ze-opt-greater-than-4GB-buffer-required to disable
// stateful optimizations and be able to use larger than
// 4GB allocations on these kernels.
if (Context->Devices[0]->useLargeAllocations()) {
Program->BuildFlags += " -ze-opt-greater-than-4GB-buffer-required";
}
}
Program->State = ur_program_handle_t_::Object;

return UR_RESULT_SUCCESS;
Expand Down
6 changes: 4 additions & 2 deletions source/adapters/level_zero/usm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,10 @@ static ur_result_t USMDeviceAllocImpl(void **ResultPtr,
ZeDesc.ordinal = 0;

ZeStruct<ze_relaxed_allocation_limits_exp_desc_t> RelaxedDesc;
if (Size > Device->ZeDeviceProperties->maxMemAllocSize) {
// Tell Level-Zero to accept Size > maxMemAllocSize
if (Device->useLargeAllocations() &&
(Size > Device->ZeDeviceProperties->maxMemAllocSize)) {
// Tell Level-Zero to accept Size > maxMemAllocSize if
// large allocations are used.
RelaxedDesc.flags = ZE_RELAXED_ALLOCATION_LIMITS_EXP_FLAG_MAX_SIZE;
ZeDesc.pNext = &RelaxedDesc;
}
Expand Down

0 comments on commit 3482cdf

Please sign in to comment.