Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ansor] Improve OpenCL support #10108

Merged
merged 4 commits into from
Feb 1, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 0 additions & 21 deletions apps/topi_recipe/gemm/cuda_gemm_square.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,27 +27,6 @@
USE_MANUAL_CODE = False


@tvm.register_func("tvm_callback_cuda_compile", override=True)
def tvm_callback_cuda_compile(code):
ptx = nvcc.compile_cuda(code, target_format="ptx")
return ptx


def write_code(code, fname):
with open(fname, "w") as f:
f.write(code)


@tvm.register_func
def tvm_callback_cuda_postproc(code):
if not os.path.exists("perf"):
os.mkdir("perf")
write_code(code, "perf/%s_generated.cu" % TASK)
if USE_MANUAL_CODE:
code = open("perf/%s_manual.cu" % TASK).read()
return code


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is stale and blocks this script from running. We don't need this anymore, so better just to remove it.

def test_gemm():
# graph
nn = 2048
Expand Down
27 changes: 25 additions & 2 deletions src/auto_scheduler/search_task.cc
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,31 @@ HardwareParams HardwareParamsNode::GetDefaultHardwareParams(const Target& target
max_threads_per_block, max_vthread_extent, warp_size);
} else {
// add other opencl target
auto target_device = target->GetAttr<String>("device", "");
LOG(FATAL) << "No default hardware parameters for opencl target device: " << target_device;
auto dev = Device{static_cast<DLDeviceType>(device_type), 0};
auto device_name = "device_api.opencl";
auto func = tvm::runtime::Registry::Get(device_name);
ICHECK(func != nullptr) << "Cannot find OpenCL device_api in registry";
auto device_api = static_cast<tvm::runtime::DeviceAPI*>(((*func)()).operator void*());

tvm::runtime::TVMRetValue ret;
device_api->GetAttr(dev, tvm::runtime::DeviceAttrKind::kMaxSharedMemoryPerBlock, &ret);
int max_shared_memory_per_block = ret;

int max_local_memory_per_block = INT32_MAX;

device_api->GetAttr(dev, tvm::runtime::DeviceAttrKind::kMaxThreadsPerBlock, &ret);
int max_threads_per_block = ret;

device_api->GetAttr(dev, tvm::runtime::DeviceAttrKind::kWarpSize, &ret);
int warp_size = ret;

if (warp_size == 1) {
LOG(WARNING) << "Th warp size is 1, tuning might crash or stuck.";
masahi marked this conversation as resolved.
Show resolved Hide resolved
}

int max_vthread_extent = warp_size / 4;
Copy link
Member

@FrozenGene FrozenGene Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I just come back after vocation. I want to check here max_vthread_extent. As I wrote in the tutorial https://github.com/apache/tvm/blob/main/gallery/how_to/tune_with_autoscheduler/tune_network_mali.py#L188-L194 : max_vthread_extent = int(dev.warp_size / 4) if int(dev.warp_size / 4) > 1 else dev.warp_size. If warp_size is 1, currently code of max_vthread_extent will be 0. Previous experiment shows we will be stacked or crashed. @masahi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do it like Vulkan int max_vthread_extent = std::max(1, warp_size / 4); https://github.com/apache/tvm/blob/main/src/auto_scheduler/search_task.cc#L153 @masahi

return HardwareParams(-1, 16, 64, max_shared_memory_per_block, max_local_memory_per_block,
max_threads_per_block, max_vthread_extent, warp_size);
}
} else if (device_type == kDLVulkan) {
auto dev = Device{static_cast<DLDeviceType>(device_type), 0};
Expand Down
4 changes: 3 additions & 1 deletion src/runtime/opencl/opencl_device_api.cc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
/*!
* \file opencl_device_api.cc
*/
#include <dmlc/parameter.h>
#include <dmlc/thread_local.h>
#include <tvm/runtime/registry.h>

Expand Down Expand Up @@ -122,7 +123,8 @@ void OpenCLWorkspace::GetAttr(Device dev, DeviceAttrKind kind, TVMRetValue* rv)
corresponding to the number of SIMD entries the heardware configures.
We need to figure out a way to query this information from the hardware.
*/
*rv = 1;
const int warp_size = dmlc::GetEnv("TVM_OPENCL_WARP_SIZE", 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't really like environment variable since it creates side effects, I don't have a better solution just as mentioned by the above TODO. Maybe that's it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our vulkan backend has a better solution, that uses a Vulkan API function to query the warp size on a given HW. However, I didn't find such API in OpenCL, for some reason. clGetKernelSubGroupInfoKHR described in https://github.com/KhronosGroup/OpenCL-Docs/blob/master/ext/cl_khr_subgroups.asciidoc looks closest, but it requires a compiled kernel as an argument, which I find strange since we want warp size information to write or generate a kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. It's reasonable that the desired API is not always available. A better solution I'm thinking is the direction of exposing this option to the hardware parameters in tuning options instead of an environment variable. For example, the default wrap size of OpenCL devices is always 1, or user should provide wrap size in hardware parameters otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reminds me of the fact that it is already possible to set HW params from a python script https://github.com/apache/tvm/blob/main/gallery/how_to/tune_with_autoscheduler/tune_network_mali.py#L188-L194

So in practice, this patch might not be necessary. But since the possibility to manually specify HW params is not known well and cumbersome anyway, I still want to land this PR. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I agree to have this change but with a different reason. If we just look at the device API without Ansor, this change could be the only general workaround for OpenCL devices. Specifically, any place in TVM may query device API to get the wrap size, so setting default wrap size to 1 in Ansor might not be a general solution.

*rv = warp_size;
break;
}
case kMaxSharedMemoryPerBlock: {
Expand Down