Skip to content

Source Code Explanation 2

Oluwatosin Odubanjo edited this page Oct 29, 2022 · 23 revisions

8.0. src/hello_world_usm2.cpp

NOTE: To read this page, it is advisable to have another window linking the source codes opened. You can find the source code here.

Unlike the first program that illustrates homogenous programming, this second program illustrates heterogeneous programming and makes use of the unified shared memory (usm) style for data and memory management. It illustrates the three modes of usm allocation malloc_shared, malloc_device and malloc_host.

The program features three dynamically allocated arrays, a, b and c, the sender array, receiver array and copy array, respectively.

On a selected device, the data of the sender array, a, is copied into the receiver array, b, using the assignment operator, =. Finally, the data of the receiver array, b is copied into the copy array, c, using a memory copy operation, and data of array c is printed on the host device.

8.1. Preprocessor Directives

  • THE SYCL HEADER FILE
#include <CL/sycl.hpp>

Line 29 : Contains the inclusion of the SYCL header file which controls every SYCL constructs in the program. One, might wonder why I do not explicitly include the standard c++ library (iostream) for input and output (since the value of the receiver array will be outputted). This is because, the CL/sycl.hpp header includes other c++ header files (remember dpc++ is based on c++); so, by including it, we get all of the additional c++ header files it includes.

8.2. The main Function

Lines 32 - 144 : Is a collection of statements that make up the main function.

8.2.1. Error Handling

Lines 33 - 141 : This is the try{}catch{} block for catching both synchronous and asynchronous errors. The synchronous error handler is given from lines 33 - 141, it catches host program related errors, while the asynchronous error handler is a user-defined object and is given from lines 39 - 48, and it catches device related errors.

  • Synchronous error handler
try{
     ...

}catch(sycl::exception const &e) {
                std::cout << "Unexpected exception caught during synchronous operation:\n" << e.what() << std::endl;
                std::terminate();
  • Asynchronous error handler
 auto async_error_handler = [&] (sycl::exception_list exceptions) {
		        for (auto const& e : exceptions) {
			        try{
				        std::rethrow_exception(e);
			        } catch(sycl::exception const& e) {
				std::cout << "Unexpected exception caught during asynchronous operation:\n" << e.what() << std::endl;
				std::terminate();
			        }
		        }
	        }; 

8.2.2. Device Selection and Information

Lines 51 - 52: Two queues are created to select the host and gpu devices respectively. The line 51 selects the host device whereas line 52 selects the gpu device using the queue mechanism. Notice also the inclusion of the asynchronous error handler object.

        // select device
	sycl::queue queue_host{sycl::host_selector(), async_error_handler};
        sycl::queue queue_gpu{sycl::gpu_selector(), async_error_handler};

Lines 55 - 60: Outputs the host and gpu devices information respectively.

          std::cout << "HOST DEVICE = "
                    << queue_device.get_device().get_info<sycl::info::device::name>()
                    << '\n' << std::endl;
          std::cout << "GPU DEVICE = "
                    << queue_device.get_device().get_info<sycl::info::device::name>()
                    << '\n' << std::endl;

8.2.3. Setting Up Host Storage

Recall: The host code is the driver code for the device code.

Line 36: This is the size of the dynamically allocated arrays a, b and c.

Lines 63 - 65: Arrays a and b, (the sender and receiver array) are one dimensional arrays allocated dynamically on the gpu device using the malloc_shared and malloc_device respectively.

By now, we already know the implication of using the malloc_shared as opposed to malloc_device and malloc_host (memory is allocated on either host or device and is shared and accessible by the device and host. So, there would be no need for an explicit memory copy operation from the device to the host when computation on the device is finished).

However, for memory allocated on the device using malloc_device, memory is allocated on the device and only accessible by the device. To therefore, access this memory on the host, a host accessor must be utilized, hence the reason for the copy array, c, which is allocated using malloc_host.

        // dynamically allocate arrays
        char *a = sycl::malloc_shared<char>(N , queue_gpu);
	char *b = sycl::malloc_device<char>(N , queue_gpu);
        char *c = sycl::malloc_host<char>(N , queue_host);

Lines 70 - 92: This part of the program performs a null array check on arrays allocated on the host and gpu devices respectively. If memory allocated is not empty, it sets the allocated memory to zeroes on the gpu device / host device, since memory allocated by malloc is filled with garbage value. The .wait() method is a coarse-grained synchronization method, ensuring that .memset() completes its action.

               if(a != NULL){
                queue_gpu.memset(a, 0, N).wait();
                }else{
                        std::cout << "Could not allocate memory!\n" << std::endl;
                        std::cout << "Array a is NULL! Exiting...\n" << std::endl;
                        exit(EXIT_FAILURE);
                }

                if(b != NULL){
                queue_gpu.memset(b, 0, N).wait();
                }else{
                        std::cout << "Could not allocate memory!\n" << std::endl;
                        std::cout << "Array b is NULL! Exiting...\n" << std::endl;
                        exit(EXIT_FAILURE);
                }

                if(c != NULL){
                queue_host.memset(b, 0, N).wait();
                }else{
                        std::cout << "Could not allocate memory!\n" << std::endl;
                        std::cout << "Array c is NULL! Exiting...\n" << std::endl;
                        exit(EXIT_FAILURE);
                }

Lines 95 - 109: This part of the program is a simple for loop for filling the sender array, a with the string characters Hello World!.

                for(int i = 0; i < N; i++)
                {
                        a[0] = 'H';
                        a[1] = 'e';
                        a[2] = 'l';
                        a[3] = 'l';
                        a[4] = 'o';
                        a[5] = ' ';
                        a[6] = 'W';
                        a[7] = 'o';
                        a[8] = 'r';
                        a[9] = 'l';
                        a[10] = 'd';
                        a[11] = '!';
                }

8.2.4. Device Code Computation

Lines 112 - 120: This part of the program is where the code execution control is transferred to the device. Let's break things down -

               // define kernel to do array copy on selected device
                sycl::range<1> size{N};
                {
                        queue_device.submit([&] (sycl::handler &h) {
                                h.parallel_for(size, [=](sycl::id<1> idx) {
                                        int i = idx[0];
                                        b[i] = a[i];
                                        });
                                }).wait();
                }
  • DATA SIZE - The sycl::range<> in line 112 is used to define the size of work to be done in the kernel code. The style in which it is used to encapsulate the device code is not a compulsory style; the device code can otherwise be written as -
 queue_device.submit([&] (sycl::handler &h) {
         h.parallel_for(N, [=](sycl::id<1> idx) {
              int i = idx[0];
              b[i] = a[i];
              });
 }).wait();

as well as:

 queue_device.submit([&] (sycl::handler &h) {
         h.parallel_for(sycl::range(N), [=](sycl::id<1> idx) {
              int i = idx[0];
              b[i] = a[i];
              });
 }).wait();

and even :

 queue_device.submit([&] (sycl::handler &h) {
         h.parallel_for(sycl::range<1> {N}, [=](sycl::id<1> idx) {
              int i = idx[0];
              b[i] = a[i];
              });
 }).wait();

Depending on the programming style preferred.

  • CREATING A COMMAND GROUP HANDLER - Line 114: A command group is useful for encapsulating all actions (kernel launches, data dependencies, explicit memory operations: update_host, fill, copy) to be performed by the device. A command group is useful for preventing data race conditions as it helps it helps the device know the correct order of code execution.

The command group in this code, has exactly one, 1, action in it - a kernel launch, given in line 115.

     queue_device.submit([&] (sycl::handler &h) 
  • DPC++ DEVICE CODE - The device code starts just after the command group handler creation, line 114 and ends at line 118. A coarse-grain synchronization method .wait() in line 119, is used to ensure actions in the device code are completed. The device code in this program has only one, 1, action - the kernel launch in line 115 which contains the kernel code in lines 116 - 117.

  • DPC++ KERNEL - Line 115 is the start of a DPC++ kernel (the kernel launch); the kernel is expressed as a Basic data-parallel kernel and represented as a Lambda Expression. The method used in distribution of work (data) is the parallel-for method.

  h.parallel_for(size, [=](sycl::id<1> idx) {
  • KERNEL CODE - Lines 116 - 118 represents the kernel code (device code). This code does a simple computation - Copy the content of the sender array, a, to the receiving array, b.
 int i = idx[0];
 b[i] = a[i];

8.2.5. Accessing Device Data On Host Device

Lines 124 - 131: The memory allocated for the receiving array, b, is done with malloc_device; this type of memory is only accessible by the device. Therefore, data stored in the receiving array, b, should be copied to a host memory by using the explicit memory operation - memcpy and appropriately synchronized. Only after the memcpy and a synchronization method can data be made available on the host device.

    queue_gpu.memcpy(c, b, N * sizeof(char)).wait();

    for(int i = 0; i < N; i++)
    {
         std::cout << c[i];
    }

8.2.6. Freeing Allocated Memory

Lines 134 - 136 frees memory allocated to the sender array, a, receiver array, b and copy array, c.

    // free allocated memory
    sycl::free(a, queue_gpu);
    sycl::free(b, queue_gpu);
    sycl::free(b, queue_host);

8.3. Expected Result

HOST DEVICE = Intel(R) Xeon(R) E-2146G CPU @ 3.50GHz

GPU DEVICE = Intel(R) UHD Graphics P630 [0x3e96]

Hello World!