Skip to content

Source Code Explanation 2

Oluwatosin Odubanjo edited this page Oct 24, 2022 · 23 revisions

6.0. src/hello_world_usm2.cpp

NOTE: To read this page, it is advisable to have another window linking the source codes opened.

Unlike the first program that illustrates homogenous programming, this second program illustrates heterogeneous programming and makes use of the unified shared memory (usm) style for data and memory management. It illustrates the three modes of usm memory allocation malloc_shared, malloc_device and malloc_host.

The program features three dynamically allocated arrays, a, b and c, the sender, receiver and copy arrays respectively.

On a selected device, the value of the sender array, a is copied into the receiver array, b. Finally, the value of the receiver array, b is copied into the copy array, c and value of array c is printed on the host device.

6.1. Preprocessor directives

  • THE SYCL HEADER FILE
#include <CL/sycl.hpp>

Line 20 : Contains the inclusion of the SYCL header file which controls every SYCL constructs in the program. One, might wonder why I do not explicitly include the standard c++ library for input and output (since the value of the receiver array will be outputted). This is because, the CL/sycl.hpp header includes other c++ header files (remember dpc++ is based on c++); so, by including it, we get all of the additional c++ header files it includes.

6.2. The main function

Lines 32 - 144 : Is a collection of statements that make up the main function.

5.2.1. Error Handling

Lines 33 - 141 : This is the try{}catch{} block for catching both synchronous and asynchronous errors. The synchronous error handler is given from lines 33 - 141, it catches host program related errors, while the asynchronous error handler is a user-defined object and is given from lines 39 - 48, and it catches device related errors.

  • Synchronous error handler
try{
     ...

}catch(sycl::exception const &e) {
                std::cout << "Unexpected exception caught during synchronous operation:\n" << e.what() << std::endl;
                std::terminate();
  • Asynchronous error handler
 auto async_error_handler = [&] (sycl::exception_list exceptions) {
		        for (auto const& e : exceptions) {
			        try{
				        std::rethrow_exception(e);
			        } catch(sycl::exception const& e) {
				std::cout << "Unexpected exception caught during asynchronous operation:\n" << e.what() << std::endl;
				std::terminate();
			        }
		        }
	        }; 

6.2.2. Device Selection and Information

Lines 51 - 52: Two queues are created to select the host and gpu devices respectively. The line 51 selects the host device whereas line 52 selects the gpu device using the queue mechanism. Notice also the inclusion of the asynchronous error handler object.

        // select device
	sycl::queue queue_host{sycl::host_selector(), async_error_handler};
        sycl::queue queue_gpu{sycl::gpu_selector(), async_error_handler};

Lines 55 - 60: Outputs the host and gpu devices information respectively.

          std::cout << "HOST DEVICE = "
                    << queue_device.get_device().get_info<sycl::info::device::name>()
                    << '\n' << std::endl;
          std::cout << "GPU DEVICE = "
                    << queue_device.get_device().get_info<sycl::info::device::name>()
                    << '\n' << std::endl;

6.2.3. Setting Up Host Storage

Recall: The host code is the driver code for the device code.

Line 36: This is the size of the dynamically allocated arrays a, b and c.

Lines 63 - 65: Arrays a and b, (the sender and receiver array) are one dimensional arrays allocated dynamically on the gpu device using the malloc_shared and malloc_device respectively. By now, we already know the implication of using the malloc_shared as opposed to malloc_device and malloc_host (the allocated memory is shared and accessible by the device and host. So, there would be no need for a memory copy from the device to the host when computation on the device is finished). However, for memory allocated on the device using malloc_device,

        // dynamically allocate arrays
        char *a = sycl::malloc_shared<char>(N , queue_device);
	char *b = sycl::malloc_shared<char>(N , queue_device);

Lines 66 - 80: This part of the program performs a null array check on the host device. If memory allocated is not empty, it sets the allocated memory to zeroes on the device, since memory allocated by malloc is filled with garbage value. The .wait() method is a coarse-grained synchronization method, ensuring that .memset() completes its action.

               if(a != NULL){
                queue_device.memset(a, 0, N).wait();
                }else{
                        std::cout << "Could not allocate memory!\n" << std::endl;
                        std::cout << "Array a is NULL! Exiting...\n" << std::endl;
                        exit(EXIT_FAILURE);
                }

                if(b != NULL){
                queue_device.memset(b, 0, N).wait();
                }else{
                        std::cout << "Could not allocate memory!\n" << std::endl;
                        std::cout << "Array b is NULL! Exiting...\n" << std::endl;
                        exit(EXIT_FAILURE);
                }

Lines 83 - 97: This part of the program is a simple for loop for filling the sender array, a with the string characters Hello World!.

                for(int i = 0; i < N; i++)
                {
                        a[0] = 'H';
                        a[1] = 'e';
                        a[2] = 'l';
                        a[3] = 'l';
                        a[4] = 'o';
                        a[5] = ' ';
                        a[6] = 'W';
                        a[7] = 'o';
                        a[8] = 'r';
                        a[9] = 'l';
                        a[10] = 'd';
                        a[11] = '!';
                }

6.2.4. Device Code Computation

Lines 100 - 108: This part of the program is the device code. Let's break things down -

               // define kernel to do array copy on selected device
                sycl::range<1> size{N};
                {
                        queue_device.submit([&] (sycl::handler &h) {
                                h.parallel_for(size, [=](sycl::id<1> idx) {
                                        int i = idx[0];
                                        b[i] = a[i];
                                        });
                                }).wait();
                }
  • Data size - The sycl::range<> in line 100 is used to define the size of work to be done in the kernel code. The style in which it is used to encapsulate the device code is not a compulsory style; the device code can otherwise be written as -
 queue_device.submit([&] (sycl::handler &h) {
         h.parallel_for(N, [=](sycl::id<1> idx) {
              int i = idx[0];
              b[i] = a[i];
              });
 }).wait();

as well as:

 queue_device.submit([&] (sycl::handler &h) {
         h.parallel_for(sycl::range(N), [=](sycl::id<1> idx) {
              int i = idx[0];
              b[i] = a[i];
              });
 }).wait();

and even :

 queue_device.submit([&] (sycl::handler &h) {
         h.parallel_for(sycl::range<1> {N}, [=](sycl::id<1> idx) {
              int i = idx[0];
              b[i] = a[i];
              });
 }).wait();

Depending on the programming style preferred.

  • Creating a command group handler - For encapsulating all actions (function calls, data dependencies) to be performed by the device. A command group is useful for preventing data race conditions as it helps it helps the device know the correct order of code execution.
     queue_device.submit([&] (sycl::handler &h) 
  • DPC++ Kernel - Line 103 is the start of a DPC++ kernel; the kernel is expressed as a Basic data-parallel kernel and represented as a Lambda Expression. The method used in distribution of work (data) is the parallel-for method.
  h.parallel_for(size, [=](sycl::id<1> idx) {
  • Kernel Code - Lines 104 - 105 represent the kernel code (device code). This code does a simple computation - Copy the content of the sender array, a, to the receiving array, b.
 int i = idx[0];
 b[i] = a[i];
  • End of DPC++ Device Code - Line 105 is the end of the kernel code; Line 106 is the end of the kernel code as well as the end of the device code. A coarse-grain synchronization method .wait() is used to ensure actions (- computation) in the DPC++ kernel is completed.

6.2.5. Accessing Device Code On Host Device

The memory allocated for the receiving array is done with malloc_shared; this type of memory is shared and accessible by the device and host. Therefore, data stored in the receiving array can be accessed directly from the host as shown in lines 112 - 115

    for(int i = 0; i < N; i++)
    {
         std::cout << b[i];
    }

6.2.6. Freeing Allocated Memory

Lines 118 - 120 frees memory allocated to both the sender array, a, and receiving array, b.

    // free allocated memory
    sycl::free(a, queue_device);
    sycl::free(b, queue_device);

6.3. Expected Result