-
Notifications
You must be signed in to change notification settings - Fork 6
/
README
56 lines (42 loc) · 2.64 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
This directory contain a CUDA Fortran port of the serial Cloverleaf benchmark for a single
GPU.
Managed Memory
--------------
This code utilizes the Managed Memory feature introduced in the CUDA 6.0 Toolkit
and in the PGI 14.7 compilers. Managed Memory is a pool of memory accessible
to both CPU and GPU using a single variable, which eliminates the need for separate host
and device versions of data along with coding explicit data movement between host and
device. In CUDA Fortran, allocation into this pool of memory is achieved using the
"managed" variable attribute in declarations (see definitions.cuf). Managed Memory
requires use of the 6.0 CUDA libraries and is available on devices with compute capability
of 3.0 or higher.
One caveat in using Managed Memory is that on multi-GPU systems where any two GPUs are NOT
peer-to-peer capable, the system falls back to using zero-copy memory which will typically
result in a large performance degradation. Since this code utilizes a single GPU, the
zero-copy fallback can be avoided by setting the environment variable CUDA_VISIBLE_DEVICES
to the appropriate GPU, or by setting the environment variable CUDA_MANAGED_FORCE_DEVICE_ALLOC
to a non-zero value. (To determine the whether all pairs of GPUs on your system are
peer-to-peer capable, compile and run the example code p2pAccess found in:
/opt/pgi/*/2015/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter4/P2P/p2pAccess/)
Texture Cache
-------------
On devices of compute capability 3.5 and higher, kernel arguments declared as
"intent(in)" will be routed through the read-only texture cache via the LDG
instruction. As a result of this feature, there are neither use of explicit
textures nor shared memory in any kernels. On devices of compute capability 3.0,
the code will run slower as the texture cache is not used.
Test Cases
----------
Four different input files are included with the code are those recommended to test for
correctness. Cloverleaf expects the input file to be named clover.in, so copy one of
these files to clover.in before running a test.
The kinetic energy, as displayed in the output file clover.out every 10 timesteps and
at the end of the run, is usually the most sensitive state variable and can be used as
a correctness test. The reference values for the kinetic energy for the last time step
of these runs, along with approximate wall-clock times for a system with a K20 are listed
below:
Input Kinetic Energy Wall-clock time (s)
clover_bm_short.in 0.1193E+01 1.13
clover_bm.in 0.2590E+01 39
clover_bm16_short.in 0.3075E+00 17
clover_bm16.in 0.4854E+01 580