Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/HPC-tutorial via python script #1527

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ repos:
rev: 6.3.0 # pick a git hash / tag to point to
hooks:
- id: pydocstyle
exclude: "tests|benchmarks|examples|scripts|setup.py" #|heat/utils/data/mnist.py|heat/utils/data/_utils.py ?
exclude: "tutorials|tests|benchmarks|examples|scripts|setup.py" #|heat/utils/data/mnist.py|heat/utils/data/_utils.py ?
- repo: "https://github.com/citation-file-format/cffconvert"
rev: "054bda51dbe278b3e86f27c890e3f3ac877d616c"
hooks:
Expand Down
36 changes: 36 additions & 0 deletions tutorials/hpc/01_basics/01_basics_dndarrays.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import heat as ht

# ### DNDarrays
#
#
# Similar to a NumPy `ndarray`, a Heat `dndarray` (we'll get to the `d` later) is a grid of values of a single (one particular) type. The number of dimensions is the number of axes of the array, while the shape of an array is a tuple of integers giving the number of elements of the array along each dimension.
#
# Heat emulates NumPy's API as closely as possible, allowing for the use of well-known **array creation functions**.


a = ht.array([1, 2, 3])
print("array creation with values [1,2,3] with the heat array method:")
print(a)

a = ht.ones(
(
4,
5,
)
)
print("array creation of shape = (4, 5) example with the heat ones method:")
print(a)

a = ht.arange(10)
print("array creation with [0,1,...,9] example with the heat arange method:")
print(a)

a = ht.full(
(
3,
2,
),
fill_value=9,
)
print("array creation with ones and shape = (3, 2) with the heat full method:")
print(a)
27 changes: 27 additions & 0 deletions tutorials/hpc/01_basics/02_basics_datatypes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import heat as ht
import numpy as np
import torch

# ### Data Types
#
# Heat supports various data types and operations to retrieve and manipulate the type of a Heat array. However, in contrast to NumPy, Heat is limited to logical (bool) and numerical types (uint8, int16/32/64, float32/64, and complex64/128).
#
# **NOTE:** by default, Heat will allocate floating-point values in single precision, due to a much higher processing performance on GPUs. This is one of the main differences between Heat and NumPy.

a = ht.zeros(
(
3,
4,
)
)
print(f"floating-point values in single precision is default: {a.dtype}")

b = torch.zeros(3, 4)
print(f"like in PyTorch: {b.dtype}")


b = np.zeros((3, 4))
print(f"whereas floating-point values in double precision is default in numpy: {b.dtype}")

b = a.astype(ht.int64)
print(f"casting to int64: {b}")
41 changes: 41 additions & 0 deletions tutorials/hpc/01_basics/03_basics_operations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import heat as ht

# ### Operations
#
# Heat supports many mathematical operations, ranging from simple element-wise functions, binary arithmetic operations, and linear algebra, to more powerful reductions. Operations are by default performed on the entire array or they can be performed along one or more of its dimensions when available. Most relevant for data-intensive applications is that **all Heat functionalities support memory-distributed computation and GPU acceleration**. This holds for all operations, including reductions, statistics, linear algebra, and high-level algorithms.
#
# You can try out the few simple examples below if you want, but we will skip to the [Parallel Processing](#Parallel-Processing) section to see memory-distributed operations in action.

a = ht.full(
(
3,
4,
),
8,
)
b = ht.ones(
(
3,
4,
)
)
c = a + b
print("matrix addition a + b:")
print(c)


c = ht.sub(a, b)
print("matrix substraction a - b:")
print(c)

c = ht.arange(5).sin()
print("application of sin() elementwise:")
print(c)

c = a.T
print("transpose operation:")
print(c)

c = b.sum(axis=1)
print("summation of array elements:")
print(c)
13 changes: 13 additions & 0 deletions tutorials/hpc/01_basics/04_basics_indexing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import heat as ht

# ## Indexing

# Heat allows the indexing of arrays, and thereby, the extraction of a partial view of the elements in an array. It is possible to obtain single values as well as entire chunks, i.e. slices.

a = ht.arange(10)

print(a[3])
print(a[1:7])
print(a[::2])

# **NOTE:** Indexing in Heat is undergoing a [major overhaul](https://github.com/helmholtz-analytics/heat/pull/938), to increase interoperability with NumPy/PyTorch indexing, and to provide a fully distributed item setting functionality. Stay tuned for this feature in the next release.
19 changes: 19 additions & 0 deletions tutorials/hpc/01_basics/05_basics_broadcast.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import heat as ht

# ---
# Heat implements the same broadcasting rules (implicit repetion of an operation when the rank/shape of the operands do not match) as NumPy does, e.g.:

a = ht.arange(10) + 3
print(f"broadcast example of adding single value 3 to [0, 1, ..., 9]: {a}")


a = ht.ones(
(
3,
4,
)
)
b = ht.arange(4)
print(
f"broadcasting across the first dimension of {a} with shape = (3, 4) and {b} with shape = (4): {a + b}"
)
79 changes: 79 additions & 0 deletions tutorials/hpc/01_basics/06_basics_gpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import heat as ht
import torch

# ## Parallel Processing
# ---
#
# Heat's actual power lies in the possibility to exploit the processing performance of modern accelerator hardware (GPUs) as well as distributed (high-performance) cluster systems. All operations executed on CPUs are, to a large extent, vectorized (AVX) and thread-parallelized (OpenMP). Heat builds on PyTorch, so it supports GPU acceleration on Nvidia and AMD GPUs.
#
# For distributed computations, your system or laptop needs to have Message Passing Interface (MPI) installed. For GPU computations, your system needs to have one or more suitable GPUs and (MPI-aware) CUDA/ROCm ecosystem.
#
# **NOTE:** The GPU examples below will only properly execute on a computer with a GPU. Make sure to either start the notebook on an appropriate machine or copy and paste the examples into a script and execute it on a suitable device.

# ### GPUs
#
# Heat's array creation functions all support an additional parameter that which places the data on a specific device. By default, the CPU is selected, but it is also possible to directly allocate the data on a GPU.


if torch.cuda.is_available():
ht.zeros(
(
3,
4,
),
device="gpu",
)

# Arrays on the same device can be seamlessly used in any Heat operation.

if torch.cuda.is_available():
a = ht.zeros(
(
3,
4,
),
device="gpu",
)
b = ht.ones(
(
3,
4,
),
device="gpu",
)
print(a + b)


# However, performing operations on arrays with mismatching devices will purposefully result in an error (due to potentially large copy overhead).

if torch.cuda.is_available():
a = ht.full(
(
3,
4,
),
4,
device="cpu",
)
b = ht.ones(
(
3,
4,
),
device="gpu",
)
print(a + b)

# It is possible to explicitly move an array from one device to the other and back to avoid this error.

if torch.cuda.is_available():
a = ht.full(
(
3,
4,
),
4,
device="gpu",
)
a.cpu()
print(a + b)
70 changes: 70 additions & 0 deletions tutorials/hpc/01_basics/07_basics_distributed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import heat as ht

# ### Distributed Computing
#
# Heat is also able to make use of distributed processing capabilities such as those in high-performance cluster systems. For this, Heat exploits the fact that the operations performed on a multi-dimensional array are usually identical for all data items. Hence, a data-parallel processing strategy can be chosen, where the total number of data items is equally divided among all processing nodes. An operation is then performed individually on the local data chunks and, if necessary, communicates partial results behind the scenes. A Heat array assumes the role of a virtual overlay of the local chunks and realizes and coordinates the computations - see the figure below for a visual representation of this concept.
#
# <img src="https://github.com/helmholtz-analytics/heat/blob/main/doc/images/split_array.png?raw=true" width="100%"></img>
#
# The chunks are always split along a singular dimension (i.e. 1-D domain decomposition) of the array. You can specify this in Heat by using the `split` paramter. This parameter is present in all relevant functions, such as array creation (`zeros(), ones(), ...`) or I/O (`load()`) functions.
#
#
#
#
# Examples are provided below. The result of an operation on a Heat tensor will in most cases preserve the split of the respective operands. However, in some cases the split axis might change. For example, a transpose of a Heat array will equally transpose the split axis. Furthermore, a reduction operations, e.g. `sum()` that is performed across the split axis, might remove data partitions entirely. The respective function behaviors can be found in Heat's documentation.
#
# You may also modify the data partitioning of a Heat array by using the `resplit()` function. This allows you to repartition the data as you so choose. Please note, that this should be used sparingly and for small data amounts only, as it entails significant data copying across the network. Finally, a Heat array without any split, i.e. `split=None` (default), will result in redundant copies of data on each computation node.
#
# On a technical level, Heat follows the so-called [Bulk Synchronous Parallel (BSP)](https://en.wikipedia.org/wiki/Bulk_synchronous_parallel) processing model. For the network communication, Heat utilizes the [Message Passing Interface (MPI)](https://computing.llnl.gov/tutorials/mpi/), a *de facto* standard on modern high-performance computing systems. It is also possible to use MPI on your laptop or desktop computer. Respective software packages are available for all major operating systems. In order to run a Heat script, you need to start it slightly differently than you are probably used to. This
#
# ```bash
# python ./my_script.py
# ```
#
# becomes this instead:
#
# ```bash
# mpirun -n <number_of_processors> python ./my_script.py
# ```
# On an HPC cluster you'll of course use SBATCH or similar.
#
#
# Let's see some examples of working with distributed Heat:

# In the following examples, we'll recreate the array shown in the figure, a 3-dimensional DNDarray of integers ranging from 0 to 59 (5 matrices of size (4,3)).


dndarray = ht.arange(60).reshape(5, 4, 3)
if dndarray.comm.rank == 0:
print("3-dimensional DNDarray of integers ranging from 0 to 59:")
print(dndarray)


# Notice the additional metadata printed with the DNDarray. With respect to a numpy ndarray, the DNDarray has additional information on the device (in this case, the CPU) and the `split` axis. In the example above, the split axis is `None`, meaning that the DNDarray is not distributed and each MPI process has a full copy of the data.
#
# Let's experiment with a distributed DNDarray: we'll split the same DNDarray as above, but distributed along the major axis.


dndarray = ht.arange(60, split=0).reshape(5, 4, 3)
if dndarray.comm.rank == 0:
print("3-dimensional DNDarray of integers ranging from 0 to 59:")
print(dndarray)


# The `split` axis is now 0, meaning that the DNDarray is distributed along the first axis. Each MPI process has a slice of the data along the first axis. In order to see the data on each process, we can print the "local array" via the `larray` attribute.

print(f"data on process no {dndarray.comm.rank}: {dndarray.larray}")


# Note that the `larray` is a `torch.Tensor` object. This is the underlying tensor that holds the data. The `dndarray` object is an MPI-aware wrapper around these process-local tensors, providing memory-distributed functionality and information.

# The DNDarray can be distributed along any axis. Modify the `split` attribute when creating the DNDarray in the cell above, to distribute it along a different axis, and see how the `larray`s change. You'll notice that the distributed arrays are always load-balanced, meaning that the data are distributed as evenly as possible across the MPI processes.

# The `DNDarray` object has a number of methods and attributes that are useful for distributed computing. In particular, it keeps track of its global and local (on a given process) shape through distributed operations and array manipulations. The DNDarray is also associated to a `comm` object, the MPI communicator.
#
# (In MPI, the *communicator* is a group of processes that can communicate with each other. The `comm` object is a `MPI.COMM_WORLD` communicator, which is the default communicator that includes all the processes. The `comm` object is used to perform collective operations, such as reductions, scatter, gather, and broadcast. The `comm` object is also used to perform point-to-point communication between processes.)


print(f"Global shape on rank {dndarray.comm.rank}: {dndarray.shape}")
print(f"Local shape on rank: {dndarray.comm.rank}: {dndarray.lshape}")
print(f"Local device on rank: {dndarray.comm.rank}: {dndarray.device}")
24 changes: 24 additions & 0 deletions tutorials/hpc/01_basics/08_basics_distributed_operations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import heat as ht

dndarray = ht.arange(60, split=0).reshape(5, 4, 3)

# You can perform a vast number of operations on DNDarrays distributed over multi-node and/or multi-GPU resources. Check out our [Numpy coverage tables](https://github.com/helmholtz-analytics/heat/blob/main/coverage_tables.md) to see what operations are already supported.
#
# The result of an operation on DNDarays will in most cases preserve the `split` or distribution axis of the respective operands. However, in some cases the split axis might change. For example, a transpose of a Heat array will equally transpose the split axis. Furthermore, a reduction operations, e.g. `sum()` that is performed across the split axis, might remove data partitions entirely. The respective function behaviors can be found in Heat's documentation.


# transpose
print(dndarray.T)


# reduction operation along the distribution axis
print(dndarray.sum(axis=0))

# min / max etc.
print(ht.sin(dndarray).min(axis=0))


other_dndarray = ht.arange(60, 120, split=0).reshape(5, 4, 3) # distributed reshape

# element-wise multiplication
print(dndarray * other_dndarray)
55 changes: 55 additions & 0 deletions tutorials/hpc/01_basics/09_basics_distributed_matmul.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# As we saw earlier, because the underlying data objects are PyTorch tensors, we can easily create DNDarrays on GPUs or move DNDarrays to GPUs. This allows us to perform distributed array operations on multi-GPU systems.
#
# So far we have demostrated small, easy-to-parallelize arithmetical operations. Let's move to linear algebra. Heat's `linalg` module supports a wide range of linear algebra operations, including matrix multiplication. Matrix multiplication is a very common operation data analysis, it is computationally intensive, and not trivial to parallelize.
#
# With Heat, you can perform matrix multiplication on distributed DNDarrays, and the operation will be parallelized across the MPI processes. Here on 4 GPUs:

import heat as ht
import torch

if torch.cuda.is_available():
device = "gpu"
else:
device = "cpu"

n, m = 400, 400
x = ht.random.randn(n, m, split=0, device=device) # distributed RNG
y = ht.random.randn(m, n, split=None, device=device)
z = x @ y
print(z)

# `ht.linalg.matmul` or `@` breaks down the matrix multiplication into a series of smaller `torch` matrix multiplications, which are then distributed across the MPI processes. This operation can be very communication-intensive on huge matrices that both require distribution, and users should choose the `split` axis carefully to minimize communication overhead.

# You can experiment with sizes and the `split` parameter (distribution axis) for both matrices and time the result. Note that:
# - If you set **`split=None` for both matrices**, each process (in this case, each GPU) will attempt to multiply the entire matrices. Depending on the matrix sizes, the GPU memory might be insufficient. (And if you can multiply the matrices on a single GPU, it's much more efficient to stick to PyTorch's `torch.linalg.matmul` function.)
# - If **`split` is not None for both matrices**, each process will only hold a slice of the data, and will need to communicate data with other processes in order to perform the multiplication. This **introduces huge communication overhead**, but allows you to perform the multiplication on larger matrices than would fit in the memory of a single GPU.
# - If **`split` is None for one matrix and not None for the other**, the multiplication does not require communication, and the result will be distributed. If your data size allows it, you should always favor this option.
#
# Time the multiplication for different split parameters and see how the performance changes.
#
#


import time

start = time.time()
z = x @ y
end = time.time()
print("runtime: ", end - start)


# Heat supports many linear algebra operations:
# ```bash
# >>> ht.linalg.
# ht.linalg.basics ht.linalg.hsvd_rtol( ht.linalg.projection( ht.linalg.triu(
# ht.linalg.cg( ht.linalg.inv( ht.linalg.qr( ht.linalg.vdot(
# ht.linalg.cross( ht.linalg.lanczos( ht.linalg.solver ht.linalg.vecdot(
# ht.linalg.det( ht.linalg.matmul( ht.linalg.svdtools ht.linalg.vector_norm(
# ht.linalg.dot( ht.linalg.matrix_norm( ht.linalg.trace(
# ht.linalg.hsvd( ht.linalg.norm( ht.linalg.transpose(
# ht.linalg.hsvd_rank( ht.linalg.outer( ht.linalg.tril(
# ```
#
# and a lot more is in the works, including distributed eigendecompositions, SVD, and more. If the operation you need is not yet supported, leave us a note [here](tinyurl.com/demoissues) and we'll get back to you.

# You can of course perform all operations on CPUs. You can leave out the `device` attribute entirely.
Loading