-
Notifications
You must be signed in to change notification settings - Fork 8
/
gpu-computing.markdown
119 lines (82 loc) · 3.46 KB
/
gpu-computing.markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
layout: page
title: GPU Computing
permalink: /gpu-computing/
nav_order: 10
---
# GPU Computing
## 1. Logging onto the cluster
To use the GPU machines on cluster, you need to log in to cluster. Here is the nice introduction of how to log in: `https://www.alessandravalcarcel.com/blog/2019-04-23-ssh/`
`lpcgpu01`, which is the host for GPU machines, is accessible via `takim` server. You may enter the `takim` server by typing this:
```sh
ssh -X <pennkey>@takim.pmacs.upenn.edu
```
*** Don't read this if you are not familiar with GPU computing ***
For advanced user, looking for additional GPU power, we have six additional GPU cores, exclusively available to us. `takim2` is a submit host (not a excutible host) that can be used for GPU computing. You can access it by typing this:
```sh
ssh -X <pennkey>@takim2.pmacs.upenn.edu
```
or
```sh
ssh -X <pennkey>@takim2
```
Note that this is a submit host, not a executable host. You can directly use it as a interactive session, but can't submit a normal job to `takim2`.
## 2. Interactive Session Basics
If you intend to use an interactive session, consider using screen so that you don't lose your work. You can open a screen by
```sh
screen -S <Screen-name>
```
and start your work. You can find the details about how to use a screen at: `https://www.alessandravalcarcel.com/blog/2019-06-12-interactivesession1/`
Once you are in the executable host, you can open an interactive session by typing this:
```sh
bsub -Is -q lpcgpu -gpu "num=1" -n 1 "bash"
```
Make sure you request gpu. "num=1" requests the number of GPU whereas "-n 1" requests the number of CPU.
Next, load torch and tensorflow to activate CUDA
```sh
module load torch
module load tensorflow/2.3-GPU
```
To check whether your CUDA is running, run this:
```sh
python
```
In Python, run this:
```py
import torch
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)
```
The response should be
```py
torch.cuda.is_available() : True
torch.cuda.device_count() : 1
torch.cuda.current_device() : 0
torch.cuda.device(0) : <torch.cuda.device object at 0x2ae68db7cb20>
torch.cuda.get_device_name(0) : 'NVIDIA GeForce RTX 2080 Ti'
```
Now, you are ready to use the GPU!
## 3. Normal Job Sessions
Once you are in the executable host, you can submit a normal job, usually with the bash file.
```sh
bsub -q lpcgpu -gpu "num=1" -n 1 -J "orig[1-3]" -o <where to save your log file> <location of your bash file>
```
For example, I save my log file at `/home/ecbae/nnUNet.txt` and bash file at `/home/ecbae/orig.sh`. Note that my job index is [1-3], which in result requests 3 GPU cores and 3 CPU cores.
My bash file looks like this:
```sh
module load torch
module load tensorflow/2.3-GPU
nnUNet_plan_and_preprocess -t $(( $LSB_JOBINDEX +149 ))
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 0 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 1 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 2 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 3 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 4 --npz
```
Note that we have 10 GPU cores in `lpcgpu01` host. If you need more GPU cores, you may want to use the `takim2` host, discussed in the first section.
Also, I am running a pre-installed python package `nnUNet`. To install the existing package, you should submit a ticket to PMACS or send an email to Martin Das.
## 4. Concluding remarks
We need more GPU!