BrokenPipeError: [Errno 32] Broken pipe #184

R-Ceph · 2021-04-13T05:23:59Z

When training to 189 epoch, the training was interrupted in a server.
It seems OK on my own computer with the same config.

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 175, in _sender
conn.send(next(self.send_generator))
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 398, in _send_bytes
self._send(buf)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 620, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

yaml
train_args:
turn_based_training: False
observation: True
gamma: 0.8
forward_steps: 32
compress_steps: 4
entropy_regularization: 2.0e-3
entropy_regularization_decay: 0.3
update_episodes: 300
batch_size: 400
minimum_episodes: 10000
maximum_episodes: 250000
num_batchers: 7
eval_rate: 0.1
worker:
num_parallel: 6
lambda: 0.7
policy_target: 'UPGO' # 'UPGO' 'VTRACE' 'TD' 'MC'
value_target: 'TD' # 'VTRACE' 'TD' 'MC'
seed: 0
restart_epoch: 0

worker_args:
server_address: ''
num_parallel: 6

The text was updated successfully, but these errors were encountered:

R-Ceph · 2021-04-15T02:53:42Z

It seems that when running "python main.py -- worker" and "python main.py --train-server" on the same computer, the "worker" process will take up a lot of memory so that this kind of error occurs.

In my last running time, this problem came out after 129400 episodes, and the "worker" took up about 50GB mem, and the "train-server" took up about 10GB mem.

128900 129000 129100
epoch 396
win rate = 0.999 (124.8 / 125)
generation stats = 0.000 +- 0.742
loss = p:-0.000 v:0.012 ent:0.324 total:0.012
updated model(54158)
129200 129300 129400
epoch 397
win rate = 0.991 (125.8 / 127)
generation stats = -0.000 +- 0.742
loss = p:-0.000 v:0.012 ent:0.329 total:0.012
updated model(54282)
Exception in thread Thread-7:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

YuriCat · 2021-04-15T03:02:57Z

@han-x

Massively thank you for your report.
You are training a big neural network, aren't you?
In current implementation, each Gather process store all models after starting workers.
Rewrite worker.py for storing only latest model is currently the best way to reduce memories.

We are considering enabling to select training scheme and avoid storing unused models.

R-Ceph · 2021-04-15T03:38:31Z

Thanks for your reply! But I just used the original GeeseNet...
I will try the method you said, thanks again LoL

YuriCat · 2021-04-15T08:01:52Z

@han-x

Thanks. That's a strange case...
Since the original GeeseNet was about 500KB, it will occupy only 100MB when in 200 epochs.

It's nothing wrong 129k episodes occupied 10GB in trainer process.
You can increase compress_steps if you want to save memory, while it will slow down the speed of making batches and sometimes slow down the training speed as a result.

ikki407 · 2021-04-15T10:00:36Z

Hi @han-x !

Could you give me an information like below to consider from various perspectives?

server machine info (Ubuntu? OSX? Windows?)
machine specs (#CPUs, #GPUs, RAM)
pytorch version

Thanks!

R-Ceph · 2021-04-15T12:17:54Z

Hi @han-x !

Could you give me an information like below to consider from various perspectives?

server machine info (Ubuntu? OSX? Windows?)

machine specs (#CPUs, #GPUs, RAM)

pytorch version

Thanks!

server machine info: Ubuntu 16.04.7 LTS
CPU：Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
RAM：64GB
GPU：2080 * 2
pytorch：Version: 1.7.0

ikki407 · 2021-04-21T16:45:09Z

I have noticed a possible cause from your stacktraces. Are you using the codes of current master branch? I think there are some differences between your script and script in master branch.

The similar error happened before and we solved it in #145.

Could you check it? And update your code if old code is used. Thanks.

R-Ceph closed this as completed Apr 15, 2021

R-Ceph reopened this Apr 15, 2021

ikki407 mentioned this issue Apr 21, 2021

The cpu memory keeps increasing, and when it is full, an error got! #189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BrokenPipeError: [Errno 32] Broken pipe #184

BrokenPipeError: [Errno 32] Broken pipe #184

R-Ceph commented Apr 13, 2021

R-Ceph commented Apr 15, 2021

YuriCat commented Apr 15, 2021

R-Ceph commented Apr 15, 2021

YuriCat commented Apr 15, 2021

ikki407 commented Apr 15, 2021

R-Ceph commented Apr 15, 2021

ikki407 commented Apr 21, 2021 •

edited

Loading

BrokenPipeError: [Errno 32] Broken pipe #184

BrokenPipeError: [Errno 32] Broken pipe #184

Comments

R-Ceph commented Apr 13, 2021

R-Ceph commented Apr 15, 2021

YuriCat commented Apr 15, 2021

R-Ceph commented Apr 15, 2021

YuriCat commented Apr 15, 2021

ikki407 commented Apr 15, 2021

R-Ceph commented Apr 15, 2021

ikki407 commented Apr 21, 2021 • edited Loading

ikki407 commented Apr 21, 2021 •

edited

Loading