The cpu memory keeps increasing, and when it is full, an error got! #189

nizhihao · 2021-04-21T04:31:02Z

Hi, When I try to train this code with python main.py --train in my local computer. It's always happen that the cpu memory keeps increasing and when it is full, an error is reported. I think this problem cause by the memory out.

Traceback (most recent call last):
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/nzh_projects/kaggle-environments-master/kaggle_agent/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/nzh_projects/kaggle-environments-master/kaggle_agent/HandyRL/handyrl/connection.py", line 175, in _sender
conn.send(next(self.send_generator))
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Because my Memory only have 32G, so i have try to turn the max_episodes to the 200000. when the number of the sample at 20w, the memoey use_rate about 95%, but this code still occupy the more cpu memory with continuous sampling. In my Option, I think when the sample number more than max_episodes, the replay buffer will delete the old sample. but i dont know why the memory still increase.

could you give me some ideas. and how can i solve this problem(not restart). Thanks very much!

ikki407 · 2021-04-21T16:48:26Z

Thank you for your report!
After I checked your stacktrace, I think your error is perhaps the same thing as commented in #184 (comment)

Could you check it out? Thanks.

YuriCat · 2021-04-25T10:41:06Z

Thank you for your detailed report!
Since episodes among agents at the beginning are very short in some environments, it's not strange that used memory increases after the replay buffer is filled with episodes.

If you increase compress_steps you can save memory, while the speed of making batches will slow down.

nizhihao · 2021-04-26T03:37:38Z

Ok! thanks your reply!
First, Let me try to restart epoch when memory is full. because the pre-trained model have get some skills, so every episode will hold the same length I think.
second, I try to decrease max_episode and train it again.

and if I increase compress_steps, I will lose the trajectory's precision or only slow down the train?

digitalspecialists · 2021-04-26T08:40:27Z

Thank you for your detailed report!
Since episodes among agents at the beginning are very short in some environments, it's not strange that used memory increases after the replay buffer is filled with episodes.

If you increase compress_steps you can save memory, while the speed of making batches will slow down.

If I set initial/maximum_episodes to 10k/14k and even mem_used_ratio to 0.04 (6GB) as a safeguard it continues to blow 128GB with the 14k buffer, once we've processed 180k matches.
EDIT: Solved. Seems this was due to Cuda 11.2 ... back to Cuda 11.1 and all good.

YuriCat · 2021-04-29T14:57:51Z

@nizhihao
There is no precision problem with compress_steps.
It only slows down the speed of making batches. If this speed is not enough to use GPUs without a break, training will slow down. So if you prepared batchers enough to use GPUs continuously, there is no problem, while batchers also spend CPU time.

YuriCat · 2021-04-29T15:04:27Z

@digitalspecialists
Thank you for your helpful report!
So, is there a problem around CUDA 11.2? We'll look into it when we get a chance.

digitalspecialists · 2021-05-27T14:12:31Z

An issue seems to persist.

Clearly later episodes take more space than earlier episodes due to more successful agents.

maximum_episodes is capped when we first hit 95% of mem.

But maximum_episodes is never further reduced as the memory usage of the deque continues to grow with longer episodes.

That is, the replay buffer mem usage continues to grow inside the capped maximum_episodes queue size.

I think what is needed is to self.trainer.episodes.popleft() while mem_used_ratio >= 0.95

YuriCat · 2021-05-28T15:31:36Z

@digitalspecialists
I understand what you pointed out. Yes, that's necessary. Thanks!

ikki407 mentioned this issue Jun 2, 2021

(Idea) fix: lazy memory limit #204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cpu memory keeps increasing, and when it is full, an error got! #189

The cpu memory keeps increasing, and when it is full, an error got! #189

nizhihao commented Apr 21, 2021

ikki407 commented Apr 21, 2021 •

edited

Loading

YuriCat commented Apr 25, 2021 •

edited

Loading

nizhihao commented Apr 26, 2021

digitalspecialists commented Apr 26, 2021 •

edited

Loading

YuriCat commented Apr 29, 2021

YuriCat commented Apr 29, 2021

digitalspecialists commented May 27, 2021 •

edited

Loading

YuriCat commented May 28, 2021 •

edited

Loading

The cpu memory keeps increasing, and when it is full, an error got! #189

The cpu memory keeps increasing, and when it is full, an error got! #189

Comments

nizhihao commented Apr 21, 2021

ikki407 commented Apr 21, 2021 • edited Loading

YuriCat commented Apr 25, 2021 • edited Loading

nizhihao commented Apr 26, 2021

digitalspecialists commented Apr 26, 2021 • edited Loading

YuriCat commented Apr 29, 2021

YuriCat commented Apr 29, 2021

digitalspecialists commented May 27, 2021 • edited Loading

YuriCat commented May 28, 2021 • edited Loading

ikki407 commented Apr 21, 2021 •

edited

Loading

YuriCat commented Apr 25, 2021 •

edited

Loading

digitalspecialists commented Apr 26, 2021 •

edited

Loading

digitalspecialists commented May 27, 2021 •

edited

Loading

YuriCat commented May 28, 2021 •

edited

Loading