Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cpu memory keeps increasing, and when it is full, an error got! #189

Open
nizhihao opened this issue Apr 21, 2021 · 8 comments
Open

Comments

@nizhihao
Copy link

Hi, When I try to train this code with python main.py --train in my local computer. It's always happen that the cpu memory keeps increasing and when it is full, an error is reported. I think this problem cause by the memory out.

Traceback (most recent call last):
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/nzh_projects/kaggle-environments-master/kaggle_agent/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/nzh_projects/kaggle-environments-master/kaggle_agent/HandyRL/handyrl/connection.py", line 175, in _sender
conn.send(next(self.send_generator))
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Because my Memory only have 32G, so i have try to turn the max_episodes to the 200000. when the number of the sample at 20w, the memoey use_rate about 95%, but this code still occupy the more cpu memory with continuous sampling. In my Option, I think when the sample number more than max_episodes, the replay buffer will delete the old sample. but i dont know why the memory still increase.

could you give me some ideas. and how can i solve this problem(not restart). Thanks very much!

@ikki407
Copy link
Member

ikki407 commented Apr 21, 2021

Thank you for your report!
After I checked your stacktrace, I think your error is perhaps the same thing as commented in #184 (comment)

Could you check it out? Thanks.

@YuriCat
Copy link
Contributor

YuriCat commented Apr 25, 2021

Thank you for your detailed report!
Since episodes among agents at the beginning are very short in some environments, it's not strange that used memory increases after the replay buffer is filled with episodes.

If you increase compress_steps you can save memory, while the speed of making batches will slow down.

@nizhihao
Copy link
Author

Ok! thanks your reply!
First, Let me try to restart epoch when memory is full. because the pre-trained model have get some skills, so every episode will hold the same length I think.
second, I try to decrease max_episode and train it again.

and if I increase compress_steps, I will lose the trajectory's precision or only slow down the train?

@digitalspecialists
Copy link

digitalspecialists commented Apr 26, 2021

Thank you for your detailed report!
Since episodes among agents at the beginning are very short in some environments, it's not strange that used memory increases after the replay buffer is filled with episodes.

If you increase compress_steps you can save memory, while the speed of making batches will slow down.

If I set initial/maximum_episodes to 10k/14k and even mem_used_ratio to 0.04 (6GB) as a safeguard it continues to blow 128GB with the 14k buffer, once we've processed 180k matches.
EDIT: Solved. Seems this was due to Cuda 11.2 ... back to Cuda 11.1 and all good.

@YuriCat
Copy link
Contributor

YuriCat commented Apr 29, 2021

@nizhihao
There is no precision problem with compress_steps.
It only slows down the speed of making batches. If this speed is not enough to use GPUs without a break, training will slow down. So if you prepared batchers enough to use GPUs continuously, there is no problem, while batchers also spend CPU time.

@YuriCat
Copy link
Contributor

YuriCat commented Apr 29, 2021

@digitalspecialists
Thank you for your helpful report!
So, is there a problem around CUDA 11.2? We'll look into it when we get a chance.

@digitalspecialists
Copy link

digitalspecialists commented May 27, 2021

An issue seems to persist.

Clearly later episodes take more space than earlier episodes due to more successful agents.

maximum_episodes is capped when we first hit 95% of mem.

But maximum_episodes is never further reduced as the memory usage of the deque continues to grow with longer episodes.

That is, the replay buffer mem usage continues to grow inside the capped maximum_episodes queue size.

I think what is needed is to self.trainer.episodes.popleft() while mem_used_ratio >= 0.95

@YuriCat
Copy link
Contributor

YuriCat commented May 28, 2021

@digitalspecialists
I understand what you pointed out. Yes, that's necessary. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants