Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cherry-pick] Fix mem release error. #32655

Merged
merged 1 commit into from
Apr 29, 2021

Conversation

jiweibo
Copy link
Contributor

@jiweibo jiweibo commented Apr 28, 2021

PR types

Others

PR changes

Others

Describe

cherry-pick #32654

目前Predictor释放的时候,发现有一瞬间会占用所有显卡的显存。

定位到该问题在于 Predictor内部的scope_在析构的时候会遍历所有的卡,依次调用memory::Release(place)接口,该接口需调用cuda底层函数,所以会申请cuda handle等,占用显存。

    scope_.reset(new paddle::framework::Scope(), [](framework::Scope *scope) {
      delete scope;
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
      for (int dev_id = 0; dev_id < paddle::platform::GetCUDADeviceCount();
           ++dev_id) {
        memory::Release(platform::CUDAPlace(dev_id));
      }
#endif
#ifdef PADDLE_WITH_XPU
      for (int dev_id = 0; dev_id < paddle::platform::GetXPUDeviceCount();
           ++dev_id) {
        memory::Release(platform::XPUPlace(dev_id));
      }
#endif
      memory::Release(platform::CPUPlace());
    });

在Predictor Clone()接口调用后,Scope_的声明周期可能比Predictor要长,所以无法获取用户指定的显卡即device_id,该pr的修改会导致未定义的问题:https://github.com/PaddlePaddle/Paddle/pull/28409/files#diff-f6feda974e038d722114830a39bea985fd814e28bd86bcd33248aece3c3181a4R178

所以在此处,我们去除全部遍历显卡,依次释放的逻辑,恢复原有代码逻辑,这样会导致,权重所占据的显存最后会归还显存池,但不会压缩显存池的大小。

旧有问题现象:
测试代码:https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/test/shrink_memory
1、初始化Predictor后(config设置initGpu为500M),显存为780M(handle + 权重等)
2、batch_size为100运行一次,显存为4418M
3、调用ShrinkMemory接口后,显存占用为1292M(handle + 权重 + 其它?)
4、batch_size为2运行一次,显存占用为1728M
5、Predictor析构后,0卡显存占用为792M,其它卡占用280M

更改代码逻辑后,

5、Predictor析构后,显存占用1292M,但不影响其它卡显存占用。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@Superjomn Superjomn merged commit a5627df into PaddlePaddle:release/2.1 Apr 29, 2021
@jiweibo jiweibo deleted the release/2.1 branch April 29, 2021 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants