Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) #38657

From00 · 2022-01-01T13:12:32Z

PR types

New features

PR changes

APIs

Describe

Add 4 new APIs: paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved

CN docs PR：PaddlePaddle/docs#4193

paddle-bot-old · 2022-01-01T13:12:40Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot-old · 2022-01-10T02:38:12Z

Sorry to inform you that 4d506ea's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

… add-new-api-memory_reserved

Shixiaowei02 · 2022-01-18T02:59:37Z

paddle/fluid/memory/allocation/allocator_facade.cc

+  if (platform::is_gpu_place(place)) {
+    int dev_id = place.GetDeviceId();
+    int64_t alloc_size =
+        STAT_INT_ADD("STAT_gpu" + std::to_string(dev_id) + "_alloc_size",


请问这里能做到通过 AllocatorFacade 分配内存等价于拿到具体的 Allocator 然后返回 Allocator->Allocate(size) 吗？后续 Tensor 计划不走 AllocatorFacade，而是直接传入具体的 Allocator

已另作讨论，这里采集显存数据的相关逻辑无法实现到具体的Allocator里，与pten直接获取Allocator对象后分配内存的设想不等价，之后pten的Alloc接口在获取Allocator分配内存后，也需要添加类似的数据采集逻辑。此处存在一些和Allocator以及Pten最初设计不太切合的修改，短期先同步后进行合入，不阻塞相关功能的开发，后续pten项目相关负责人员腾出时间后，再对类似的问题进行集中讨论和优化整改。 @phlrain @chenwhql @zhiqiu @jim19930609

按先期形成的共识，Allocator 分配逻辑的统一出口为 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/allocation/allocator.h#L142 。目前因为进度原因先行同意此合入，相关问题 @From00 后续处理。

Shixiaowei02

按先期形成的共识，Allocator 分配逻辑的统一出口为 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/allocation/allocator.h#L142 。目前此 PR 在实现上有差异。因为进度原因先行同意此合入，相关问题 @From00 后续处理。

paddle-bot-old · 2022-01-30T02:37:39Z

Sorry to inform you that fb04a61's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

… add-new-api-memory_reserved

ZHUI · 2022-03-22T08:53:40Z

此PR有计划随着 paddle2.3 发版吗？ @From00

liutiexing · 2022-03-25T10:24:22Z

paddle/fluid/memory/stats.h

+  void Update(int64_t increment) override {
+    ThreadLocalStatType thread_local_stat =
+        ThreadDataRegistry<ThreadLocalStatType>::GetInstance()
+            .GetCurrentThreadData();


GetMutableCurrentThreadData is designed for read&write scene

From00 · 2022-03-26T02:24:40Z

按先期形成的共识，Allocator 分配逻辑的统一出口为 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/allocation/allocator.h#L142 。目前此 PR 在实现上有差异。因为进度原因先行同意此合入，相关问题 @From00 后续处理。

新的方案通过包装一层StatAllocator实现显存统计功能，目前已不影响Alloacator分配逻辑的出口函数GetAllocator，与PHI（原PTEN）的设计原则无差异，可兼容直接获取Allocator对象后分配内存的行为。

From00 · 2022-03-26T02:26:00Z

此PR有计划随着 paddle2.3 发版吗？ @From00

有计划。此PR目前由于人力原因推动较慢，会争取在Paddle 2.3随版发布。

tianshuo78520a

approve for coverage-ci build size

zhiqiu

LGTM

TCChenlong

LGTM

XiaoguangHu01

LG API

luotao1 · 2022-05-25T07:53:21Z

build目录体积增大的情况说明与豁免申请

PR内容：在paddle主框架fluid/memory目录下新实现一套用于模型显存监测的基础组件，对底层Allocator显存分配情况进行实时采集和统计，并暴露相关的python端API供用户使用。
影响幅度：Coverage CI上build目录体积从136G增长到140G，本地编译测试build目录体积从212G增加到216G，增加了4G，超过CI限制的3G上限。
具体原因：因Allocator相关代码过于底层，paddle中几乎所有模块都有依赖，底层新功能的改动和增加牵一发而动全身。在本地测试，加入相关功能代码后，memory目录自身大小增加约0.4G（其中110M是新增的单测文件），phi、framework和operator三个目录均各增加约1G，framework目录增加约0.4G，imperative目录增加约0.1G，合计整个build目录体积增加接近4G。新增体积主要来自paddle/fluid/memory/stats.h头文件，该文件中新实现了一套显存统计的基础组件，并在allocator和gpu_info中被用于实时统计allocator分配给Tensor以及从GPU设备申请的显存大小，而allocator和gpu_info被框架中大部分文件依赖。
其它情况说明：
1. PR合入的必要性？模型显存相关问题覆盖面广、涉及用户多，显存监测功能的缺失会导致模型训练遇到显存相关问题时排查困难、针对显存的性能分析和优化不方便、极大影响用户体验，相关需求已有多次用户反馈。
2. 是否可以复用框架中已有的性能监测代码，而不引入一套全新的实现？前期尝试过直接使用框架中前人开发的monitor组件进行显存监测，但实测证明已有组件性能较差，再加上模型训练时显存相关操作频繁，对显存模块性能非常敏感，已有代码用于显存监测后对许多模型最大可造成超过10个点的训练性能下降，因而需要新实现一套高性能的统计方案。
3. 代码实现上是否可以对编译体积进行优化？有考虑过不在有较多依赖的头文件中添加太多内容，但相关代码很难不在头文件中进行实现。一方面为了支持多种统计指标，相关数据结构使用类模板实现，无法不在头文件中进行定义。另一方面，为了运行时性能考虑，一些映射逻辑设计成宏函数在编译期进行处理，宏函数只有实现在头文件中才能被其它模块使用。

Add new API memory_reserved

4d506ea

Add memory_allocated, max_memory_reserved and max_memory_allocater

2f8e782

From00 changed the title ~~Add new API memory_reserved and max_memory_reserved~~ Add new APIs for GPU memory monitoring (memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated) Jan 17, 2022

From00 added 2 commits January 17, 2022 15:09

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

3704d55

… add-new-api-memory_reserved

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2525fce

… add-new-api-memory_reserved

From00 changed the title ~~Add new APIs for GPU memory monitoring (memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated)~~ Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) Jan 17, 2022

From00 added 2 commits January 17, 2022 19:45

Fix CI error

b3d048f

Fix CI error

22ae0af

Shixiaowei02 reviewed Jan 18, 2022

View reviewed changes

From00 requested a review from zhiqiu January 20, 2022 06:10

Shixiaowei02 previously approved these changes Jan 20, 2022

View reviewed changes

Enhance UT

fb04a61

From00 dismissed Shixiaowei02’s stale review via fb04a61 January 21, 2022 11:18

From00 added 7 commits February 7, 2022 10:48

Add FLAGS_memory_stats_opt

1e539ef

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ac21435

… add-new-api-memory_reserved

Add STATS macro functions

b0ae93a

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

bc75cfe

… add-new-api-memory_reserved

Add StatAllocator

a633df6

Fix CI errors

827da88

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5986842

… add-new-api-memory_reserved

liutiexing requested changes Mar 25, 2022

View reviewed changes

From00 added 2 commits March 26, 2022 02:28

Add UT

678acca

Fix CI errors

2363830

From00 requested a review from phlrain March 27, 2022 09:24

From00 requested review from TCChenlong, XiaoguangHu01 and liutiexing March 27, 2022 09:24

From00 mentioned this pull request Mar 27, 2022

新增4个显存监测API的中文文档（max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved） PaddlePaddle/docs#4193

Merged

tianshuo78520a approved these changes Mar 27, 2022

View reviewed changes

zhiqiu approved these changes Mar 28, 2022

View reviewed changes

TCChenlong approved these changes Mar 29, 2022

View reviewed changes

phlrain approved these changes Mar 29, 2022

View reviewed changes

XiaoguangHu01 approved these changes Mar 30, 2022

View reviewed changes

chenwhql approved these changes Mar 30, 2022

View reviewed changes

From00 merged commit afe02e9 into PaddlePaddle:develop Mar 30, 2022

From00 mentioned this pull request Mar 30, 2022

在paddle中如何对模型进行显存占用分析？ #38193

Closed

From00 deleted the add-new-api-memory_reserved branch April 4, 2022 12:29

From00 mentioned this pull request Apr 21, 2022

Print memory peak message for UT #42092

Merged

From00 mentioned this pull request May 27, 2022

Support memory stats for CPU #42945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) #38657

Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) #38657

From00 commented Jan 1, 2022 •

edited

Loading

paddle-bot-old bot commented Jan 1, 2022

paddle-bot-old bot commented Jan 10, 2022

Shixiaowei02 Jan 18, 2022

From00 Jan 20, 2022

Shixiaowei02 Jan 20, 2022

Shixiaowei02 left a comment •

edited

Loading

paddle-bot-old bot commented Jan 30, 2022

ZHUI commented Mar 22, 2022

liutiexing Mar 25, 2022

From00 Mar 26, 2022

From00 commented Mar 26, 2022

From00 commented Mar 26, 2022

tianshuo78520a left a comment

zhiqiu left a comment

TCChenlong left a comment

XiaoguangHu01 left a comment

luotao1 commented May 25, 2022

Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) #38657

Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) #38657

Conversation

From00 commented Jan 1, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jan 1, 2022

paddle-bot-old bot commented Jan 10, 2022

Shixiaowei02 Jan 18, 2022

Choose a reason for hiding this comment

From00 Jan 20, 2022

Choose a reason for hiding this comment

Shixiaowei02 Jan 20, 2022

Choose a reason for hiding this comment

Shixiaowei02 left a comment • edited Loading

Choose a reason for hiding this comment

paddle-bot-old bot commented Jan 30, 2022

ZHUI commented Mar 22, 2022

liutiexing Mar 25, 2022

Choose a reason for hiding this comment

From00 Mar 26, 2022

Choose a reason for hiding this comment

From00 commented Mar 26, 2022

From00 commented Mar 26, 2022

tianshuo78520a left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

TCChenlong left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

luotao1 commented May 25, 2022

build目录体积增大的情况说明与豁免申请

From00 commented Jan 1, 2022 •

edited

Loading

Shixiaowei02 left a comment •

edited

Loading