Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leaks when train transformer model. #10492

Closed
gongweibao opened this issue May 8, 2018 · 2 comments
Closed

Memory Leaks when train transformer model. #10492

gongweibao opened this issue May 8, 2018 · 2 comments

Comments

@gongweibao
Copy link
Contributor

gongweibao commented May 8, 2018

Background:

I found memory leaks in the process of run transformer model. Memory increases by speed about 100KB/batch. Both trainer and pserver meet the problem.

Generally, memory increases by two reasons:

  • Malloced(newed) memory is not freed.
  • Memory fragment

And I found two location of not freed memory use pprof tool to run all C++ unit tests:

But the memory increases over time even I solved the above.

Analysis

First, I use pprof and Valgrind to detect when run python interface, but it contains a lot of warnings

  • I write a C++ executor to friendly to memory check tool, but I found nothing except the initialize memory leak.
  • I compiled the debug version python for memory check tool, and the result is similar to the above.

Second, I think maybe there's memory fragment in Glibc memory pool:

  • Use malloc_trim to release not used memory: it's not helpful.
  • Use LD_PRELOAD tcmalloc.so, and set TCMALLOC_RELEASE_RATE=10.0(max value): it's not helpful.
  • Link tcmalloc to paddle: because our complicated dependency and the dependency order, I meet free invalid pointer error and so fail to link.

Third, I think it's maybe the python memory leak:

Third, I use mallinfo to trace memory consumption. It does consume some memory every batch, but I can't locate the operator: every operator use new or malloc to allocate memory in our code or std:: STL code.

Conclusion: Need help

Maybe memory fragment is the reason of memory leak. We can use malloc hook to manage our memory.

Reference:

valgrind + debug version python:

7881 ==22452== 616,960 (63,240 direct, 553,720 indirect) bytes in 527 blocks are definitely lost in loss record 13,934 of 13,972
7882 ==22452==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
7883 ==22452==    by 0x3D549AA9: google::protobuf::internal::GenericTypeHandler<paddle::framework::proto::OpDesc>::NewFromPrototype(paddle::framework::proto::OpDesc const*,      google::protobuf::Arena*) [clone .isra.186] (in /paddle/build/release_gpu/python/paddle/fluid/core.so)
7884 ==22452==    by 0x3D5509C0: paddle::framework::proto::BlockDesc::UnsafeMergeFrom(paddle::framework::proto::BlockDesc const&) (in /paddle/build/release_gpu/python/paddle     /fluid/core.so)
7885 ==22452==    by 0x3D550D86: paddle::framework::proto::ProgramDesc::UnsafeMergeFrom(paddle::framework::proto::ProgramDesc const&) (in /paddle/build/release_gpu/python/pa     ddle/fluid/core.so)
7886 ==22452==    by 0x3C7A288D: paddle::framework::ProgramDesc::ProgramDesc(paddle::framework::proto::ProgramDesc const&) (in /paddle/build/release_gpu/python/paddle/fluid/     core.so)
7887 ==22452==    by 0x3C6F3F61: void pybind11::cpp_function::initialize<paddle::pybind::pybind11_init()::{lambda(paddle::framework::ProgramDesc const&, std::vector<std::arr     ay<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&)#39}, paddle::framework::ProgramDesc*, paddle::framework::ProgramDesc const&, std::vect     or<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&, pybind11::name, pybind11::scope, pybind11::sibling>(paddle::pybind::pybind1     1_init()::{lambda(paddle::framework::ProgramDesc const&, std::vector<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&)#39}&&, pa     ddle::framework::ProgramDesc* (*)(paddle::framework::ProgramDesc const&, std::vector<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > c     onst&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) (i     n /paddle/build/release_gpu/python/paddle/fluid/core.so)
7888 ==22452==    by 0x3C713823: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (in /paddle/build/release_gpu/python/paddle/fluid/core.so)
7889 ==22452==    by 0x4BC3F9: PyEval_EvalFrameEx (in /usr/bin/python2.7)
7890 ==22452==    by 0x4B9AB5: PyEval_EvalCodeEx (in /usr/bin/python2.7)
7891 ==22452==    by 0x4C1E6E: PyEval_EvalFrameEx (in /usr/bin/python2.7)
7892 ==22452==    by 0x4B9AB5: PyEval_EvalCodeEx (in /usr/bin/python2.7)
7893 ==22452==    by 0x4C16E6: PyEval_EvalFrameEx (in /usr/bin/python2.7)
.....
7938 ==22452==    definitely lost: 264,370 bytes in 1,939 blocks
7939 ==22452==    indirectly lost: 799,257 bytes in 16,449 blocks
7940 ==22452==      possibly lost: 6,991,543 bytes in 49,725 blocks
7941 ==22452==    still reachable: 247,616,073 bytes in 535,440 blocks
7942 ==22452==                       of which reachable via heuristic:
7943 ==22452==                         stdstring          : 7,891 bytes in 122 blocks
7944 ==22452==         suppressed: 0 bytes in 0 blocks
7945 ==22452== Reachable blocks (those to which a pointer was found) are not shown.
7946 ==22452== To see them, rerun with: --leak-check=full --show-leak-kinds=all

glibc malloc trace:

Diff is the memory consumed by operator.
image
Diff is the memory consumed by executor of every batch.
image

@gongweibao
Copy link
Contributor Author

gongweibao commented May 8, 2018

And I can't figure out why this #10358 resolved the inference memory leak.

@gongweibao gongweibao changed the title Memory Leak Memory Leaks when train transformer model. May 8, 2018
@typhoonzero typhoonzero reopened this May 10, 2018
@panyx0718
Copy link
Contributor

@gongweibao Can you have something that can be reproduced? Other people can help you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants