-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_analyzer_* random crashs at compare_mkldnn
#15032
Comments
Is the same problem like #14174? |
compare_mkldnn
compare_mkldnn
@jczaja Could you help to have a check at this? Thanks! |
It seems that mkldnn hangs from the log, is a openmp problem? |
@luotao1 We will look to that and get back to You |
@luotao1 I did not get to reproduce problem.
|
We have
Yes, we use parallel mode of execution of tests of ctest. We use Paddle/paddle/scripts/paddle_build.sh Lines 415 to 418 in bf518ec
The parallel level is |
@jczaja http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=45100&_focus=23067#_state=23743 |
|
@luotao1 From the log you indicated, I found below: |
@luotao1 Please ignore my above comment. Just found there are two builds in this log. The second one should be valid one with all features including MKL turn on. |
@luotao1 @jczaja One interesting point is that "dam" and "small" dam share same test app so they has exactly same test cases, i.e. "dam" also has compare_mkldnn test case. But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course). |
Another hint is that "small dam" is tested immediately after "dam". This means those two models will be tested concurrently due to parallel mode of ctest. This may make over-subscription of OMP core even worse. |
@jianhang-liu If |
|
@luotao1 "I could try run test_analyzer_small_dam, test_analyzer_dam, test_analyzer_resnet50 and test_analyzer_mobilenet_depthwise_conv in sequential order". It somehow prove our suspect: the hang in "small dam" is possibly due to "oversubscription" of OMP cores since it's running concurrently with "dam". I think change the order of those tests could be a acceptable workaround for now. @jczaja What's your idea? |
@luotao1 Yes, if problem is with threads over subscription then sequential execution of ctest should help.
Why I wrote that sequential execution should help? Because I'm suspecting that it could be that virtual machine that of 5117 is causing threads over-subscription . I'm not expert on VM , but
|
@jczaja Thanks very much for so detailed explanation!
#15196 make related ctest running in sequential mode.
It's difficult to do this experiment since nightly stress testing runs on the develop branch.
I'm not expert on VM as well. We will observe nightly stress testing after #15196 merged several days at first. |
@jczaja @jianhang-liu @yihuaxu |
@jczaja The detail machine configuration: http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&state=65&_focus=75#_state=65,34
|
@luotao1 When We look at specification of Xeon Gold 5117: There is 14 cores within single socket. And specification you sent/from logs is 16 sockets with 1 core in each socket. So Please tell me:
|
@jczaja @jianhang-liu
We use OpenStack.
One or Two virtual machines run on 5117.
Only One docker container runs at the same time. |
@luotao1 |
I check that all the fail log is on 5117 only. We have
|
@luotao1 The log you mentioned shows that on other SKX also test_analyzer_small_dam fails it just it is not timeout , but computational problem (diff in results). Perhaps it just different outcome of the same problem |
Yes, it is the MKL diff problem, same with #15116 (comment) |
@luotao1 Since couple of people are looking at that issue, I'm sharing my current status. Hopefully it will be helpful. I got reproduction of problem When running in a loop(upto hundred times) test_analyzer_dam_small .We can get either crash(segfault) or hang(timeout when running under ctest). CI on 5117 is building Paddle WITHOUT ON_INFER=ON and at that situation there can be hang of test_analyzer_small_dam test. If ON_INFER=ON is specified then randomly test_analyzer_small_dam will result in Segmentation Fault. Hang: Paddle/paddle/fluid/framework/operator.cc Line 1064 in 7e651a3
Crash: Workaround: Paddle/paddle/fluid/framework/operator.cc Line 1057 in 7e651a3
So currently We can see that there is a problem with TransferScope, that randomly manifest as hang, crash depending on ON_INFER being set or not. |
@jczaja Great! Actually this afternoon we just found build withOUT ON_INFER=ON is the key to reproduce. We can easily reproduce two types of CI error now (Timeout, compare fail due to diff) in our local 6151 server. For timeout (hang) issue, we positioned the error almost same as you (but not as detail as you). We wonder whether it's caused by SCOPE_XXX_LOCK definition. |
@luotao1 @jczaja We confirmed the same root cause as Jacek, i.e. current code in TryCreateTransferScope (use thread_local as cache to avoid creating transfer scope) may have random failure which cause this CI failure (hangs as timeout, segfault as crash). By simply comment out it (i.e. don't use cache; always create "new_scope"), we won't run into any error anymore. We trial with several times to enable/disable this cache and proved no issue occur when cache is disabled. |
@jianhang-liu Could you create a simple PR to show which lines are comment out? @Superjomn Could you help see the cache in |
Paddle/paddle/fluid/framework/transfer_scope_cache.cc Lines 30 to 46 in 6597ccb
@Superjomn Is line45 unused? |
#15032 (comment) is hot-fixed in #15450 |
http://ci.paddlepaddle.org/viewLog.html?buildId=126022&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20439
hang on |
do we have any conclusion why it hangs with TryCreateTransferScope()? are there any race conditions or multi-instances case? |
English DescriptionDiscussed with @LeoZhao-Intel and @jianhang-liu, we have some common views:
TODO:
Chinese Description简单说下原因:
|
compare_mkldnn
in three nightly CI at a same machine (5117).http://ci.paddlepaddle.org/viewLog.html?buildId=40622&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=21635
http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=40460&_focus=22113
http://ci.paddlepaddle.org/viewLog.html?buildId=40324&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=21764
compare_mkldnn
in nightly CIhttp://ci.paddlepaddle.org/viewLog.html?buildId=44600&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=22935
compare_mkldnn
in nightly CIhttp://ci.paddlepaddle.org/viewLog.html?buildId=44596&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=23004
The text was updated successfully, but these errors were encountered: