Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

Closed
GNiendorf opened this issue Jul 28, 2023 · 12 comments · Fixed by #348
Closed

CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

GNiendorf opened this issue Jul 28, 2023 · 12 comments · Fixed by #348
Labels
bug Something isn't working

Comments

@GNiendorf
Copy link
Member

GNiendorf commented Jul 28, 2023

The error is given as:

Fatal system signal has occurred during exit ./alpaka_setup.sh: line 59: 2562239 Aborted (core dumped) cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

Edit: Here is the full bt

Dan's bt with more information - Here

This Issue seems related.

Steps to reproduce (on cgpu1, taken from @VourMa's readme instructions). If you put this into a file alpaka_setup.sh for example and run chmod +x alpaka_setup.sh and ./alpaka_setup.sh it should run automatically and produce the error at the very end. Make sure your github username is set though or it will fail. This setup uses the 100 step2 events input file on CGPU1 that was made by Manos:

# Clone the TrackLooper repo
git clone git@github.com:SegmentLinking/TrackLooper.git
cd TrackLooper/

# Source the setup script to configure the environment
source setup.sh

# Make the TrackLooper using the "-mc" option to turn the caching allocator on
sdl_make_tracklooper -mc

cd ..

# Create the working folder and move into it
mkdir workingFolder
cd workingFolder

# Set up CMSSW
cmsrel CMSSW_13_0_0_pre4
cd CMSSW_13_0_0_pre4/src
cmsenv

# Initialize git and add remote
git cms-init
git remote add SegLink git@github.com:SegmentLinking/cmssw.git

# Fetch and checkout specific branch
git fetch SegLink CMSSW_13_0_0_pre4_LST_X
git cms-addpkg RecoTracker Configuration
git checkout CMSSW_13_0_0_pre4_LST_X

# Create lst.xml
cat <<EOF >lst.xml
<tool name="lst" version="1.0">
  <client>
    <environment name="LSTBASE" default="$PWD/../../../TrackLooper"/>
    <environment name="LIBDIR" default="\$LSTBASE/SDL"/>
    <environment name="INCLUDE" default="\$LSTBASE"/>
  </client>
  <runtime name="LST_BASE" value="\$LSTBASE"/>
  <lib name="sdl"/>
</tool>
EOF

# Setup scram and env
scram setup lst.xml
cmsenv

# Modify the LSTProducer.cc file
sed -i 's/lst_.run(ctx.queue().getNativeHandle(),/lst_.run(ctx.queue(),/' ./RecoTracker/LST/plugins/alpaka/LSTProducer.cc

# Check dependencies
git cms-checkdeps -a -A

# Build
scram b -j 12

# Generate the step3 file
cmsDriver.py step3  -s RAW2DIGI,RECO:reconstruction_trackingOnly,VALIDATION:@trackingOnlyValidation,DQM:@trackingOnlyDQM --conditions auto:phase2_realistic_T21 --datatier GEN-SIM-RECO,DQMIO -n 10 --eventcontent RECOSIM,DQM --geometry Extended2026D88 --era Phase2C17I13M9 --pileup AVE_200_BX_25ns --pileup_input file:file.root --procModifiers gpu,trackingLST,trackingIters01 --no_exec

# Edit the configuration file
sed -i "28i process.load('Configuration.StandardSequences.Accelerators_cff')\nprocess.AlpakaServiceCudaAsync = cms.Service('AlpakaServiceCudaAsync')\nprocess.AlpakaServiceSerialSync = cms.Service('AlpakaServiceSerialSync')" step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

sed -i "/process.mix.input.fileNames =/c \
process.mix.input.fileNames = cms.untracked.vstring(['file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/066fc95d-1cef-4469-9e08-3913973cd4ce.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/07928a25-231b-450d-9d17-e20e751323a1.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/26bd8fb0-575e-4201-b657-94cdcb633045.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/4206a9c5-44c2-45a5-aab2-1a8a6043a08a.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/55a372bf-a234-4111-8ce0-ead6157a1810.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/59ad346c-f405-4288-96d7-795f81c43fe8.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/7280f5ec-b71d-4579-a730-7ce2de0ff906.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/b93adc85-715f-477a-afc9-65f3241933ee.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/c7a0aa46-f55c-4b01-977f-34a397b71fba.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/e77fa467-97cb-4943-884f-6965b4eb0390.root'])" step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

sed -i "s|fileNames = cms.untracked.vstring('file:step3_DIGI2RAW.root')|fileNames = cms.untracked.vstring('file:/ceph/cms/store/user/evourlio/LST/step2_21034.1_100Events.root')|" step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

# Run the modified step3
cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py
@GNiendorf
Copy link
Member Author

Paging @dan131riley. If anything comes to mind, please chime in!

@GNiendorf GNiendorf added the bug Something isn't working label Jul 28, 2023
@GNiendorf
Copy link
Member Author

Tagging @fwyzard, we have two backtraces here (one from me and one from Dan both linked above).

@fwyzard
Copy link

fwyzard commented Aug 8, 2023

I can try to reproduce and have a look, but first a couple of questions:

@VourMa
Copy link
Contributor

VourMa commented Aug 8, 2023

Thanks, Andrea! Some replies to your questions:

the recipe above mentions CMSSW_13_0_0_pre4; is the issue still present after merging the workaround in cms-sw/cmssw#42427 ?

The workaround was propagated to our own copy of the caching allocator: 6ea9524.
It's included in PR #314, which may not be merged yet but Gavin tested locally, and it seems that the error persists. It is true that the test happened in CMSSW_13_0_0_pre4. @GNiendorf could comment if I got anything wrong.

do you have a recipe for a more recent release of CMSSW - like 13.0.10 or 13.2.0 ?

I have been working on getting the setup to work in CMSSW_13_2_0_pre2. The version I prepared should be functional in any CMSSW version with the "new accelerator framework". I can tidy it up tomorrow and send you a few details.

can I use the recipe on e.g. lxplus-gpu, or one of the online machines ?

I think it should work anywhere as long as cvmfs is available.

@dan131riley
Copy link

@fwyzard so far as I know, the current LST Alpaka integration is not using the Alpaka caching allocator service. If you look at my stack traces, both calls to the caching allocator destructor are in the exit handlers, which is going to be after the CUDA service was unloaded. It may not be worth your time looking at this until the LST CMSSW integration is using the allocator service.

@VourMa
Copy link
Contributor

VourMa commented Aug 9, 2023

I have been working on getting the setup to work in CMSSW_13_2_0_pre2. The version I prepared should be functional in any CMSSW version with the "new accelerator framework". I can tidy it up tomorrow and send you a few details.

I went ahead and updated to more recent releases.
If one chooses to work in CMSSW_13_2_0_pre2, then the README can be followed to the letter by applying the substitutions CMSSW_13_0_0_pre4(_LST_X) -> CMSSW_13_2_0_pre2(_LST_X).
If one chooses to work in any other release, in which cms-sw/cmssw#41341 is in, then cherry-pick-ing commits SegmentLinking/cmssw@05c3d73 and SegmentLinking/cmssw@a0aae36 from SegmentLinking/cmssw/CMSSW_13_2_0_pre2_LST_X, instead of pulling the specific _LST_X branch, should work as well.

@GNiendorf
Copy link
Member Author

GNiendorf commented Aug 9, 2023

@fwyzard so far as I know, the current LST Alpaka integration is not using the Alpaka caching allocator service. If you look at my stack traces, both calls to the caching allocator destructor are in the exit handlers, which is going to be after the CUDA service was unloaded. It may not be worth your time looking at this until the LST CMSSW integration is using the allocator service.

Right now we are using a copied version of the caching allocator which can also be run for our standalone code. @fwyzard your fix is applied on the alpaka_upgrade branch that is still waiting to be merged in on #314. This error only occurs when the caching allocator is enabled and within CMSSW, and persists on this branch after your fix was applied for the related issue.

@fwyzard
Copy link

fwyzard commented Aug 9, 2023

@GNiendorf I'm confused: does the crash happen when the application is run stand-alone (with the copy of the caching allocator with the fix) or does it happen within CMSSW ?

@GNiendorf
Copy link
Member Author

GNiendorf commented Aug 9, 2023

@GNiendorf I'm confused: does the crash happen when the application is run stand-alone (with the copy of the caching allocator with the fix) or does it happen within CMSSW ?

It happens only within CMSSW, but we are using our copied version of the CMSSW caching allocator when we are running within CMSSW as Dan mentioned above. See here for our copied version of the alpaka interface: https://github.com/SegmentLinking/TrackLooper/tree/alpaka_upgrade/code/alpaka_interface

@fwyzard
Copy link

fwyzard commented Aug 9, 2023

So there are two identical but independent instances of the caching allocator ?
That could very well be the reason of the problem.

@GNiendorf
Copy link
Member Author

@fwyzard Sorry for the late reply, we spent some time resolving a few CPU/GPU backend differences before coming back to this issue.

Is there any documentation on how to use the CMSSW Alpaka caching allocator service correctly? Is it as simple as changing the include statement towards the relevant CMSSW path as I did here? Or is using the service more complicated?

@fwyzard
Copy link

fwyzard commented Sep 23, 2023

hi Gavin,
the issue is that you must ensure that all memory allocated by the caching allocator has been freed, before the alpaka objects are destroyed at the end of the job.

If you have instances of a caching allocator in your code, you could try calling freeAllCached() after the event processing is complete, and before the destruction of the alpaka devices (which should happen sometime during the destruction of the services, if I remember correctly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants