Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

borzari · 2024-02-14T16:39:26Z

PR description:

This implements DQM modules for comparing pixel tracks objects reconstructed using Alpaka and CUDA. These modules can be used for any possible comparisons using pixel tracks: CUDA CPU vs CUDA GPU, Alpaka Serial vs Alpaka Device and Alpaka vs CUDA (both Serial/CPU and Device/GPU). A new process modifier chain, alpakaCUDAValidation, was added to turn on both reconstructions and the Alpaka vs CUDA DQM modules when needed.

A new workflow offset, .5041, was added to run DQM with all the possible comparisons using pixel tracks objects. A new customization to run Alpaka vs CUDA comparisons in HLT was also created.

These changes will be back ported to CMSSW_14_0_X for the special validation of the Alpaka vs CUDA reconstruction.

PR validation:

The modifications were tested using wfs 12434.403 (Alpaka validation), 12434.5041 (Alpaka vs CUDA validation), and 12434.503 (CUDA validation). All comparisons follow the expectations.

cmsbuild · 2024-02-14T16:40:05Z

cms-bot internal usage

cmsbuild · 2024-02-14T16:47:37Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43964/38859

This PR adds an extra 1168KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): Add run3 tracking low pu era #33532, Introduce matrix workflows for Alpaka-based Particle Flow #43701, add Era_Run3_2024, Era_Run3_2025, Tier0 reco scenario for 2024, and change default era in PyReleaseValidation #43761
- File DQM/SiPixelHeterogeneous/plugins/SiPixelCompareVertexSoAAlpaka.cc modified in PR(s): Z vertex multi layout #43952
- File RecoLocalTracker/SiPixelRecHits/python/SiPixelRecHits_cfi.py modified in PR(s): New Heterogeneous Memory Pool #37952
- File RecoTracker/PixelVertexFinding/plugins/alpaka/vertexFinder.dev.cc modified in PR(s): Z vertex multi layout #43952

cmsbuild · 2024-02-14T16:48:06Z

A new Pull Request was created by @borzari for master.

It involves the following packages:

Configuration/ProcessModifiers (operations)
Configuration/PyReleaseValidation (upgrade, pdmv)
DQM/SiPixelHeterogeneous (dqm)
DataFormats/TrackSoA (reconstruction, heterogeneous)
DataFormats/TrackingRecHitSoA (reconstruction, heterogeneous)
HLTrigger/Configuration (hlt)
RecoLocalTracker/SiPixelClusterizer (reconstruction)
RecoLocalTracker/SiPixelRecHits (reconstruction)
RecoTracker/Configuration (reconstruction)
RecoTracker/PixelSeeding (reconstruction)
RecoTracker/PixelTrackFitting (reconstruction)
RecoTracker/PixelVertexFinding (reconstruction)
RecoVertex/BeamSpotProducer (alca, reconstruction)

@mmusich, @syuvivida, @AdrianoDee, @saumyaphor4252, @miquork, @nothingface0, @perrotta, @tjavaid, @antoniovilela, @fabiocos, @cmsbuild, @sunilUIET, @mandrenguyen, @davidlange6, @fwyzard, @Martin-Grunewald, @makortel, @rappoccio, @consuegs, @rvenditti, @jfernan2, @antoniovagnerini, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@idebruyn, @mmusich, @GiacomoSguazzoni, @mroguljic, @yuanchao, @JanFSchulte, @francescobrivio, @jandrea, @rsreds, @missirol, @silviodonato, @VinInn, @dgulhan, @dkotlins, @fabiocos, @tvami, @fioriNTU, @felicepantaleo, @VourMa, @Martin-Grunewald, @makortel, @gpetruc, @mtosi, @slomeo, @tocheng, @threus, @ferencek, @rovere this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich · 2024-02-14T16:57:22Z

-hlt

mmusich · 2024-02-14T16:59:45Z

HLTrigger/Configuration/python/HLT_2023v12_cff.py

-fragment.hltSiPixelRecHitsSoACompareGPUvsCPU = cms.EDProducer( "SiPixelPhase1CompareRecHitsSoA",
-    pixelHitsSrcCPU = cms.InputTag( "hltSiPixelRecHitsSoA@cpu" ),
-    pixelHitsSrcGPU = cms.InputTag( "hltSiPixelRecHitsSoA@cuda" ),
+fragment.hltSiPixelRecHitsSoACompareGPUvsCPU = cms.EDProducer( "SiPixelPhase1CompareRecHits",


you are not supposed to touch these files directly.
Please provide a suitable customization function in https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Configuration/python/customizeHLTforCMSSW.py that modifies these configurations.
STORM will take care of migrating these files centrally.

Thanks Marco: we're on that!
(there was a miss-communication between me and Breno).

Hi @mmusich. Yes, this was my mistake. I was just talking to @AdrianoDee about it and he explained that I should create a customization instead of changing fragments directly. I'll remove the last commit and make the customization

mmusich

@borzari

Martin-Grunewald · 2024-02-14T17:00:45Z

-1
No modification of the HLT menu dumps in HLTrigger/Configuration. AFAIK all ALPKA modifications are to be coded in the ALPAKA customisation routine in HLTrigger/Configuration/python/customizeHLTforAlpka.py .

Martin-Grunewald · 2024-02-14T17:06:25Z

Also, it is unclear why DQM comparisons affect the HLT menu; I think it needs to be discussed how many modules we add and execute for this comparison to the HLT menu. Will this also need to run at P5, or only for offline workflows? If the former, it needs to be assessed for timing. If the latter, it definitely does NOT go in the HLT menus in HLTrigger/Configuration, and has to remain a DQM specific customisation routine called in a cmsDriver DQM step and put in some DQM subsystem.

AdrianoDee · 2024-02-14T17:10:31Z

Also, it is unclear why DQM comparisons affect the HLT menu; I think it needs to be discussed how many modules we add and execute for this comparison to the HLT menu. Will thus also need to run at P5 or only for offline workflows? If the former, it needs to be assessed for timing. If the latter, it definitely does NOT go in the menus and has to remain a DQM specific customisation routine called in a cmsDriver DQM step.

This is because there's a path for CPUvsGPU checks DQM_PixelReconstruction_v that uses the comparison modules. In this case we are not adding anything, just changed the module to be able in general to do CUDAvsCUDA, CUDAvsAlpaka and AlpakaVSAlpaka comparisons.

Martin-Grunewald · 2024-02-14T17:12:56Z

Only one path in the HLT menu is modified?
In any case, if you modify a path run in the HLT menu at P5, you need to file a CMSHLT JIRA request with the details.

mmusich · 2024-04-17T12:17:10Z

@borzari this branch has now conflicts to resolve. Notice that the current HLT menu in release as integrated in #44733 runs alpaka for pixel local reconstruction + tracking + vertexing by default.

borzari · 2024-04-17T13:04:32Z

@borzari this branch has now conflicts to resolve. Notice that the current HLT menu in release as integrated in #44733 runs alpaka for pixel local reconstruction + tracking + vertexing by default.

Hi @mmusich. I was working on fixing the conflicts and matching the AlpakavsCUDA customizations with the Alpaka ones, and just finished some local tests using wf 12434.5041. You can see the modifications here. Before force pushing the changes to the PR branch, I have a question: should I modify HLTriggerOffline/Common/python/customizeHLTforAlpakavsCUDA.py to be as similar as possible to HLTrigger/Configuration/python/customizeHLTforAlpaka.py? I am asking because I see some modules that might not be used for the Alpaka vs CUDA comparisons, but I am not sure if I should keep (and make CUDA counterparts) of those.

mmusich · 2024-04-17T13:44:07Z

I am asking because I see some modules that might not be used for the Alpaka vs CUDA comparisons, but I am not sure if I should keep (and make CUDA counterparts) of those.

I think it would still be beneficial if this could be discussed in a wider forum (e.g. a Joint TRK DPG / POG meeting).

slava77 · 2024-05-08T14:08:54Z

@mmusich
was the issue related to SoA persistency #44700 ?

Do I understand correctly that the issue is related to

'keep *RecHit*_hltSiPixelRecHitsSoASerialSync_*_*',
'keep *RecHit*_hltSiPixelRecHitsSoA_*_*',

Does it mean that the track SoAs can be persisted OK?
Would this reduce the problem to monitor only to siPixelPhase1CompareRecHitsSoAAlpaka?
Is the old style equivalent for hit monitoring in the HLT process also the slowest or is it the others?
Perhaps we could start with a mix-and-match solution

mmusich · 2024-05-08T14:11:41Z

@slava77

was the issue related to SoA persistency #44700 ?

yes.

Does it mean that the track SoAs can be persisted OK?

no, please read carefully #44700 (comment).

This seems to come from these keep statements:

'keep *RecHit*_hltSiPixelRecHitsSoASerialSync_*_*',
'keep *RecHit*_hltSiPixelRecHitsSoA_*_*',

reverting that, one encounters a different error related to the hltPixelTracksSoA*:

----- Begin Fatal Exception 10-Apr-2024 14:01:59 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TStreamerInfo::Build
reco::TrackSoA<pixelTopology::Phase1>::Layout<128,false>, unknown type: cms::alpakatools::OneToManyAssocSequential<unsigned int,32769,163840>* hitIndices_
----- End Fatal Exception -------------------------------------------------

the only product we can safely persists are pixel vertices.

mmusich · 2024-05-08T14:13:31Z

Is the old style equivalent for hit monitoring in the HLT process also the slowest or is it the others?

I don't have measurements, but judging from the multiplicities, I think it's a reasonable assumption.

Perhaps we could start with a mix-and-match solution

do you care to expand?

slava77 · 2024-05-08T14:52:56Z

Perhaps we could start with a mix-and-match solution

do you care to expand?

I meant to drop the DQM monitors/modules from the HLT table for parts that can be persisted and run those in the client.

mmusich · 2024-05-08T14:58:30Z

I meant to drop the DQM monitors/modules from the HLT table for parts that can be persisted and run those in the client.

as mentioned we can do it (only) for the vertices for the time being. If @cms-sw/tracking-pog-l2 prefers to do that, I think we are set to do it, but please do chime in in the ticket https://its.cern.ch/jira/browse/CMSHLT-3147 clearly explaining your preference and motivations.

slava77 · 2024-05-08T15:06:14Z

as mentioned we can do it (only) for the vertices for the time being.

I'm guessing that the resource gain will be small using just vertices (unless there's a record that this code is significantly slow).
I was more motivated to propose a mix when most of what's needed can be moved to the DQM client.

mmusich · 2024-06-11T03:28:18Z

@borzari can you clarify on the status / plans for this PR?

borzari · 2024-06-11T14:23:21Z

@borzari can you clarify on the status / plans for this PR?

@mmusich I am closing this PR, as most of the changes are not needed anymore. I am going to open a new PR just with changes to the Alpaka DQM modules, and some others that were to be included in this PR (I'll list them in the new PR to make it clear), to be included in CMSSW_14_1_X only. I was making some tests with the changes and had a few issues, that I solved now, but it took a while. I am opening the new PR later today and will tag you to check it

cmsbuild added this to the CMSSW_14_1_X milestone Feb 14, 2024

cmsbuild added reconstruction-pending dqm-pending hlt-pending alca-pending operations-pending pending-signatures tests-pending orp-pending pdmv-pending upgrade-pending code-checks-pending heterogeneous-pending tracking trk labels Feb 14, 2024

cmsbuild added code-checks-approved and removed code-checks-pending labels Feb 14, 2024

cmsbuild added hlt-rejected and removed hlt-pending labels Feb 14, 2024

mmusich reviewed Feb 14, 2024

View reviewed changes

mmusich suggested changes Feb 14, 2024

View reviewed changes

borzari force-pushed the AlpakavsCUDA branch from 3ea8509 to 4a6bd2e Compare February 14, 2024 17:06

cmsbuild mentioned this pull request May 6, 2024

Fixes for Alpaka Phase2 Pixel Reco #44874

Merged

cmsbuild mentioned this pull request May 13, 2024

Introducing DIGInoHLT + HLTOnly Steps #44964

Merged

This was referenced May 28, 2024

Revert "Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT" #45080

Merged

CMSSW Integration of LST #45117

Open

This was referenced Jun 5, 2024

L2TauTagNNProducerAlpaka: do not call TF inference with empty grid #45147

Merged

Include HIon type traits for Alpaka pixel tracking #45151

Merged

borzari closed this Jun 11, 2024

borzari mentioned this pull request Jun 12, 2024

Including more general DQM compare modules for pixel tracks objects #45206

Merged

borzari mentioned this pull request Jun 25, 2024

[14.0.X] Adding general DQM modules for pixel tracks #45302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

borzari commented Feb 14, 2024 •

edited

Loading

cmsbuild commented Feb 14, 2024 •

edited

Loading

cmsbuild commented Feb 14, 2024

cmsbuild commented Feb 14, 2024

mmusich commented Feb 14, 2024

mmusich Feb 14, 2024

AdrianoDee Feb 14, 2024

borzari Feb 14, 2024

mmusich left a comment

Martin-Grunewald commented Feb 14, 2024

Martin-Grunewald commented Feb 14, 2024 •

edited

Loading

AdrianoDee commented Feb 14, 2024

Martin-Grunewald commented Feb 14, 2024

mmusich commented Apr 17, 2024

borzari commented Apr 17, 2024

mmusich commented Apr 17, 2024

slava77 commented May 8, 2024

mmusich commented May 8, 2024

mmusich commented May 8, 2024

slava77 commented May 8, 2024

mmusich commented May 8, 2024

slava77 commented May 8, 2024

mmusich commented Jun 11, 2024

borzari commented Jun 11, 2024

Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

Conversation

borzari commented Feb 14, 2024 • edited Loading

PR description:

PR validation:

cmsbuild commented Feb 14, 2024 • edited Loading

cmsbuild commented Feb 14, 2024

cmsbuild commented Feb 14, 2024

mmusich commented Feb 14, 2024

mmusich Feb 14, 2024

Choose a reason for hiding this comment

AdrianoDee Feb 14, 2024

Choose a reason for hiding this comment

borzari Feb 14, 2024

Choose a reason for hiding this comment

mmusich left a comment

Choose a reason for hiding this comment

Martin-Grunewald commented Feb 14, 2024

Martin-Grunewald commented Feb 14, 2024 • edited Loading

AdrianoDee commented Feb 14, 2024

Martin-Grunewald commented Feb 14, 2024

mmusich commented Apr 17, 2024

borzari commented Apr 17, 2024

mmusich commented Apr 17, 2024

slava77 commented May 8, 2024

mmusich commented May 8, 2024

mmusich commented May 8, 2024

slava77 commented May 8, 2024

mmusich commented May 8, 2024

slava77 commented May 8, 2024

mmusich commented Jun 11, 2024

borzari commented Jun 11, 2024

borzari commented Feb 14, 2024 •

edited

Loading

cmsbuild commented Feb 14, 2024 •

edited

Loading

Martin-Grunewald commented Feb 14, 2024 •

edited

Loading