Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643

Open
saumyaphor4252 opened this issue Apr 6, 2024 · 62 comments

Comments

@saumyaphor4252
Copy link
Contributor

Reporting the T0 processing error in jobs for HLTMonitor stream for Run 378985 detailed in
https://cms-talk.web.cern.ch/t/express-paused-jobs-run2024b-productnotfound-error/38544

CRITICAL:root:Error running cmsRun
{‘arguments’: [‘/bin/bash’, ‘/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh’, ‘’, ‘el8_amd64_gcc12’, ‘scramv1’, ‘CMSSW’, ‘CMSSW_14_0_4’, ‘FrameworkJobReport.xml’, ‘cmsRun’, ‘PSet.py’, ’
']}
CMSSW Return code: 8006

CRITICAL:root:Error message: An exception of category ‘ProductNotFound’ occurred while
[0] Processing Event run: 378985 lumi: 224 event: 220047314 stream: 6
[1] Running path ‘dqmoffline_step’
[2] Calling method for module TrackRefitter/‘hltrefittedForPixelDQM’
Exception Message:
RefCore: A request to resolve a reference to a product of type ‘std::vectorreco::TrackExtra’ with ProductID ‘1:3256’
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add “TryToContinue = cms.untracked.vstring(‘ProductNotFound’)” to the “options” PSet in the configuration.

The tarball for the PSet is available at

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024B/ProductNotFound/vocms014.cern.ch-27238-3-log.tar.gz

Somewhat similar symptoms also seem to be for Run378993 with error

type: "Fatal Exception"
details: "An exception of category 'ProductNotFound' occurred while 
   [0] Processing Event run: 378993 lumi: 188 event: 223221502 stream: 1  
   [1] Running path 'dqmoffline_step' 
   [2] Calling method for module TrackingMonitor/'gsfTracksMonitoringHLT' 
Exception Message: 
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:3118' can not be satisfied because the product cannot be found. 
Probably the branch containing the product is not stored in the input file. 
  Additional Info: 
        [a] If you wish to continue processing events after a ProductNotFound exception, add "TryToContinue = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration."
exitCode: 8006

FYI @cms-sw/dqm-l2 May also be relevant to the online hlt DQM client crashes starting Run 378981 reported at the DRM today:
https://cmsweb.cern.ch/dqm/dqm-square/api?what=get_logs&id=dqm-source-state-run378981-hostdqmfu-c2b03-45-01-pid3057526&db=production

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2024

A new Issue was created by @saumyaphor4252.

@makortel, @smuzaffar, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@perrotta
Copy link
Contributor

perrotta commented Apr 6, 2024

assign dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2024

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 6, 2024

just for the record, the PoolOutputModule of the HLTMonitor stream is configured such that it sends the right event content to be consumed at Tier-0:

      'keep *_hltEgammaGsfTracks_*_*',
      'keep *_hltMergedTracksForBTag_*_*',
      'keep *_hltMergedTracks_*_*',
      'keep *_hltPixelTracks_*_*',

thus the recoTrackExtras_*__HLT branches should be present in the input streamer files.
I suspect something along the lines of #39064.

@Pranjal033
Copy link

We copied a few LSs from run 378981 (LS=455-465) in the playback region which were affected and reproduced this crash at playback, here is it's log file.

The information about how to reproduce the crash at lxplus can be found here and the streamers have been copied at this path :
/eos/cms/store/group/comm_dqm/Collisions24_tempStreamers/

@syuvivida
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2024

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@saumyaphor4252
Copy link
Contributor Author

Dear all, the issue is now becoming a showstopper for T0 operations with 100s of paused jobs in the HLTMonitor stream and Prompt Reco processing now stopped starting Run 378981: https://cms-talk.web.cern.ch/t/promptreco-is-paused-for-run2024b/38673.

Can the experts please look into it with a high priority.

Thanks and regards,
Saumya (incoming ORM)

@makortel
Copy link
Contributor

makortel commented Apr 8, 2024

Can the experts please look into it with a high priority.

Starting now.

@sextonkennedy
Copy link
Member

Thanks Matti. Note that from the Joint Ops meeting DQM reported that the track extra collection is really not in the file output from the HLT. So the process that really needs to be debugged is the HLT executable to determine why in a small fraction of the events the HLT does not manage to write out the TrackExtras.

@makortel
Copy link
Contributor

makortel commented Apr 8, 2024

The tarball for the PSet is available at

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024B/ProductNotFound/vocms014.cern.ch-27238-3-log.tar.gz

Looking at this reproducer, the hltrefittedForPixelDQM that throws the exception, consumes hltMergedTracks. According to EventContentAnalyzer the std::vector<TrackExtra> from hltMergedTracks is there, but with ProductID 1:3264, whereas the exception complains the std::vector<TrackExtra> with ProductID 1:3256 being missing.

@makortel
Copy link
Contributor

makortel commented Apr 8, 2024

Instrumenting the TrackProducerAlgorithm<T>::runWithTrack() for the case that throws the exception I see in the event 378985:224:220047314 the module's input (hltMergedTracks) has total of 326 tracks. First 316 tracks have algo hltIter0 and the last 10 have undefAlgorithm. For all tracks the TrackExtraRef has ProductID of 1:3256.

In the previous event (378985:224:219871457) in the file all the TrackExtraRefs have (the correct) ProductID 1:3264.

@makortel
Copy link
Contributor

makortel commented Apr 8, 2024

@cms-sw/hlt-l2 Is it possible to find out after the fact if the events 378985:224:219871457 and 378985:224:220047314 were processed by the same cmsRun process or by two different processes (either same or different node)?

@slava77
Copy link
Contributor

slava77 commented Apr 8, 2024

type tracking

@dan131riley
Copy link

Maybe the EventAuxiliary processGUID?

@makortel
Copy link
Contributor

makortel commented Apr 8, 2024

A recipe to rerun the HLT menu used to process these (kind of) events would also help in the investigation, even if it would not reproduce this problem exactly.

@makortel
Copy link
Contributor

makortel commented Apr 8, 2024

Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985?

@mmusich
Copy link
Contributor

mmusich commented Apr 8, 2024

A recipe to rerun the HLT menu used to process these (kind of) events would also help in the investigation, even if it would not reproduce this problem exactly.

Here's a possible recipe (not guaranteed to reproduce):

#!/bin/bash -ex

# CMSSW_14_0_4

hltGetConfiguration run:378985  \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --output full \
  --max-events -1 \
  --input /store/data/Run2024B/EphemeralZeroBias0/RAW/v1/000/378/985/00000/f5f542ca-b93e-46e9-a136-7e9f1740218a.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.hltOutputFull.outputCommands = [
    'keep *',
    'drop *_hltSiPixelDigisLegacy_*_*',
    'drop *_hltSiPixelClustersLegacy_*_*',
    'drop *_hltSiPixelRecHitsFromLegacy_*_*',
    'drop *_hltEcalDigisLegacy_*_*',
    'drop *_hltEcalUncalibRecHitLegacy_*_*',
    'drop *_hltHbherecoLegacy_*_*',
    ]
@EOF

cmsRun hlt.py &> hlt.log

(the customization of outputCommands is needed because #37207).

@mmusich
Copy link
Contributor

mmusich commented Apr 8, 2024

Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985?

adding to the ticket @mzarucki @trtomei @trocino @smorovic

@mzarucki
Copy link
Contributor

mzarucki commented Apr 8, 2024

Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985?

Run 378985 was the second run with SB @ 13.6 TeV, however, the menu did not change wrt. the first stable collisions run 378981.

The menu deployed for the first 13.6 TeV collisions (for both runs) is the full p-p physics menu (V1.0) with L1T seeds and HLT paths enabled: /cdaq/physics/Run2024/2e34/v1.0.3/HLT/V2

@smorovic
Copy link
Contributor

smorovic commented Apr 8, 2024

Hello,
As far as I know, there was nothing different, globally, from DAQ side.
One problem in this range is starting from 378981 until Sunday evening when we had a single FU with no GPUs detected by CMSSW and thus running with CPU only, if that could've had any effect.

This was fixed from 379067 with full reset of one GPU on the host (also, possibly, shorter periods on Saturday didn't have this problem when we attempted another fix but the problem returned quickly).

@mmusich
Copy link
Contributor

mmusich commented Apr 8, 2024

One problem in this range is starting from 378981 until Sunday evening when we had a single FU with no GPUs detected by CMSSW and thus running with CPU only, if that could've had any effect.

There is a correlation it seems.
Judging from https://dmytro.web.cern.ch/dmytro/cmsprodmon/tier0.php the last run with this kind of failure was 379058, so it's consistent with the change occurring at 379067.

@smorovic
Copy link
Contributor

More thinking about it, with the CPU fallback disabled I don't see a need for a workaround (options 1-2...).
For HLT we can assume that we will run the whole cluster either with GPUs or without GPUs (at least in Run 3).

Sorry, who decided that this is how we will be running for the rest of Run 3 ?

I understood the "fix" as a quick'n'dirty hack to prevent the problem with Tier-0 jobs while a better solution is being investigated.

Not as a permanent solution for the rest of Run 3.

OK, noted, I take it back. It was my naive assumption that we will keep homogenous setup in Run 3, but, if that is not the case, we need a long term solution.

@smorovic
Copy link
Contributor

I made estimate of additional bandwidth for option 1 on the current HLT farm.

After removing PSetMap writing in IOPool/Streamer/src/StreamSerializer.cc, size of uncompressed INI files among streams with the HLT menu of run 378981 is much lower, between 54K and 66K (largest one being HLTMonitor). This seems to come from branch ID lists (sd.setBranchIDLists(branchIDLists);) while thinnedAssociationsHelper doesn't have any effect on size.
Compressing the HLTMonitor file with tar cjf, I'm getting around 34K.

Taking into account we run about 200 FUs with 8 processes each in current configuration (this will increase by ~20% with new FUs), if we had one process block per stream file of a single process (we write one such file every lumisection), there would be 1600 such files written per stream per lumisection, or almost 69 Hz per stream.
Let's assume 70 streams each lumisection, which is less than we have now in ramp up. We are probably running with less streams overall. Note: if streams have no events in a lumisection, process block could also be skipped.

69 Hz *70 * 0.034 MB => 164 MB/s. Probably around 200 MB/s with new FUs.
I think this looks acceptable compared to 10 to 12 GB/s expected this year for pp runs.

@makortel
Copy link
Contributor

Thanks for the feedback. With @Dr15Jones and @wddgit we started to look into the details of "option 2" (and we will continue on this route unless decided otherwise, or we hit a blocker).

We came to the conclusion that the Adler32 checksum stored in the InitMsgView

uint32 adler32_chksum() const { return adler32_chksum_; }

would do the necessary job to disambiguate the different Init messages at the granularity needed for the framework metadata. (for reference, that checksum corresponds the serialized data of SendJobHeader, that includes
sd.setBranchIDLists(branchIDLists);
sd.setThinnedAssociationsHelper(thinnedAssociationsHelper);
sd.setParameterSetMap(psetMap);

)

We have two questions to DAQ at this stage

  • (more out of curiosity) How would the merging identify which INI files have different content (checksum)? Would it e.g. rely on file naming (in which case the aforementioned checksum would have to be included in the final INI file name), or read the checksum from the INI file, or calculate the checksum of the full INI file on the fly?
  • Do you have preference how the final unique INI files and the event data files are concatenated together? Would it be reasonable to require all the InitMessages (INI files) first, followed by all the event data?

@smorovic
Copy link
Contributor

)
We have two questions to DAQ at this stage

  • (more out of curiosity) How would the merging identify which INI files have different content (checksum)? Would it e.g. rely on file naming (in which case the aforementioned checksum would have to be included in the final INI file name), or read the checksum from the INI file, or calculate the checksum of the full INI file on the fly?

This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).
But, in my opinion, the most straightforward way would be to have a full INI checksum (not adler32 of the payload) calculated on FUs, then and appended to the file name to disambiguate and preserve each variant through the merging chain.
We now use files of the format runX_ls0000_streamY_pidZ.ini. Later, as it is propagated, it becomes, for example, runX_ls0000_streamY_fu-c2b02-41-01.ini and similar. Another substring, like _checksum98798787 could be added to the file.

  • Do you have preference how the final unique INI files and the event data files are concatenated together? Would it be reasonable to require all the InitMessages (INI files) first, followed by all the event data?

Yes, INI chunks would definitively come before any events, as now.

@makortel
Copy link
Contributor

I believe I can cook up a quick (and ugly) workaround addressing point 2, for the specific case of the present HLT menu and farm. This workaround is not a good solution for longer term, but would help us to move forward while we work on a good long-term solution.

This workaround is in

@makortel
Copy link
Contributor

This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).

Ok. Can you tell anything about the timescale for a decision? I guess if this approach would be deemed unfeasible in the end, we'd go with the option 1 ("process-level metadata section in the event data files").

But, in my opinion, the most straightforward way would be to have a full INI checksum (not adler32 of the payload) calculated on FUs, then and appended to the file name to disambiguate and preserve each variant through the merging chain.

Ok. I think also this should be fine from the framework perspective.

@smorovic
Copy link
Contributor

This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).

Ok. Can you tell anything about the timescale for a decision? I guess if this approach would be deemed unfeasible in the end, we'd go with the option 1 ("process-level metadata section in the event data files").

Typically we discuss such topics at our weekly meetings (Monday morning).
Let me tag @emeschi and @GuillelmoGomezCeballos
They should be aware of this discussion and could give their opinion on discussed ideas/options, particularly on option "2" which assumes changes in merging of INI files.

@makortel
Copy link
Contributor

assign daq

@cmsbuild
Copy link
Contributor

New categories assigned: daq

@emeschi,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smorovic
Copy link
Contributor

I discussed this with @GuillelmoGomezCeballos and another complication came up.
We are doing distributed merging into single files, in a global filesystem before they get transferred. This happens at the second step of merging, after the FU. When first RUBU (out of 60 machines) sees a stream for a lumi fully processed locally, it takes a file lock in Lustre and creates a large sparse file, writes INI header to it and reserves space to write event data (that reservation is written into the lock file before it's unlocked).
Subsequent RUBUs also take a lock to reserve space for event data for themselves, then unlock and write what they have.
After event count written checks out globally with the input from DAQ event-building, the file is truncated and transfer to T0 is initiated.
In this way we achieve having a single large file on disk without reading and writing again to the filesystem, otherwise we would have to use double bandwidth from/to the global file system (which is already at limits like this).

Writing on a particular RUBU happens only when lumi is closed locally, before all other RUBUs have closed that lumisection, in fact it can even happen before all FUs provide their INI files because the system is asynchronous in the sense of participating nodes.

So, there is a chance to miss an INI file, and then still end up with events from the corresponding process written into the streamer file, which would lead to the problem we had over the last few days.

While we could still consider (contrary to what I wrote before) writing INI files in the middle of streamer files, this would complicate synchronization between writing processes and we would like to avoid that.

Therefore we prefer to have option 1.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 11, 2024

Can't the merger prepend the necessary INI files to the sparse file once all the data has been accounted for ?

@Dr15Jones
Copy link
Contributor

What if you added the INI file info at the end of the streamer file? If the last 4 bytes of the file give the size the INI takes up at the end of the file we could read the last bytes, use that to jump to the INI section, read that and the start reading the Event data from the front of the file. In that way you do not have to wait for all INIs before starting to write the Event data.

@smorovic
Copy link
Contributor

Can't the merger prepend the necessary INI files to the sparse file once all the data has been accounted for ?

I'm not an expert on this, but I think that, while the sparse file can have gaps, the start of (any kind of) file can't be moved and it can be truncated only at the end side.

What if you added the INI file info at the end of the streamer file? If the last 4 bytes of the file give the size the INI takes up at the end of the file we could read the last bytes, use that to jump to the INI section, read that and the start reading the Event data from the front of the file. In that way you do not have to wait for all INIs before starting to write the Event data.

That seems interesting, e.g. for the macro-merging stage (which currently now deals with bookkeeping only).
I'll let @GuillelmoGomezCeballos comment on the idea.

@GuillelmoGomezCeballos
Copy link

From what I see in the long thread, there is already an option which involves not making changes downstream. Given all the trouble we are having elsewhere, and the fact other changes are likely to happen, I don't see how to make things more complicated. Therefore, let's use the so-called option 1 and avoiding adding further files and changes involved the file system.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 12, 2024

From what I see in the long thread, there is already an option which involves not making changes downstream. Given all the trouble we are having elsewhere, and the fact other changes are likely to happen, I don't see how to make things more complicated. Therefore, let's use the so-called option 1 and avoiding adding further files and changes involved the file system.

This is not a technical evaluation of the different solutions, it's just an "I'm busy, let somebody else do the work" argument.

@makortel
Copy link
Contributor

@smorovic About the bandwidth addition estimate of the "option 1" (#44643 (comment)), how would the picture change for the heavy ion data taking? Would there be e.g. less streams (output files) per job per lumi compared to pp?

@smorovic
Copy link
Contributor

I checked one of the HI menus in 2023, /cdaq/physics/Run2023HI/v1.1.6/HLT/V8. It has 77 event data streams (mostly to optimally split file size). INI sizes are ~2.3 MB.
After removing writeout of pset, most INI sizes are 17-18kB, and three streams have 21-23 kB.
Taking the largest one (23k), it compresses to 11 kB, while 18kB ones compresse to 9.3kB

So, even with more streams, it is expected to add 2 and 3 times less overhead than pp ( less than 60 MB/s).
The number of streams is expected to stay similar as in 2023.

@makortel
Copy link
Contributor

+heterogeneous

There is nothing left related to heterogeneous.

For core we had some ideas to improve reading performance of streamer files that were concatenated from processes with differing ProductRegistry (cms-sw/framework-team#931), but presently we are not planning to address those in the near term, and I'm not sure if this issue would be useful to remind those. So maybe this issue could be closed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests