opportunity to cleanup LogErrors and LogWarnings in 2023 data promptReco processing #41456

slava77 · 2023-04-28T13:47:11Z

This may be just for taxonomy, but could be used as for investigation of corresponding modules

I was looking at message logger details in 2023 data for run 366498
in Muon0 and ZeroBias LogErrorMonitor skim files
/store/data/Run2023B/Muon0/USER/LogErrorMonitor-PromptReco-v1/000/366/498/00000/0e204272-49fb-42de-a373-65747ff240b4.root and /store/data/Run2023B/ZeroBias/USER/LogErrorMonitor-PromptReco-v1/000/366/498/00000/3f9f69f5-7933-47bb-a5a4-d1f9ffbf7331.root
have about 25K events (the run is fairly short)

Warnings

235 EcalRecHitWorkerRecover
358 TrackNaN
616 BeamSpotFromDB
1126 Configuration
1136 L1TStage2uGTTiming
1224 BadMuon
3681 BasicTrajectoryState
24298 TopMonitor
26727 L1TTauOffline
407208 L1T

Errors

1 InvalidGhostTrackState
2 BadRange
4 InvalidState
5 GsfMultiStateUpdator
7 PFTrackTransformer
16 TrackerValidationVariables
225 FailedPropagation
542 TrajectoryNotPosDef
3251 CTPPSPixelDataFormatter
10392 HOCalib

plain text printout in fwlite with (later post-processed to get totals):

f = open("zb.ews.pyscan.txt", "a")
en = ROOT.TChain("Events")
n = en.Add("root://cms-xrd-global.cern.ch/LFN")
for ev in range(en.GetEntries()):
  n = en.GetEntry(ev)
  mv = en.edmErrorSummaryEntrys_logErrorHarvester__RECO.product()
  e = en.EventAuxiliary
  for mm in mv: print(e.run(), e.luminosityBlock(), e.event(), mm.severity.getName().data(), mm.count, mm.module, mm.category, file=f)
f.close()

Full outputs with run lumi event sevLev count module category are available for Muon0 and ZeroBias These exclude MemoryCheck

an example log from one job in ZeroBias can be found here

an example of Muon0 job log is here

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-04-28T13:47:29Z

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2023-04-28T14:28:57Z

assign reconstruction,dqm,l1,alca

cmsbuild · 2023-04-28T14:29:15Z

New categories assigned: dqm,alca,reconstruction,l1

@epalencia,@micsucmed,@rvenditti,@mandrenguyen,@emanueleusai,@syuvivida,@clacaputo,@aloeliger,@francescobrivio,@saumyaphor4252,@tvami,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 · 2023-04-28T15:03:15Z

an example log from one job in ZeroBias can be found here

mmusich · 2023-04-28T15:06:08Z

concerning

44397 BeamSpotFromDB

the message is of the type:

%MSG-w BeamSpotFromDB:  BeamSpotOnlineProducer:scalerBeamSpot  26-Apr-2023 01:16:02 CEST Run: 366498 Event: 1987746
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot

that happens because the prompt GT doesn't have the BeamSpotOnline records,

$ conddb list 130X_dataRun3_Prompt_v2 | grep BeamSpot
[2023-04-28 16:40:50,422] INFO: Connecting to pro [frontier://PromptProd/cms_conditions]
BeamSpotObjectsRcd                                      -                                                              BeamSpotObjects_PCL_byLumi_v0_prompt

for reference express has:

$ conddb list 130X_dataRun3_Express_v2 | grep BeamSpot
[2023-04-28 16:47:54,877] INFO: Connecting to pro [frontier://PromptProd/cms_conditions]
BeamSpotObjectsRcd                                      -                                                              BeamSpotObjects_2009_v1_express                                            
BeamSpotOnlineHLTObjectsRcd                             -                                                              BeamSpotOnlineHLT                                                          
BeamSpotOnlineLegacyObjectsRcd                          -                                                              BeamSpotOnlineLegacy

so it always enters this clause:

cmssw/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc

Lines 155 to 158 in ca3234a

    
           if (not legacyRec and not hltRec) { 
        
             edm::LogInfo("OnlineBeamSpotESProducer") << "None of the Beam Spots in ES are available! \n returning a fake one."; 
        
             return std::shared_ptr<const BeamSpotObjects>(&fakeBS_, edm::do_nothing_deleter()); 
        
           }

which in turn triggers here:

cmssw/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc

Lines 106 to 110 in ca3234a

    
           if (spotDB.beamType() != 2) { 
        
             if (shoutMODE && beamTransientRcdESWatcher_.check(iSetup)) { 
        
               edm::LogWarning("BeamSpotFromDB") 
        
                   << "Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot "; 
        
             }

cc: @francescobrivio @dzuolo @gennai

slava77 · 2023-04-29T02:53:36Z

I forgot that ErrorSummaryEntry also has a count (and some go up to O(100)), if I add up the counts:

LogWarnings:

1461 EcalRecHitWorkerRecover
3624 TopMonitor
3832 BadMuon
8793 BasicTrajectoryState
19743 CTPPSPixelDataFormatter
20282 L1TTauOffline
23643 Configuration
46111 BeamSpotFromDB
101726 HOCalib
382982 L1T

So, L1T actually "wins": 51223 from L1TMuonEndCapTrackProducer:simEmtfDigis and 331759 from L1TMuonEndCapTrackProducer:valEmtfStage2Digis

LogErrors:

8 L1T
9 BadMuon
11 GsfMultiStateUpdator
24 EcalRecHitWorkerRecover
45 BeamSpotFromDB
132 BasicTrajectoryState
228 FailedPropagation
278 TrajectoryNotPosDef
3384 CTPPSPixelDataFormatter
10177 HOCalib

mmusich · 2023-05-02T06:46:46Z

101726 HOCalib
...
10177 HOCalib

since this is particularly noisy, I was having a look in the log file you posted above, but I don't seem to be able to find occurrences of that in it ...
though I think the error can only come from here:

cmssw/Calibration/HcalAlCaRecoProducers/plugins/AlCaHOCalibProducer.cc

Line 349 in 4da37d3

edm::LogError("HOCalib") << "lumiScale collection is empty";

(which I think is expected given that in Run-3, the luminosity from scaler is not available).

tvami · 2023-05-02T12:34:26Z

hi @bsunanda can you please have a look at AlCaHOCalibProducer.cc and deal with the fact that lumiScale collection will be always empty from now on (as pointed out by Marco)?

slava77 · 2023-05-02T12:42:44Z

101726 HOCalib
...
10177 HOCalib

since this is particularly noisy, I was having a look in the log file you posted above, but I don't seem to be able to find occurrences of that in it ... though I think the error can only come from here:

cmssw/Calibration/HcalAlCaRecoProducers/plugins/AlCaHOCalibProducer.cc

Line 349 in 4da37d3

edm::LogError("HOCalib") << "lumiScale collection is empty";

(which I think is expected given that in Run-3, the luminosity from scaler is not available).

These show up in Muon0
e.g. in the log here

mmusich · 2023-05-02T12:45:11Z

e.g. in the log here

thanks, and indeed:

%MSG-e HOCalib:  AlCaHOCalibProducer:hoCalibProducer 26-Apr-2023 03:44:30 CEST  Run: 366498 Event: 740058
lumiScale collection is empty
%MSG

mmusich · 2023-05-02T15:10:40Z

@cms-sw/ctpps-dpg-l2

please take note of the prevalence of

%MSG-e CTPPSPixelDataFormatter:  CTPPSPixelRawToDigi:ctppsPixelDigis  26-Apr-2023 01:43:51 CEST Run: 366498 Event: 2389803
 unphysical col and/or row fedId=1463 nllink=11 convroc=1 adc=0 col=40 row=80 detId=2040004608
%MSG

(and similar) warnings in the recent prompt.
This sounds like either data corruption, or wrong cabling map.

mmusich · 2023-05-02T15:12:11Z

@cms-sw/l1-l2

So, L1T actually "wins": 51223 from L1TMuonEndCapTrackProducer:simEmtfDigis and 331759 from L1TMuonEndCapTrackProducer:valEmtfStage2Digis

logs are littered with these. Is it possible to do something about it?

slava77 · 2023-05-02T15:32:27Z

@cms-sw/l1-l2

So, L1T actually "wins": 51223 from L1TMuonEndCapTrackProducer:simEmtfDigis and 331759 from L1TMuonEndCapTrackProducer:valEmtfStage2Digis

logs are littered with these. Is it possible to do something about it?

Curiously, the first simEmtfDigis seems misplaced: why are we running a sim module in data?

The warnings are kind of similar though
in the zero bias log

%MSG-w L1T:  L1TMuonEndCapTrackProducer:valEmtfStage2Digis  26-Apr-2023 01:18:16 CEST Run: 366498 Event: 2233307
Found error in LCT quality: 0 (allowed range: 1-15).
%MSG
%MSG-w L1T:  L1TMuonEndCapTrackProducer:valEmtfStage2Digis  26-Apr-2023 01:18:16 CEST Run: 366498 Event: 2233307
From endcap 1, sector 4, station 2, ring 2, cscid 8. (Note that this LCT may be reported multiple times. See source code for explanations.)
%MSG

in the Muon0 log

%MSG-w L1T:  L1TMuonEndCapTrackProducer:simEmtfDigis 26-Apr-2023 03:44:27 CEST  Run: 366498 Event: 722132
Found error in LCT quality: 0 (allowed range: 1-15).
%MSG
%MSG-w L1T:  L1TMuonEndCapTrackProducer:simEmtfDigis 26-Apr-2023 03:44:27 CEST  Run: 366498 Event: 722132
From endcap 2, sector 4, station 2, ring 2, cscid 5. (Note that this LCT may be reported multiple times. See source code for explanations.)
%MSG

fabferro · 2023-05-02T15:33:17Z

@cms-sw/ctpps-dpg-l2

please take note of the prevalence of
%MSG-e CTPPSPixelDataFormatter:  CTPPSPixelRawToDigi:ctppsPixelDigis  26-Apr-2023 01:43:51 CEST Run: 366498 Event: 2389803
 unphysical col and/or row fedId=1463 nllink=11 convroc=1 adc=0 col=40 row=80 detId=2040004608
%MSG
(and similar) warnings in the recent prompt. This sounds like either data corruption, or wrong cabling map.

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).
As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

mmusich · 2023-05-02T15:43:55Z

@fabferro

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).

There's plenty of them here: https://slava77sk.web.cern.ch/slava77sk/reco/log/PromptReco_Run366498_ZeroBias/WMTaskSpace/logCollect1/11735be0-80ac-4b03-9427-9428d624bfe3-1-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

fabferro · 2023-05-02T18:27:33Z

@cms-sw/ctpps-dpg-l2

please take note of the prevalence of
%MSG-e CTPPSPixelDataFormatter:  CTPPSPixelRawToDigi:ctppsPixelDigis  26-Apr-2023 01:43:51 CEST Run: 366498 Event: 2389803
 unphysical col and/or row fedId=1463 nllink=11 convroc=1 adc=0 col=40 row=80 detId=2040004608
%MSG
(and similar) warnings in the recent prompt. This sounds like either data corruption, or wrong cabling map.

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).
As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

@fabferro

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).

There's plenty of them here: https://slava77sk.web.cern.ch/slava77sk/reco/log/PromptReco_Run366498_ZeroBias/WMTaskSpace/logCollect1/11735be0-80ac-4b03-9427-9428d624bfe3-1-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

Still, I see only errors related to CTPPSPixelDataFormat, no warning. As I said, it's correct that they are reported. If there is any warning (MSG-w not MSG-e) issued by CTPPSPixelDataFormat, please cut and paste the line cause I can't actually see them.

slava77 · 2023-05-02T19:25:05Z

If there is any warning (MSG-w not MSG-e) issued by CTPPSPixelDataFormat, please cut and paste the line cause I can't actually see them.

ehm, apparently Tree::Scan is not working; I guess it still reports the first error with a correct category for a given event, but then does the rest wrong

slava77 · 2023-05-02T19:48:33Z

I now used python to analyze this in fwlite the updated toplist is

Warnings

235 EcalRecHitWorkerRecover
358 TrackNaN
616 BeamSpotFromDB
1126 Configuration
1136 L1TStage2uGTTiming
1224 BadMuon
3681 BasicTrajectoryState
24298 TopMonitor
26727 L1TTauOffline
407208 L1T

Errors

1 InvalidGhostTrackState
2 BadRange
4 InvalidState
5 GsfMultiStateUpdator
7 PFTrackTransformer
16 TrackerValidationVariables
225 FailedPropagation
542 TrajectoryNotPosDef
3251 CTPPSPixelDataFormatter
10392 HOCalib

f = open("zb.ews.pyscan.txt", "a")
en = ROOT.TChain("Events")
n = en.Add("root://cms-xrd-global.cern.ch/LFN")
for ev in range(en.GetEntries()):
  n = en.GetEntry(ev)
  mv = en.edmErrorSummaryEntrys_logErrorHarvester__RECO.product()
  e = en.EventAuxiliary
  for mm in mv: print(e.run(), e.luminosityBlock(), e.event(), mm.severity.getName().data(), mm.count, mm.module, mm.category, file=f)
f.close

slava77 · 2023-05-02T20:08:07Z

Full outputs with run lumi event sevLev count module category details are available for Muon0 and ZeroBias These exclude MemoryCheck

slava77 · 2023-07-12T01:29:33Z

13_0_9 now

@slava77 woud you mind trying again the recipe for #41456 (comment) in a recent run if we see improvements?

I used 2023D run 369956 (the run is a bit too large though so, absolute counts are higher). In the ZeroBias I get:

ZeroBias:

errors counts by category
- 6 BadRange
- 31 PFTrackTransformer
- 36 InvalidState
- 41 GsfMultiStateUpdator
- 70 TrackerValidationVariables
- 296 LCTQualityControl
- 1528 FailedPropagation
- 3600 TrajectoryNotPosDef
- 14651 EcalLaserDbService
- 14935 CTPPSPixelDataFormatter (seen/discussed before)

The TrackerValidationVariables are like

%MSG-e TrackerValidationVariables:  MonitorTrackResiduals:MonitorTrackResiduals  TrackerValidationVariables::fillHitQuantities() 04-Jul-2023 17:06:32 CEST  Run: 369956 Event: 18239283
One of the squared error methods gives negative result
        errHit.xx()     errHit.yy()     errTrk.xx()     errTrk.yy()
        2.69487e-06     3.40282e+38     -1.41249        2066.1

The LSTQualityControl are like

%MSG-e LCTQualityControl:  CSCTriggerPrimitivesProducer:valCscStage2Digis  04-Jul-2023 17:20:02 CEST Run: 369956 Event: 114602933
CSCCLCTDigi with invalid key half-strip: 65535; allowed [0, 159]

The EcalLaserDbService are apparently coming in bursts and look like

%MSG-e EcalLaserDbService:  EcalRecHitProducer:ecalRecHit@cpu  04-Jul-2023 17:21:18 CEST Run: 369956 Event: 70547281
apdpnref (1) or t_i-t_f (0 is zero!

Warnings counts by category (top 10)
- 458 RPCAMCRawToDigi
- 2028 L1T
- 2880 SiStripRawToDigi
- 3729 TrackNaN (somewhat known for tracking with Track has NaN or the cov is not pos-definite message)
- 5700 Configuration
- 6774 BadMuon
- 8640 L1TStage2uGTTiming
- 22196 BasicTrajectoryState (somewhat known for tracking with curv error not pos-def message)
- 25215 TopMonitor
- 280574 NoModule

The SiStripRawToDigi are like

%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigis  04-Jul-2023 16:01:42 CEST Run: 369956 Event: 8224256
NULL pointer to FEDRawData for FED: id 251
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)

There's one per FED per stream; in around 10% of logs it accumulates to (460) (530) (535) (537) (543) (549) (551) (552) (553) (556) (557) (560) (562) (565) (568) (575) (589) (603) (606) (609) (610) (614) (616) (620) (624) (627) (628) (632) (636) (639) (640) (642) (646) (647) (648) (649) (650) (651) (652) (653) (654) (655) (656) (657) (658) (659) (660) (661) (662) (663) (664) (665) (666) (667) (668) (669) (670) (671) (672) (673) (674) (675) (676) (677) (678) (679) (680) (681) (682) (683) (684) (685) (686) (687) (688) (689) (690) (691) (692) (693) (694) (695) (696) (697) (698) (699) (700) (701) (702) (703) (704) (705) (706) (707) (708) (709) (710) (711) (712) (713) (714) (715) (716) (717) (718) (719) (720) (721) (722) (723) (724) (725) (726) (727) (728) (730) (731) (732) (733) (738) (744) (755) (772)

The Configuration are like

%MSG-w Configuration:  HLTHighLevel:ALCARECOTkAlMinBiasNOTHLT  04-Jul-2023 17:03:19 CEST Run: 369956 Event: 46049
none of the requested paths and pattern match any HLT path - no events will be selected

The BadMuon are like

%MSG-w BadMuon:  PileupJetIdProducer:pileupJetId 04-Jul-2023 17:04:37 CEST  Run: 369956 Event: 1702169
Found a PFCandidate muon without a trackRef: falling back to Muon::bestTrack

The L1TStage2uGTTiming are like

%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 04-Jul-2023 17:02:20 CEST  Run: 369956 Stream: 6
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_2_0. Could not retrieve algo bit number.

The TopMonitor is perhaps similar to what it was before, e.g.

%MSG-w TopMonitor:  TopMonitor:BTagMu_AK4DiJet20_Mu5 26-Apr-2023 03:44:28 CEST  Run: 366498 Event: 214752
B-Jet handle not valid, will skip event

The NoModule are like:

%MSG-w NoModule:   ByClusterSummaryMultiplicityPairEventFilter:toomanystripclus53X  04-Jul-2023 17:03:11 CEST Run: 369956 Event: 56994
No information for requested module 5 (PIXEL). Please check in the Provenance Infomation for proper modules.

%MSG-w NoModule:   ByClusterSummaryMultiplicityPairEventFilter:toomanystripclus53X  04-Jul-2023 17:03:11 CEST Run: 369956 Event: 44229
No information for requested module 0 (STRIP). Please check in the Provenance Infomation for proper modules.

I will post notes from Muon[01] separately later

mmusich · 2023-07-12T04:47:52Z

The EcalLaserDbService are apparently coming in bursts and look like

%MSG-e EcalLaserDbService: EcalRecHitProducer:ecalRecHit@cpu 04-Jul-2023 17:21:18 CEST Run: 369956 Event: 70547281
apdpnref (1) or t_i-t_f (0 is zero!

@cms-sw/alca-l2 this looks potentially worrisome. You might want to contact ECAL DPG about it.

mmusich · 2023-07-12T04:50:20Z

The TopMonitor is perhaps similar to what it was before, e.g.

%MSG-w TopMonitor: TopMonitor:BTagMu_AK4DiJet20_Mu5 26-Apr-2023 03:44:28 CEST Run: 366498 Event: 214752
B-Jet handle not valid, will skip event

@cms-sw/dqm-l2 is it possible to reach out to top dqm developers to silence this warning?

aloeliger · 2023-07-12T06:24:38Z

The L1TStage2uGTTiming are like

%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 04-Jul-2023 17:02:20 CEST  Run: 369956 Stream: 6
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_2_0. Could not retrieve algo bit number.

These appear to be from changes that were made in #41972 but don't appear to have been backported. Ultimately I think some consideration was requested about how to get around the changing menu, but I can make the backport if you want.

mmusich · 2023-07-12T06:28:22Z

@aloeliger

I have a question on #41972.
IIUC the change in the algorithms didn't happen right at the beginning of 2023 data-taking, so assuming there will be reprocessing with >= 13_2_X (current master), does it mean we won't be monitoring the right algos in the first part of the data? (probably a moot question at this point since the data is already taken, I am not sure how L1T offline certification works).

mmusich · 2023-07-12T06:30:26Z

does it mean we won't be monitoring the right algos in the first part of the data?

actually #41972 (comment) says it all...

I am not sure #41972 was the right approach.

mmusich · 2023-07-12T06:32:32Z

@fabferro

As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption.

aloeliger · 2023-07-12T06:41:06Z

@aloeliger

I have a question on #41972. IIUC the change in the algorithms didn't happen right at the beginning of 2023 data-taking, so assuming there will be reprocessing with >= 13_2_X (current master), does it mean we won't be monitoring the right algos in the first part of the data? (probably a moot question at this point since the data is already taken, I am not sure how L1T offline certification works).

@mmusich

The bits hardcoded into the uGT timing config before the change were never correct for 2023. They did not exist in any 2023 L1 menu. If we reprocess 2023 data now, the currently extant bits should be present in all 2023 menus I think.

Ultimately, I think we were not following (and I guess currently are not following online without a backport) the right bits in the first part of the data that we took this year (They were likely just ignored in the DQM in this case?), but I do not believe this particular test to be a make or break decider for L1T certification. It is likely just a uGT debugging tool, and hardware health check by the sound of it.

I opened #41972 to clean up the warnings, but if we want changes beyond that I think the requested solution is not under my purview? If it is, I can take a look at what it requires to change anything.

mmusich · 2023-07-12T06:45:08Z

I opened #41972 to clean up the warnings

based on #41972 (comment) did #41972 create new warnings in other years?

aloeliger · 2023-07-12T06:47:29Z

I opened #41972 to clean up the warnings

based on #41972 (comment) did #41972 create new warnings in other years?

Very probably. I don't know that it has been reported to me though.

mmusich · 2023-07-12T07:08:17Z

but if we want changes beyond that I think the requested solution is not under my purview?

Is there any L1T DQM developers team you might reach to in order to suggest / implement a better fix?

fabferro · 2023-07-12T09:11:39Z

@fabferro

As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption.

The investigation on the hardware instabilities that are generating (also) these errors have been ongoing since the start of data taking and are not yet over. Some problems have been fixed, some others not (yet). We are struggling to keep as many planes as possible online, this of course is being balanced with the disruption that those not-perfectly-working planes cause to the whole CMS DAQ and Offline processing. Maybe @AndreaBellora may add some more details, but my understanding so far is that we will very likely have to live with these error messages for the rest of this year data taking.

mmusich · 2023-07-12T09:13:11Z

but my understanding so far is that we will very likely have to live with these error message for the rest of this year data taking.

given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run?

AndreaBellora · 2023-07-12T09:32:49Z

@fabferro

As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption.

The investigation on the hardware instabilities that are generating (also) these errors have been ongoing since the start of data taking and are not yet over. Some problems have been fixed, some others not (yet). We are struggling to keep as many planes as possible online, this of course is being balanced with the disruption that those not-perfectly-working planes cause to the whole CMS DAQ and Offline processing. Maybe @AndreaBellora may add some more details, but my understanding so far is that we will very likely have to live with these error messages for the rest of this year data taking.

Indeed, I can only add that unfortunately there is no current way to fully remove this data corruption by acting on the hardware (neither it is strictly detrimental to the detector performance for what we know). I'd suggest looking for a way to suppress these messages after they reach a certain number of occurrences per run, for example.
We want to know if they are there (or not), but we are not very interested in getting all of them (and we could always privately re-process some data without this new 'suppression feature' if we wanted to)

fabferro · 2023-07-12T09:49:36Z

but my understanding so far is that we will very likely have to live with these error message for the rest of this year data taking.

given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run?

After (also) @AndreaBellora 's comment, I've nothing against limiting the error messages to a "certain number" per run. Whether the data is corrupted or not is spotted in the EventFilter that issues the LogError message. Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

mmusich · 2023-07-12T09:52:36Z

Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

@fabferro and @AndreaBellora I would suggest you to study the way that the SiStrip unpacker deals with this issue:

cmssw/EventFilter/SiStripRawToDigi/plugins/SiStripRawToDigiUnpacker.h

Line 85 in fbbfc76

void printWarningSummary() const { warnings_.printSummary(); }

AndreaBellora · 2023-07-12T09:53:46Z

but my understanding so far is that we will very likely have to live with these error message for the rest of this year data taking.

given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run?

After (also) @AndreaBellora 's comment, I've nothing against limiting the error messages to a "certain number" per run. Whether the data is corrupted or not is spotted in the EventFilter that issues the LogError message. Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

@fabferro and @AndreaBellora I would suggest you to study the way that the SiStrip unpacker deals with this issue:

cmssw/EventFilter/SiStripRawToDigi/plugins/SiStripRawToDigiUnpacker.h

Line 85 in fbbfc76

void printWarningSummary() const { warnings_.printSummary(); }

Otherwise, wouldn't it be possible to define a limit parameter in the MessageLogger for this module? I'm no expert on that, but I could do something similar in the past with some plugins...

mmusich · 2023-07-12T09:57:16Z

I'm no expert on that, but I could do something similar in the past with some plugins...

I am not sure you have control on that at the level of Tier-0 configuration.

slava77 · 2023-07-12T12:47:04Z

I'm no expert on that, but I could do something similar in the past with some plugins...

I am not sure you have control on that at the level of Tier-0 configuration.

RecoTLR.customisePrompt and customiseExpress can be used

cmssw/Configuration/DataProcessing/python/RecoTLR.py

Line 155 in 437b0e2

def customisePrompt(process):

But wouldn't these be needed in reprocessing as well? In that case I'd configure limited issue of errors by default and enable elsewhere for the tests.

Similar to the tracker, it may be more practical to produce error objects and rely on analysis of theese data products instead of text log files

slava77 · 2023-07-13T00:58:23Z

Here are some details from Muon[01] for the same run 369956, apparently based on 1.9M events (90 /pb run).

Errors (including average frequency per 1000 events):

1 (<0.01) TrajectoryNaN ("new", rare)
4 (<0.01) InvalidGhostTrackState ("new", rare)
29 (0.02) BadRange
116 (0.06) PFTrackTransformer
384 (0.20) InvalidState
393 (0.21) GsfMultiStateUpdator
1669 (0.88) LCTQualityControl
5600 (2.96) FailedPropagation
10405 (5.5) TrajectoryNotPosDef
42668 (22) CTPPSPixelDataFormatter
112521 (59) EcalLaserDbService

FailedPropagation is like; seems like a candidate for a warning

%MSG-e FailedPropagation:  TrackProducer:lowPtQuadStepTracks  05-Jul-2023 01:47:23 CEST Run: 369956 Event: 12755986
 too many iterations trying to reach plane

Warnings (including average frequency per 1000 events, listing only those with rate per 1K events of >0.1):

255 (0.13) EnergyInDeadEB_FE (not in ZeroBias)
366 (0.19) L1T|EMTF
668 (0.35) FreeTrajectoryState
708 (0.37) TSCPBuilderNoMaterial
742 (0.39) PFTrackTransformer
1187 (0.63) TwoTrackMinimumDistance
1684 (0.89) Muon|RecoMuon|MuonUpdatorAtVertex
2091 (1.10) MakeMuonWithTEV
3840 (2.03) SiStripRawToDigi
4537 (2.40) ProductNotValid (not in ZeroBias)
6972 (3.68) L1T
12562 (6.64) TrackNaN
12910 (6.82) L1TTauOffline (not in ZeroBias)
15508 (8.19) BadMuon
70452 (37.22) NoModule (only in bad LS, outside of good by DCS 37 to 1076)
118020 (62.35) BasicTrajectoryState
3328419 (1758.35) TopMonitor

EnergyInDeadEB_FE are like

%MSG-w EnergyInDeadEB_FE:  EcalRecHitProducer:ecalRecHit@cpu  05-Jul-2023 00:38:39 CEST Run: 369956 Event: 9750863
TP energy in the dead TT = 54.5 at (EcalTT subDet Barrel iz -  ieta 10 iphi 53)

ProductNotValid are like

%MSG-w ProductNotValid:   Tau3MuMonitor:HLT_Tau3Mu_Mu7_Mu1_TkMu1_IsoTau15_Charge1_Monitoring  05-Jul-2023 00:34:15 CEST Run: 369956 Event: 9075312
Tau3Mu trigger product not valid

L1TTauOffline are like

%MSG-w L1TTauOffline:  L1TTauOffline:l1tTauOfflineDQMEmu  05-Jul-2023 00:23:57 CEST Run: 369956 Event: 8522774
This offline tau has no comb3T discriminator, skipping

srimanob · 2023-09-18T19:34:29Z

TopMonitor is solved in #42808

cmsbuild added the pending-assignment label Apr 28, 2023

cmsbuild added reconstruction-pending dqm-pending l1-pending alca-pending pending-signatures and removed pending-assignment labels Apr 28, 2023

francescobrivio mentioned this issue Apr 28, 2023

Add BeamSpotOnline tags to the Prompt GT #41459

Closed

mmusich mentioned this issue May 2, 2023

use onlineMetaDataDigis when available for retrieval of average PU in AlCaHOCalibProducer #41486

Merged

This was referenced May 4, 2023

[13.1.X] use onlineMetaDataDigis when available for retrieval of average PU in AlCaHOCalibProducer #41517

Merged

[13.0.X] use onlineMetaDataDigis when available for retrieval of average PU in AlCaHOCalibProducer #41519

Merged

mmusich mentioned this issue Jul 13, 2023

HLT crash in run-368566 (BasicSingleVertexState error from PFTauPrimaryVertexProducer) #41914

Closed

This was referenced Jul 19, 2023

PPS suppress LogErrors in Pixel unpacker #42303

Merged

[13_2_x] PPS suppress LogErrors in Pixel unpacker #42305

Merged

[13_1_x] PPS suppress LogErrors in Pixel unpacker #42306

Merged

[13_0_x] PPS suppress LogErrors in Pixel unpacker #42307

Merged

mmusich mentioned this issue Jul 24, 2023

Noisy Run-3 TopMonitor DQM #42342

Closed

grzanka mentioned this issue Sep 7, 2023

PPS - RawToDigi update to handle mappings from CondDB #42711

Merged

mmusich mentioned this issue May 29, 2024

Update btag from DeepCSV to ParticleNet for BTV DQM #45077

Merged

missirol mentioned this issue Sep 5, 2024

EnergyInDeadE{B,E}_FE warnings from hltEcalRecHit #45880

Open

opportunity to cleanup LogErrors and LogWarnings in 2023 data promptReco processing #41456

opportunity to cleanup LogErrors and LogWarnings in 2023 data promptReco processing #41456

Comments

slava77 commented Apr 28, 2023 • edited Loading

cmsbuild commented Apr 28, 2023

makortel commented Apr 28, 2023

cmsbuild commented Apr 28, 2023

slava77 commented Apr 28, 2023 • edited Loading

mmusich commented Apr 28, 2023 • edited Loading

slava77 commented Apr 29, 2023

mmusich commented May 2, 2023 • edited Loading

tvami commented May 2, 2023

slava77 commented May 2, 2023

mmusich commented May 2, 2023

mmusich commented May 2, 2023

mmusich commented May 2, 2023

slava77 commented May 2, 2023

fabferro commented May 2, 2023

mmusich commented May 2, 2023

fabferro commented May 2, 2023

slava77 commented May 2, 2023

slava77 commented May 2, 2023

slava77 commented May 2, 2023 • edited Loading

slava77 commented Jul 12, 2023

mmusich commented Jul 12, 2023

mmusich commented Jul 12, 2023

aloeliger commented Jul 12, 2023

mmusich commented Jul 12, 2023

mmusich commented Jul 12, 2023

mmusich commented Jul 12, 2023

aloeliger commented Jul 12, 2023 • edited Loading

mmusich commented Jul 12, 2023

aloeliger commented Jul 12, 2023

mmusich commented Jul 12, 2023

fabferro commented Jul 12, 2023 • edited Loading

mmusich commented Jul 12, 2023

AndreaBellora commented Jul 12, 2023

fabferro commented Jul 12, 2023

mmusich commented Jul 12, 2023

AndreaBellora commented Jul 12, 2023

mmusich commented Jul 12, 2023

slava77 commented Jul 12, 2023

slava77 commented Jul 13, 2023

srimanob commented Sep 18, 2023

slava77 commented Apr 28, 2023 •

edited

Loading

slava77 commented Apr 28, 2023 •

edited

Loading

mmusich commented Apr 28, 2023 •

edited

Loading

mmusich commented May 2, 2023 •

edited

Loading

slava77 commented May 2, 2023 •

edited

Loading

aloeliger commented Jul 12, 2023 •

edited

Loading

fabferro commented Jul 12, 2023 •

edited

Loading