Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opportunity to cleanup LogErrors and LogWarnings in 2023 data promptReco processing #41456

Open
slava77 opened this issue Apr 28, 2023 · 50 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Apr 28, 2023

This may be just for taxonomy, but could be used as for investigation of corresponding modules

I was looking at message logger details in 2023 data for run 366498
in Muon0 and ZeroBias LogErrorMonitor skim files
/store/data/Run2023B/Muon0/USER/LogErrorMonitor-PromptReco-v1/000/366/498/00000/0e204272-49fb-42de-a373-65747ff240b4.root and /store/data/Run2023B/ZeroBias/USER/LogErrorMonitor-PromptReco-v1/000/366/498/00000/3f9f69f5-7933-47bb-a5a4-d1f9ffbf7331.root
have about 25K events (the run is fairly short)

Warnings

235 EcalRecHitWorkerRecover
358 TrackNaN
616 BeamSpotFromDB
1126 Configuration
1136 L1TStage2uGTTiming
1224 BadMuon
3681 BasicTrajectoryState
24298 TopMonitor
26727 L1TTauOffline
407208 L1T

Errors

1 InvalidGhostTrackState
2 BadRange
4 InvalidState
5 GsfMultiStateUpdator
7 PFTrackTransformer
16 TrackerValidationVariables
225 FailedPropagation
542 TrajectoryNotPosDef
3251 CTPPSPixelDataFormatter
10392 HOCalib

plain text printout in fwlite with (later post-processed to get totals):

f = open("zb.ews.pyscan.txt", "a")
en = ROOT.TChain("Events")
n = en.Add("root://cms-xrd-global.cern.ch/LFN")
for ev in range(en.GetEntries()):
  n = en.GetEntry(ev)
  mv = en.edmErrorSummaryEntrys_logErrorHarvester__RECO.product()
  e = en.EventAuxiliary
  for mm in mv: print(e.run(), e.luminosityBlock(), e.event(), mm.severity.getName().data(), mm.count, mm.module, mm.category, file=f)
f.close()

Full outputs with run lumi event sevLev count module category are available for Muon0 and ZeroBias These exclude MemoryCheck

an example log from one job in ZeroBias can be found here

an example of Muon0 job log is here

@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign reconstruction,dqm,l1,alca

@cmsbuild
Copy link
Contributor

New categories assigned: dqm,alca,reconstruction,l1

@epalencia,@micsucmed,@rvenditti,@mandrenguyen,@emanueleusai,@syuvivida,@clacaputo,@aloeliger,@francescobrivio,@saumyaphor4252,@tvami,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor Author

slava77 commented Apr 28, 2023

an example log from one job in ZeroBias can be found here

@mmusich
Copy link
Contributor

mmusich commented Apr 28, 2023

concerning

44397 BeamSpotFromDB

the message is of the type:

%MSG-w BeamSpotFromDB:  BeamSpotOnlineProducer:scalerBeamSpot  26-Apr-2023 01:16:02 CEST Run: 366498 Event: 1987746
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot

that happens because the prompt GT doesn't have the BeamSpotOnline records,

$ conddb list 130X_dataRun3_Prompt_v2 | grep BeamSpot
[2023-04-28 16:40:50,422] INFO: Connecting to pro [frontier://PromptProd/cms_conditions]
BeamSpotObjectsRcd                                      -                                                              BeamSpotObjects_PCL_byLumi_v0_prompt        

for reference express has:

$ conddb list 130X_dataRun3_Express_v2 | grep BeamSpot
[2023-04-28 16:47:54,877] INFO: Connecting to pro [frontier://PromptProd/cms_conditions]
BeamSpotObjectsRcd                                      -                                                              BeamSpotObjects_2009_v1_express                                            
BeamSpotOnlineHLTObjectsRcd                             -                                                              BeamSpotOnlineHLT                                                          
BeamSpotOnlineLegacyObjectsRcd                          -                                                              BeamSpotOnlineLegacy              

so it always enters this clause:

if (not legacyRec and not hltRec) {
edm::LogInfo("OnlineBeamSpotESProducer") << "None of the Beam Spots in ES are available! \n returning a fake one.";
return std::shared_ptr<const BeamSpotObjects>(&fakeBS_, edm::do_nothing_deleter());
}

which in turn triggers here:

if (spotDB.beamType() != 2) {
if (shoutMODE && beamTransientRcdESWatcher_.check(iSetup)) {
edm::LogWarning("BeamSpotFromDB")
<< "Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot ";
}

cc: @francescobrivio @dzuolo @gennai

@slava77
Copy link
Contributor Author

slava77 commented Apr 29, 2023

I forgot that ErrorSummaryEntry also has a count (and some go up to O(100)), if I add up the counts:

LogWarnings:

1461 EcalRecHitWorkerRecover
3624 TopMonitor
3832 BadMuon
8793 BasicTrajectoryState
19743 CTPPSPixelDataFormatter
20282 L1TTauOffline
23643 Configuration
46111 BeamSpotFromDB
101726 HOCalib
382982 L1T 

So, L1T actually "wins": 51223 from L1TMuonEndCapTrackProducer:simEmtfDigis and 331759 from L1TMuonEndCapTrackProducer:valEmtfStage2Digis

LogErrors:

8 L1T
9 BadMuon
11 GsfMultiStateUpdator
24 EcalRecHitWorkerRecover
45 BeamSpotFromDB
132 BasicTrajectoryState
228 FailedPropagation
278 TrajectoryNotPosDef
3384 CTPPSPixelDataFormatter
10177 HOCalib

@mmusich
Copy link
Contributor

mmusich commented May 2, 2023

101726 HOCalib
...
10177 HOCalib

since this is particularly noisy, I was having a look in the log file you posted above, but I don't seem to be able to find occurrences of that in it ...
though I think the error can only come from here:

edm::LogError("HOCalib") << "lumiScale collection is empty";

(which I think is expected given that in Run-3, the luminosity from scaler is not available).

@tvami
Copy link
Contributor

tvami commented May 2, 2023

hi @bsunanda can you please have a look at AlCaHOCalibProducer.cc and deal with the fact that lumiScale collection will be always empty from now on (as pointed out by Marco)?

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2023

101726 HOCalib
...
10177 HOCalib

since this is particularly noisy, I was having a look in the log file you posted above, but I don't seem to be able to find occurrences of that in it ... though I think the error can only come from here:

edm::LogError("HOCalib") << "lumiScale collection is empty";

(which I think is expected given that in Run-3, the luminosity from scaler is not available).

These show up in Muon0
e.g. in the log here

@mmusich
Copy link
Contributor

mmusich commented May 2, 2023

e.g. in the log here

thanks, and indeed:

%MSG-e HOCalib:  AlCaHOCalibProducer:hoCalibProducer 26-Apr-2023 03:44:30 CEST  Run: 366498 Event: 740058
lumiScale collection is empty
%MSG

@mmusich
Copy link
Contributor

mmusich commented May 2, 2023

@cms-sw/ctpps-dpg-l2

please take note of the prevalence of

%MSG-e CTPPSPixelDataFormatter:  CTPPSPixelRawToDigi:ctppsPixelDigis  26-Apr-2023 01:43:51 CEST Run: 366498 Event: 2389803
 unphysical col and/or row fedId=1463 nllink=11 convroc=1 adc=0 col=40 row=80 detId=2040004608
%MSG

(and similar) warnings in the recent prompt.
This sounds like either data corruption, or wrong cabling map.

@mmusich
Copy link
Contributor

mmusich commented May 2, 2023

@cms-sw/l1-l2

So, L1T actually "wins": 51223 from L1TMuonEndCapTrackProducer:simEmtfDigis and 331759 from L1TMuonEndCapTrackProducer:valEmtfStage2Digis

logs are littered with these. Is it possible to do something about it?

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2023

@cms-sw/l1-l2

So, L1T actually "wins": 51223 from L1TMuonEndCapTrackProducer:simEmtfDigis and 331759 from L1TMuonEndCapTrackProducer:valEmtfStage2Digis

logs are littered with these. Is it possible to do something about it?

Curiously, the first simEmtfDigis seems misplaced: why are we running a sim module in data?

The warnings are kind of similar though
in the zero bias log

%MSG-w L1T:  L1TMuonEndCapTrackProducer:valEmtfStage2Digis  26-Apr-2023 01:18:16 CEST Run: 366498 Event: 2233307
Found error in LCT quality: 0 (allowed range: 1-15).
%MSG
%MSG-w L1T:  L1TMuonEndCapTrackProducer:valEmtfStage2Digis  26-Apr-2023 01:18:16 CEST Run: 366498 Event: 2233307
From endcap 1, sector 4, station 2, ring 2, cscid 8. (Note that this LCT may be reported multiple times. See source code for explanations.)
%MSG

in the Muon0 log

%MSG-w L1T:  L1TMuonEndCapTrackProducer:simEmtfDigis 26-Apr-2023 03:44:27 CEST  Run: 366498 Event: 722132
Found error in LCT quality: 0 (allowed range: 1-15).
%MSG
%MSG-w L1T:  L1TMuonEndCapTrackProducer:simEmtfDigis 26-Apr-2023 03:44:27 CEST  Run: 366498 Event: 722132
From endcap 2, sector 4, station 2, ring 2, cscid 5. (Note that this LCT may be reported multiple times. See source code for explanations.)
%MSG

@fabferro
Copy link
Contributor

fabferro commented May 2, 2023

@cms-sw/ctpps-dpg-l2

please take note of the prevalence of

%MSG-e CTPPSPixelDataFormatter:  CTPPSPixelRawToDigi:ctppsPixelDigis  26-Apr-2023 01:43:51 CEST Run: 366498 Event: 2389803
 unphysical col and/or row fedId=1463 nllink=11 convroc=1 adc=0 col=40 row=80 detId=2040004608
%MSG

(and similar) warnings in the recent prompt. This sounds like either data corruption, or wrong cabling map.

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).
As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

@mmusich
Copy link
Contributor

mmusich commented May 2, 2023

@fabferro

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).

There's plenty of them here: https://slava77sk.web.cern.ch/slava77sk/reco/log/PromptReco_Run366498_ZeroBias/WMTaskSpace/logCollect1/11735be0-80ac-4b03-9427-9428d624bfe3-1-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

@fabferro
Copy link
Contributor

fabferro commented May 2, 2023

@cms-sw/ctpps-dpg-l2

please take note of the prevalence of

%MSG-e CTPPSPixelDataFormatter:  CTPPSPixelRawToDigi:ctppsPixelDigis  26-Apr-2023 01:43:51 CEST Run: 366498 Event: 2389803
 unphysical col and/or row fedId=1463 nllink=11 convroc=1 adc=0 col=40 row=80 detId=2040004608
%MSG

(and similar) warnings in the recent prompt. This sounds like either data corruption, or wrong cabling map.

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).
As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

@fabferro

I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).

There's plenty of them here: https://slava77sk.web.cern.ch/slava77sk/reco/log/PromptReco_Run366498_ZeroBias/WMTaskSpace/logCollect1/11735be0-80ac-4b03-9427-9428d624bfe3-1-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

Still, I see only errors related to CTPPSPixelDataFormat, no warning. As I said, it's correct that they are reported. If there is any warning (MSG-w not MSG-e) issued by CTPPSPixelDataFormat, please cut and paste the line cause I can't actually see them.

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2023

If there is any warning (MSG-w not MSG-e) issued by CTPPSPixelDataFormat, please cut and paste the line cause I can't actually see them.

ehm, apparently Tree::Scan is not working; I guess it still reports the first error with a correct category for a given event, but then does the rest wrong

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2023

I now used python to analyze this in fwlite the updated toplist is

Warnings

235 EcalRecHitWorkerRecover
358 TrackNaN
616 BeamSpotFromDB
1126 Configuration
1136 L1TStage2uGTTiming
1224 BadMuon
3681 BasicTrajectoryState
24298 TopMonitor
26727 L1TTauOffline
407208 L1T

Errors

1 InvalidGhostTrackState
2 BadRange
4 InvalidState
5 GsfMultiStateUpdator
7 PFTrackTransformer
16 TrackerValidationVariables
225 FailedPropagation
542 TrajectoryNotPosDef
3251 CTPPSPixelDataFormatter
10392 HOCalib
f = open("zb.ews.pyscan.txt", "a")
en = ROOT.TChain("Events")
n = en.Add("root://cms-xrd-global.cern.ch/LFN")
for ev in range(en.GetEntries()):
  n = en.GetEntry(ev)
  mv = en.edmErrorSummaryEntrys_logErrorHarvester__RECO.product()
  e = en.EventAuxiliary
  for mm in mv: print(e.run(), e.luminosityBlock(), e.event(), mm.severity.getName().data(), mm.count, mm.module, mm.category, file=f)
f.close

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2023

Full outputs with run lumi event sevLev count module category details are available for Muon0 and ZeroBias These exclude MemoryCheck

@slava77
Copy link
Contributor Author

slava77 commented Jul 12, 2023

13_0_9 now

@slava77 woud you mind trying again the recipe for #41456 (comment) in a recent run if we see improvements?

I used 2023D run 369956 (the run is a bit too large though so, absolute counts are higher). In the ZeroBias I get:

ZeroBias:

  • errors counts by category
    • 6 BadRange
    • 31 PFTrackTransformer
    • 36 InvalidState
    • 41 GsfMultiStateUpdator
    • 70 TrackerValidationVariables
    • 296 LCTQualityControl
    • 1528 FailedPropagation
    • 3600 TrajectoryNotPosDef
    • 14651 EcalLaserDbService
    • 14935 CTPPSPixelDataFormatter (seen/discussed before)

The TrackerValidationVariables are like

%MSG-e TrackerValidationVariables:  MonitorTrackResiduals:MonitorTrackResiduals  TrackerValidationVariables::fillHitQuantities() 04-Jul-2023 17:06:32 CEST  Run: 369956 Event: 18239283
One of the squared error methods gives negative result
        errHit.xx()     errHit.yy()     errTrk.xx()     errTrk.yy()
        2.69487e-06     3.40282e+38     -1.41249        2066.1

The LSTQualityControl are like

%MSG-e LCTQualityControl:  CSCTriggerPrimitivesProducer:valCscStage2Digis  04-Jul-2023 17:20:02 CEST Run: 369956 Event: 114602933
CSCCLCTDigi with invalid key half-strip: 65535; allowed [0, 159]

The EcalLaserDbService are apparently coming in bursts and look like

%MSG-e EcalLaserDbService:  EcalRecHitProducer:ecalRecHit@cpu  04-Jul-2023 17:21:18 CEST Run: 369956 Event: 70547281
apdpnref (1) or t_i-t_f (0 is zero!
  • Warnings counts by category (top 10)
    • 458 RPCAMCRawToDigi
    • 2028 L1T
    • 2880 SiStripRawToDigi
    • 3729 TrackNaN (somewhat known for tracking with Track has NaN or the cov is not pos-definite message)
    • 5700 Configuration
    • 6774 BadMuon
    • 8640 L1TStage2uGTTiming
    • 22196 BasicTrajectoryState (somewhat known for tracking with curv error not pos-def message)
    • 25215 TopMonitor
    • 280574 NoModule

The SiStripRawToDigi are like

%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigis  04-Jul-2023 16:01:42 CEST Run: 369956 Event: 8224256
NULL pointer to FEDRawData for FED: id 251
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)

There's one per FED per stream; in around 10% of logs it accumulates to (460) (530) (535) (537) (543) (549) (551) (552) (553) (556) (557) (560) (562) (565) (568) (575) (589) (603) (606) (609) (610) (614) (616) (620) (624) (627) (628) (632) (636) (639) (640) (642) (646) (647) (648) (649) (650) (651) (652) (653) (654) (655) (656) (657) (658) (659) (660) (661) (662) (663) (664) (665) (666) (667) (668) (669) (670) (671) (672) (673) (674) (675) (676) (677) (678) (679) (680) (681) (682) (683) (684) (685) (686) (687) (688) (689) (690) (691) (692) (693) (694) (695) (696) (697) (698) (699) (700) (701) (702) (703) (704) (705) (706) (707) (708) (709) (710) (711) (712) (713) (714) (715) (716) (717) (718) (719) (720) (721) (722) (723) (724) (725) (726) (727) (728) (730) (731) (732) (733) (738) (744) (755) (772)

The Configuration are like

%MSG-w Configuration:  HLTHighLevel:ALCARECOTkAlMinBiasNOTHLT  04-Jul-2023 17:03:19 CEST Run: 369956 Event: 46049
none of the requested paths and pattern match any HLT path - no events will be selected

The BadMuon are like

%MSG-w BadMuon:  PileupJetIdProducer:pileupJetId 04-Jul-2023 17:04:37 CEST  Run: 369956 Event: 1702169
Found a PFCandidate muon without a trackRef: falling back to Muon::bestTrack

The L1TStage2uGTTiming are like

%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 04-Jul-2023 17:02:20 CEST  Run: 369956 Stream: 6
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_2_0. Could not retrieve algo bit number.

The TopMonitor is perhaps similar to what it was before, e.g.

%MSG-w TopMonitor:  TopMonitor:BTagMu_AK4DiJet20_Mu5 26-Apr-2023 03:44:28 CEST  Run: 366498 Event: 214752
B-Jet handle not valid, will skip event

The NoModule are like:

%MSG-w NoModule:   ByClusterSummaryMultiplicityPairEventFilter:toomanystripclus53X  04-Jul-2023 17:03:11 CEST Run: 369956 Event: 56994
No information for requested module 5 (PIXEL). Please check in the Provenance Infomation for proper modules.

%MSG-w NoModule:   ByClusterSummaryMultiplicityPairEventFilter:toomanystripclus53X  04-Jul-2023 17:03:11 CEST Run: 369956 Event: 44229
No information for requested module 0 (STRIP). Please check in the Provenance Infomation for proper modules.

I will post notes from Muon[01] separately later

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

The EcalLaserDbService are apparently coming in bursts and look like

%MSG-e EcalLaserDbService: EcalRecHitProducer:ecalRecHit@cpu 04-Jul-2023 17:21:18 CEST Run: 369956 Event: 70547281
apdpnref (1) or t_i-t_f (0 is zero!

@cms-sw/alca-l2 this looks potentially worrisome. You might want to contact ECAL DPG about it.

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

The TopMonitor is perhaps similar to what it was before, e.g.

%MSG-w TopMonitor: TopMonitor:BTagMu_AK4DiJet20_Mu5 26-Apr-2023 03:44:28 CEST Run: 366498 Event: 214752
B-Jet handle not valid, will skip event

@cms-sw/dqm-l2 is it possible to reach out to top dqm developers to silence this warning?

@aloeliger
Copy link
Contributor

The L1TStage2uGTTiming are like

%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 04-Jul-2023 17:02:20 CEST  Run: 369956 Stream: 6
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_2_0. Could not retrieve algo bit number.

These appear to be from changes that were made in #41972 but don't appear to have been backported. Ultimately I think some consideration was requested about how to get around the changing menu, but I can make the backport if you want.

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

@aloeliger

I have a question on #41972.
IIUC the change in the algorithms didn't happen right at the beginning of 2023 data-taking, so assuming there will be reprocessing with >= 13_2_X (current master), does it mean we won't be monitoring the right algos in the first part of the data? (probably a moot question at this point since the data is already taken, I am not sure how L1T offline certification works).

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

does it mean we won't be monitoring the right algos in the first part of the data?

actually #41972 (comment) says it all...

I am not sure #41972 was the right approach.

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

@fabferro

As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption.

@aloeliger
Copy link
Contributor

aloeliger commented Jul 12, 2023

@aloeliger

I have a question on #41972. IIUC the change in the algorithms didn't happen right at the beginning of 2023 data-taking, so assuming there will be reprocessing with >= 13_2_X (current master), does it mean we won't be monitoring the right algos in the first part of the data? (probably a moot question at this point since the data is already taken, I am not sure how L1T offline certification works).

@mmusich

The bits hardcoded into the uGT timing config before the change were never correct for 2023. They did not exist in any 2023 L1 menu. If we reprocess 2023 data now, the currently extant bits should be present in all 2023 menus I think.

Ultimately, I think we were not following (and I guess currently are not following online without a backport) the right bits in the first part of the data that we took this year (They were likely just ignored in the DQM in this case?), but I do not believe this particular test to be a make or break decider for L1T certification. It is likely just a uGT debugging tool, and hardware health check by the sound of it.

I opened #41972 to clean up the warnings, but if we want changes beyond that I think the requested solution is not under my purview? If it is, I can take a look at what it requires to change anything.

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

I opened #41972 to clean up the warnings

based on #41972 (comment) did #41972 create new warnings in other years?

@aloeliger
Copy link
Contributor

I opened #41972 to clean up the warnings

based on #41972 (comment) did #41972 create new warnings in other years?

Very probably. I don't know that it has been reported to me though.

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

but if we want changes beyond that I think the requested solution is not under my purview?

Is there any L1T DQM developers team you might reach to in order to suggest / implement a better fix?

@fabferro
Copy link
Contributor

fabferro commented Jul 12, 2023

@fabferro

As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption.

The investigation on the hardware instabilities that are generating (also) these errors have been ongoing since the start of data taking and are not yet over. Some problems have been fixed, some others not (yet). We are struggling to keep as many planes as possible online, this of course is being balanced with the disruption that those not-perfectly-working planes cause to the whole CMS DAQ and Offline processing. Maybe @AndreaBellora may add some more details, but my understanding so far is that we will very likely have to live with these error messages for the rest of this year data taking.

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

but my understanding so far is that we will very likely have to live with these error message for the rest of this year data taking.

given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run?

@AndreaBellora
Copy link
Contributor

@fabferro

As to the errors, they are actually there and can't be silenced, in my point of view. Mapping looks correct, they are probably due to some data corruption we are investigating.

there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption.

The investigation on the hardware instabilities that are generating (also) these errors have been ongoing since the start of data taking and are not yet over. Some problems have been fixed, some others not (yet). We are struggling to keep as many planes as possible online, this of course is being balanced with the disruption that those not-perfectly-working planes cause to the whole CMS DAQ and Offline processing. Maybe @AndreaBellora may add some more details, but my understanding so far is that we will very likely have to live with these error messages for the rest of this year data taking.

Indeed, I can only add that unfortunately there is no current way to fully remove this data corruption by acting on the hardware (neither it is strictly detrimental to the detector performance for what we know). I'd suggest looking for a way to suppress these messages after they reach a certain number of occurrences per run, for example.
We want to know if they are there (or not), but we are not very interested in getting all of them (and we could always privately re-process some data without this new 'suppression feature' if we wanted to)

@fabferro
Copy link
Contributor

but my understanding so far is that we will very likely have to live with these error message for the rest of this year data taking.

given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run?

After (also) @AndreaBellora 's comment, I've nothing against limiting the error messages to a "certain number" per run. Whether the data is corrupted or not is spotted in the EventFilter that issues the LogError message. Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

@fabferro and @AndreaBellora I would suggest you to study the way that the SiStrip unpacker deals with this issue:

void printWarningSummary() const { warnings_.printSummary(); }

@AndreaBellora
Copy link
Contributor

but my understanding so far is that we will very likely have to live with these error message for the rest of this year data taking.

given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run?

After (also) @AndreaBellora 's comment, I've nothing against limiting the error messages to a "certain number" per run. Whether the data is corrupted or not is spotted in the EventFilter that issues the LogError message. Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions?

@fabferro and @AndreaBellora I would suggest you to study the way that the SiStrip unpacker deals with this issue:

void printWarningSummary() const { warnings_.printSummary(); }

Otherwise, wouldn't it be possible to define a limit parameter in the MessageLogger for this module? I'm no expert on that, but I could do something similar in the past with some plugins...

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

I'm no expert on that, but I could do something similar in the past with some plugins...

I am not sure you have control on that at the level of Tier-0 configuration.

@slava77
Copy link
Contributor Author

slava77 commented Jul 12, 2023

I'm no expert on that, but I could do something similar in the past with some plugins...

I am not sure you have control on that at the level of Tier-0 configuration.

RecoTLR.customisePrompt and customiseExpress can be used

def customisePrompt(process):

But wouldn't these be needed in reprocessing as well? In that case I'd configure limited issue of errors by default and enable elsewhere for the tests.

Similar to the tracker, it may be more practical to produce error objects and rely on analysis of theese data products instead of text log files

@slava77
Copy link
Contributor Author

slava77 commented Jul 13, 2023

Here are some details from Muon[01] for the same run 369956, apparently based on 1.9M events (90 /pb run).

Errors (including average frequency per 1000 events):

  • 1 (<0.01) TrajectoryNaN ("new", rare)
  • 4 (<0.01) InvalidGhostTrackState ("new", rare)
  • 29 (0.02) BadRange
  • 116 (0.06) PFTrackTransformer
  • 384 (0.20) InvalidState
  • 393 (0.21) GsfMultiStateUpdator
  • 1669 (0.88) LCTQualityControl
  • 5600 (2.96) FailedPropagation
  • 10405 (5.5) TrajectoryNotPosDef
  • 42668 (22) CTPPSPixelDataFormatter
  • 112521 (59) EcalLaserDbService

FailedPropagation is like; seems like a candidate for a warning

%MSG-e FailedPropagation:  TrackProducer:lowPtQuadStepTracks  05-Jul-2023 01:47:23 CEST Run: 369956 Event: 12755986
 too many iterations trying to reach plane

Warnings (including average frequency per 1000 events, listing only those with rate per 1K events of >0.1):

  • 255 (0.13) EnergyInDeadEB_FE (not in ZeroBias)
  • 366 (0.19) L1T|EMTF
  • 668 (0.35) FreeTrajectoryState
  • 708 (0.37) TSCPBuilderNoMaterial
  • 742 (0.39) PFTrackTransformer
  • 1187 (0.63) TwoTrackMinimumDistance
  • 1684 (0.89) Muon|RecoMuon|MuonUpdatorAtVertex
  • 2091 (1.10) MakeMuonWithTEV
  • 3840 (2.03) SiStripRawToDigi
  • 4537 (2.40) ProductNotValid (not in ZeroBias)
  • 6972 (3.68) L1T
  • 12562 (6.64) TrackNaN
  • 12910 (6.82) L1TTauOffline (not in ZeroBias)
  • 15508 (8.19) BadMuon
  • 70452 (37.22) NoModule (only in bad LS, outside of good by DCS 37 to 1076)
  • 118020 (62.35) BasicTrajectoryState
  • 3328419 (1758.35) TopMonitor

EnergyInDeadEB_FE are like

%MSG-w EnergyInDeadEB_FE:  EcalRecHitProducer:ecalRecHit@cpu  05-Jul-2023 00:38:39 CEST Run: 369956 Event: 9750863
TP energy in the dead TT = 54.5 at (EcalTT subDet Barrel iz -  ieta 10 iphi 53)

ProductNotValid are like

%MSG-w ProductNotValid:   Tau3MuMonitor:HLT_Tau3Mu_Mu7_Mu1_TkMu1_IsoTau15_Charge1_Monitoring  05-Jul-2023 00:34:15 CEST Run: 369956 Event: 9075312
Tau3Mu trigger product not valid

L1TTauOffline are like

%MSG-w L1TTauOffline:  L1TTauOffline:l1tTauOfflineDQMEmu  05-Jul-2023 00:23:57 CEST Run: 369956 Event: 8522774
This offline tau has no comb3T discriminator, skipping

@srimanob
Copy link
Contributor

TopMonitor is solved in #42808

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests