Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online-DQM occasionally not picking up the correct IOVs ? #45714

Open
missirol opened this issue Aug 16, 2024 · 8 comments
Open

Online-DQM occasionally not picking up the correct IOVs ? #45714

missirol opened this issue Aug 16, 2024 · 8 comments

Comments

@missirol
Copy link
Contributor

missirol commented Aug 16, 2024

There have been cases in recent weeks/months where strange discrepancies were observed in the online-DQM outputs at P5.

  • L1T prescales. On May-27 (2024), strange data-emulator mismatches in the L1T decisions were seen in online-DQM during run-381286 (link). The checks made at the time suggested that, while 381286 was a collisions run, the emulator plots in the online-DQM were using the L1T prescales of the trigger menu used in the previous run (which was a run with "circulating" beams, not collisions). The relevant tag is L1TGlobalPrescalesVetosFract_Stage2v1_hlt. No issues were seen in the O2O logs at the time, nor warnings or crashes anywhere. Below is the L1T report from the day after.

    During collisions run 381286, DQM shfter reported mismatches in EGamma and uGT data vs emulator. While EGamma issue has been reported before (expert checking) the uGT mismatch is new.

    • Emulator picked prescale conditions from circulating key for some reason that is not yet understood
    • Can be a problem related to DQM software

    This slide from L1T Techical Coordination suggests that a similar issue also occurred on Apr-25 (2024), see CMSLITDPG-1257.

  • Data-emulator mismatches related to ECAL Barrel trigger primitives. On Aug-15 (2024), ECAL uploaded new conditions via O2O with IOV starting from run-384485, but during that run data-emulator mismatches showed up in the ECAL online-DQM outputs. This too may be consistent with the online-DQM jobs consuming conditions from an older IOV.

In both examples, the discrepancies disappeared after a new run was started.

At face value, both examples seem compatible with the cmsRun jobs in the online-DQM nodes not picking up the latest (and correct) IOVs, using instead older ones and thus leading to mismatches between real and emulated data in DQM outputs.

I think it would be helpful if DQM and AlCa-DB could investigate what happened in these cases (O2O logs, etc), with help from framework experts if needed.

If the issue is not specific to online-DQM, but generally related to the access to the conditions database, it could potentially affect the HLT jobs running online as well.

Maybe unrelated, a recent HLT crash possibly caused by a failure in accessing correct conditions (in that case, for the beamspot) is being discussed in #45555.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 16, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @missirol.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign dqm, alca, db

@cmsbuild
Copy link
Contributor

New categories assigned: dqm,alca,db

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini,@francescobrivio,@saumyaphor4252,@saumyaphor4252,@perrotta,@perrotta,@consuegs,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol
Copy link
Contributor Author

@cms-sw/dqm-l2 @cms-sw/db-l2 @cms-sw/alca-l2

Will you try to address this issue ?

@missirol
Copy link
Contributor Author

@cms-sw/dqm-l2 @cms-sw/db-l2 @cms-sw/alca-l2

Still wondering if there will be some follow-up. Or the issue is not worth investigating ? Or should more info be provided ?

@perrotta
Copy link
Contributor

@cms-sw/dqm-l2 @cms-sw/db-l2 @cms-sw/alca-l2

Still wondering if there will be some follow-up. Or the issue is not worth investigating ? Or should more info be provided ?

@missirol if the online-DQM jobs consumes conditions from an older IOV I think is an issue rooted in the online-DQM jobs. If needed (and if I can) I can help debugging, but @cms-sw/dqm-l2 should pinpoint first which are those jobs, where the issue could come from, etc.

@missirol
Copy link
Contributor Author

missirol commented Sep 3, 2024

if the online-DQM jobs consumes conditions from an older IOV I think is an issue rooted in the online-DQM jobs.

How can we be sure that this only affects the online-DQM [*] ? Could it be that the online-DQM is just the first (and only ?) place where such an issue would be spotted ?

In the cases given in the description, was anything strange noticed on the DB side and/or in the O2O logs ? (I understood in #45555 (comment) that O2O logs get eventually deleted, so maybe now it's too late to check). @cms-sw/db-l2

[*] From the description

If the issue is not specific to online-DQM, but generally related to the access to the conditions database, it could potentially affect the HLT jobs running online as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants