Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepTauId failures in RelVals (Incompatible shapes) #44333

Closed
AdrianoDee opened this issue Mar 7, 2024 · 47 comments
Closed

DeepTauId failures in RelVals (Incompatible shapes) #44333

AdrianoDee opened this issue Mar 7, 2024 · 47 comments

Comments

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Mar 7, 2024

Running RelVals we are observing some failures due to a tensorflow exception coming from DeepTauId module. Some examples listed here.

1) 2023 Data reHLT + reRECO

In HLTDR3_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in 14_0_0_pre3 RelVals

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 11 event: 22076365 stream: 0
[1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

with the config here, that is what we get from wf 141.035 running L1REPACK:Full,HLT:@relval2024 (HLT pointing at GRun here). The error here. The wf on Stats2.

Also in the same step in 13_3_0_pre5 RunDisplacedJet2023C in a different path (HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6 ) run in HLT:@relval2023. The error here. The wf on Stats2.

2) 2022 Data reHLT + reRECO

Much rarer in AODNANORUN3_reHLT_2022 step in deepTau2017v2p1ForMini in RunJetMET2022D with 14_0_0 The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 357735 lumi: 20 event: 32782226 stream: 0
[1] Running path 'NANOEDMAODoutput_step'
[2] Prefetching for module PoolOutputModule/'NANOEDMAODoutput'
[3] Prefetching for module SimpleCandidateFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[13] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

3) MC 2023

In DigiPU_2023PU step in hltHpsPFTauDeepTauProducer in RelValTenTau_15_500 with 13_3_0_pre1 (at the moment the first occurrence I found). The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 18 event: 1707 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

CPU

At the moment it appears that in all cases the jobs were running on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (or on a Gold one), Cascade Lake (see #44333 (comment)).

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2024

A new Issue was created by @AdrianoDee.

@Dr15Jones, @antoniovilela, @smuzaffar, @makortel, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@AdrianoDee
Copy link
Contributor Author

assign hlt

@AdrianoDee
Copy link
Contributor Author

assign pdmv

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2024

New categories assigned: hlt,pdmv

@Martin-Grunewald,@mmusich,@AdrianoDee,@sunilUIET,@miquork you have been requested to review this Pull request/Issue and eventually sign? Thanks

@AdrianoDee AdrianoDee changed the title DeepTau failures in HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 DeepTauId failures in HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in RelVals Mar 7, 2024
@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2024

@cms-sw/tau-pog-l2 FYI

@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2024

type tau

@cmsbuild cmsbuild added the tau label Mar 7, 2024
@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2024

just as an observation this path is not new (first included in the GRun menu in 2022, https://its.cern.ch/jira/browse/CMSHLT-2289)

EDIT but was touched recently in https://its.cern.ch/jira/browse/CMSHLT-3052

@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2024

@cms-sw/pdmv-l2

In data reHLT+reRECO RelVals we are observing some failures at HLTDR_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7

Please help filling in some information:

  • In which release is this happening?
  • Is it reproducibile?
  • Does it affect all jobs of the relvals?
  • Is there a pattern w.r.t. the CPU microarchitecture of the node on which the job lands?

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 7, 2024

I can't find it in the Dashboard. Since it is labelled HLTDR_2023, and the path in question is not in the Fake* menus, it must be in some 13_X release running the actual 2023 HLT with the 2023 version of that path.

@AdrianoDee
Copy link
Contributor Author

Quick answers:

  • this happened both in 14_0_0_pre3 and 14_0_0 but I'm tracking it back to older releases (coming back as soon as I find the first occurrence);
  • it just happens on a fraction of the jobs and the fraction itself is quite random (fluctuates in the order of few percentages of the events failing).

For the reproducibility and the CPU pattern I'll need a moment to check those.

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 7, 2024

Hmm well, in 14_X, HLTDR_2023 should (now) run the Fake* menus, while the real HLT menus should be within HLTDR_2024.

@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2024

in 14_X, HLTDR_2023 should (now) run the Fake* menus, while the real HLT menus should be within HLTDR_2024

Indeed the configuration linked above has
L1REPACK:Full,HLT:@relval2024, but in absence of real 2024 data we're running the 2024 menu on 2023 data.

@AdrianoDee
Copy link
Contributor Author

I see the same (similar) error

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 122 event: 206577729 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

in 13_3_0_pre5 RunDisplacedJet2023C running L1REPACK:Full,HLT:@relval2023.

@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2024

HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6

This is a different path, so it points to a general problem with DeepTauId (path-aspecific)

@AdrianoDee AdrianoDee changed the title DeepTauId failures in HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in RelVals DeepTauId failures in RelVals Mar 7, 2024
@AdrianoDee AdrianoDee changed the title DeepTauId failures in RelVals DeepTauId failures in RelVals (Incompatible shapes) Mar 7, 2024
@Dr15Jones
Copy link
Contributor

For context, it appears the exception comes from here:

Status status = session->Run(runOptions, inputs, outputNames, {}, outputs, nullptr, threadPoolOptions);
if (!status.ok()) {
throw cms::Exception("InvalidRun") << "error while running session: " << status.ToString();
}

@makortel
Copy link
Contributor

makortel commented Mar 7, 2024

assign ml

@makortel
Copy link
Contributor

makortel commented Mar 7, 2024

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2024

New categories assigned: ml,reconstruction

@jfernan2,@mandrenguyen,@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5 .

@mmusich
Copy link
Contributor

mmusich commented Mar 16, 2024

urgent

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

@valsdav, we have established that this issue can affect Prompt Reconstruction and (potentially, when the new nodes for the HLT farm arrive) also online trigger operations. Please prepare PRs with guards to avoid the execution of the model with empty inputs.
Thank you.

Marco (as ORM)

@mmusich
Copy link
Contributor

mmusich commented Mar 19, 2024

for record, the proposed fixes are:

@jfernan2
Copy link
Contributor

+1
solved by #44455

@valsdav
Copy link
Contributor

valsdav commented Mar 20, 2024

+ml

Basic guards to solve the empty input problem in DeepTauId are in place, but the reason of the empty grid needs to be investigated with Tau experts.

A more general guard for empty inputs will be added (see #44481)

@AdrianoDee
Copy link
Contributor Author

AdrianoDee commented Mar 20, 2024

+pdmv
(really only the reporter)

@mmusich
Copy link
Contributor

mmusich commented Mar 20, 2024

... hlt will sign once the 14.0.X PR is merged and tested in IBs.

@mmusich
Copy link
Contributor

mmusich commented Mar 20, 2024

but the reason of the empty grid needs to be investigated with Tau experts.

@cms-sw/reconstruction-l2 this looks like needs a separate issue. Can you open one?

@mmusich
Copy link
Contributor

mmusich commented Mar 25, 2024

+hlt

  • no issues observed after the 14.0.X PR got merged and tested in IBs.

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants