Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No execuation time comparison available for PRs #43166

Open
mandrenguyen opened this issue Nov 2, 2023 · 20 comments
Open

No execuation time comparison available for PRs #43166

mandrenguyen opened this issue Nov 2, 2023 · 20 comments

Comments

@mandrenguyen
Copy link
Contributor

Since a few months we are not able to see the CPU impact of a given pull request, which used to be possible with the enable profiling option in the Jenkins tests.
This is a bit problematic for integrating new features, as we won't easily be able to keep track of changes in performance until a pre-release is built.
The issue seems to come from igprof, which apparently can no longer really be supported.
One suggestion from @gartung is to try to move to VTune.

@mandrenguyen
Copy link
Contributor Author

assign core, reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2023

New categories assigned: core,reconstruction

@Dr15Jones,@jfernan2,@makortel,@mandrenguyen,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2023

A new Issue was created by @mandrenguyen Matthew Nguyen.

@rappoccio, @antoniovilela, @sextonkennedy, @makortel, @smuzaffar, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@gartung
Copy link
Member

gartung commented Nov 2, 2023

The problem is that the cmsRun process itself gets a segfault while being profiled by Igprof. The same segfault might happen when being profiled with Vtune.

@makortel
Copy link
Contributor

makortel commented Nov 2, 2023

In case IgProf+cmsRun combination crashes, is any information on the job timings saved that can be used for comparison?

@gartung
Copy link
Member

gartung commented Nov 2, 2023

Usually the FastTimerService job completes and the average per module is contained in the raw json file if the resources piechart is not readable.

@gartung
Copy link
Member

gartung commented Nov 2, 2023

The IgprofService dumps the profile after the first, middle and next to last event. The first one might not have enough data to be meaningful.

@gartung
Copy link
Member

gartung commented Nov 2, 2023

@mandrenguyen Can you point me to a PR so I can look at the logs.

@mandrenguyen
Copy link
Contributor Author

@gartung The last one we tried with profiling enabled was:
#43107

@makortel
Copy link
Contributor

The crashes under profilers are quite likely caused by the memory corruption inside Tensorflow (when ran through IgProf or VTune) that has been investigated in #42444.

@VinInn
Copy link
Contributor

VinInn commented Dec 15, 2023

The FastTimer Service should suffice. Still It seems not active in RelVals

@mmusich
Copy link
Contributor

mmusich commented Aug 26, 2024

The issue seems to come from igprof, which apparently can no longer really be supported.
[...] is to try to move to VTune.

for my education is this replacement documented somewhere ?
I still see igprof listed in the RecoIntegration CMS Twiki.

@jfernan2
Copy link
Contributor

jfernan2 commented Sep 3, 2024

@mmusich it is expected that VTune gives the same problem as igprof, so the replacement has not been done.
Indeed this is a real showstopper for RECO developments since we cannot monitor the time profiling in PRs

@mmusich
Copy link
Contributor

mmusich commented Sep 3, 2024

so the replacement has not been done.
Indeed this is a real showstopper for RECO developments since we cannot monitor the time profiling in PRs

I see, that's bad news. I gather the same holds true for user checks when developing (regardless of the time profiling in PRs)

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

Is the most burning problem that there is no timing information (entire job, per module) or that the real IgProf/VTune profile (with function-level information) is missing (because of crash)?

@mmusich
Copy link
Contributor

mmusich commented Sep 6, 2024

Is the most burning problem that there is no timing information (entire job, per module) or that the real IgProf/VTune profile (with function-level information) is missing (because of crash)?

for me (personally) at least, having the function level information would be really helpful.

@jfernan2
Copy link
Contributor

jfernan2 commented Sep 6, 2024

IMHO the crash of igprof/Vtune is a problem although there is timing info from FastTimer module, but the real issue is not having a comparison of baseline time performance vs baseline+PR, which force us to detect a posteriori total increases in the profiles when a prerelease is built, and then figure out which PR(s) were responsible....

Perhaps a comparison script based on FastTimer output could be useful even if not optimal, do you think this is possible @gartung ?
Thanks

@gartung
Copy link
Member

gartung commented Sep 10, 2024

Yes it would be possible. In fact there is a script already that merges two FastTimer output files
https://github.com/fwyzard/circles/blob/master/scripts/merge.py
Changing that to diff two files should be possible.

@gartung
Copy link
Member

gartung commented Sep 10, 2024

@gartung
Copy link
Member

gartung commented Sep 10, 2024

If you add enable profiling as a comment on the pull request the FastTimer service is run as part of the profiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants