-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLANG_X] Segmentation violation in ThePEG::EventGenerator::doinit #45872
Comments
assign generators |
cms-bot internal usage |
A new Issue was created by @iarspider. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@Dominic-Stafford @theofil can you please take a look? It seems related to Herwig. Thanks! |
I'll try to have a look next week (unless @theofil has time before then). @iarspider has the CLANG/C++ version or anything else like this changed which might have triggered the issue? |
I don't think so. |
Hi, I'll will also try to look at it next week starting on Tuesday. a) Could we get some instructions how to reproduce the problem in an lxplus session ? best, |
The recipe to reproduce would be along cmssw-el8
cmsrel CMSSW_14_2_CLANG_X_2024-09-05-2300
cd CMSSW_14_2_CLANG_X_2024-09-05-2300/src
cmsenv
runTheMatrix.py -l 535.0 -t 4 |
In case there is a connection to #45510 (as wondered in #45510 (comment)), we updated the C++ standard from 17 to 20 around Jul 12 (cms-sw/cmsdist#9288), and #45510 was opened on Jul 19. But I guess we have lost the information on whether the problem reported in #45510 started on 07-18-2300 IB or earlier (or on the first occurrence of the problem reported in this issue). |
@makortel thanks for the code. indeed the test is passed for CMSSW_14_2_X_2024-09-06-1100 but fails for CMSSW_14_2_CLANG_X_2024-09-05-2300 but both IBs have clang version 18.1.6, at least this is what I get from ideas ? |
The only difference between the default IB and the CLANG IB is that in the default IB the CMSSW code is compiled with Some thoughts
|
I've had a bit of a look at this, and found very weirdly that I can reproduce it locally when I do
That seems to be that Herwig has crashed due to some issue reading the LHE file (though I'm not exactly sure why as the lhe file is the same as the one that ran without issues in a non-CLANG build), and then we keep going despite not having produced the run file, causing a seg fault). For this new issue I'd definitely lay the blame at the fact we skip over any errors from Herwig here -it may or may not be the root cause of the full issue, but it certainly makes it harder to debug. We've kept this block for a long time as it's supposed to get around an issue with Herwig being called before the externalLHEProducer, but I think we should really get rid of it, as running past errors where Herwig would just exit could be the root of what we're seeing here, then deal with the issue with the sequence of calls if it's still occuring. I probably won't have time to try this in the next few days, so if you have time @theofil that would be good, otherwise I'll try to by the end of the week. |
I have checked our opensearch and found that workflow |
I haven't yet found the origin of the problem, but I can reply to this question:
yes @smuzaffar thanks a lot for the info. I see that the could relevant to the crash we see. I will try to have a look if this is really where the problem starts. Would replacing the Apart from the software changes we see in the are there other differences between the 2 IBs for what concerns their builds ? |
As @makortel mentioned we also have updated c++ standard (to c++20) for July 12th IB. By the way, build herwig7 and sherpa in debug mode, I get this stacktrace for workflow 535/step1
|
Note that all these failing workflows ( |
failed for both |
I had very little progress so far unfortunately. I compiled two versions of Herwig under
and run standalone Herwig MC generation, checking if we can generate simple processes without reading external LHE files. We cannot generate any event in While compiling the code in the two releases, I see many warnings regarding to the ThePEG regarding arithmetic operations that I was not used to see before, but that all seem innocent in the CMSSW_14_2_X_2024-09-06-1100 warnings case. However in the CMSSW_14_2_CLANG_X_2024-09-05-2300.txt warnings we see for fist time warning regarding the creation of the which later appears in the crash messages. This to me confirms that the problem is not related with the CMSSW HerwigInterface and there is not much we can do there, but rather with the external package ThePEG, which is needed by Herwig generator. Is there a reason why we build CMSSW with clang while the external packages, like ThePEG are still built with gcc ? Is it sound to use the same binary of the ThePEG in the two cases ? Is it possible to try to have the ThePEG built also with clang instead of gcc when CMSSW is built with clang ? |
Hi, I've had a bit more of a look at this. The first thing to note is it's happening also for standalone Herwig (no external LHE), and there it's happening consistently, so the block in the interface I mentioned earlier definitely isn't to blame. Running with valgrind, it gives the same segfault, but before that it gives a warning about |
The |
This is a compilation warning about EventInfoBase & operator=(const EventInfoBase &) = delete; so the warning is effectively asking to add an accompanying EventInfoBase (const EventInfoBase &) = default; I'd suggest to define the move constructor and assignment operator as well. I don't see how these warnings could affect the generated binaries though. On the other hand, some other questions came to my mind. @smuzaffar Did we rebuild all/relevant the externals with C++20? (or did we inherit those from the CPP20 IB flavor "for free"?) On a quick look it also seems to me we are not passing the C++ standard to ThePEG. Should we? |
RelVals 535.0, 537.0, 538.0 failed with SIGSEGV in
ThePEG::EventGenerator::doinit
:The text was updated successfully, but these errors were encountered: