Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condor submission failed at cmslpc #2101

Open
qliphy opened this issue Feb 19, 2019 · 6 comments
Open

Condor submission failed at cmslpc #2101

qliphy opened this issue Feb 19, 2019 · 6 comments

Comments

@qliphy
Copy link
Collaborator

qliphy commented Feb 19, 2019

@AndreasAlbert @kdlong

It seems the gridpack condor submission script does not work at cmslpc for most of the nodes.
This might be due to "Moving to a central scheduling model"@cmslpc as mentioned here

The error message is below:
raise ClusterManagmentError, 'could not import htcondor python API: \n%s' % error
madgraph.various.cluster.ClusterManagmentError: could not import htcondor python API:

It seems we should update "source_condor.sh" to have HTCondor python bindings in submission step?

Ref: https://hypernews.cern.ch/HyperNews/CMS/get/generators/4243/1/1/1/1/1/1.html

@qliphy
Copy link
Collaborator Author

qliphy commented Feb 19, 2019

@Saptaparna is also checking this.

@Saptaparna
Copy link
Contributor

Yes, I have submitted a ticket, after verifying that condor submission still works for simple submission scripts. Will provide update based on feedback from LPC experts.

@Saptaparna
Copy link
Contributor

Saptaparna commented Mar 12, 2019

The origin of the problem stems from the fact that condor_submit has changed under the hood. The condor_submit command is no longer the HTCondor command but the full redefinition of condor_submit can be found by doing: more /usr/local/bin/condor_submit. Also, the gridpack generation script assumes a local condor Schedd but the condor refactor update at the LPC was specifically geared toward moving away from that setup.

@qliphy
Copy link
Collaborator Author

qliphy commented Mar 12, 2019

@Saptaparna Thanks for the information! Do you have a workaround? And if possible can you update our script to make it work at cmslpc?

@Saptaparna
Copy link
Contributor

@qliphy The obvious workaround is to update condor_submit with its cmslpc version. There is an added complication of making sure that schedd name is not hard coded and list of schedulers is provided (but this list may change over time). I am trying to deal with this now and following the suggestions of some of the experts here at the LPC.

@qliphy
Copy link
Collaborator Author

qliphy commented Mar 28, 2019

Suggestions from FNAL computing expert as below for your references:


You CAN run on CMS Connect to T3_US_FNALLPC, so that's a really good point of Dave's, if everyone running gridpack just submits through CMS Connect, their jobs can run at T3_US_FNALLPC without rewriting the code.

One has to be sure that your CMS grid certificate is mapped to your FNAL username to be allowed in from CMS Connect and CRAB jobs, which I believe is true for all the gridpack users in this email.

-Marguerite

Dr. Marguerite Tonjes
LPC Computing Support https://lpc.fnal.gov/computing
tonjes@fnal.gov
Skype: phMarguerite CMS Experiment Mattermost: @belt
office: (630) 840-2859 FNAL WH11E

On Mar 25, 2019, at 10:48 AM, David A Mason dmason@fnal.gov wrote:

Though I would have a better question -- if it works through CMS Connect why is there a rewrite?

On Mar 25, 2019, at 10:03 AM, Marguerite Tonjes tonjes@fnal.gov wrote:

Yes, you are affected by the same gridpack issue Sapta has found here:
#2101

Basically the condor refactor has "condor_submit" as a python wrapper script which talks to the negotiators and then the schedulers. The condor schedulers are no longer located on the interactive nodes.

I was told that gridpack can be run in CMS Connect, so you won't be out of CPUs while the rewrite is happening.

We do encourage your group to reach out if you need help in re-writing the scripts.

-Marguerite

Dr. Marguerite Tonjes
LPC Computing Support https://lpc.fnal.gov/computing
tonjes@fnal.gov
Skype: phMarguerite CMS Experiment Mattermost: @belt
office: (630) 840-2859 FNAL WH11E

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants