Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CUDA to support Pascal, Volta, Turing, Ampere and Lovelace GPUs #8767

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Oct 18, 2023

No description provided.

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 18, 2023

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 18, 2023

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_13_3_X/master.

@iarspider, @smuzaffar, @aandvalenzuela can you please review it and eventually sign? Thanks.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-GPU GpuUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35252/summary.html
COMMIT: b93cfa0
CMSSW: CMSSW_13_3_X_2023-10-17-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8767/35252/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35252/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35252/git-merge-result

RelVals-GPU

  • 12434.58712434.587_TTbar_14TeV+2023_Patatrack_AllTripletsGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_AllTripletsGPU_Validation.log

GPU Unit Tests

I found 2 errors in the following unit tests:

---> test testTFVisibleDevicesCUDA had ERRORS
---> test testEigenGPUNoFit_t had ERRORS

Comparison Summary

Summary:

  • You potentially added 9 lines to the logs
  • Reco comparison results: 62637 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3357400
  • DQMHistoTests: Total failures: 155151
  • DQMHistoTests: Total nulls: 269
  • DQMHistoTests: Total successes: 3201958
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.10599999999999998 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.308 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 12634.0 ): 0.288 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 250202.181 ): -0.288 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): 0.978 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 7.3 ): -0.564 KiB SiStrip/MechanicalView
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: found differences in 18 / 48 workflows

@smuzaffar
Copy link
Contributor

please test

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 19, 2023

test parameters:

  • full_cmssw = true
  • workflows_gpu = 160.03502,12434.504,12434.503

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: GpuUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35285/summary.html
COMMIT: b93cfa0
CMSSW: CMSSW_13_3_X_2023-10-19-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8767/35285/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Unit Tests

I found 2 errors in the following unit tests:

---> test testTFVisibleDevicesCUDA had ERRORS
---> test testEigenGPUNoFit_t had ERRORS

Comparison Summary

Summary:

  • You potentially removed 137 lines from the logs
  • Reco comparison results: 11 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3357400
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3357366
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • You potentially added 160 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 52 differences found in the comparisons
  • DQMHistoTests: Total files compared: 2
  • DQMHistoTests: Total histograms compared: 19678
  • DQMHistoTests: Total failures: 2007
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 17671
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 1 files compared)
  • Checked 15 log files, 14 edm output root files, 2 DQM output files
  • TriggerResults: found differences in 1 / 1 workflows

@smuzaffar
Copy link
Contributor

please test for el8_aarch64_gcc12

@smuzaffar
Copy link
Contributor

please test for el8_ppc64le_gcc12

@smuzaffar
Copy link
Contributor

@fwyzard , gpu tests for x86_64 worked this time. Unit tests failure are

  • testEigenGPUNoFit_t was already there due to eigen update.
  • testTFVisibleDevicesCUDA: Tensorflow /cuda part was disabled for gcc12 that is why this test is failing.

are you happy with current list of cuda archs? once aarch64/ppc64le tests are done then I can merge this

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 19, 2023

@smuzaffar

are you happy with current list of cuda archs? once aarch64/ppc64le tests are done then I can merge this

the current list should cover all GPUs we currently have access to.
To be more complete, we could add the architectures for the A40/A10 (sm_87) and H100 (sm_90) GPUs:

-%define cuda_arch 60 70 75 80 89
+%define cuda_arch 60 70 75 80 87 89 90

Do you have some idea of how much that would increase the build time and binary size?

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35296/summary.html
COMMIT: b93cfa0
CMSSW: CMSSW_13_3_X_2023-10-18-2300/el8_aarch64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8767/35296/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35296/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35296/git-merge-result

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35297/summary.html
COMMIT: b93cfa0
CMSSW: CMSSW_13_3_X_2023-10-18-2300/el8_ppc64le_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8767/35297/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35297/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1f586/35297/git-merge-result

Unit Tests

I found 1 errors in the following unit tests:

---> test testCSCTriggerMapping had ERRORS

RelVals

  • 24900.024900.0_CloseByPGun_CE_H_Coarse_Scint+2026D98/step2_CloseByPGun_CE_H_Coarse_Scint+2026D98.log
  • 24896.024896.0_CloseByPGun_CE_E_Front_120um+2026D98/step2_CloseByPGun_CE_E_Front_120um+2026D98.log
  • 24834.024834.0_TTbar_14TeV+2026D98/step2_TTbar_14TeV+2026D98.log
Expand to see more relval errors ...

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 20, 2023

@smuzaffar let's merge this, we can add more architectures later.

@smuzaffar
Copy link
Contributor

@smuzaffar let's merge this, we can add more architectures later.

sure, once 13.3.0.pre4 is uploaded then I will merge this.

Do you have some idea of how much that would increase the build time and binary size?

CMSSW/lib binary size increased by ~3% (39MB: from 1189MB to 1228MB). I do not have access to individual build time for each package but overall full cmssw build time is nearly same between 150-155mins

@smuzaffar
Copy link
Contributor

+externals

@smuzaffar smuzaffar merged commit fe6c909 into cms-sw:IB/CMSSW_13_3_X/master Oct 20, 2023
23 of 26 checks passed
@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_13_3_X/master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard fwyzard deleted the IB/CMSSW_13_3_X/master_update_cuda_arch branch December 20, 2023 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants