Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TF] Update TF v2.16.1 (without libfft) #9388

Closed
wants to merge 22 commits into from

Conversation

iarspider
Copy link
Contributor

Alternative version of #9241

@iarspider iarspider changed the base branch from IB/CMSSW_14_2_X/master to IB/CMSSW_14_2_X/tf September 3, 2024 11:39
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

A new Pull Request was created by @iarspider for branch IB/CMSSW_14_2_X/tf.

@aandvalenzuela, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2024

cms-bot internal usage

@iarspider
Copy link
Contributor Author

iarspider commented Sep 3, 2024

@cmsbuild please test for CMSSW_14_2_TF_X/el8_aarch64_gcc12

@iarspider
Copy link
Contributor Author

iarspider commented Sep 3, 2024

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

@iarspider
Copy link
Contributor Author

@cmsbuild please test for CMSSW_14_2_TF_X/el8_aarch64_gcc12

@smuzaffar
Copy link
Contributor

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2024

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41304/summary.html
COMMIT: 1ba91fc
CMSSW: CMSSW_14_2_TF_X_2024-09-02-2300/el8_aarch64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9388/41304/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41304/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41304/git-merge-result

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2024

-1

Failed Tests: GpuUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41308/summary.html
COMMIT: 1ba91fc
CMSSW: CMSSW_14_2_TF_X_2024-09-02-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9388/41308/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41308/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41308/git-merge-result

GPU Unit Tests

I found 9 errors in the following unit tests:

---> test testTFMetaGraphLoadingCUDA had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
---> test testTFConstSessionCUDA had ERRORS
and more ...

Comparison Summary

Summary:

  • You potentially added 77 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 2370 differences found in the comparisons
  • DQMHistoTests: Total files compared: 44
  • DQMHistoTests: Total histograms compared: 3328276
  • DQMHistoTests: Total failures: 30395
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3297861
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 36.092000000000006 KiB( 43 files compared)
  • DQMHistoSizes: changed ( 11634.0,... ): 2.879 KiB Physics/NanoAODDQM
  • DQMHistoSizes: changed ( 13234.0,... ): 1.771 KiB Physics/NanoAODDQM
  • Checked 193 log files, 163 edm output root files, 44 DQM output files
  • TriggerResults: found differences in 2 / 42 workflows

@iarspider
Copy link
Contributor Author

iarspider commented Sep 5, 2024

@smuzaffar since the tests depend on Tensorflow (not Keras), the environment was not set. Should I add an explicit dependency on keras to PhysicsTools/TensorFlow? I don't think we can handle circular dependency keras ↔ tensorflow

@smuzaffar
Copy link
Contributor

@iarspider , can you try running unit tests locally after setting the KERAS_BACKEND=tensorflow env?
Note that ## INITENV SET KERAS_BACKEND tensorflow only add/set this env via init.*sh file. For scram one need to update the xml file to set it. So if setting KERAS_BACKEND=tensorflow allows to fix the gou unit tests then please add KERAS_BACKEND=tensorflow in to one of tf xml files

@smuzaffar smuzaffar mentioned this pull request Sep 9, 2024
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2024

Pull request #9388 was updated.

@iarspider
Copy link
Contributor Author

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

@@ -3,6 +3,7 @@
<environment name="TENSORFLOW_BASE" default="@TOOL_ROOT@"/>
<environment name="LIBDIR" default="$TENSORFLOW_BASE/lib"/>
<environment name="INCLUDE" default="$TENSORFLOW_BASE/include"/>
<environment name="KERAS_BACKEND" default="tensorflow"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be <runtime ..../> type variable ( see ROOTSYS as an example)
did you run test locally to see if gpu unit tests passed after setting this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it by setting the environment variable manually (not via toolfile).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me all the unit tests still fails with error

##Failure Location unknown## : Error
Test name: testHelloWorldCUDA::test
uncaught exception of type std::exception (or derived).
- An exception of category 'UnavailableAccelerator' occurred while
   [0] Calling tensorflow::setBackend()
Exception Message:
Cuda backend requested, NVIDIA GPU visible to cmssw, but not visible to TensorF
low in the job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests that were failing previously worked after setting KERAS_BACKEND. Yes, I saw these failures as well - I thought I missed some setup step to make them work (in a container started with --nv flag)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. testTFConstSession was failing with ValueError: Unable to import backend : theano, but after setting KERAS_BACKEND it passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the failure be due to 12.4 not being an officially tested CUDA version for TF 2.16.1 (and even 2.17) - link lists 12.3 as officially tested version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" prints this message:

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

and returns an empty list []. I googled this message, and there are basically three solutions:

  1. Run TensorFlow using official docker image
  2. Install TensorFlow using conda and prebuilt wheels
  3. Force-connect NUMA node (as suggested in the document that TensorFlow prints out), namely run sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:06\:10.0/numa_node after each reboot. But that requires sudo rights (and, I would imagine, not in container, but on the host).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure you started cmssw-el8 with --nv option? For me the following command

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

runs fine (both for this PR and TF_X Ibs) and return

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it works for me as well, weird.

@iarspider
Copy link
Contributor Author

please abort

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2024

Pull request #9388 was updated.

@iarspider
Copy link
Contributor Author

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2024

-1

Failed Tests: UnitTests GpuUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41405/summary.html
COMMIT: 34311da
CMSSW: CMSSW_14_2_TF_X_2024-09-06-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9388/41405/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41405/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41405/git-merge-result

Unit Tests

I found 1 errors in the following unit tests:

---> test testSiStripPayloadInspector had ERRORS

GPU Unit Tests

I found 9 errors in the following unit tests:

---> test testBrokenLineFitGPU_t had ERRORS
---> test testFitsGPU_t had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
and more ...

Comparison Summary

Summary:

  • You potentially added 1448 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 2416 differences found in the comparisons
  • DQMHistoTests: Total files compared: 44
  • DQMHistoTests: Total histograms compared: 3328859
  • DQMHistoTests: Total failures: 35091
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3293748
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 1831.5049999999999 KiB( 43 files compared)
  • DQMHistoSizes: changed ( 11634.0,... ): 96.395 KiB GEM/Segment_TnP
  • Checked 193 log files, 163 edm output root files, 44 DQM output files
  • TriggerResults: no differences found

@iarspider
Copy link
Contributor Author

ChatGPT suggests using Session::ListDevices to check if GPU is available, instead of mutable_device_count():

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/protobuf/config.pb.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/common_runtime/device_factory.h"

#include <iostream>

int main() {
    // Initialize a session
    tensorflow::Session* session;
    tensorflow::SessionOptions options;
    
    // Try to create a new session
    tensorflow::Status status = tensorflow::NewSession(options, &session);
    if (!status.ok()) {
        std::cerr << "Error creating session: " << status.ToString() << std::endl;
        return -1;
    }

    // Retrieve the list of available devices
    std::vector<tensorflow::DeviceAttributes> devices;
    status = session->ListDevices(&devices);
    if (!status.ok()) {
        std::cerr << "Error listing devices: " << status.ToString() << std::endl;
        return -1;
    }

    // Check if any GPU devices are available
    bool gpu_available = false;
    for (const auto& device : devices) {
        std::cout << "Device name: " << device.name() << ", type: " << device.device_type() << std::endl;
        if (device.device_type() == "GPU") {
            gpu_available = true;
        }
    }

    if (gpu_available) {
        std::cout << "GPU is available and can be used." << std::endl;
    } else {
        std::cout << "No GPU devices are available." << std::endl;
    }

    // Clean up
    session->Close();
    delete session;

    return 0;
}

@iarspider iarspider closed this Sep 11, 2024
@smuzaffar smuzaffar deleted the tf2.16.1-nolibfft branch September 18, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants