Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fbgemm/cci.20210316 #4927

Closed
wants to merge 5 commits into from

Conversation

SpaceIm
Copy link
Contributor

@SpaceIm SpaceIm commented Mar 17, 2021

Specify library name and version: fbgemm/cci.20210316

  • I've read the guidelines for contributing.
  • I've followed the PEP8 style guides for Python code in the recipes.
  • I've used the latest Conan client version.
  • I've tried at least one configuration locally with the
    conan-center hook activated.

closes #4428

There are still several issues with:

  • gcc 5 in Debug mode
  • clang < 6
  • clang >= 6 in Debug mode
  • MinGW

@SpaceIm SpaceIm marked this pull request as draft March 17, 2021 09:46
@conan-center-bot

This comment has been minimized.

@conan-center-bot

This comment has been minimized.

@SpaceIm
Copy link
Contributor Author

SpaceIm commented Mar 23, 2021

I think I understand, fbgemm relies on a new C++17 behavior for static constexpr, therefore it's more or less broken with C++14 standard.
pytorch/FBGEMM#573

@SpaceIm SpaceIm marked this pull request as ready for review March 23, 2021 21:18
@conan-center-bot

This comment has been minimized.

@SpaceIm
Copy link
Contributor Author

SpaceIm commented Mar 23, 2021

Assertion failed: (0 && "unknown architecure"), function PackBMatrix, file /Users/jenkins/w/BuildSingleReference@2/.conan/data/fbgemm/cci.20210316/_/_/build/404fde2faa3a37dc9d67d7e86ee41ee9d90a8f5b/source_subfolder/src/PackBMatrix.cc, line 228.
ERROR: fbgemm/cci.20210316 (test package): Error in test() method, line 17
	self.run(bin_path, run_environment=True)
	ConanException: Error -6 while executing DYLD_LIBRARY_PATH="/Users/jenkins/w/BuildSingleReference@2/.conan/data/fbgemm/cci.20210316/_/_/package/404fde2faa3a37dc9d67d7e86ee41ee9d90a8f5b/lib:/Users/jenkins/w/BuildSingleReference@2/.conan/data/asmjit/cci.20210306/_/_/package/ba203d82ae0020eccba7236c3748eb8f79fceaf6/lib:/Users/jenkins/w/BuildSingleReference@2/.conan/data/cpuinfo/cci.20201217/_/_/package/9487036bb519a194183bc395da97637cb9f00de3/lib" DYLD_FRAMEWORK_PATH="" bin/test_package

@jgsogo It fails on Macos agents because fbgemm requires at least avx2 support.

@jgsogo
Copy link
Contributor

jgsogo commented Mar 26, 2021

According to this http://csharpmulticore.blogspot.com/2014/12/how-to-check-intel-avx2-support-on-mac-os-x-haswell.html, we can check for AVX2 support using

sysctl -a | grep machdep.cpu.leaf7_features

I'll run this line on our MacOS machines...

@jgsogo jgsogo added the infrastructure Waiting on tools or services belonging to the infra label Mar 26, 2021
@jgsogo
Copy link
Contributor

jgsogo commented Mar 26, 2021

Checked our infra. Indeed, MacOS machines doesn't support AVX2 instruction set. I'm opening a ticket to the infra team to check if this is something we can request, if we can migrate to other machines.

@jgsogo
Copy link
Contributor

jgsogo commented Apr 8, 2021

macOS issues have been fixed, we are now in control of the Mac workers and now they all use the same hardware 🎉 . All of them report AVX2 from CPU features (and they are able to build the library).

Now I'm facing issues with Linux workers, it looks like it takes more than 5Gb to build these libraries and the docker containers run out of memory (not all of them, I retrying the PR again and again to see if it fails consistently for some configurations). Also, the failure is not reported because the k8s POD dies, and the CI keeps waiting for a response that will never arrive.


Fixes:

  • Assign more memory to all the PODs. how much memory? the problem will be there, just 2Gb further away.
  • Do not parallelize the builds (CONAN_CPU_COUNT=1)?
  • ...?

@SpaceIm
Copy link
Contributor Author

SpaceIm commented Apr 10, 2021

fbgemm is not the only recipe suffering of memory issues. There are also PCL #1891 (comment), and maybe OpenCASCADE #4094.
I guess that libtorch #5100 will also require a lot of memory. For this one, CONAN_CPU_COUNT=1 would be the last solution, since it takes hours with one CPU, even with the most limited set of options and CUDA disabled.
For example on github action with 3 configurations tested sequentially in one container for Visual Studio Release (dual-core I think)

Would it be possible to allocate more memory to containers running build of specific recipes?

@jgsogo
Copy link
Contributor

jgsogo commented Apr 12, 2021

Would it be possible to allocate more memory to containers running build of specific recipes?

There are three proposals right now on the table:

  • [easy] increase container memory: this will be a waste of resources, all containers will be allocated with more Gb even if they are only going to build a header-only recipe. More Gb means more costs.
  • [easy] increase swap memory: this will affect all containers, but we won't waste resources. PRs hitting the memory limit will start to use swap memory. Probably it translates that those PRs will take too much time to build (maybe they will never finish).
  • [hard] create two node-pools in the cluster and, based on something written in config.yml, send the PR to one node-pool or other. One pool with fewer resources will be used by default, the one with higher resources only if requested. It requires developing a feature in C3i library, but for sure it is the best solution (and it can probably reduce costs).

@SpaceIm SpaceIm closed this Apr 20, 2021
@SpaceIm SpaceIm reopened this Apr 20, 2021
@conan-center-bot

This comment has been minimized.

@conan-center-bot

This comment has been minimized.

@conan-center-bot

This comment has been minimized.

@conan-center-bot

This comment has been minimized.

@jgsogo
Copy link
Contributor

jgsogo commented Jan 31, 2022

Is this just about memory? Do we know what is the amount it needs?

We have modified a bunch of things and now it is easy for us to redirect this build to a cluster with more memory.

@conan-center-bot
Copy link
Collaborator

Failure in build 16 (5e629904b0f0f406dbcec0f2c5876ce9bb4a062d):

An unexpected error happened and has been reported. Help is on its way! 🏇

@jgsogo
Copy link
Contributor

jgsogo commented Feb 2, 2022

This is failing always the same way: processes are killed.

@SSE4 , can we confirm if it is about memory and maybe use larger cluster for this one as well? Thanks!

@CLAassistant
Copy link

CLAassistant commented May 18, 2023

CLA assistant check
All committers have signed the CLA.

@jcar87
Copy link
Contributor

jcar87 commented Feb 9, 2024

Closing this PR as stale due to inactivity. PCL was eventually merged after reaching out to the PCL maintainers - turns out was a specific bottleneck in the compilation that required some additional build-system help to prevent a massive amount of memory. This really was an issue outside of the Conan CI realm, and had been reported by other users - if this is really an issue my advice would be to reach out and report it to the authors of fbgemm

@jcar87 jcar87 closed this Feb 9, 2024
@valgur valgur mentioned this pull request Jul 28, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Waiting on tools or services belonging to the infra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[request] fbgemm/cci.20210130
5 participants