Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core 0x23 stalled on current (21.11.0) image #31

Open
sanori opened this issue Jan 23, 2024 · 5 comments
Open

Core 0x23 stalled on current (21.11.0) image #31

sanori opened this issue Jan 23, 2024 · 5 comments

Comments

@sanori
Copy link
Contributor

sanori commented Jan 23, 2024

Log.txt

00:29:18:WU02:FS00:Download complete
00:29:19:WU02:FS00:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:12261 run:0 clone:236 gen:81 core:0x23 unit:0x000000ec0000005100002fe500000000
00:29:19:WU02:FS00:Starting
00:29:19:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/openmm-core-23/centos-7.9.2009-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23 -dir 02 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
00:29:19:WU02:FS00:Started FahCore on PID 50
00:29:19:WU02:FS00:Core PID:54
00:29:19:WU02:FS00:FahCore 0x23 started
00:29:19:WARNING:WU02:FS00:FahCore returned: WU_STALLED (127 = 0x7f)

Inspection

Core 0x23 seems to require OpenCL 3.0. But, OpenCL 3.0 does not work properly on CUDA 11.2.2.

$ docker exec -it fah0 clinfo
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 3.0 CUDA 12.2.148
  Platform Profile                                FULL_PROFILE
(snip)
ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
	NOTE:	your OpenCL library only supports OpenCL 2.1,
		but some installed platforms support OpenCL 3.0.
		Programs using 3.0 features may crash
		or behave unexepectedly

Inference

According to the NVIDIA Technical Blog, NVIDIA supports OpenCL 3.0 since Linux driver version 465.19.1. The matching CUDA version would be 11.3.1 according to the CUDA release notes

Therefore, I guess that the CUDA version of base image should be updated at least 11.3.1.

@sanori sanori changed the title Core 0x23 stalled on current (CUDA 11.2,2) image Core 0x23 stalled on current (CUDA 11.2.2) image Jan 23, 2024
@sanori
Copy link
Contributor Author

sanori commented Jan 23, 2024

I found that clinfo package on Ubuntu 18.04 and 20.04 does not support OpenCL 3.0.
The minimum CUDA version of official docker image that is built on Ubuntu 22.04 is 11.7.1-base-ubuntu22.04, currently.
Therefore, minimum CUDA version would be at least 11.7.1, I guess.

@sanori
Copy link
Contributor Author

sanori commented Jan 23, 2024

I also found current FahCore_23 binary requires libexpat.so.1 which does not exist neither in docker image nor in core 23 package.
(I don't know why core 23 requires libexpat1 which is not required in 0xa8 or 0x22 cores.)
I succeeded to run FahCore_23 in the docker container by changing some parameters in Dockerfile.
I'll send a pull-request.

@sanori sanori changed the title Core 0x23 stalled on current (CUDA 11.2.2) image Core 0x23 stalled on current (21.11.0) image Jan 23, 2024
@beberg
Copy link
Contributor

beberg commented Jan 23, 2024

We hold back versions intentionally for maximum compatibility.

The only new requirement for core 23 is that the CPU must support SSE4.1, so this may be what's triggering the cascade of issues. It only needs OpenCL 1.2, but you should not get assigned that core if you don't have SSE4.1.

@sanori
Copy link
Contributor Author

sanori commented Jan 24, 2024

I understand. Thank you for the explanation. I misunderstood that OpenMM 8 uses OpenCL 3.

Then, it appears that the main cause of Fahcore 23 stalling is the absence of libexpat1.
Which one would be the solution of WU_STALLED? Adding libexpat1 in the docker image? Or wating for the dependency change of Fahcore23 package?

@cgint
Copy link

cgint commented Mar 23, 2024

I faced the same issue as described in the title.

@sanori Thx for your PR #32.

That fixed it for me and now everything works inside docker the same way as on my host machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants