Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relion_tomo_reconstruct_tomogram_mpi requiring more memory than before and then crashing #1151

Open
rdrighetto opened this issue Jun 21, 2024 · 1 comment

Comments

@rdrighetto
Copy link

rdrighetto commented Jun 21, 2024

Describe your problem

Reconstructing tomograms, which worked fine in a previous version (RELION-5.0-beta-3-commit-63311fe) now seems to be broken. The first issue I see is that it now requires more RAM. Something that I could run before with 128 GB easily (perhaps more than enough) now I have to increase to 400 GB to manage, otherwise it runs into OOM errors.
Once enough memory is available, it runs into this error:

TomoBackprojectProgram::getCtfCorrectedSNR  BUG: invalid access of newSNR array...

Please see full error message below. I'm trying to re-run something that worked fine in a previous version. I saw there were changes to reconstruct_tomogram.cpp (c3edb97) and wanted to compare the results before and after.

Environment:

  • OS: Ubuntu 22.04
  • MPI runtime: OpenMPI 4.1.5
  • RELION version: RELION-5.0-beta-3-commit-7d79f3
  • Memory: 400 GB

Dataset:

  • Box size (unbinned): 4096 x 5760 x 1024
  • Pixel size (unbinned): 2.685 Å/px
  • Box size (8x binned): 512 x 720 x 128
  • Pixel size (8x binned): 21.48 Å/px

Job options:

  • Type of job: Reconstruct tomograms
  • Number of MPI processes: 5
  • Number of threads: 12
  • Full command:
    `which relion_tomo_reconstruct_tomogram_mpi` --t AlignTiltSeries/job027/aligned_tilt_series.star --o Tomograms/job037/ --w 4096 --h 5760 --d 1024 --binned_angpix 21.48 --fourier  --ctf_intact_first_peak  --only_do_unfinished  --j 12 --SNR 100 --pipeline_control Tomograms/job037/
    

Error message:
UPDATE: I edited the error message below to reflect the actual job corresponding to the binned tomogram dimensions above. The error is the same though.

in: /scicore/projects/scicore-p-structsoft/ubuntu/software/RELION/ver5.0/src/jaz/tomography/programs/reconstruct_tomogram.cpp, line 310
ERROR: 
TomoBackprojectProgram::getCtfCorrectedSNR  BUG: invalid access of newSNR array...
terminate called after throwing an instance of 'RelionError'
[scb05:972747] *** Process received signal ***
[scb05:972747] Signal: Aborted (6)
[scb05:972747] Signal code:  (-6)
[scb05:972747] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x14cdd210a520]
[scb05:972747] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x14cdd215e9fc]
[scb05:972747] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x14cdd210a476]
[scb05:972747] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x14cdd20f07f3]
[scb05:972747] [ 4] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xa9a49)[0x14cdd2481a49]
[scb05:972747] [ 5] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb4e6a)[0x14cdd248ce6a]
[scb05:972747] [ 6] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb3ed9)[0x14cdd248bed9]
[scb05:972747] [ 7] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(__gxx_personality_v0+0x86)[0x14cdd248c5f6]
[scb05:972747] [ 8] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libgcc_s.so.1(+0x17864)[0x14cde3945864]
[scb05:972747] [ 9] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x14cde39462bd]
[scb05:972747] [10] /scicore/projects/scicore-p-structsoft/ubuntu/software/RELION/ver5.0/build_amdfftw/bin/relion_tomo_reconstruct_tomogram_mpi[0x44b974]
[scb05:972747] [11] /scicore/projects/scicore-p-structsoft/ubuntu/software/RELION/ver5.0/build_amdfftw/bin/relion_tomo_reconstruct_tomogram_mpi[0x4feef6]
[scb05:972747] [12] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libgomp.so.1(+0x1e45e)[0x14cdd8a3145e]
[scb05:972747] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x14cdd215cac3]
[scb05:972747] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x14cdd21ee850]
[scb05:972747] *** End of error message ***
Command terminated by signal 6
133.84user 42.36system 0:45.40elapsed 388%CPU (0avgtext+0avgdata 41107392maxresident)k
srun: error: scb05: task 4: Exited with exit code 134
16848inputs+0outputs (2716major+40074934minor)pagefaults 0swaps
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 442534 ON scb04 CANCELLED AT 2024-06-21T19:05:10 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 442534.0 ON scb04 CANCELLED AT 2024-06-21T19:05:10 DUE TO TIME LIMIT ***
srun: got SIGCONT
srun: forcing job termination

@rdrighetto
Copy link
Author

Just confirmed that relion_tomo_reconstruct_tomogram_mpi works again when reverting to 6331fe6 with exactly the same settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant