Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of Threads Limit? #512

Closed
sklages opened this issue Oct 30, 2018 · 19 comments
Closed

Number of Threads Limit? #512

sklages opened this issue Oct 30, 2018 · 19 comments
Labels
issue: code Likely to be an issue with STAR code

Comments

@sklages
Copy link

sklages commented Oct 30, 2018

System/Data

Linux 4.14.33
GNU libc 2.27
machine 128 cores, 1TB 

This is a standard human dataset, reference has been created from GENCODE:

GRCh38.primary_assembly.genome.fa
gencode.v29.annotation.gtf

Passed Mapping (20 threads)

STAR --genomeDir /project/genomes/STAR/GRCh38.p12 \
--runMode alignReads \
--runThreadN 20 \
--genomeLoad NoSharedMemory \
--readFilesCommand zcat \
--outFileNamePrefix XXX \
--outTmpDir temp.XXX \
--outStd Log \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--outSAMmapqUnique 255 \
--quantMode GeneCounts \
--readFilesIn TRIMMED_FASTQ_R1.trm.fq.gz TRIMMED_FASTQ_R2.trm.fq.gz

Oct 30 12:20:43 ..... started STAR run
Oct 30 12:20:43 ..... loading genome
Oct 30 12:20:59 ..... started mapping
Oct 30 12:24:56 ..... started sorting BAM
Oct 30 12:26:46 ..... finished successfully

Failed Mapping (21 threads)

STAR --genomeDir /project/genomes/STAR/GRCh38.p12 \
--runMode alignReads \
--runThreadN 21 \
--genomeLoad NoSharedMemory \
--readFilesCommand zcat \
--outFileNamePrefix XXX \
--outTmpDir temp.XXX \
--outStd Log \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--outSAMmapqUnique 255 \
--quantMode GeneCounts \
--readFilesIn TRIMMED_FASTQ_R1.trm.fq.gz TRIMMED_FASTQ_R2.trm.fq.gz

Oct 30 12:16:05 ..... started STAR run
Oct 30 12:16:05 ..... loading genome
Oct 30 12:16:18 ..... started mapping
Oct 30 12:20:09 ..... started sorting BAM

EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 478405037   0   43

Oct 30 12:20:11 ...... FATAL ERROR, exiting

Before reducing the threads parameter I was playing around (at threads=40) with gzipped/unzipped input, different versions of STAR, incl. dynamically or statically linked binary. All jobs failed with the same error.

Why is (my version of) STAR crashing when using more than 20 threads?

It is not really needed, because STAR is quite fast, .. but I don't like to see any segfaults :-)
(log says:

*** glibc detected *** /package/sequencer/bin/STAR: free(): invalid pointer: 0x000000000089dde0 ***
[..]
RNAseq_Workflow_01.sh: line 281: 43967 Aborted  (core dumped) $star [..]
@sklages
Copy link
Author

sklages commented Oct 30, 2018

When using --runThreadN 40 --outBAMsortingThreadN 40 with the current static binary I get:

*** glibc detected *** STAR: double free or corruption (!prev): 0x00000000008c5160 ***

same machine, same dataset.

@sklages
Copy link
Author

sklages commented Oct 30, 2018

Just an example log file:
Log.out.txt

@sklages
Copy link
Author

sklages commented Oct 30, 2018

Just another info: building the reference with STAR --runThreadN 40 --runMode genomeGenerate works just fine ...

@alexdobin
Copy link
Owner

Hi @sklages

here are a few things I would recommend to try:

To resolve seg-faults: try the dynamically linked binary, and also compiling from scratch with make.
To resolve the sorting problem: check the disk space, is there enough for STAR to write temporary files? If this is not the problem, please send me the output of
$ ls -lR _STARtmp
in the failed directory run.

Cheers
Alex

@alexdobin alexdobin added the issue: code Likely to be an issue with STAR code label Oct 30, 2018
@sklages
Copy link
Author

sklages commented Oct 30, 2018

I always build STAR from source and I tried with statically and dynamically linked binary. Both failed.
I have ~10TB free disc space (working dir and temp dir) ... so this shouldn't be an issue.

The output of ls -lR temp.STAR looks like this:
output.ls-lR.txt

@alexdobin
Copy link
Owner

Hi @sklages

Thanks a lot for helping me to debug this problem.
Could you please send me the Log.out file for the specific run above for which you sent me the output.ls-lR.txt ?

Cheers
Alex

@sklages
Copy link
Author

sklages commented Oct 30, 2018

I am afraid, I can't ... as I quite a lot I have removed these dirs.

But I have run a new one with 40 threads which also failed (using STAR statically linked, for GNU/Linux 4.4.34):

Log.out.txt
MY_SAMPLE.temp-ls-lR.txt

hope this helps ...

@paulmenzel
Copy link

I tried to built the binary STAR with the thread sanitizer and the patch below.

diff --git a/source/Makefile b/source/Makefile
index cf53bc3..db7c8fe 100644
--- a/source/Makefile
+++ b/source/Makefile
@@ -5,14 +5,14 @@
 # CFLAGS
 
 # or these user-set flags that will be added to standard flags
-LDFLAGSextra ?=
-CXXFLAGSextra ?=
+LDFLAGSextra ?= -pie
+CXXFLAGSextra ?= -fsanitize=thread -g -fPIE -I/package/sequencer/samtools/1.8/include
 
 # user may define the compiler
 CXX ?= g++
 
 # pre-defined flags
-LDFLAGS_shared := -pthread -Lhtslib -Bstatic -lhts -Bdynamic -lz
+LDFLAGS_shared := -pthread -Lhtslib -Bstatic -lhts -Bdynamic -lz -L/package/sequencer/samtools/1.8/lib
 LDFLAGS_static := -static -static-libgcc -pthread -Lhtslib -lhts -lz
 LDFLAGS_Mac :=-pthread -lz htslib/libhts.a
 LDFLAGS_Mac_static :=-pthread -lz -static-libgcc htslib/libhts.a
@@ -24,7 +24,7 @@ CXXFLAGS_common := -pipe -std=c++11 -Wall -Wextra -fopenmp $(COMPTIMEPLACE)
 CXXFLAGS_main := -O3 $(CXXFLAGS_common)
 CXXFLAGS_gdb :=  -O0 -g $(CXXFLAGS_common)
 
-CFLAGS := -O3 -pipe -Wall -Wextra $(CFLAGS)
+CFLAGS := -O3 -pipe -Wall -Wextra $(CFLAGS) -fsanitize=thread -g -fPIE -I/package/sequencer/samtools/1.8/include
 
 
 ##########################################################################################################

I also had to use “our” libhts, as I couldn’t build the sources in libhts/ so that the linker didn’t complain.

/usr/lib/gcc/x86_64-pc-linux-gnu/7.3.0/../../../../x86_64-pc-linux-gnu/bin/ld: htslib/libhts.a(bgzf.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC

Unfortunately, the stack trace are not easily usable.

$ LD_LIBRARY_PATH=/package/sequencer/samtools/1.8/lib/:$LD_LIBRARY_PATH ./call.sh
Oct 30 20:02:38 ..... started STAR run
Oct 30 20:02:38 ..... loading genome
Oct 30 20:04:11 ..... started mapping
==================
WARNING: ThreadSanitizer: data race (pid=129796)
  Write of size 8 at 0x7b9000000000 by thread T11 (mutexes: write M103):
    #0 <null> <null> (libtsan.so.0+0x000000031aad)
    #1 <null> <null> (libstdc++.so.6+0x00000012502a)
    #2 <null> <null> (STAR+0x000000111327)
    #3 <null> <null> (STAR+0x0000000952ff)
    #4 <null> <null> (STAR+0x0000000a2818)
    #5 <null> <null> (STAR+0x000000087a3d)
    #6 <null> <null> (STAR+0x000000083d79)
    #7 <null> <null> (STAR+0x000000106ea5)
    #8 <null> <null> (libtsan.so.0+0x00000002843b)

  Previous read of size 8 at 0x7b9000000000 by main thread:
    [failed to restore the stack]

  Location is heap block of size 8192 at 0x7b9000000000 allocated by main thread:
    #0 <null> <null> (libtsan.so.0+0x00000006d9d6)
    #1 <null> <null> (libstdc++.so.6+0x0000000f3b77)
    #2 <null> <null> (STAR+0x00000000833d)
    #3 <null> <null> (libc.so.6+0x000000021b5d)

  Mutex M103 (0x55a0829e5598) created at:
    #0 <null> <null> (libtsan.so.0+0x00000002befe)
    #1 <null> <null> (STAR+0x0000000087eb)
    #2 <null> <null> (libc.so.6+0x000000021b5d)

  Thread T11 (tid=129858, running) created by main thread at:
    #0 <null> <null> (libtsan.so.0+0x00000002b6f0)
    #1 <null> <null> (STAR+0x000000105be8)
    #2 <null> <null> (STAR+0x000000008a6e)
    #3 <null> <null> (libc.so.6+0x000000021b5d)

SUMMARY: ThreadSanitizer: data race (/usr/lib/libtsan.so.0+0x31aad) 
==================

Maybe you are more lucky to get something out of that.

@paulmenzel
Copy link

Strangely, taking the message below as an example from the original post, there is no file with the name 43 (iBin) in the thread directories.

EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 478405037   0   43

So, how does binS get calculated to 478405037?

@paulmenzel
Copy link

paulmenzel commented Nov 2, 2018

Any idea, how files, like 47, get deleted?

Before calling BAMbinSortByCoordinate().

$ ls -l temp.STAR.21/BAMsort/0
total 766992
-rw-rw---- 1 pmenzel pmenzel 16101267 Nov  2 00:30 0
-rw-rw---- 1 pmenzel pmenzel 16010604 Nov  2 00:30 1
-rw-rw---- 1 pmenzel pmenzel 16074692 Nov  2 00:30 10
-rw-rw---- 1 pmenzel pmenzel 16394632 Nov  2 00:30 11
-rw-rw---- 1 pmenzel pmenzel 15985698 Nov  2 00:30 12
-rw-rw---- 1 pmenzel pmenzel 15501963 Nov  2 00:30 13
-rw-rw---- 1 pmenzel pmenzel 15635042 Nov  2 00:30 14
-rw-rw---- 1 pmenzel pmenzel 15786787 Nov  2 00:30 15
-rw-rw---- 1 pmenzel pmenzel 16487123 Nov  2 00:30 16
-rw-rw---- 1 pmenzel pmenzel 16017383 Nov  2 00:30 17
-rw-rw---- 1 pmenzel pmenzel 15838374 Nov  2 00:30 18
-rw-rw---- 1 pmenzel pmenzel 16047548 Nov  2 00:30 19
-rw-rw---- 1 pmenzel pmenzel 15844008 Nov  2 00:30 2
-rw-rw---- 1 pmenzel pmenzel 16418779 Nov  2 00:30 20
-rw-rw---- 1 pmenzel pmenzel 15728451 Nov  2 00:30 21
-rw-rw---- 1 pmenzel pmenzel 15637958 Nov  2 00:30 22
-rw-rw---- 1 pmenzel pmenzel 16373698 Nov  2 00:30 23
-rw-rw---- 1 pmenzel pmenzel 15997917 Nov  2 00:30 24
-rw-rw---- 1 pmenzel pmenzel 15601027 Nov  2 00:30 25
-rw-rw---- 1 pmenzel pmenzel 15784773 Nov  2 00:30 26
-rw-rw---- 1 pmenzel pmenzel 15724044 Nov  2 00:30 27
-rw-rw---- 1 pmenzel pmenzel 16261866 Nov  2 00:30 28
-rw-rw---- 1 pmenzel pmenzel 16031446 Nov  2 00:30 29
-rw-rw---- 1 pmenzel pmenzel 15761149 Nov  2 00:30 3
-rw-rw---- 1 pmenzel pmenzel 15995103 Nov  2 00:30 30
-rw-rw---- 1 pmenzel pmenzel 16369743 Nov  2 00:30 31
-rw-rw---- 1 pmenzel pmenzel 15313179 Nov  2 00:30 32
-rw-rw---- 1 pmenzel pmenzel 15687287 Nov  2 00:30 33
-rw-rw---- 1 pmenzel pmenzel 15802125 Nov  2 00:30 34
-rw-rw---- 1 pmenzel pmenzel 15698200 Nov  2 00:30 35
-rw-rw---- 1 pmenzel pmenzel 15671227 Nov  2 00:30 36
-rw-rw---- 1 pmenzel pmenzel 15481866 Nov  2 00:30 37
-rw-rw---- 1 pmenzel pmenzel 16282634 Nov  2 00:30 38
-rw-rw---- 1 pmenzel pmenzel 15163882 Nov  2 00:30 39
-rw-rw---- 1 pmenzel pmenzel 15499124 Nov  2 00:30 4
-rw-rw---- 1 pmenzel pmenzel 15715222 Nov  2 00:30 40
-rw-rw---- 1 pmenzel pmenzel 15795028 Nov  2 00:30 41
-rw-rw---- 1 pmenzel pmenzel 16254751 Nov  2 00:30 42
-rw-rw---- 1 pmenzel pmenzel 17187360 Nov  2 00:30 43
-rw-rw---- 1 pmenzel pmenzel 17568423 Nov  2 00:30 44
-rw-rw---- 1 pmenzel pmenzel 17674917 Nov  2 00:30 45
-rw-rw---- 1 pmenzel pmenzel 17095225 Nov  2 00:30 46
-rw-rw---- 1 pmenzel pmenzel 17030236 Nov  2 00:30 47
-rw-rw---- 1 pmenzel pmenzel 15646516 Nov  2 00:30 48
-rw-rw---- 1 pmenzel pmenzel        0 Nov  2 00:27 49
-rw-rw---- 1 pmenzel pmenzel 15888232 Nov  2 00:30 5
-rw-rw---- 1 pmenzel pmenzel 16455256 Nov  2 00:30 6
-rw-rw---- 1 pmenzel pmenzel 15292509 Nov  2 00:30 7
-rw-rw---- 1 pmenzel pmenzel 15770231 Nov  2 00:30 8
-rw-rw---- 1 pmenzel pmenzel 15929471 Nov  2 00:30 9

Sometime in BAMbinSortByCoordinate():

$ ls -l temp.STAR.21/BAMsort/0
total 667172
-rw-rw---- 1 pmenzel pmenzel 16101267 Nov  2 00:30 0
-rw-rw---- 1 pmenzel pmenzel 16010604 Nov  2 00:30 1
-rw-rw---- 1 pmenzel pmenzel 16074692 Nov  2 00:30 10
-rw-rw---- 1 pmenzel pmenzel 16394632 Nov  2 00:30 11
-rw-rw---- 1 pmenzel pmenzel 15985698 Nov  2 00:30 12
-rw-rw---- 1 pmenzel pmenzel 15501963 Nov  2 00:30 13
-rw-rw---- 1 pmenzel pmenzel 15635042 Nov  2 00:30 14
-rw-rw---- 1 pmenzel pmenzel 15786787 Nov  2 00:30 15
-rw-rw---- 1 pmenzel pmenzel 16487123 Nov  2 00:30 16
-rw-rw---- 1 pmenzel pmenzel 16017383 Nov  2 00:30 17
-rw-rw---- 1 pmenzel pmenzel 15838374 Nov  2 00:30 18
-rw-rw---- 1 pmenzel pmenzel 16047548 Nov  2 00:30 19
-rw-rw---- 1 pmenzel pmenzel 15844008 Nov  2 00:30 2
-rw-rw---- 1 pmenzel pmenzel 16418779 Nov  2 00:30 20
-rw-rw---- 1 pmenzel pmenzel 15728451 Nov  2 00:30 21
-rw-rw---- 1 pmenzel pmenzel 15637958 Nov  2 00:30 22
-rw-rw---- 1 pmenzel pmenzel 16373698 Nov  2 00:30 23
-rw-rw---- 1 pmenzel pmenzel 15997917 Nov  2 00:30 24
-rw-rw---- 1 pmenzel pmenzel 15601027 Nov  2 00:30 25
-rw-rw---- 1 pmenzel pmenzel 15784773 Nov  2 00:30 26
-rw-rw---- 1 pmenzel pmenzel 15724044 Nov  2 00:30 27
-rw-rw---- 1 pmenzel pmenzel 16261866 Nov  2 00:30 28
-rw-rw---- 1 pmenzel pmenzel 16031446 Nov  2 00:30 29
-rw-rw---- 1 pmenzel pmenzel 15761149 Nov  2 00:30 3
-rw-rw---- 1 pmenzel pmenzel 15995103 Nov  2 00:30 30
-rw-rw---- 1 pmenzel pmenzel 16369743 Nov  2 00:30 31
-rw-rw---- 1 pmenzel pmenzel 15313179 Nov  2 00:30 32
-rw-rw---- 1 pmenzel pmenzel 15687287 Nov  2 00:30 33
-rw-rw---- 1 pmenzel pmenzel 15802125 Nov  2 00:30 34
-rw-rw---- 1 pmenzel pmenzel 15698200 Nov  2 00:30 35
-rw-rw---- 1 pmenzel pmenzel 15671227 Nov  2 00:30 36
-rw-rw---- 1 pmenzel pmenzel 15481866 Nov  2 00:30 37
-rw-rw---- 1 pmenzel pmenzel 16282634 Nov  2 00:30 38
-rw-rw---- 1 pmenzel pmenzel 15163882 Nov  2 00:30 39
-rw-rw---- 1 pmenzel pmenzel 15499124 Nov  2 00:30 4
-rw-rw---- 1 pmenzel pmenzel 15715222 Nov  2 00:30 40
-rw-rw---- 1 pmenzel pmenzel 15795028 Nov  2 00:30 41
-rw-rw---- 1 pmenzel pmenzel 16254751 Nov  2 00:30 42
-rw-rw---- 1 pmenzel pmenzel        0 Nov  2 00:27 49
-rw-rw---- 1 pmenzel pmenzel 15888232 Nov  2 00:30 5
-rw-rw---- 1 pmenzel pmenzel 16455256 Nov  2 00:30 6
-rw-rw---- 1 pmenzel pmenzel 15292509 Nov  2 00:30 7
-rw-rw---- 1 pmenzel pmenzel 15770231 Nov  2 00:30 8
-rw-rw---- 1 pmenzel pmenzel 15929471 Nov  2 00:30 9

@paulmenzel
Copy link

Ignore my last post as I missed remove(bamInFile.c_str());.

@paulmenzel
Copy link

Running git bisect, the commit 91d34b7 (Implemented --outBAMsortingBinsN option to control the number of sorting bins. Icnreasing this number reduces the amount of RAM required for sorting.) introduced the regression.

$ git bisect log
git bisect start
# good: [172fa15e2aaf649675395381b610e7732d979b24] 2.5.4a Mac executables
git bisect good 172fa15e2aaf649675395381b610e7732d979b24
# bad: [a4fadc519161e9ac777ca1d344f235e83ebc2d19] Fixed the bug causing inconsistent output for mate1/2 in the Unmapped files.
git bisect bad a4fadc519161e9ac777ca1d344f235e83ebc2d19
# good: [cdf9df6c9babf61ef76abdf8af71c0299aca063d] Merged master changes (RG for chimeric junction output) into var.
git bisect good cdf9df6c9babf61ef76abdf8af71c0299aca063d
# bad: [ca294ddcc7222b2b70470e4589ce57f044447e08] Merged master into var. Preparing for 2.6.0
git bisect bad ca294ddcc7222b2b70470e4589ce57f044447e08
# bad: [7783243534915d01e5ab7232e2c6e7f0bc381f8f] Fixed some issues with the overlapping mates algorithm.
git bisect bad 7783243534915d01e5ab7232e2c6e7f0bc381f8f
# good: [a57c5f81da9fbb63c33966665fab8a8f7a259fc0] For BAM to signal conversion, process alignments without NH tags as unique (i.e. NH=1).
git bisect good a57c5f81da9fbb63c33966665fab8a8f7a259fc0
# bad: [7701cf6b1400bfa9822df9ec8aafa0852e2db14c] Fixed a problem with non-default --sjdbOverhang genome generation.
git bisect bad 7701cf6b1400bfa9822df9ec8aafa0852e2db14c
# bad: [f013fe13e8c13ff00b7ff25465b896dc2ac3ffa3] Recompiled executables.
git bisect bad f013fe13e8c13ff00b7ff25465b896dc2ac3ffa3
# bad: [91d34b705038e8b8ad25355ef3994bb30394946f] Implemented --outBAMsortingBinsN option to control the number of sorting bins. Icnreasing this number reduces the amount of RAM required for sorting.
git bisect bad 91d34b705038e8b8ad25355ef3994bb30394946f
# first bad commit: [91d34b705038e8b8ad25355ef3994bb30394946f] Implemented --outBAMsortingBinsN option to control the number of sorting bins. Icnreasing this number reduces the amount of RAM required for sorting.

@alexdobin
Copy link
Owner

Hi Paul,

thanks for your help!
Were you able to reproduce the error in your run?
You do not seem to have 0-sized bin files.
For some reason in @sklages run at least one bin (45th in the latest example) is empty.
I think the error happens when the files are written. I am actually not checking if the write operation completes successfully - I will add these checks and release a new patch shortly.

Cheers
Alex

@sklages
Copy link
Author

sklages commented Nov 3, 2018

What may cause such potential write errors / problems?

It does not appear to happen randomly. I routinely use STAR with 20 cores in our RNAseq workflow. It only seems to crash when using more cores, no matter if I work on a heavily loaded file server or on my on workstation with local SSDs...

@paulmenzel
Copy link

paulmenzel commented Nov 4, 2018 via email

@alexdobin
Copy link
Owner

Hi Paul, @sklages

thanks a lot for your help, I think I figured out what the problem was.
It's caused by the value of ulimit -n being too small, =1024 for most systems by default.
On my standard systems, this value is increased to 10000, that's why I could not reproduce the problem. When I tried it on another system, I got the error with >20 threads.
The release that Paul found actually increase the number of bins to 50, which resulted in total number of temp files (21*50) to go above 1000. When I open the temp files, I did not check for errors, so it could not write into them, and thus complained file size did not match the recorded values.
I fixed that - if you try the latest patch from GitHub master, you will get the error before the mapping starts that the temp file cannot be open.

TL;DR:
To solve the problem, you need to either increase ulimit -n , or (if not possible) reduce --runThreadN and/or --outBAMsortingBinsN to make sure that "ulimit -n" > runThreadN*outBAMsortingBinsN .

Cheers
Alex

@paulmenzel
Copy link

paulmenzel commented Nov 5, 2018 via email

@sklages
Copy link
Author

sklages commented Nov 5, 2018

Hi @alexdobin , thanks for solving the issue, and @paulmenzel for your efforts hunting down the problem :-).

@aiqc
Copy link

aiqc commented Jul 8, 2023

Check soft limit and hard limit

$ ulimit -Sn
1024

$ ulimit -Hn
1048576

Set a higher soft limit

ulimit -S -n 4096

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue: code Likely to be an issue with STAR code
Projects
None yet
Development

No branches or pull requests

4 participants