Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTq ouput files containing unmapped reads are out-of-order #222

Closed
Gig77 opened this issue Dec 15, 2016 · 22 comments
Closed

FASTq ouput files containing unmapped reads are out-of-order #222

Gig77 opened this issue Dec 15, 2016 · 22 comments
Labels
issue: code Likely to be an issue with STAR code resolved problem or issue that has been resolved

Comments

@Gig77
Copy link

Gig77 commented Dec 15, 2016

When I configure STAR to output unmapped paired-end reads into two FASTq files using the "outReadsUnmapped - Fastx" option, then the resulting files are out-of-order, i.e. mates of the same pair are not always found at the same line number of the two files.

If I take these outputted FASTq files as-is and align it with STAR again, it results in a very high percentage of "reads unmapped: too short" (>80%). If the FASTq files are sorted before alignment, this percentage goes back to normal (~5%).

I'm using STAR version 2.5.1b.

@alexdobin
Copy link
Owner

Hi Christian,

this does not happen in my tests, so it has to be parameter or system specific.
Please send me the Log.out file first.

Cheers
Alex

@Gig77
Copy link
Author

Gig77 commented Dec 17, 2016 via email

@alexdobin
Copy link
Owner

Hi Christian

thanks for the files. Nothing suspicious in them, unfortunately.
Could you please check a few more things:

  1. Does this happen every time you map?
  2. Is the number of lines the same in the two Unmapped files?
  3. Does it look like the order is screwed up in blocks?

Could you send me an example of the Unmapped reads with wrong ordering? If the files are too big, please try to reproduce this problem on a small subset of reads (~300k to 1M).

Cheers
Alex

@Gig77
Copy link
Author

Gig77 commented Dec 22, 2016 via email

@Gig77
Copy link
Author

Gig77 commented Dec 22, 2016 via email

@yuxinghai
Copy link

I have the same issue when mapping with STAR_2.5.3a

@alexdobin
Copy link
Owner

Hi @yuxinghai

I could not reproduce this problem on my system. If you could send me your FASTQ files, Log.out file, and the links to the genome, I can try running it on my system - maybe we get lucky to catch the error this time.

Cheers
Alex

@sschmeier
Copy link

We are seeing the same behavior in some samples. Unclear why this is. I attach here one example log out. We are using mapping human, so the fq-files are reasonably big. However, if you want them I can get them to you for the example below.

The heads of the fq input looks like this:

$ zcat /media/seb/Data/crc/201712/samples_star/CBMFTANXX-2706-219-12-1_S123_L001_R1_001.fastq.trimmed.paired.gz | head -8
@7001326F:122:CBMFTANXX:1:2310:9932:95316 1:N:0:CGCTCATT+CCTATCCT
CTGAAGTCCTTTAGGAGCTTGGACATTTAACTATATCTGCTAGTGTGCAAATCCCCTGACATCCTGGATATTAGTGATGGTTTTGTTGCTCTTCAAATTCAAGGATAAGGATGCACAAGTTACCA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@7001326F:122:CBMFTANXX:1:2310:9970:95337 1:N:0:CGCTCATT+CCTATCCT
GCTCTCCCACCCTGGTCCCTCTTCCTTCAA
+
FFFFFBFFFFFFFFFFBFFFBBBFFFFFFF


$ zcat samples_star/CBMFTANXX-2706-219-12-1_S123_L001_R2_001.fastq.trimmed.paired.gz | head -8
@7001326F:122:CBMFTANXX:1:2310:9932:95316 2:N:0:CGCTCATT+CCTATCCT
ATTCAGCATTATTTCATTGTGATCCAGTTTTTATATGCTTCAGTTAAGCCAGTGAGTTTTTAAATGCGACCAGCATCTGGCAAAATTGTTTCCAGGAAAAATGTTTCCATTGTTGGAAGGATGGT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFBFFBFFFFB
@7001326F:122:CBMFTANXX:1:2310:9970:95337 2:N:0:CGCTCATT+CCTATCCT
GAAGAATAGAGGTCCTCATGGGTCCCTTGAAGGAAGAGGGACCAGG
+
<7FFFFFFFFFFFBBFFFFFFBFBBBFFFFFFFFFFFFBF<FFFFF

The heads of the unmapped reads look like this:

$ cat CBMFTANXX-2706-219-12-1_S123_L001_Unmapped.out.mate1 | head -8
@7001326F:122:CBMFTANXX:1:2310:10067:95410	01
GCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGACTCAAGCGATCCTCCAGCCTC
+
BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@7001326F:122:CBMFTANXX:1:2310:12168:95410	01
TGATAGCTTTGCACAGGAAGATTGTGAGTTATTTGCACAGGAGGGCTATGTGTCCTGGACCATAAAGAAAGGCAGACTTACAGCTTATCCACTTTCT
+
B<FFBFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFBFFFFFF

$ cat CBMFTANXX-2706-219-12-1_S123_L001_Unmapped.out.mate2 | head -8
@7001326F:122:CBMFTANXX:1:2210:17208:15941	01
TGAGGTCAGGAGCTTGAGACCAGCCTGGCCAACATGGTGAAACCT
+
FFFF<FBFFFFFFFBFFFFFFFFFFFFFFFFFFBFFF7F<FFF<F
@7001326F:122:CBMFTANXX:1:2210:20971:15817	01
CCAGAAATGGTTCTGTGCCAGCTCACTCACTCCCGCTTTCTGGAAAAATGATTGCTTGGCCCGAGGGCTCTGCTCCCTCCCCCAACCCCTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Any ideas welcome. Please shout if you want to follow this up and need/want the original files.

Great work by the way!

Cheers,
Seb

CBMFTANXX-2706-219-12-1_S123_L001_Log.out.zip

@mpschr
Copy link

mpschr commented Mar 26, 2018

I just ran into this problem with STAR_2.5.2b. It only affects 2 Samples in 40 aligned samples (multi-threaded).

@alexdobin
Copy link
Owner

Hi Seb, Michael,

I could never reproduce this problem in my tests.
Is it reproducible, i.e. if you map the same files again, is the problem still present?
If so, could you share these files? If the files are too big, could you try to find a small subset that still shows the problem?

Cheers
Alex

@sschmeier
Copy link

I will compile something but need some time. Hopefully end of the week.

@zhoujj2013
Copy link

Hi all,

I came across the same problem.
Anyone solve it?

Best,
zhoujj

@alexdobin
Copy link
Owner

Hi @zhoujj2013

I have not been able to reproduce the problem on my system.
Is it reproducible, i.e. happens on the same sample every time?
Please try to use the latest release, both dynamic and static pre-compiled executables, and compiled with make.
Also try using just the default parameters.
If it still happens, I would need a test example.

Cheers
Alex

@zhoujj2013
Copy link

Hi @alexdobin

Thanks for your reply.

I use static pre-compiled executable with 6 threads (outReadsUnmapped - Fastx, other default paras) for my analysis.
I also try to debug this problem.
It look like the order is screwed up in blocks.
When I run STAR (v2.5.1b) with keep ._tmpDir and cat the seperate *.mate1/mate2, I can get normal fastq file.
So I guess the error should happen in concatenate process.
But I don't know how STAR work at in concatenate process.
Could you please check if STAR don't sort the *.mate1/2 file name from different threads?

Thanks again.

Best,
zhoujj

@alexdobin
Copy link
Owner

Hi @zhoujj2013

I think I may have found the bug causing the problem.
Could you please try the latest patch from GitHub master and let me know if it solved the problem.

Cheers
Alex

@zhoujj2013
Copy link

Hi @alexdobin

Thanks a lot.
I have tested the updated version and it works well.

Cheers
zhoujj

@alexdobin
Copy link
Owner

Hi Zhoujj

thanks a lot - I will release a new tagged version shortly.

Cheers
Alex

jackkamm added a commit to chazlangelier/ct-transcriptomics that referenced this issue Jan 2, 2019
@amayer21
Copy link

Hi @alexdobin,

Sorry to bother you about an issue you already fixed, but we would like to know a bit more about this bug to troubleshoot a problem on our calculation nodes.

We were using version 2.5.2b and had the same problem as described above. When troubleshooting (before to read this issue), we used a single fastq files and ran the same script several time. We have 4 calculation nodes that should be identical, but we noticed that the problem always occured when using 3 of the nodes and never on the 4th one. We have tested each node about 10 times so it seems unlikely to be just random.

After reading this post, we updated to version 2.6.1e and that solved the problem. However, our sys-admin is really worried about the fact that our 4 nodes didn't behave identically, and he thought that maybe if you could give us some indications about the bug, he will have a better idea about what to look for.

Thank you very much!
Best wishes,
Alice

@alexdobin
Copy link
Owner

Hi Alice,

this problem was fixed - please try one of the latest releases 2.6.1d or 2.7.2a.

Cheers
Alex

@alexdobin alexdobin added issue: code Likely to be an issue with STAR code resolved problem or issue that has been resolved labels Aug 29, 2019
@amayer21
Copy link

amayer21 commented Sep 3, 2019

Hi Alex,
I know it has been fixed in 2.6.1d. I'm asking this on a "system-administrator point-of-view", to find out why the behavior was different on nodes supposed to be identical, to try to understand what may be different between these nodes and fix it. I thought that having some information about the nature of the bug you fixed could help.
Thank you very much
Best,
Alice

@paulmenzel
Copy link

I think I may have found the bug causing the problem.
Could you please try the latest patch from GitHub master and let me know if it solved the problem.

Was the fix commit a4fadc5 (Fixed the bug causing inconsistent output for mate1/2 in the Unmapped files.)?

@amayer21
Copy link

amayer21 commented Sep 3, 2019

Thank you very much Paul,

If I understand properly, the fix was to move the pthread_mutex_lock from inside the for loop (in the chunkFstreamCat command) to the outside of it. It makes sense but doesn't explain we had problem only on specific nodes (if the pthread library wasn't working properly on some node, the new version wouldn't have fixed the problem on these nodes).

I guess we have to look for some delay in writing to /local which means that 2 threads are more likely to want to write at the same time on these node (?). Alternatively, it may have been a pure stochastic phenomenon and it's only by chance that we never had problem on 3 of the nodes and always on 3 other nodes (I've tested each node 5 to 10 times which may not be powerfull enough to reach statistical significance)...

Thank you very much for your help anyway!

All the best,
Alice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue: code Likely to be an issue with STAR code resolved problem or issue that has been resolved
Projects
None yet
Development

No branches or pull requests

8 participants