Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make portcullis work with long reads: ideas #41

Open
lucventurini opened this issue Mar 14, 2019 · 12 comments
Open

Make portcullis work with long reads: ideas #41

lucventurini opened this issue Mar 14, 2019 · 12 comments
Assignees

Comments

@lucventurini
Copy link
Collaborator

Hi Dan,
I have been playing with portcullis for long reads (Nanopore) and I see some potential - however, it does require changing a bit the conditions for the self training.
I can experiment on that side - it's pure Python code - but I wonder, would it be possible to have the filter command line select between different pre-sets? For example, at the moment portcullis ships with two JSON presets ("balanced" and "precise") but I cannot see any way of selecting either through the command line.
So we could add a parameter to the command line like:

-p, --preset         Preset for the initial selection of positive and negative junctions. Either choose one of the available presets or provide your own folders (see documentation for details.
                           Available presets:
                           - balanced
                           - illumina
                           - nanopore

PS: just to reiterate: Portcullis functions out of the box with long read BAMs (I think it does not correctly call the strandedness, but it's not fundamental). It only needs some adjustment on the training rules, and the ability to choose at run time.

@maplesond
Copy link
Collaborator

maplesond commented Mar 14, 2019

Hey @lucventurini, good to know portcullis is working on long reads. At the moment the --balanced option is disabled (it always uses balanced by default so strictly speaking it's the precise option that is disabled), so to activate it you either need to recompile, changing the last argument of line 756 https://github.com/maplesond/portcullis/blob/master/src/junction_filter.cc to true. Or if you don't want to recompile go into the data directory backup the current balanced directory and replace with the new rules.

What you suggest is easily doable though providing we have suitable rules to work with. If you have a set that works well let me know and I'll try and find some time to add the extra command line options in and necessary logic.

@maplesond
Copy link
Collaborator

By the way, I suspect the reason the strand calling doesn't work well for long reads is that we currently need 95% of alignments to agree to call the read strand column, which feeds into the consensus strand column. Possibly the depth of the longs reads is meaning these can easily get set to unknown?

@lucventurini
Copy link
Collaborator Author

Or if you don't want to recompile go into the data directory backup the current balanced directory and replace with the new rules.

For the time being that is what I am doing, for the purposes of testing. Having a command line switch (and maybe even the possibility of providing a custom folder yourself) will be more elegant in the long run, but for it will be enough.

What you suggest is easily doable though providing we have suitable rules to work with. If you have a set that works well let me know and I'll try and find some time to add the extra command line options in and necessary logic.

Not yet, working on it. Not surprisingly, the "mean_mismatches" parameter is quite harmful though - practically all long junctions have at least 30 or so mismatches per read on average. Maxmmes is also quite useless for the opposite reason: practically all junctions have on both sides an anchor as long as an exon, making this a very weak filter.

By the way, I suspect the reason the strand calling doesn't work well for long reads is that we currently need 95% of alignments to agree to call the read strand column, which feeds into the consensus strand column. Possibly the depth of the longs reads is meaning these can easily get set to unknown?

I am not so sure about that. The dataset I am analysing (here: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md) is extremely well covered (each BAM has ~19 million reads). Although I am analysing only chr22 for the time being, I daresay that coverage should not be an issue for the strand detection algorithm.

Could it be that Heng Li himself broke the BAM compatibility in MiniMap2, and instead of using the XS tag for the strand, he uses the ts tag instead?

@maplesond
Copy link
Collaborator

maplesond commented Mar 16, 2019

Hi @lucventurini,

That sounds logical, makes sense to reduce or eliminate the use of MaxMMES and mean mismatches for nanopore. Interesting that you have that many reads, I kind of assumed you'd have a lot less for nanopore, what kind of average depth do you have on your genome?

Regarding XS tag, that makes life easier for portcullis if present but isn't absolutely required, at least for Illumina paired end. The relevant logic is here: https://github.com/maplesond/portcullis/blob/master/lib/src/bam_alignment.cc#L93.

Dan

@lucventurini
Copy link
Collaborator Author

Hi @maplesond ,

Interesting that you have that many reads, I kind of assumed you'd have a lot less for nanopore, what kind of average depth do you have on your genome?

This is not my experiment, so I do not know whether the yield is representative. However, for this particular example, on chromosome 22 I see that junctions have an average coverage of ~50X. I am not sure I would use the nb_* metrics for this instance, though, until I verify that this high coverage is not something obtained through stacking multiple runs together.

Regarding XS tag, that makes life easier for portcullis if present but isn't absolutely required, at least for Illumina paired end. The relevant logic is here: https://github.com/maplesond/portcullis/blob/master/lib/src/bam_alignment.cc#L93.

I see. It's strange then that the read strand is not inferred correctly - maybe it's a more generic bug with single-end reads?


More in general, I have been playing around with the data, and I am seeing that the JAD metrics seem to be very predictive of the correctness of a particular junction, for long reads. E.g.:

$ j_index.correct.value_counts()
False    15471
True      3915

$ j_index[(j_index.JAD18 > 0)].correct.value_counts()
True     1985
False     275

$ j_index[(j_index.JAD18 > 0) & (j_index.suspicious == 0) & (j_index.entropy > 1)].correct.value_counts()
True     1920
False     139

$ j_index[(j_index.JAD18 > 0) & (j_index.suspicious == 0) & (j_index.entropy > 1) & (j_index.primary_junc == 1)].correct.value_counts()
True     1791
False      43

# Without the JAD metric, the precision is much worse
$ j_index[(j_index.suspicious == 0) & (j_index.entropy > 1) & (j_index.primary_junc == 1) ].correct.value_counts()

True     2676
False     537

I have been reading the documentation and the definition in the original FineSplice article, but I am still not completely clear on what these metrics represent and - more crucially - whether it would be foolhardy to use them for the positive layer. The predictive power associated with JAD values over ~10 is, very, very high though, so this could really be an easy win!

lucventurini added a commit that referenced this issue Mar 18, 2019
@lucventurini
Copy link
Collaborator Author

Hi @maplesond,
So reading the code in junction.cc, I think I understand that to calculate the JAD score, portcullis checks the minimum distance between one of the junction borders and the first mismatch on the read. So a read with a mismatch 10 bps from the left junction dude and another 5 bps from the right junction side will increase the JAD scores from 1 to 5, but not any else.
So the JAD scores are a measure of goodness of the alignment between the reads and the genomic string surrounding the junction.

Is my understanding correct?

lucventurini added a commit that referenced this issue Mar 19, 2019
…tial training rules (see #41). Corrected deprecation warning in rule_filter.py
@maplesond
Copy link
Collaborator

Hi @lucventurini, yes that's basically correct. The logic essentially trims the alignment to the closest mismatch to the junctions site (either upstream or downstream). It also takes into account deviation of expected coverage around the junction. So with short reads it would be expected to drop off quick the further you are away from the junction site. With longer reads you'd expect a slower drop off.

@lucventurini
Copy link
Collaborator Author

OK, thanks for clarifying! Would you think that basing an initial filter on, say, JAD15>=3, would be sensible?

@maplesond
Copy link
Collaborator

I suspect it will depend on how deep the coverage is but with long reads I guess that setting sounds reasonable.

Also I may have been wrong in my previous statement regarding trimming. It definitely takes mismatches into account but not 100% sure if it effectively trims after thinking about it. You'd need to check the code.

@lucventurini
Copy link
Collaborator Author

Hi @maplesond , apologies for not having replied earlier! I have been actively working on ideas to make the selection function with nanopore reads (see the "longreads" branch). The wall I am hitting at the moment is that the quality of data seems really bad .. on RNA reads the junction precision, off the aligner, is ~20%. This is still recoverable as I can get to ~80% precision (depending on how much I want to be biased towards junction recall).
cDNA reads, on the other hand, seem like an almost desperate case. The junction precision out of the gate is about ~3% (yes, three). I can get to 30%, so a 10X increase, but I would not call it a victory yet.

In any case, now Portcullis will accept custom rules on the fly and outside of the installation folder. I am also working on getting some new metrics up to help with Nanopore reads. Please see the branch itself for details!

@maplesond
Copy link
Collaborator

Hi @lucventurini, how's this going? Any changes to migrate over? I've been working on a bunch of improvements to the build system in portcullis, and fixing a few bugs at the same time. I was planning to get a new release out at some point, so would be nice to get this change in too.

@lucventurini
Copy link
Collaborator Author

Hi @maplesond, I had to put this on the backburner for a little bit while I had to work on other projects. I should be receiving some data soon, which should allow me to continue development on this.

I will let you know ASAP.

maplesond pushed a commit that referenced this issue May 21, 2019
…tial training rules (see #41). Corrected deprecation warning in rule_filter.py

Also Resetting doc prefix

Corrected a bug that caused saltuarily segmentation faults when the size of the positive or the negative set were too small (smote.cc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants