Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make coverage maps consistent with contigs coverage plot #479

Closed
3 tasks done
donkirkby opened this issue Aug 12, 2019 · 3 comments
Closed
3 tasks done

Make coverage maps consistent with contigs coverage plot #479

donkirkby opened this issue Aug 12, 2019 · 3 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Aug 12, 2019

We added the contigs coverage plot as part of #442, but contigs that align to a gene region don't always get included in the coverage maps. The reason is that we align the contig to a seed's full-genome nucleotide sequence for the contigs coverage plot, and we align the translated consensus sequence to a gene region's amino acid sequence for the coverage plots.

Consider using the landmark coordinates and reading frames to choose where to split up the read counts when generating coverage plots.

Consider showing how good the alignment is in the contigs coverage plot, especially when some parts of the contig align well and others don't.

Examples where the two plots are inconsistent:

  • 73051ANS5A1-HCV-NS5a_S89 from 15-Jul-2016.M01841 - The 3-HCV-1a contig didn't get included in the coverage map for NS5a, probably because the region that aligns well is split between NS4b and NS5a.
  • 1693-NOGEL100D22-HIV_S49 in 02-Aug-2019.M04401 - The HIV1B-gag didn't get a coverage map at all, because it has a giant deletion in the middle, while the contigs coverage plot covers roughly two thirds of gag.
  • 1693-NOGEL100E6-HIV_S53 in 02-Aug-2019.M04401 - The 1-HIV1-B-US-KT284371-seed didn't get included in the HIV1B-nef coverage map, even though it is in the contigs coverage plot.

Update

I added a class, ConsensusAligner to track the relationship between the consensus sequence and the reference sequence. This is mostly finished, with some small tasks remaining:

  • Increase insertion counters for amino regions.
  • Increase insertion counters for nucleotide-only regions.
  • Does the 2.2 match for 1693-NOGEL100E6-HIV_S53 really span the deletion, or are there two disconnected alignments?
@donkirkby
Copy link
Member Author

Now that we've switched the genome coverage plot to use minimap2 in #480, we should be able to use the minimap2 alignment to clip out the gene regions, instead of aligning again with Gotoh.

@donkirkby
Copy link
Member Author

This is mostly handled by the fix for #589, but it seems to be failing when a single minimap2 alignment contains a large deletion.

donkirkby added a commit that referenced this issue Jul 13, 2021
@donkirkby
Copy link
Member Author

Reopening, because the current strategy breaks alignments up into too many small pieces.
New strategy:

  • Walk through the alignment looking for all possible breakpoints: insertions or deletions longer than 30 bases or that shift the reading frame.
  • When you find a breakpoint, ignore any breakpoints in the next 30 bases. We don't want to deal with smaller sections than that.
  • Take each section between breakpoints, and translate it into all three reading frames with some padding on either end. Try aligning each translation to the corresponding section in the amino acid reference to see which reading frame is the best match.
  • Walk through all the aligned sections, and combine neighbours that are in the same reading frame, if they don't have a big insertion or deletion between them.
  • Translate any newly combined sections, and redo the alignment.
  • Finally, transfer the counts from the seed aminos to the report aminos, based on the aligned sections.
  • Set deletion counts, even on large deletions between sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant