Skip to content

Additional files: How are my short read and or fasta files supposed to be formatted or formulated?

marabouboy edited this page Dec 13, 2021 · 5 revisions

INPUT

Input Additional Files: BAM and/or FASTA

This section is dedicated to explaining the additional files: short-read and/or fasta-reference.

Here I will show examples of how to format the additional files, most pertinent for our case, the short-read sequencing files and the fasta-reference.

1. Short-read sequencing:

  • Optionally, one can add short-read bulk RNA sequencing data and use it to confirm the existence of the splice site signals.
  • If this is desired, then the input file requires preprocessing to: align the raw sequencing reads; remove possible multi-alignment, supplementary reads, and secondary alignment. This preprocessing is similar to how the long-read sequencing input under the Fastq and the SAM subsection.
  • Ideally, the input file is a compressed BAM-file ([Short-read-Input.bam]) formated as following: <-----ERROR----->How to format the short-read sequencing data.
2. Fasta reference file:

  • Another additional option is to add the reference genome of your targeted organism for the detection of canonical GU/AG splice site dinucleotides.
  • My recommendation is to download the reference file from NCBI and have it in standard formating. Standard formating implies that the headers of the chromosomes/organisms are annotated through the use of the \> ("more than") symbol while the nucleotide sequences themselves are approximately ~108 nucleotides long (105-110 nt): <-----ERROR----->How to format the reference fasta file
  • If you want to use multichromosomal organism or an amalgamation reference fasta file containing the reference sequence for multiple organisms, then the header needs to be formated as following:
    <-----ERROR-----> How to have multiple chromosomal/organism fasta reference file.