Skip to content

Input: How is my input supposed to be formatted or formulated?

marabouboy edited this page Dec 13, 2021 · 5 revisions

Input

Input File: Bed12

This section is dedicated to explaining the input, bed12.

I will break down the input into four categories depending on how far you have proceeded in the bioinformatical analysis post-sequencing with either a MinION, a GridION, or a PromethION.

1. Fast5/.f5 format:

  • In the case where you have just initialized the sequencing and your starting point is with the Fast5/.f5 sequencing format, my recommendation is to basecall the file and convert the file into Fastq format.
  • This is simply done by using the guppy software, or any other open-source or proprietary basecalling software, according to the developer's specifications and instructions.
  • Due to the Fast5/.f5 file being compressed as a default, I would only be able to show compressed files, therefore the figure is omitted.
2. Fastq/.fq format:

  • If you've basecalled and/or are starting with the raw Fastq/.fq sequencing format, my recommendation is to align the sequences and convert the file into .bam format.
  • This is easiest done with an established aligner, such as Minimap2, preferably with these three parameters
    1. Splice preset: -x splice
    2. Samtools output format: -a
    3. Removal of secondary alignment: --secondary=no
    4. Complete command: minimap2 -ax splice --secondary=no [Input.fastq] > [Output.sam]
  • Ideally, your Fastq/.fq input file should have a formating similar to this: <-----ERROR----->Example of Nanopore long-read sequencing reads
2.5. SAM/.sam format:

  • If you've aligned and/or are starting with the processed BAM/.bam aligned sequencing format, my recommendation is to sort and compress the sam-file into a bam-file. You can also optionally remove secondary reads and supplementary alignment as these reads may skew the results so as to enrich certain splice variants.
    1. Filtering of Secondary and Supplementary reads:
      awk '$2 != 2048 {print $0}' [Input-file.sam] | awk '$2 != 2064 {print $0}' > [Input-Output1.sam]
    2. Sorting and Compression of the reads:
      samtools sort [Input-Output1.sam] > [Input-Output2.bam]
    3. Indexing of the file:
      samtools index [Input-Output2.bam]
3. BAM/.bam format:

  • If you've aligned and/or are starting with the processed BAM/.bam aligned sequencing format, my recommendation is to convert the bam-file into a BED12/.bed12 format using bedtools, as this BED12-file is the main input for FLAME.
  • According to the developers of bedtools, the transformation into BED12 format requires the use of the bamtobed function and the bed12 flag.
    1. bamtobed function: bedtools bamtobed
    2. BED12 flag: -bed12
    3. Complete command: bedtools bamtobed -bed12 -i [Input.bam] > [Output.bed]
  • Ideally, your BAM/.bam input file should have a formating similar to this: <-----ERROR----->Example of Nanopore long-read aligned sequencing reads
4. BED12/.bed format:

  • If you've already converted the input file into the prerequisite bed12 format, the input file is ready for FLAME
  • My recommendation is to first test the pipeline by running the program without any additional input files. Refer to the wiki-section regarding reference creation and annotation.
    1. FLAME function: ./FLAME -I [Input.bed] -GTF [Reference.gtf]
  • Ideally, your BED12/.bed input file should have a formating similar to this: <-----ERROR----->Example of Nanopore long-read aligned sequencing reads, in bed12 format