Output: How is the output supposed to be interpreted and used?

Output

Output Files: Bed12, TSV, and TXT.

This section is dedicated to explaining the outputs.

I will break down each output that is possible to generate from FLAME.
Remember that not all outputs are mandatory and the output depends on the presence of various flags.
Most notably -B [Shortread.bam], -R [Reference.fasta], and --verbose.
Also, since the user can input any prefix for the output files, the term [Prefix] will denote the users choice of prefix.

1. [Prefix].Reference.txt:

This file is the standard output reference file that is produced when specifying the --verbose flag.
Due to the fact that these files are simple extractions, they should be very similar to the input files with examples given for the [Prefix].Reference.txt:
- The first column will denote the internal naming system of FLAME.
- The second column will denote the exon chromosome/reference genome if one wishes for the implementation of FLAME in multi-chromosomal and/or multi-organism reference GTF file.
- The third column will denote the exon starting position.
- The fourth column will denote the exon length.
- The fifth column will denote the exon ending position.

2. [Prefix].Annotated.bed and [Prefix].Incongruent.bed:

Both of the files are simply sorted and filtered versions of the input bed12 file.
This file is mainly there if further independent analysis is required and the raw bed12 reads are necessary inputs for whichever software, program and/or script the user wants to use.
Due to the fact that these files are simple extractions, they should be very similar to the input files with examples given for the:
- [Prefix].Annotated.bed:
- [Prefix].Incongruent.bed

3. [Prefix].AnnotatedTrans.txt and [Prefix].IncongruentTrans.txt:

The files will looks slightly different dependent on if one uses the [Prefix].AnnotatedTrans.txt or [Prefix].IncongruentTrans.txt.
- If one uses the [Prefix].AnnotatedTrans.txt then each element (line) represent the combination of exons, of which the specifications can be found within the [Prefix].Reference.txt:
- If one uses the [Prefix].IncongruentTrans.txt then each element (line) represent the combination of exons, with the exeption when there is a number range, e.g. 134465 - 134472.
  This represent the possible novel exon that does not conform with the established exons in the [Prefix].Reference.txt.

4. [Prefix].Annotated.tsv and [Prefix].IncongruentTrans.tsv:

The files will looks slightly different dependent on if one uses the [Prefix].QuantAnnotated.tsv or [Prefix].QuantIncongruent.tsv.
- If one uses the [Prefix].QuantAnnotated.tsv then each element (line) represent a splice variant permutation:
  - The first column will denote the quantification count in absolute numbers of the splice variant permutation.
  - The second column will denote the length of the splice variant permutation which is based on the [Prefix].Reference.txt.
  - The third column will denote the number of exons that the selected specific splice variant permutation contains.
  - The fourth column will denote the isoform or combination of exons that constitute the specific splice variant permutation.
    This can then be translated with the [Prefix].Reference.txt.
- If one uses the [Prefix].QuantIncongruent.tsv then the file is slightly more streamline as compared to its incongruent counterpart.
  Each element (line) represent the a splice variant permutation:
  - The first column will denote the quantification count in absolute numbers of the splice variant permutation.
  - The second column will denote the isoform or the combination of exons that constitute the specific splice variant permutation.
    A special note goes to the number range, e.g. 156464 - 156741.
    This represents the possible novel exon that does not conform with the established exons in the [Prefix].Reference.txt.

5. [Prefix].PotentialSplice.tsv:

This [Prefix].PotentialSplice.tsv is one of the main output files and central for the detection of possible novel exons and splice sites:
- The first column will denote the genomic/chromosomal position within your organism.
  Do note that most of the position comes in sets of four positions, each increasing with the position with an increment of a single nucleotide.
  This is by design due to the variance function.
  The user needs to decide on which of the four positions is the most correct using this file in conjunction with the [Prefix].AdjacencyIncongruent.tsv file that will be explained later.
- The second column will denote the number of individual nanopore long-read sequencing reads that are supporting the existence of the specific, potentially novel, splice site signal.
- The third column will denote the percentage of supporting nanopore long-read sequencing reads as compared to the total number of long-read sequencing reads that have been classified as inconguent ([Prefix].Incongruent.bed).
  Notice that in the example, there is no splice site signal that is below 1% as this is the standard percentage threshold.
  Though do remember that this threshold is customizable to the user's liking.
- *Optional*: The fourth column will denote the presence of adjacent canonical splice site dinucleotides (GU/AG).
  There are four outcomes: "None", "GU", "AG", and "Both".
  - "None" denotes the absence of either of the canonical dinucleotides.
  - "GU"/"AG" denotes the presence of the canonical "GU" splice donor dinucleotide or the "AG" splice acceptor dinucleotide, respectively
  - "Both" denotes the presence of both a canonical "GU" dinucleotide signal upstream from the designated genomic position as well as a "AG" dinucleotide signal downstream from the designated genomic position.
  Do note that this column will only be present if one also inputs the fasta reference sequence of ones target organism using the -R [Reference.fasta] flag.
- *Optional*: The fifth column will denote the number of short-read sequences supporting the specific splice site signal, in absolute numbers.
  N/A denotes that any short-read support for the splice site are not available essentially meaning 0 supporting short-read sequences.
  Do note that this column will only be present if one also inputs the short-read sequencing files using the -B [Shortread.bam] flag.

6. [Prefix].AdjacencyAnnotated.tsv:

This [Prefix].AdjacencyAnnotated.tsv is the central for analyzing the exon connectivity of the sequencing reads that follow the established reference:
- The file generated is a large table (in .tsv format) where the column represent the exon donor site and the rows represent the exon acceptor site:
This figure is the same table generated as above.
However, this version has the [Prefix].Reference.tsv added as well as having certain sections highlighted which I will go through chronologically:
1. 1: Highlight 1 denotes the actual absolute numerical value of exon connectivity.
  This example also happens to have the highest value out of any exon connections, (e.g. 112184).
2. 2: Highlight 2 denotes the exon acceptor (Where the exon connects to).
3. 3: Highlight 3 denotes the exon donor (Where the exon connects from).
4. 4: Highlight 4 denotes the reference file that one should use to get the exact genomic position.
The optimal method for interpreting the adjacency matrix, specifically the Annotated Adjacency Matrix, is to start with your column (Exon donor, Highlight 3) and see the breakdown and quantification of each row (Exon acceptor, Highlight 2). Use the value (Highlight 1) and translate using the reference file (Highlight 4).

7. [Prefix].AdjacencyIncongruent.tsv:

This [Prefix].AdjacencyIncongruent.tsv will try to explain the method in detecting novel splicing using said file.
This section will be quite large as the manual determination of novel exons requires the use of two files, much in similar fashion as [Prefix].AdjacencyAnnotated.tsv.
However, where the annotation of exon connectivity is straightforward as the exons are already defined, the determination of novel exons requires a bit of decision making, dependent on multiple factors.
The original file will be a .tsv-file which one can open with MS Excel, google-docs, or however the user chooses to view matrices:
Some things to take note of is that this matrix will be quite large with a maximum cell dimension size of 401x401 cells as that is the theoretical limit in how many splice site that will be suggested using the 1% threshold.
What I have selected is a cell with the value 2939, meaning that 2939 long-read has these approximate genomic positions.
To highlight the selected section and the important elements:
- Highlight 1 denotes the selected cell with the aforementioned value of 2939.
- Highlight 2 denotes the splice donor position of the suggested novel exon (e.g. 138392).
- Highlight 3 denotes the splice acceptor position of the suggested novel exon (e.g. 138478).
- Highlight 4 denotes the section that will be cropped in order to more easily focus on the relevant section.
Lets take a closer look into the relevant section (Highlight 4)
Lets add the [Prefix].PotentialSplice.tsv file for easier determination of suggested novel exon as well as highlight the relevant sections:
- Highlight 1 denotes all of the possible novel splice acceptor sites when using the selected (e.g. 138392) splice donor site.
  Note that there are multiple possible splice acceptor sites and this is what allows for the detection of more complex splicing patterns.
- Highlight 2 denotes all of the possible novel splice donor sites when using the selected (e.g. 138478) splice acceptor site.
  Note that there are multiple possible splice donor sites and once again, this is what allows for the detection of more complex splicing patterns.
- Highlight 3 denotes a section of the [Prefix].PotentialSplice.tsv file that are relevant in determining if the suggested splice donor site and splice acceptor site are canonical representing a possible novel exon.
  Note that this highlight also includes the adjacent dinucleotide signal as well as the number of short-read sequencing support.
When using the logic that is layed out, the suggested novel exon is an exon that expands 138392 – 138478.
However, this is where the additional input files comes into play:
1. Suggested splice donor site (138392) is suggested to not have any adjacent canonical dinucleotide splice signal that is associated with splice donation (GU/GT) but does have an adjacent canonical dinucleotide splice acceptor signal (AG).
2. However, there does seem to be 18 short-read sequences that support that this position.
3. Suggested splice acceptor site (138478) is suggested to not have any adjacent canonical dinucleotide splice signal that is associated with splice acceptance (AG) but does have an adjacent canonical dinucleotide splice donor signal (GU/GT).
4. However, there does not seem to be any short-read sequences that support this position.
5. Despite the lack of short-read support at the exact chosen splice acceptor position (138478), there does seem to be heavy short-read sequencing support at just two nucleotide downstream (138480) as well as having canonical dinucleotide splice acceptor signal (AG)
Based on the five aforementioned factors, it is more likely that that the true genomic position of the exon is at 138392 – 138480.
However, do note that it is still slightly odd that the suggested splice donor site does not have any canonical dinucleotide associated with splice donation (GU/GT).
Using this sequence of logic, one can deduce a true novel exon.
While this could be automated for automatic detection and suggestion of novel exons, there are factors that require human interpretation and knowledge of ones data that cannot be standardized.

8. [Prefix].RawRanges.tsv:

This output is optional as it requires the verbose (--verbose) flag option:
This list is simply a list showing the genomic ranges at which a possible novel exon might exist.

9. [Prefix].ShortreadSplice.tsv:

This output is optional as it requires both the input of short-read sequencing data (Short-read sequencing) as well as specifying the verbose (--verbose) flag option:
- The first column will denote the genomic position, in other words, where the in the genome the splice was found.
- The second column will done the number of short-read sequencing reads containing a splice signal at that specific genomic location.
  The number shown are absolute numbers.

Index

1. Input - Input sequences: How is my input supposed to be formatted/formulated?

2. Input - Input reference: How is my reference supposed to be formatted/formulated?

3. Input - Optional additions: How to format the optional short-read bulk RNA sequencing data and/or the reference genome in fasta format?

4. Output - Output files: How is the output supposed to be interpreted and used?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output: How is the output supposed to be interpreted and used?

Output

Output Files: Bed12, TSV, and TXT.

This section is dedicated to explaining the outputs.

Index

Clone this wiki locally