Skip to content

Tutorial: Identification of circular DNA using Circle Map Realign

iprada edited this page Sep 6, 2019 · 3 revisions

This is a tutorial explaining step by step how to go from the raw data (the fastq files) to a interpretable, tab separated BED file indicating the chromosomal coordinates of DNA circles. To make the tutorial, we have simulated Illumina reads coming from a circular DNA from some unknown region of the human genome. The aim of this tutorial is to use Circle-Map in order to recapitulate the circular DNA origin.

Requirements to run this tutorial

Step 1: preparing and downloading the data

Downloading the raw data

We can download the left and right Illumina reads using the following commands:

wget https://raw.githubusercontent.com/iprada/Circle-Map/master/tutorial/unknown_circle_reads_1.fastq
wget https://raw.githubusercontent.com/iprada/Circle-Map/master/tutorial/unknown_circle_reads_2.fastq

Downloading and preparing the reference genome

Once we have the raw data, we need to download the human genome reference, to map the reads back to the genome. In this tutorial we will use the hg38 version, which can be downloaded from UCSC using:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

Now, decompress it using gzip:

gunzip -d hg38.fa.gz

Finally we need create the index of the reference genome, which are a set of files required by the aligner to map the reads to genome. They can be created using the following command:

bwa index hg38.fa

And with the following command for Circle-Map index:

samtools faidx hg38.fa

Step 2: Alignment of the reads to the reference genome

Now, we can begin our workflow to detect the circular DNA. We need to align the Illumina reads back to the reference genome. For this case, we will use the algorithm MEM under BWA and we will invoke it as follows:

bwa mem -q hg38.fa unknown_circle_reads_1.fastq unknown_circle_reads_2.fastq > unknown_circle.sam

This generates a SAM file, which contains the information about where and how the reads have aligned to the genome.

Notes:

  • We use the -q option to assign independent mapping quality scores to split read alignments in BWA. This improves the estimation of the breakpoint graph weights in Circle-Map probabilistic model.
  • In case you have a computer with many cores, you can win speed by increasing the number of threads using BWA -t option.

Step 3: Preparing the files for Circle-Map

In this step, we will process the the SAM files in order to fit the formats required by Circle-Map.

We will sort the SAM file by read name and convert it to BAM ( SAM binary format) using:

samtools sort -n -o qname_unknown_circle.bam unknown_circle.sam

This file is used by Circle-Map to extract the circular DNA read candidates and to estimate the paremeters of the prior distribution.

We also need to sort the BAM file by the leftmost mapping coordinate:

samtools sort -o sorted_unknown_circle.bam unknown_circle.sam

This file is used by Circle-Map to calculate the genomic coverage metrics.

Notes:

  • In case you have a computer with many cores, you can speed up this commands by increasing the number of threads using SAMtools sort -@ option

Now, we will extract the reads that indicate circular DNA rearrangements ( partially mapped reads and discordant read pairs) using Circle-Map:

Circle-Map ReadExtractor -i qname_unknown_circle.bam -o circular_read_candidates.bam

And we will sort the read candidates by coordinate with SAMtools:

samtools sort -o sort_circular_read_candidates.bam circular_read_candidates.bam

Notes:

  • In case you have a computer with many cores, you can speed up this command by increasing the number of threads using SAMtools sort -@ option

Indexing the BAM files

To finish this step, we need to index the BAM file:

samtools index sort_circular_read_candidates.bam
samtools index sorted_unknown_circle.bam

This command will generate two BAI files, so that Circle-Map can retrieve efficiently the aligned reads from the sorted candidate circular and raw BAM files

Step 4: Detecting the circular DNA

Finally, we are ready to use Circle-Map Realign to detect the circular DNA. Circle-Map will take as input the following files we have generated above:

  • sort_circular_read_candidates.bam: The circular DNA read candidates. Used for detecting the circular DNA
  • qname_unknown_circle.bam: The read name sorted BAM file. Used for calculating the realignment prior
  • sorted_unknown_circle.bam: The coordinate sorted BAM. Used for calculating circular DNA coverage metrics
  • hg38.fa: Reference sequence. Used for realignment of the partially mapped reads.

And with following command:

Circle-Map Realign -i sort_circular_read_candidates.bam -qbam qname_unknown_circle.bam -sbam sorted_unknown_circle.bam -fasta hg38.fa -o my_unknown_circle.bed

Notes:

  • In case you have a computer with many cores, you can speed up this command by increasing the number of threads using Circle-Map Realign -t option

You can look at the output using the command:

less -S my_unknown_circle.bed

Where we can see the generated BED file with the following circular DNA coordinates in BED format:

chr7    143911105       143917553       12      7       251.0   29.97115384615385       5.408852146902424       1.0  0.9988734509951184       0.0

Which indeed, has the coordinate from where we have simulate the circular DNA:

chr7    143911105       143917553