Skip to content

Tutorial: Identification of repetitive circular DNA using Circle Map Repeats

iprada edited this page Jul 17, 2020 · 14 revisions

This is a step-by-step tutorial explaining how to go from the raw sequencing data (fastq files) to a interpretable file containing the chromosomal coordinates of the repetitive DNA circles present in your sample. In order to create friendly tutorial we have simulated reads from a tandem duplication present in the Saccharomyces cerevisiae genome. The aim of this tutorial is to use Circle-Map Repeats to detect the origin of theDNA circle formed from the repetitive region of the yeast genome.

Requirements to run this tutorial

Step 1: preparing and downloading the data

Downloading the raw data

We can download the left and right Illumina reads using the following commands

wget https://raw.githubusercontent.com/iprada/Circle-Map/master/tutorial/repetitive_region1.fastq
wget https://raw.githubusercontent.com/iprada/Circle-Map/master/tutorial/repetitive_region2.fastq

Downloading and preparing the reference genome

Now that we have downloaded the raw data, we need to download and prepare the yeast reference genome in order to align the circular DNA reads back to it. In this tutorial, we will use the SacCer3 version of the yeast genome, which is the most recent one. We can download the yeast chromosomes from UCSC:

wget https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz

Decompress it with the tar command:

tar xvf chromFa.tar.gz

And join the resulting chromosomes into a single file using cat:

cat chr*.fa > SacCer3.fa

Finally, in order to align the reads to the reference genome with bwa we need to create the index of the reference genome, which are a set of files required by the aligner to efficiently align the reads. We can do this using:

bwa index SacCer3.fa

Step 2: alignment of the reads to the reference genome

Now that we have prepared all the required files, we are ready to execute our workflow for detecting repetitive circular DNA. In this case, we will use the MEM algorithm under BWA:

bwa mem SacCer3.fa repetitive_region1.fastq repetitive_region2.fastq > aligned_repetitive_circle.sam

This command generates a SAM file containing the information about where and how the reads align to the yeast reference genome.

Notes:

  • If your input files are large, you can speed-up BWA increasing the number of cores using the -t option.

Step 3: post-processing of the alignment file

In this case, we need to sort the reads by alignment position and index the resulting BAM file so that we can execute Circle-Map in the next step. We can do so using SAMtools:

samtools sort -o sorted_aligned_repetitive_circle.bam aligned_repetitive_circle.sam

Notes:

  • If your input files are large, you can speed-up SAMtools increasing the number of cores using the -@ option.

Finally, we need to index the sorted file:

samtools index sorted_aligned_repetitive_circle.bam

Step 4: detecting the circular DNA using Circle-Map Repeats.

Finally, we can detect the circles using Circle-Map. Behind the scenes, Circle-Map Repeats is looking at reads with two high scoring alignments as these ones are indicative of circles formed from regions with homology.

Circle-Map Repeats takes as input two arguments. A BAM file sorted by alignment position (indicated with the -i argument) and the output filename (indicated with the -o argument). We can execute the algorithm as follows:

Circle-Map Repeats -i sorted_aligned_repetitive_circle.bam -o my_unknown_circle.bed

Now, we can look at the output of Circle-Map using:

less -S my_unknown_circle.bed

Where we can see the BED file generated by Circle-Map Repeats:

chrXII  451417  468889  3018    .       19.510931776556774      5.131087870100604       1.0     0.9081364829396326      0.002461080586080586

In this case, we have detected a DNA circle formed from the ribosomal DNA array, which is by far the most studied circle in the yeast genome.