Skip to content

This project plots the effect of selecting different kmer length parameter in 2 different de novo genome assembly tools - AbySS and Velvet.

License

Notifications You must be signed in to change notification settings

groverkaushal/Varying_kmer_effect_in_denovo_assembly

Repository files navigation

Varying_kmer_effect_in_denovo_assembly

Overview

This project aims to observe the effect of selecting different kmer length parameter in 2 different de novo genome assembly tools. The denovo assembly tools used are Abyss and Velvet. The workflow consists of the following steps:

  1. Fastq read file download using SRA-Toolkit
  2. Quality Assessment with FASTQC
  3. High Quality Read Filtering using Trimmomatic
  4. De-novo Whole Genome Assembly using Abyss and Velvet with different kmer lengths
  5. Comparative Analysis of Assembly Quality using a Python Script

Dataset

  • Dataset ID: SRX23809475
  • SRA Run ID: SRR28196086
  • Instrument: Illumina MiSeq
  • Layout: Paired
  • Organism: E.coli
  • Total Bases: 355.6 Mb
  • No. of reads: 752,285 pair reads
  • Estimated Genome Size: 5 Mb
  • Estimated Read Coverage: 355.6/5 = 71X

System Requirements

  • Python 3
  • Perl 5.32.1
  • Conda

Installing Dependencies

  1. Create a virtual environment and install python dependincies:

    conda create -n grover python=3.9
    conda activate grover
    conda install matplotlib
  2. Install Trimmomatic:

    sudo apt-get install -y trimmomatic
  3. Install FastQC:

    sudo apt -y install fastqc
  4. Install Abyss:

    conda install -c bioconda -c conda-forge abyss
    
  5. Install NGS QC Toolkit & Velvet:

    mkdir Pre-requisite_tools
    cd Pre-requisite_tools
    wget https://github.com/mjain-lab/NGSQCToolkit/archive/refs/tags/v2.3.tar.gz
    tar -xf v2.3.tar.gz
    
    wget https://github.com/dzerbino/velvet/archive/refs/tags/v1.2.10.tar.gz
    tar -xf v1.2.10.tar.gz
    cd velvet-1.2.10
    make 'MAXKMERLENGTH=155'
    cd ../../
    
  6. Install Busco:

    conda install -c conda-forge mamba
    mamba install -c conda-forge -c bioconda busco=5.7.1
    



Workflow

Quality Assessment

The first step involves evaluating the quality of the sample using FASTQC. Summary statistics obtained from FASTQC provide insights into various quality metrics, allowing us to identify any potential issues in the sequencing data. Trimmomatic tool was used to trim the low quality ends and remove the low quality reads.

chmod +x preprocessing.sh
./preprocessing.sh

De-novo Whole Genome Assembly

Following the quality assessment, we perform de-novo whole genome assembly using two different tools, each employing 8 distinct k-mer sizes. This approach allows for the evaluation of the assembly output, and observing the effectiveness of the k-mer sizes in reconstructing the genome. The 2 tools used were Abyss and velvet, both used for genome assembly.

chmod +x abyss_velvet_denovo_assembly.sh
./abyss_velvet_denovo_assembly.sh

Analysis and Comparison

In the analysis, we compare various output features of the assembled scaffolds from each kmer length. The features compared were Total assembled Sequences, Total assembled bases, N50, N90, L50, Average assembled contig length, Minimum contig length, Max Length, No. of N nucleotides.

python data_analysis.py

Furthermore, BUSCO analysis plots were made to access the completeness of the assembled genome. The plots are saved in the plots directory.

chmod +x busco_run.sh
./busco_run.sh

Results

The results from this study, including quality assessment graphs and assembly outputs, are generated in the plots directory. For inference we consider the N50 and L50 statistics first. We then check for the total bases and the busco scores to access the completeness of the assembly. We use other statistics like N90, Max Length, Avg Length to confirm the assembly quality. We infer from the plots that kmer length of 95 had the most complete assembly for the given dataset.


Contact

For any questions or further information, please contact Kaushal Grover at kausha87_sit@jnu.ac.in.


License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project plots the effect of selecting different kmer length parameter in 2 different de novo genome assembly tools - AbySS and Velvet.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published