Varying_kmer_effect_in_denovo_assembly

Overview

This project aims to observe the effect of selecting different kmer length parameter in 2 different de novo genome assembly tools. The denovo assembly tools used are Abyss and Velvet. The workflow consists of the following steps:

Fastq read file download using SRA-Toolkit
Quality Assessment with FASTQC
High Quality Read Filtering using Trimmomatic
De-novo Whole Genome Assembly using Abyss and Velvet with different kmer lengths
Comparative Analysis of Assembly Quality using a Python Script

Dataset

Dataset ID: SRX23809475
SRA Run ID: SRR28196086
Instrument: Illumina MiSeq
Layout: Paired
Organism: E.coli
Total Bases: 355.6 Mb
No. of reads: 752,285 pair reads
Estimated Genome Size: 5 Mb
Estimated Read Coverage: 355.6/5 = 71X

System Requirements

Python 3
Perl 5.32.1
Conda

Installing Dependencies

Create a virtual environment and install python dependincies:

conda create -n grover python=3.9
conda activate grover
conda install matplotlib

Install Trimmomatic:
```
sudo apt-get install -y trimmomatic
```
Install FastQC:
```
sudo apt -y install fastqc
```

Install Abyss:

conda install -c bioconda -c conda-forge abyss

Install NGS QC Toolkit & Velvet:

mkdir Pre-requisite_tools
cd Pre-requisite_tools
wget https://github.com/mjain-lab/NGSQCToolkit/archive/refs/tags/v2.3.tar.gz
tar -xf v2.3.tar.gz

wget https://github.com/dzerbino/velvet/archive/refs/tags/v1.2.10.tar.gz
tar -xf v1.2.10.tar.gz
cd velvet-1.2.10
make 'MAXKMERLENGTH=155'
cd ../../

Install Busco:

conda install -c conda-forge mamba
mamba install -c conda-forge -c bioconda busco=5.7.1

Workflow

Quality Assessment

The first step involves evaluating the quality of the sample using FASTQC. Summary statistics obtained from FASTQC provide insights into various quality metrics, allowing us to identify any potential issues in the sequencing data. Trimmomatic tool was used to trim the low quality ends and remove the low quality reads.

chmod +x preprocessing.sh
./preprocessing.sh

De-novo Whole Genome Assembly

Following the quality assessment, we perform de-novo whole genome assembly using two different tools, each employing 8 distinct k-mer sizes. This approach allows for the evaluation of the assembly output, and observing the effectiveness of the k-mer sizes in reconstructing the genome. The 2 tools used were Abyss and velvet, both used for genome assembly.

chmod +x abyss_velvet_denovo_assembly.sh
./abyss_velvet_denovo_assembly.sh

Analysis and Comparison

In the analysis, we compare various output features of the assembled scaffolds from each kmer length. The features compared were Total assembled Sequences, Total assembled bases, N50, N90, L50, Average assembled contig length, Minimum contig length, Max Length, No. of N nucleotides.

python data_analysis.py

Furthermore, BUSCO analysis plots were made to access the completeness of the assembled genome. The plots are saved in the plots directory.

chmod +x busco_run.sh
./busco_run.sh

Results

The results from this study, including quality assessment graphs and assembly outputs, are generated in the plots directory. For inference we consider the N50 and L50 statistics first. We then check for the total bases and the busco scores to access the completeness of the assembly. We use other statistics like N90, Max Length, Avg Length to confirm the assembly quality. We infer from the plots that kmer length of 95 had the most complete assembly for the given dataset.

Contact

For any questions or further information, please contact Kaushal Grover at kausha87_sit@jnu.ac.in.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Varying_kmer_effect_in_denovo_assembly

Overview

Dataset

System Requirements

Installing Dependencies

Workflow

Quality Assessment

De-novo Whole Genome Assembly

Analysis and Comparison

Results

Contact

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
plots		plots
LICENSE		LICENSE
README.md		README.md
abyss_velvet_denovo_assembly.sh		abyss_velvet_denovo_assembly.sh
busco_generate_plot.py		busco_generate_plot.py
busco_run.sh		busco_run.sh
data_analysis.py		data_analysis.py
format.txt		format.txt
preprocessing.sh		preprocessing.sh

License

groverkaushal/Varying_kmer_effect_in_denovo_assembly

Folders and files

Latest commit

History

Repository files navigation

Varying_kmer_effect_in_denovo_assembly

Overview

Dataset

System Requirements

Installing Dependencies

Workflow

Quality Assessment

De-novo Whole Genome Assembly

Analysis and Comparison

Results

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages