RNAseq Workflow: Mapping, Assembly, and Differential Gene Expression Analysis

Overview

This project uses an RNAseq workflow pipeline to generate count data and identify differentially expressed genes from sequencing reads. The reads are mapped using a reference genome. The workflow consists of the following steps:

Fastq files downloaded using SRA-Toolkit
Quality Assessment with FASTQC
High Quality Read Filtering using FastP
Reference Genome Mapping using HiSAT2
Assembly using StringTie
Counts data generated using Cufflinks
Differentially Expressed Genes calculated using CuffDiff

Fig: Flowchart of the workflow followed in my project

Datasets

This project involves transcriptomic analysis to compare the salinity stress response in salinity-tolerant genotypes of chickpea. The analysis was conducted on a salinity-tolerant chickpea genotype under both control and saline environments. The dataset includes RNAseq sequencing reads from two control group samples and two saline group samples.

BioProject: PRJNA842022
SRA Study: SRP376874
Run ID: SRR19383303 (Control Sample 1)
Run ID: SRR19383302 (Control Sample 2)
Run ID: SRR19383301 (Saline Sample 1)
Run ID: SRR19383300 (Saline Sample 2)
Year of Experiment: 2022
Instrument: Illumina NextSeq 500
Layout: Single
Organism: Cicer arietinum (Chickpea)
Total Bases per Sample: ~510 Mb (Million Bases)
No. of reads per Sample: ~10.4 Million reads
Estimated Genome Size: 500 Mb
Estimated Transcriptome Size: 77 Mb
Estimated Transcriptome Coverage per Sample: 510/77 = 6.6X

System Requirements

Python 3
Conda

Installing Dependencies

Create a conda environment and activate it:

conda create -n grover
conda activate grover

Install FastQC:
```
sudo apt -y install fastqc
```
Install Fastp:
```
conda install -c bioconda fastp
```

Install ncbi_datasets:

conda install -c conda-forge ncbi-datasets-cli

Install SRA-Toolkit

wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.1.1/sratoolkit.3.1.1-ubuntu64.tar.gz
tar -xf sratoolkit.3.1.1-ubuntu64.tar.gz 
rm sratoolkit.3.1.1-ubuntu64.tar.gz 
cd sratoolkit.3.1.1-ubuntu64/bin/
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bashrc
source ~/.bashrc
cd ..

Install HiSAT2:

git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
make
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bashrc
source ~/.bashrc
cd ..

Install SamTools:

wget https://github.com/samtools/samtools/releases/download/1.20/samtools-1.20.tar.bz2
tar -xf samtools-1.20.tar.bz2 
rm samtools-1.20.tar.bz2
sudo apt-get install zlib1g-dev libncurses5-dev libncursesw5-dev liblzma-dev libbz2-dev libcurl4-openssl-dev
cd samtools-1.20/
make
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bashrc
source ~/.bashrc
cd ..

Install StringTie:

git clone https://github.com/gpertea/stringtie
cd stringtie
make release
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bashrc
source ~/.bashrc
cd ..

Install CuffLinks:

wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
tar -xf cufflinks-2.2.1.Linux_x86_64.tar.gz
rm cufflinks-2.2.1.Linux_x86_64.tar.gz
cd cufflinks-2.2.1.Linux_x86_64
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bashrc
source ~/.bashrc
cd ..

Workflow

Quality Assessment

The first step involves evaluating the quality of the sample using FASTQC. Summary statistics obtained from FASTQC provide insights into various quality metrics, allowing us to identify any potential issues in the sequencing data. Fastp tool was used to remove duplicated reads, trim the low quality ends, remove the low quality reads, trim adapter sequences, remove low complexity sequences, trim poly G tail. After filtering the HQ reads, Again the FastQC reports were generated.

chmod +x Fastp_and_FastQC.sh
./Fastp_and_FastQC.sh

Mapping

Following the quality assessment, we perform mapping of the reads on a reference genome using HiSAT2. First the reference genome fasta file, gtf file and gff file was downloaded. Then HiSAT2 tool was used to generate 4 mapping sam files from the 4 SRR Fastq files.

chmod +x mapping.sh
./mapping.sh

Assembly

Now, first the 4 sam files were sorted and compressed to 4 bam (binary) files using SamTools. Next the 4 bam files were assembled individually using StringTie. The reference gff file was given as input to generate 4 assembled gtf files. These 4 gtf files were further merged into 1 "merged.gtf" file.

chmod +x assembly.sh
./assembly.sh

Differentially Expressed Genes

Now, finally the differentially expressed genes were calculated between the 2 conditions - control and stress, each with 2 samples. For this we use the 4 bam files generated by HiSAT2, and the merged.gtf file generated by StringTie. We use the CuffDiff tool from Cufflinks package to calculate the DEG's. The results were stored in "gene_exp.diff" file.

chmod +x deg.sh
./deg.sh

Contact

For any questions or further information, please contact Kaushal Grover at kausha87_sit@jnu.ac.in.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNAseq Workflow: Mapping, Assembly, and Differential Gene Expression Analysis

Overview

Datasets

System Requirements

Installing Dependencies

Workflow

Quality Assessment

Mapping

Assembly

Differentially Expressed Genes

Contact

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Fastp_and_FastQC.sh		Fastp_and_FastQC.sh
Flowchart.png		Flowchart.png
LICENSE		LICENSE
README.md		README.md
assembly.sh		assembly.sh
deg.sh		deg.sh
mapping.sh		mapping.sh

License

groverkaushal/RNAseq-Workflow-Mapping-Assembly-and-Differential-Gene-Expression-Analysis

Folders and files

Latest commit

History

Repository files navigation

RNAseq Workflow: Mapping, Assembly, and Differential Gene Expression Analysis

Overview

Datasets

System Requirements

Installing Dependencies

Workflow

Quality Assessment

Mapping

Assembly

Differentially Expressed Genes

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages