Skip to content

IARCbioinfo/gatk4-DataPreProcessing-nf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

gatk4-DataPreProcessing-nf

Nextflow pipeline for pre-process BAM(s) with hg38 and GATK4, following GATK Best Practices.

Description

Tailored to fit the need of re-analyzing BAM files under new GATK4 Best Practices, and with all hg38 databases.

Dependencies

  1. This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
  2. GATK4 executables
  3. Picard tools
  4. BWA, especially BWAKIT, because a post-alignment treatment is required (more info).
  5. Sambamba.
  6. Qualimap binary in your PATH (for a nice QC per BAM).
  7. References (genome in fasta, dbSNP vcf, 1000 Genomes vcf, Mills and 1000 Genomes Gold Standard vcf), available in GATK Bundle.

IMPORTANT note about post-alignment : according to this post, BWA has an implicit alt-aware mode. In order to have the expected behavior of postalt.js step, one must make sure to have within the FASTA reference folder, the <name_of_ref>.fasta.alt as well.

Input

  • --input : your intput BAM file(s) (do not forget the quotes for multiple BAM files e.g. --input "test_*.bam")
  • --output_dir : the folder that will contain your aligned, recalibrated, analysis-ready BAM file(s).
  • --ref_fasta : your reference in FASTA.
  • --dbsnp : dbSNP VCF file.
  • --onekg : 1000 Genomes High Confidence SNV VCF file.
  • --mills : Mills and 1000 Genomes Gold Standard SID VCF file.
  • --gatk_exec : the full path to your GATK4 binary file.
  • --interval_list : a file for the intervals to call on. More information on interval_list format.

A nextflow.config is also included, please modify it for suitability outside our pre-configured clusters (see Nexflow configuration).

Usage for Cobalt cluster

nextflow run iarcbioinfo/gatk4-DataPreProcessing.nf -profile cobalt --input "/data/test_*.bam" --output_dir /data/myRecalBAMs --ref_fasta /ref/Homo_sapiens_assembly38.fasta --gatk_exec /bin/gatk-4.0.6.0/gatk --dbsnp /ref/dbsnp_146.hg38.vcf.gz --onekg /ref/1000G_phase1.snps.high_confidence.hg38.vcf.gz --mills Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --interval_list Exome.interval_list

About

Nextflow pipeline for pre-process BAM(s) with hg38 and GATK4

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published