This workflow infers alleles, calls novel variation, and constructs sample-specific sequence variation graphs for immunoglobulin(Ig)/other immune-related loci which can be used to perform genetic association tests. The workflow inolves a combination of various graph-construction steps, sequence-to-graph alignment, flow graph decomposition, and unitig calling.
I hope to expand this workflow to enable genome-to-genome analyses/assessing genetic associations between host germline immunovariation and pathogen/metagenomic genetic variation/diversity (i.e., searching for immunological FOOTprints using joint host-pathogen genomic data).
Genetic loci where BIgFOOT performs accurate allele calling:
- IGH
- IGL
- HLA (DQA1/DQB1/... more to come)
Infers alleles - but, like bigoot, I have no evidence they're real (WiP):
- IGK
- TR
- KIR
- Raw fastq(.gz)
- BAM/CRAM alignment Note: you'll need ~65GB of RAM to sucessfully perform sequence-to-graph alignment against the full genome immunovariation graph
BIgFOOT is heavily influenced/relies on methods developed for VG-Flow (v0.0.4).
- Clone me!
git clone https://github.com/dduchen/BIgFOOT.git
- set up conda/mamba environment we'll be needing -- can move some of these after the '#' if they're already in your path (e.g., samtools, we assume you have R)
mamba create --name bigfoot -c bioconda -c conda-forge -c gurobi python=3 fastp graph-tool bazam minimap2 gurobi biopython numpy odgi gfaffix seqkit bbmap minimap2 seqwish blend-bio wfmash samtools pyseer unitig-caller parallel #fastq-dl kmc r-base cd-hit conda activate bigfoot
Ensure you have an active gurobi licence:
gurobi_cl
We also use the following R/bioconductor packages:
- data.table;
- dplyr;
- Biostrings/DECIPHER
- We also use some external tools which need to be accessible in your PATH
tools_dir=~/tools;
# (wherever you normally install+store software)
PATH=$PATH:${tools_dir};
cd ${tools_dir};
bigfoot_source=${tools_dir}/bigfoot # where are we storing all of the reference graph files?
mkdir -p ${bigfoot_source}
wget -P ${bigfoot_source} "https://zenodo.org/records/10869771/files/immunovar_graph_materials.tar.gz?download=1"
cd ${bigfoot_source} ; tar -xvf ${bigfoot_source}/immunovar_graph_materials.tar.gz* --keep-newer-files
Make distance indexes read only
chmod 0444 *.dist
We also need the variation graph toolkit (VG) executable
wget -P ${tools_dir}/ https://github.com/vgteam/vg/releases/download/v1.56.0/vg; chmod +x ${tools_dir}/vg
PATH=${tools_dir}:$PATH
We use Ryan Wick's Assembly-dereplicator package during haplotype selection Assembly-dereplicator.
git clone https://github.com/dduchen/Assembly-Dereplicator.git ${tools_dir}/Assembly-dereplicator
We provide the option of using merged paired-end reads from NGmerge for alignment/inference (optional, not always recommended) NGmerge.git clone https://github.com/dduchen/NGmerge.git ${tools_dir}/NGmerge
Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240
Set up example directory, download relevant files, and then run BIgFOOT pipeline
conda activate bigfoot
(Change this if you've downloaded the github repo somewhere else/have the bigfoot analysis scripts saved elsewere)
bigfoot_dir=${bigfoot_source}/scripts
bigfoot_dir=${tools_dir}/BIgFOOT/scripts ; immunovar_bed=${bigfoot_source}/grch38_custom_immunovar_coords.bed
test_dir=${bigfoot_source}/example/ ; mkdir -p ${test_dir}; cd ${test_dir}
Illumina chemistry: V2, Array: Agilent Sure Select Whole exome capture 50 Mb
#fastq-dl -a SRR507323 -o ${test_dir}/
wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_1.fastq.gz
wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_2.fastq.gzexport sample="SRR507323" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} merged="FALSE" graph="wg_immunovar" valid_alleles=true
################################################################ . ${bigfoot_dir}/preprocess_wg_immunovar_alignment.sh
################################################################
wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram
wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram tools_dir=${tools_dir} PATH=${tools_dir}:$PATHexport bam_file="NA19240.final.cram" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} ref_build="grch38" ref="${bigfoot_source}/GRCh38_full_analysis_set_plus_decoy_hla.fa" tools_dir=${tools_dir} PATH=${tools_dir}:$PATH merged="FALSE" graph="wg_immunovar" valid_alleles=true
################################################################ . ${bigfoot_dir}/process_from_bam_wg_immunovar_alignment.sh > ${bam_file%.cram}.log ################################################################
Support for CHM13-based BAM/CRAM is planned
graphdir=${bigfoot_source};graph="wg_immunovar";graph_base=${graphdir}/whole_genome_ig_hla_kir_immunovar;immune_graph=${graph_base}".subgraph";
bazam_reads=${i}; sample_id=${bazam_reads%.bazam.fastq.gz};sample_id=${sample_id##*/};
Sequence-to-graph alignment using VG-giraffevg giraffe -i -f ${bazam_reads} -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.bazam.grch38.wg.gam
vg giraffe -f ${sample_id}.unmapped.fastq.gz -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.unmapped.grch38.wg.gam
cat ${sample_id}.bazam.grch38.wg.gam ${sample_id}.unmapped.grch38.wg.gam > ${sample_id}.bazam.grch38.combined.gam
Ready for BIgFOOT
- export i=${sample_id}.bazam.grch38.combined.gam workdir=${PWD} graph=${graph} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} valid_alleles=true
################################################################ . ${bigfoot_dir}/filter_immune_subgraph.sh ################################################################
This is still very much a work in progress - many parameters/options exist but have not been fully documented here - for example, to limit inference to the IGenotyper set of alleles, you can set valid_alleles=igenotyper and there are options to skip allelic inference altogether if you're only interested in the reads/unitigs obtained from the immunovariation subgraph.
Please reach out if you feel this tool might be useful in your work - or if you'd like some added functionality - open an issue or email