BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.0.1

This workflow infers alleles, calls novel variation, and constructs sample-specific sequence variation graphs for immunoglobulin(Ig)/other immune-related loci which can be used to perform genetic association tests. The workflow inolves a combination of various graph-construction steps, sequence-to-graph alignment, flow graph decomposition, and unitig calling.
I hope to expand this workflow to enable genome-to-genome analyses/assessing genetic associations between host germline immunovariation and pathogen/metagenomic genetic variation/diversity (i.e., searching for immunological FOOTprints using joint host-pathogen genomic data).

Genetic loci where BIgFOOT performs accurate allele calling:

IGH
IGL
HLA (DQA1/DQB1/... more to come)

Infers alleles - but, like bigoot, I have no evidence they're real (WiP):

IGK
TR
KIR

Input:

Raw fastq(.gz)
BAM/CRAM alignment Note: you'll need ~65GB of RAM to sucessfully perform sequence-to-graph alignment against the full genome immunovariation graph

Set up conda environment

BIgFOOT is heavily influenced/relies on methods developed for VG-Flow (v0.0.4).

Clone me! git clone https://github.com/dduchen/BIgFOOT.git
set up conda/mamba environment we'll be needing -- can move some of these after the '#' if they're already in your path (e.g., samtools, we assume you have R)
mamba create --name bigfoot -c bioconda -c conda-forge -c gurobi python=3 fastp graph-tool bazam minimap2 gurobi biopython numpy odgi gfaffix seqkit bbmap minimap2 seqwish blend-bio wfmash samtools pyseer unitig-caller parallel #fastq-dl kmc r-base cd-hit conda activate bigfoot
Ensure you have an active gurobi licence:
gurobi_cl
We also use the following R/bioconductor packages:

data.table;
dplyr;
Biostrings/DECIPHER

We also use some external tools which need to be accessible in your PATH
tools_dir=~/tools; # (wherever you normally install+store software)
PATH=$PATH:${tools_dir};
cd ${tools_dir};

Download BIgFOOT graph materials from zenodo

bigfoot_source=${tools_dir}/bigfoot # where are we storing all of the reference graph files?
mkdir -p ${bigfoot_source}
wget -P ${bigfoot_source} "https://zenodo.org/records/10869771/files/immunovar_graph_materials.tar.gz?download=1"
cd ${bigfoot_source} ; tar -xvf ${bigfoot_source}/immunovar_graph_materials.tar.gz* --keep-newer-files
Make distance indexes read only
chmod 0444 *.dist
We also need the variation graph toolkit (VG) executable

wget -P ${tools_dir}/ https://github.com/vgteam/vg/releases/download/v1.56.0/vg; chmod +x ${tools_dir}/vg PATH=${tools_dir}:$PATH

We use Ryan Wick's Assembly-dereplicator package during haplotype selection Assembly-dereplicator.

git clone https://github.com/dduchen/Assembly-Dereplicator.git ${tools_dir}/Assembly-dereplicator
We provide the option of using merged paired-end reads from NGmerge for alignment/inference (optional, not always recommended) NGmerge.
git clone https://github.com/dduchen/NGmerge.git ${tools_dir}/NGmerge

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Set up example directory, download relevant files, and then run BIgFOOT pipeline

conda activate bigfoot bigfoot_dir=${bigfoot_source}/scripts (Change this if you've downloaded the github repo somewhere else/have the bigfoot analysis scripts saved elsewere)
bigfoot_dir=${tools_dir}/BIgFOOT/scripts ; immunovar_bed=${bigfoot_source}/grch38_custom_immunovar_coords.bed test_dir=${bigfoot_source}/example/ ; mkdir -p ${test_dir}; cd ${test_dir}

Starting from raw reads (WES)

Illumina chemistry: V2, Array: Agilent Sure Select Whole exome capture 50 Mb

#fastq-dl -a SRR507323 -o ${test_dir}/ wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_1.fastq.gz wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_2.fastq.gz

export sample="SRR507323" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} merged="FALSE" graph="wg_immunovar" valid_alleles=true ################################################################ . ${bigfoot_dir}/preprocess_wg_immunovar_alignment.sh ################################################################

Starting from BAM/CRAM (WGS)

wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram tools_dir=${tools_dir} PATH=${tools_dir}:$PATH

export bam_file="NA19240.final.cram" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} ref_build="grch38" ref="${bigfoot_source}/GRCh38_full_analysis_set_plus_decoy_hla.fa" tools_dir=${tools_dir} PATH=${tools_dir}:$PATH merged="FALSE" graph="wg_immunovar" valid_alleles=true ################################################################ . ${bigfoot_dir}/process_from_bam_wg_immunovar_alignment.sh > ${bam_file%.cram}.log ################################################################
Support for CHM13-based BAM/CRAM is planned

Starting from subset of reads, some manual pre-processing

graphdir=${bigfoot_source};graph="wg_immunovar";graph_base=${graphdir}/whole_genome_ig_hla_kir_immunovar;immune_graph=${graph_base}".subgraph"; bazam_reads=${i}; sample_id=${bazam_reads%.bazam.fastq.gz};sample_id=${sample_id##*/};
Sequence-to-graph alignment using VG-giraffe
vg giraffe -i -f ${bazam_reads} -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.bazam.grch38.wg.gam
vg giraffe -f ${sample_id}.unmapped.fastq.gz -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.unmapped.grch38.wg.gam cat ${sample_id}.bazam.grch38.wg.gam ${sample_id}.unmapped.grch38.wg.gam > ${sample_id}.bazam.grch38.combined.gam
Ready for BIgFOOT
- export i=${sample_id}.bazam.grch38.combined.gam workdir=${PWD} graph=${graph} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} valid_alleles=true ################################################################ . ${bigfoot_dir}/filter_immune_subgraph.sh ################################################################

This is still very much a work in progress - many parameters/options exist but have not been fully documented here - for example, to limit inference to the IGenotyper set of alleles, you can set valid_alleles=igenotyper and there are options to skip allelic inference altogether if you're only interested in the reads/unitigs obtained from the immunovariation subgraph.

Please reach out if you feel this tool might be useful in your work - or if you'd like some added functionality - open an issue or email

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
custom_beds		custom_beds
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gurobi.log		gurobi.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.0.1

Input:

Set up conda environment

Download BIgFOOT graph materials from zenodo

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Starting from raw reads (WES)

Starting from BAM/CRAM (WGS)

Starting from subset of reads, some manual pre-processing

License

dduchen/BIgFOOT

Folders and files

Latest commit

History

Repository files navigation

BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.0.1

Input:

Set up conda environment

Download BIgFOOT graph materials from zenodo

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Starting from raw reads (WES)

Starting from BAM/CRAM (WGS)

Starting from subset of reads, some manual pre-processing