Skip to content

computational method to perform immunoglobulin allele-level genotyping and identification of sample specific variation via flow variation graphs.

License

Notifications You must be signed in to change notification settings

dduchen/BIgFOOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.0.1

This workflow infers alleles, calls novel variation, and constructs sample-specific sequence variation graphs for immunoglobulin(Ig)/other immune-related loci which can be used to perform genetic association tests. The workflow inolves a combination of various graph-construction steps, sequence-to-graph alignment, flow graph decomposition, and unitig calling.
I hope to expand this workflow to enable genome-to-genome analyses/assessing genetic associations between host germline immunovariation and pathogen/metagenomic genetic variation/diversity (i.e., searching for immunological FOOTprints using joint host-pathogen genomic data).

Genetic loci where BIgFOOT performs accurate allele calling:

  • IGH
  • IGL
  • HLA (DQA1/DQB1/... more to come)

Infers alleles - but, like bigoot, I have no evidence they're real (WiP):

  • IGK
  • TR
  • KIR

Input:

  • Raw fastq(.gz)
  • BAM/CRAM alignment Note: you'll need ~65GB of RAM to sucessfully perform sequence-to-graph alignment against the full genome immunovariation graph

Set up conda environment

BIgFOOT is heavily influenced/relies on methods developed for VG-Flow (v0.0.4).

  1. Clone me! git clone https://github.com/dduchen/BIgFOOT.git
  2. set up conda/mamba environment we'll be needing -- can move some of these after the '#' if they're already in your path (e.g., samtools, we assume you have R)
    mamba create --name bigfoot -c bioconda -c conda-forge -c gurobi python=3 fastp graph-tool bazam minimap2 gurobi biopython numpy odgi gfaffix seqkit bbmap minimap2 seqwish blend-bio wfmash samtools pyseer unitig-caller parallel #fastq-dl kmc r-base cd-hit conda activate bigfoot
    Ensure you have an active gurobi licence:
    gurobi_cl
    We also use the following R/bioconductor packages:
  • data.table;
  • dplyr;
  • Biostrings/DECIPHER
  1. We also use some external tools which need to be accessible in your PATH
    tools_dir=~/tools; # (wherever you normally install+store software)
    PATH=$PATH:${tools_dir};
    cd ${tools_dir};

Download BIgFOOT graph materials from zenodo DOI

bigfoot_source=${tools_dir}/bigfoot # where are we storing all of the reference graph files?
mkdir -p ${bigfoot_source}
wget -P ${bigfoot_source} "https://zenodo.org/records/10869771/files/immunovar_graph_materials.tar.gz?download=1"
cd ${bigfoot_source} ; tar -xvf ${bigfoot_source}/immunovar_graph_materials.tar.gz* --keep-newer-files
Make distance indexes read only
chmod 0444 *.dist
We also need the variation graph toolkit (VG) executable

We use Ryan Wick's Assembly-dereplicator package during haplotype selection Assembly-dereplicator.

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Set up example directory, download relevant files, and then run BIgFOOT pipeline

  • conda activate bigfoot
    bigfoot_dir=${bigfoot_source}/scripts
    (Change this if you've downloaded the github repo somewhere else/have the bigfoot analysis scripts saved elsewere)
    bigfoot_dir=${tools_dir}/BIgFOOT/scripts ; immunovar_bed=${bigfoot_source}/grch38_custom_immunovar_coords.bed
    test_dir=${bigfoot_source}/example/ ; mkdir -p ${test_dir}; cd ${test_dir}

Starting from raw reads (WES)

Illumina chemistry: V2, Array: Agilent Sure Select Whole exome capture 50 Mb

  • #fastq-dl -a SRR507323 -o ${test_dir}/
    wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_1.fastq.gz
    wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_2.fastq.gz
  • export sample="SRR507323" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} merged="FALSE" graph="wg_immunovar" valid_alleles=true
    ################################################################ . ${bigfoot_dir}/preprocess_wg_immunovar_alignment.sh
    ################################################################
Starting from BAM/CRAM (WGS)
  • wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram
    wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram tools_dir=${tools_dir} PATH=${tools_dir}:$PATH
  • export bam_file="NA19240.final.cram" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} ref_build="grch38" ref="${bigfoot_source}/GRCh38_full_analysis_set_plus_decoy_hla.fa" tools_dir=${tools_dir} PATH=${tools_dir}:$PATH merged="FALSE" graph="wg_immunovar" valid_alleles=true
    ################################################################ . ${bigfoot_dir}/process_from_bam_wg_immunovar_alignment.sh > ${bam_file%.cram}.log ################################################################

    Support for CHM13-based BAM/CRAM is planned
Starting from subset of reads, some manual pre-processing
  • graphdir=${bigfoot_source};graph="wg_immunovar";graph_base=${graphdir}/whole_genome_ig_hla_kir_immunovar;immune_graph=${graph_base}".subgraph";
    bazam_reads=${i}; sample_id=${bazam_reads%.bazam.fastq.gz};sample_id=${sample_id##*/};

    Sequence-to-graph alignment using VG-giraffe
  • vg giraffe -i -f ${bazam_reads} -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.bazam.grch38.wg.gam
  • vg giraffe -f ${sample_id}.unmapped.fastq.gz -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.unmapped.grch38.wg.gam
    cat ${sample_id}.bazam.grch38.wg.gam ${sample_id}.unmapped.grch38.wg.gam > ${sample_id}.bazam.grch38.combined.gam

    Ready for BIgFOOT
    - export i=${sample_id}.bazam.grch38.combined.gam workdir=${PWD} graph=${graph} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} valid_alleles=true
    ################################################################ . ${bigfoot_dir}/filter_immune_subgraph.sh ################################################################

This is still very much a work in progress - many parameters/options exist but have not been fully documented here - for example, to limit inference to the IGenotyper set of alleles, you can set valid_alleles=igenotyper and there are options to skip allelic inference altogether if you're only interested in the reads/unitigs obtained from the immunovariation subgraph.

Please reach out if you feel this tool might be useful in your work - or if you'd like some added functionality - open an issue or email