Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unsure if constructed models in GRCh38 are correct #50

Open
vifehe opened this issue Jun 26, 2024 · 0 comments
Open

unsure if constructed models in GRCh38 are correct #50

vifehe opened this issue Jun 26, 2024 · 0 comments

Comments

@vifehe
Copy link

vifehe commented Jun 26, 2024

We intend to use gnomix in order to infer local ancestry for our data. However, as we use HG38, our first step is to build a pretrained model for HG38. We follow the demo script.

As we use phased data, we downloaded 1000G data from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/.

The sample smap file is constructed as,

$ tail -n +2 1000g.smap | cut -f 1 > sample.smap

and then we run (this is a test for chr1),

$ bcftools view -S sample.smap -o reference_1000g.vcf 1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

in order to get our reference.vcf file.

Now, we run,

gnomix.sh 01-1000G-PHASED/1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz /testing0/phased_chr1 chr1 True Genetic-map-b38.gmap2 reference_1000g.vcf 1000g.smap

and we get this output,

#######################################################################

/usr/local/lib/python3.8/dist-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">\n'
warnings.warn('invalid INFO header: %r' % header)
...

----------------------------------- Gnomix -----------------------------------

When using this software, please cite:
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante,
Daniel Mas Montserrat, Alexander G Ioannidis:
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data"
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1



Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 1431/1431 [05:11<00:00, 4.60it/s]Training smoother...

[12:03:06] WARNING: /workspace/src/learner.cc:480:
Parameters: { use_label_encoder } might not be used.

This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.

Evaluating model...
Re-training base models...
100%|████████████████████████████████████████| 1431/1431 [09:33<00:00, 2.49it/s]
/usr/local/lib/python3.8/dist-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">\n'
warnings.warn('invalid INFO header: %r' % header)
Analyzing model performance...
Estimated train accuracy: 99.34%
Estimated val accuracy: 98.48%
Model, info and analysis saved at /nas/osotolongo/images/testing0/phased_chr1/models/model_chm_chr1

Launching inference...
Loading and processing query file...

  • Number of SNPs from model: 5759060
  • Number of SNPs from file: 5759060
  • Number of intersecting SNPs: 5439307
  • Percentage of model SNPs covered by query file: 94.45%
    Traceback (most recent call last):
    File "gnomix.py", line 409, in
    run_inference(base_args, model,
    File "gnomix.py", line 49, in run_inference
    X_query, vcf_idx, fmt_idx = vcf_to_npy(query_vcf_data, model.snp_pos, model.snp_ref, return_idx=True, verbose=verbose)
    File "/home/gnomix/src/utils.py", line 132, in vcf_to_npy
    fill = np.full((n_ind*2, len(snp_pos_fmt)), miss_fill)
    File "/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py", line 342, in full
    a = empty(shape, dtype, order)
    numpy.core._exceptions.MemoryError: Unable to allocate 275. GiB for an array with shape (6404, 5759060) and data type int64
    ########################################################################

Now, despite the final error, the pretrained model seems to be in place,

$ ls phased_chr1/models/model_chm_chr1/
analysis config.txt model_chm_chr1.pkl

as it is said in the line: "Model, info and analysis saved at ....". However, as we are not sure what is happening later, we don't know if this pretained model is usable. Can you enlighten us about what is making the program at this point? Is the model good to be used at this point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant