Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PerformanceWarning message #6

Closed
guidebortoli opened this issue Oct 19, 2021 · 9 comments
Closed

PerformanceWarning message #6

guidebortoli opened this issue Oct 19, 2021 · 9 comments

Comments

@guidebortoli
Copy link

guidebortoli commented Oct 19, 2021

Hi,

I am not sure if this is actually an issue that would compromise the analysis, but there is a message at the end of the analysis as follows:



--------------------------------------------------------------------------------
-----------------------------------  Gnomix  -----------------------------------
--------------------------------------------------------------------------------
When using this software, please cite: 
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, 
Daniel Mas Montserrat, Alexander G Ioannidis: 
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data" 
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 709/709 [01:04<00:00, 10.95it/s]                                      
Training smoother...
Fitting calibrator...
Evaluating model...
Re-training base models...
100%|████████████████████████████████████████| 709/709 [01:20<00:00,  8.83it/s]                                                            
Analyzing model performance...
Estimated train accuracy: 98.42%
Estimated val accuracy: 98.32%
Model, info and analysis saved at chr15/models/model_chm_chr15
--------------------------------------------------------------------------------
Launching inference...
Loading and processing query file...
- Number of SNPs from model: 1087854
- Number of SNPs from file: 30954
- Number of intersecting SNPs: 20036
- Percentage of model SNPs covered by query file: 1.8399999999999999%
Inferring ancestry on query data...
Phasing individual 493/493
Writing phased SNPs to disc...
/Users/debortoli/Downloads/arquivos_compactados/gnomix-main/src/utils.py:316: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df[data_samples[i]] = genotype_person
Saving results...


Specifically, it is the PerformanceWarning

I am getting a really weird result when running gnomix...
I have an admixed sample (Native American, African and European ancestries), with a global ancestry estimate around 70% EUR, 25% AFR and 5% NAM...
The Gnomix output .msp is assigning (I've tested only on one chromosome), AFR for all windows in all samples, which is not correct...

Not sure if this PerformanceWarning might be the culprit of this...

Thanks,

@weekend37
Copy link
Collaborator

Hi Guilherme! This is very strange indeed. I don't expect this to be because of the performance warning as that refers to the speed of that operation. So as long as you weren't bothered by the time that step took, that should be fine.

Now, what I do think that the source of this issue is the reference panel (or potentially the genetic map file). Could you tell me a bit more about the reference file you're using or even share it with me?

@guidebortoli
Copy link
Author

Hi Helgi,

Please, see the zipped file with the gmap and the reference file (which is a phased HGDP reference file for the chr15).

Guilherme

@weekend37
Copy link
Collaborator

Great, thanks. I'll take a look when I find time.
Also, feel free to include the query file if possible. Then I can actually recreate your issue.

@guidebortoli
Copy link
Author

Thanks @weekend37 . I sent you an email with the files.

@guidebortoli
Copy link
Author

Hi @weekend37 , did you get a chance to take a look at this issue?

Thank you! :)

@weekend37
Copy link
Collaborator

weekend37 commented Oct 21, 2021

Hi Guilherme. Yes, I just finished looking at your files. It seems like the you have a large mismatch of SNP positions in your reference and query file. The overlap spans about 2% of the reference file SNPs. Not only that, but it seems as if the segments that are covered are entirely different (see image). The result is a model trying to estimate ancestry of segments it has never seen before.

SNP_distributions

Please make sure that this is not a mistake. This could for example be different builds like 37 vs 38.

If this is not a mistake and you only have 2% of the reference positions, I would recommend imputing the remaining SNPs in order to get sensible results. Those results will be assuming the the imputed SNPs are the ground truth and hence possibly quite biased, but at least sensible.

-Helgi

@guidebortoli
Copy link
Author

Thanks @weekend37!
So, I have done the same analysis using Xgmix and the output was slightly better (although) it was still overestimating the AFR ancestry...
And also with rfmix2, where the results were the best so far (comparing with the global ancestry with Admixture program), and also a quick admixture mapping signal analysis for a known marker related to the phenotype I am investigating (that is known to have strong signal due to an allele difference of almost 100% between AFR and EUR)...

I am not sure if the 2% overlap could be due to the fact that my study used an array genotype data while the reference is from a WGS phased from HGDP...

Regarding the genome build they are both hg38...

Guilherme

@weekend37
Copy link
Collaborator

weekend37 commented Oct 21, 2021

Yeb. That makes total sense. The default base model for Gnomix is a Logistic Regression (linear model). The default XGMix base model and the rfmix equivalent is XGBoost and Random Forest, respectively, which are both decision tree based models that handle missing data much more naturally.

However, I bet their still struggling a little bit and no matter what you'll do, your estimates past the ~0.45x10^8 marker will never have any meaning to them regardless of the model (almost like using reference file from another chromosome then the one for the query file).

Now, if you decide to trim your query file to that marker such that we're at least dealing with same segments, I can fix the overlap issue with a feature that I've been meaning to add. Namely, only train the model on the query SNPs. Given both of those things being done, you should have no issues.

I'm positive that I can add that feature soon but that would be a Beta version. Let me know if that would be of interest to you.

@weekend37
Copy link
Collaborator

Feature request added in 11.
Marking this as closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants