Strange results on 23 and me genotype data and pretrained models #22

enabieva · 2022-03-19T10:18:26Z

I'm testing Gnomix on 23andme genotype data using the pretrained models, and the results I get seem completely incorrect. While the individual is of Eastern European and Jewish ancestry, but the predictions (a) consist of short segments, (b) most of the segments, at least on Chr 22, are predicted to be African.

The statistics in pretrained_log file (for Chr 22) are:
Loading and processing query file...

Number of SNPs from model: 317408
Number of SNPs from file: 2127
Number of intersecting SNPs: 2034
Percentage of model SNPs covered by query file: 0.64%

What could be the source of the error?

vinhdc10998 · 2022-03-25T04:58:38Z

Hi Enabieva,
I'm testing Gnomix on my genotype data using 1KGP as a reference panel, and the result I get is similar to your result. Most of the predicts are AFR but my data is EAS on chromosome 20.

The statistic of command:

Number of SNPs from model: 936278
Number of SNPs from file: 938186
Number of intersecting SNPs: 877359
Percentage of model SNPs covered by query file: 93.71000000000001%

So,did you fix the error, or may share advice? And if I want to convert ancestry maker to ancestry sample, I should how to do that? Can I vote each maker to get the result of sample.

Thank you.

enabieva · 2022-03-25T07:48:37Z

Dear vinhdc10998,
No, I haven't. I had at some point tried to train a new model, but it kept crashing.
I did not understand your last question (but I'm new to the field, so probably wouldn't know the answer).

jcgrenier · 2022-04-01T20:52:54Z

Hello! I'm getting the same issue with my tests as well. I'm taking 10,000 random samples from UKBiobank and I'm also getting largely AFR ancestry from those random samples, while they should be largely Europeans. Is it a bug or a problem with the labels?

I found this in the README :

The models named default_model.pkl are trained on hg build 37 references from the following biogeographic regions: Subsaharan African (AFR), East Asian (EAS), European (EUR), Native American (NAT), Oceanian (OCE), South Asian (SAS), and West Asian (WAS) and labels and predicts them as 0, 1, .., 6 respectively. The populations used to train these ancestries are given in the supplementary section of the reference provided at the bottom of this readme.

Thanks a lot!

guidebortoli · 2022-04-01T21:42:59Z

Hi all,
Please take look at this post I've made here (already closed). Ignore the "PerformanceWarning", as it is irrelevant to the problem here...

Briefly, I had the same problem as you guys are finding. I was using my own reference (HGDP phased samples on the hg38 build), and a query file from Brazilian samples (from my own project), that are known to be admixed (EUR, AFR and NAM).

I found that Gnomix was overestimating all windows to AFR which was obviously a problem. I even compare the results with RFMix v2, the previous version of Gnomix (called Xgmix), and the global ancestry generated by ADMIXTURE program (the latter one being compatible with the results of RFMix v2).

So there is an explanation about extrapolation and interpolation in the post that kind of solve the problem for me (I still need to test in the other chromosomes to see if actually worked).

I have used only markers presented in my query file on the reference file. That way my reference file contained only markers presented in my query file (even though my query file was bigger, in terms of markers, then the reference file).

After running the Gnomix again, the program tells you the markers that were extrapolated (in my case the ones in the boundaries), and I excluded those markers from my subsequent analysis (the loss was minimal compared with the number of markers left for subsequent analysis).

With that in hand I compared again with RFMix v2, and the global ancestry generated by ADMIXTURE and the results correlated really well, r2=0.9ish for the 3 ancestries (EUR,AFR and NAM)...

Hope this help you guys in any way.

Cheers,

AlexIoannidis · 2022-04-02T00:57:00Z

Great discussion. In general when you see large overestimates of African ancestry, it is a sign that your reference SNP encodings (pre-trained models) are not matching your sample SNP encodings. For example, your sample might be on a different build (b37 vs. b38) than the one the model was trained on. In any case, when your sample SNPs are defined somehow differently than in the reference model, it will now appear to the model that your samples have SNPs with unusual variants that were very rare, or even unobserved, in the training data. Since the ancestry with the most variant diversity is African (a consequence of the out-of-Africa bottleneck), these samples will now be assigned African as the most likely match.

If you simply cannot determine how your SNP definitions are differing from those used in creating the model, you'll have to retrain your own model by obtaining public references (for instance from 1000 genomes) and merging them with your data. It is this merging step that must be done carefully, because if your sample SNP variants are not matched during the merge to the references (same build, same strandedness), you will again see African ancestry everywhere when inferring ancestry on your samples.

silviaadiz · 2022-07-22T08:21:56Z

Hi, I have followed this thread and thus used imputed data instead (Michigan Imp. Server, 1000G AMR). It was very helpful, thanks.
The overlap between the pretrained models SNPs and mine is around 60% (filtered by R2>0.7) and 70% (all imputed SNPs). My sample is mostly admixed (AFR, NAT, EUR).
However, Gnomix is estimating SAS ancestry for all my individuals and chromosomes, except for some regions that are EAS. Would I be having the same extrapolation problem even though is not estimating AFR?

Thank you!

njbowen · 2022-11-05T03:05:48Z

any updates to this issue, i am using a hg38 to hg19 liftover of a 100X whole genome sequence from nebula .vcf and getting mostly african ancestry on a european individual, chr2 , images are of hg38 with gnomix, hg38 to hg19 liftover with gnomix and 23andMe chromosome paint.

got these numbers during run for hg38 vcf:
Loading and processing query file...

Number of SNPs from model: 1975963
Number of SNPs from file: 373138
Number of intersecting SNPs: 4109
Percentage of model SNPs covered by query file: 0.21%
Found 2367 (57.6053%) different reference variants. Adjusting...

got these number during run for hg38 to hg19 liftover vcf:
Loading and processing query file...

Number of SNPs from model: 1975963
Number of SNPs from file: 368794
Number of intersecting SNPs: 272903
Percentage of model SNPs covered by query file: 13.81%
Found 3 (0.0011%) different reference variants. Adjusting...

so lots more SNPs covered after liftover and fewer different reference variants.

but chomosome 2 images don't reflect correct ancestry

njbowen · 2022-11-11T22:29:04Z

filtered the overlapping SNPs and retrained model. new output....looks better, but paints worst on the genome that the overlaps were taken from to retrain the model.....riddle me this batman.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange results on 23 and me genotype data and pretrained models #22

Strange results on 23 and me genotype data and pretrained models #22

enabieva commented Mar 19, 2022

vinhdc10998 commented Mar 25, 2022

enabieva commented Mar 25, 2022

jcgrenier commented Apr 1, 2022 •

edited

Loading

guidebortoli commented Apr 1, 2022

AlexIoannidis commented Apr 2, 2022 •

edited

Loading

silviaadiz commented Jul 22, 2022

njbowen commented Nov 5, 2022 •

edited

Loading

njbowen commented Nov 11, 2022 •

edited

Loading

Strange results on 23 and me genotype data and pretrained models #22

Strange results on 23 and me genotype data and pretrained models #22

Comments

enabieva commented Mar 19, 2022

vinhdc10998 commented Mar 25, 2022

enabieva commented Mar 25, 2022

jcgrenier commented Apr 1, 2022 • edited Loading

guidebortoli commented Apr 1, 2022

AlexIoannidis commented Apr 2, 2022 • edited Loading

silviaadiz commented Jul 22, 2022

njbowen commented Nov 5, 2022 • edited Loading

njbowen commented Nov 11, 2022 • edited Loading

jcgrenier commented Apr 1, 2022 •

edited

Loading

AlexIoannidis commented Apr 2, 2022 •

edited

Loading

njbowen commented Nov 5, 2022 •

edited

Loading

njbowen commented Nov 11, 2022 •

edited

Loading