Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange results on 23 and me genotype data and pretrained models #22

Open
enabieva opened this issue Mar 19, 2022 · 8 comments
Open

Strange results on 23 and me genotype data and pretrained models #22

enabieva opened this issue Mar 19, 2022 · 8 comments

Comments

@enabieva
Copy link

I'm testing Gnomix on 23andme genotype data using the pretrained models, and the results I get seem completely incorrect. While the individual is of Eastern European and Jewish ancestry, but the predictions (a) consist of short segments, (b) most of the segments, at least on Chr 22, are predicted to be African.

The statistics in pretrained_log file (for Chr 22) are:
Loading and processing query file...

  • Number of SNPs from model: 317408
  • Number of SNPs from file: 2127
  • Number of intersecting SNPs: 2034
  • Percentage of model SNPs covered by query file: 0.64%

What could be the source of the error?

@vinhdc10998
Copy link

Hi Enabieva,
I'm testing Gnomix on my genotype data using 1KGP as a reference panel, and the result I get is similar to your result. Most of the predicts are AFR but my data is EAS on chromosome 20.

The statistic of command:

  • Number of SNPs from model: 936278
  • Number of SNPs from file: 938186
  • Number of intersecting SNPs: 877359
  • Percentage of model SNPs covered by query file: 93.71000000000001%

So,did you fix the error, or may share advice? And if I want to convert ancestry maker to ancestry sample, I should how to do that? Can I vote each maker to get the result of sample.

Thank you.

@enabieva
Copy link
Author

Dear vinhdc10998,
No, I haven't. I had at some point tried to train a new model, but it kept crashing.
I did not understand your last question (but I'm new to the field, so probably wouldn't know the answer).

@jcgrenier
Copy link

jcgrenier commented Apr 1, 2022

Hello! I'm getting the same issue with my tests as well. I'm taking 10,000 random samples from UKBiobank and I'm also getting largely AFR ancestry from those random samples, while they should be largely Europeans. Is it a bug or a problem with the labels?

I found this in the README :

The models named default_model.pkl are trained on hg build 37 references from the following biogeographic regions: Subsaharan African (AFR), East Asian (EAS), European (EUR), Native American (NAT), Oceanian (OCE), South Asian (SAS), and West Asian (WAS) and labels and predicts them as 0, 1, .., 6 respectively. The populations used to train these ancestries are given in the supplementary section of the reference provided at the bottom of this readme.

Thanks a lot!

@guidebortoli
Copy link

Hi all,
Please take look at this post I've made here (already closed). Ignore the "PerformanceWarning", as it is irrelevant to the problem here...

Briefly, I had the same problem as you guys are finding. I was using my own reference (HGDP phased samples on the hg38 build), and a query file from Brazilian samples (from my own project), that are known to be admixed (EUR, AFR and NAM).

I found that Gnomix was overestimating all windows to AFR which was obviously a problem. I even compare the results with RFMix v2, the previous version of Gnomix (called Xgmix), and the global ancestry generated by ADMIXTURE program (the latter one being compatible with the results of RFMix v2).

So there is an explanation about extrapolation and interpolation in the post that kind of solve the problem for me (I still need to test in the other chromosomes to see if actually worked).

I have used only markers presented in my query file on the reference file. That way my reference file contained only markers presented in my query file (even though my query file was bigger, in terms of markers, then the reference file).

After running the Gnomix again, the program tells you the markers that were extrapolated (in my case the ones in the boundaries), and I excluded those markers from my subsequent analysis (the loss was minimal compared with the number of markers left for subsequent analysis).

With that in hand I compared again with RFMix v2, and the global ancestry generated by ADMIXTURE and the results correlated really well, r2=0.9ish for the 3 ancestries (EUR,AFR and NAM)...

Hope this help you guys in any way.

Cheers,

@AlexIoannidis
Copy link
Member

AlexIoannidis commented Apr 2, 2022

Great discussion. In general when you see large overestimates of African ancestry, it is a sign that your reference SNP encodings (pre-trained models) are not matching your sample SNP encodings. For example, your sample might be on a different build (b37 vs. b38) than the one the model was trained on. In any case, when your sample SNPs are defined somehow differently than in the reference model, it will now appear to the model that your samples have SNPs with unusual variants that were very rare, or even unobserved, in the training data. Since the ancestry with the most variant diversity is African (a consequence of the out-of-Africa bottleneck), these samples will now be assigned African as the most likely match.

If you simply cannot determine how your SNP definitions are differing from those used in creating the model, you'll have to retrain your own model by obtaining public references (for instance from 1000 genomes) and merging them with your data. It is this merging step that must be done carefully, because if your sample SNP variants are not matched during the merge to the references (same build, same strandedness), you will again see African ancestry everywhere when inferring ancestry on your samples.

@silviaadiz
Copy link

Hi, I have followed this thread and thus used imputed data instead (Michigan Imp. Server, 1000G AMR). It was very helpful, thanks.
The overlap between the pretrained models SNPs and mine is around 60% (filtered by R2>0.7) and 70% (all imputed SNPs). My sample is mostly admixed (AFR, NAT, EUR).
However, Gnomix is estimating SAS ancestry for all my individuals and chromosomes, except for some regions that are EAS. Would I be having the same extrapolation problem even though is not estimating AFR?

Thank you!

@njbowen
Copy link

njbowen commented Nov 5, 2022

any updates to this issue, i am using a hg38 to hg19 liftover of a 100X whole genome sequence from nebula .vcf and getting mostly african ancestry on a european individual, chr2 , images are of hg38 with gnomix, hg38 to hg19 liftover with gnomix and 23andMe chromosome paint.

got these numbers during run for hg38 vcf:
Loading and processing query file...

  • Number of SNPs from model: 1975963
  • Number of SNPs from file: 373138
  • Number of intersecting SNPs: 4109
  • Percentage of model SNPs covered by query file: 0.21%
  • Found 2367 (57.6053%) different reference variants. Adjusting...

got these number during run for hg38 to hg19 liftover vcf:
Loading and processing query file...

  • Number of SNPs from model: 1975963
  • Number of SNPs from file: 368794
  • Number of intersecting SNPs: 272903
  • Percentage of model SNPs covered by query file: 13.81%
  • Found 3 (0.0011%) different reference variants. Adjusting...

so lots more SNPs covered after liftover and fewer different reference variants.

but chomosome 2 images don't reflect correct ancestry

hg38_chr2
hg38_to_hg19_liftover_chr2
23andMe_chr2

@njbowen
Copy link

njbowen commented Nov 11, 2022

filtered the overlapping SNPs and retrained model. new output....looks better, but paints worst on the genome that the overlaps were taken from to retrain the model.....riddle me this batman.

demo_vs_overlapping_models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants