Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird output #20

Open
Rosemeis opened this issue Jul 19, 2023 · 5 comments
Open

Weird output #20

Rosemeis opened this issue Jul 19, 2023 · 5 comments

Comments

@Rosemeis
Copy link

Hi,

I have tested Neural ADMIXTURE on the 1000 Genomes Project data but I'm seeing some weird results. It is a simple dataset of the 2504 phase 3 individuals with 500,000 SNPs randomly sampled (MAF > 0.05) across all chromosomes. I see no issues when using standard ADMIXTURE or SCOPE as an example. The outputted PCA plot in Neural ADMIXTURE also looks fine. I have uploaded the admixture plots for K = 5, 6, 7 each with 10 runs using different seeds. Neural ADMIXTURE was run with all default parameter settings.

Environment

conda create -n neural python=3.9
conda activate neural
pip install neural-admixture

Command example

neural-admixture train --k 5 --data_path gp.merged.downsampled.bed --save_dir ./ --init_file test_s1_k5 --name tgp.downsampled.neural.s1 --seed 1 > tgp.downsampled.neural.s1.log

neural_K5
neural_K6
neural_K7

tgp downsampled neural s1_training_pca

@dmasmont
Copy link
Member

Hi Jonas Meisner,

Thanks for reaching out! The results do look strange indeed. Could we arrange a way to get the actual data you are using, and maybe the results that original ADMIXTURE is providing? Please, send me an email at dmasmont@stanford.edu

Regards,
Daniel Mas

@Rosemeis
Copy link
Author

Yes! The data is the newest version of 1KGP:
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/

But that also includes the related individuals, so I have attached the names of the original 2504 individuals, the 500K sampled sites as well as a run from ADMIXTURE with K=5.
tgp.downsampled.zip

Best,
Jonas

@haneenih7
Copy link

Hey,
thanks for neural-admixture, it's really fast. I actually had exact same problem, i.e., the replacement of clusters... did you know why is that?
Additionally, I do not see an option for bootstrapping as the original ADMIXTURE
-B[X] : do bootstrapping [with X replicates]
Am I missing the parameter, or it's true there is no bootstrapping option?

Thanks!

@AlbertDominguez
Copy link
Collaborator

Hi @Rosemeis and @haneenih7,

Thanks to both for your interest and testing the software!

@Rosemeis, thanks for pointing the issue out and sending us some results with data! Having checked it out, it looks like the initialization of the Q values by the network were very different, and sometimes too off to be recoverable by the network. We didn't see this at all while developing the method, so we suspect it might be some change introduced in a dependency. We have also seen that the initialization PCK-Means (Supp. Algorithm 1 in the paper) appears to be more stable, so we have set the default to be PCK-Means instead of the current one, which was PCArchetypal.

Nevertheless, in order to stabilize the results, we have added a "Warmup training" for initialization, where we supervisedly train the encoder to estimate Q values using as labels a function of the distance to the initial values of the P matrix in PCA space. This way, we not only have a sensible initialization for the P matrix, but also for the encoder which computes Q. In practice, there's no change as to how the algorithm is called from the CLI! Convergence checks are also performed starting at epoch 15 to avoid early stopping too early.

This is how results look like for the data you provided using the default parameters you were using (for K=5):

modes_K=5_2023-6-31_20h48m3s_pong

Let me know in case of any followup, happy to discuss the issue a bit more if necessary! To install the upgrades, simply run pip install -U neural-admixture.

@haneenih7, regarding bootstrapping, there currently no such option for Neural ADMIXTURE. I will open a new issue for the feature and we will try to publish it in the next release, along with cross-validation!

@Rosemeis
Copy link
Author

Rosemeis commented Aug 13, 2023

Hi @AlbertDominguez

Thanks for looking into the issue!
After the update, the results are looking a bit more consistent.

However, I still see a lot of issues. Here are the neural-admixture runs for K=5:
modes_K=5_2023-7-13_17h45m53s_pong

Here are the ADMIXTURE runs, all getting the same solution:
mainviz_2023-7-14_0h30m39s_pong

Resembling your results, but as for an example with the admixed AFR populations, you can see that there is no European component as well as a lot of individuals getting fully assigned a non-proper ancestry and generally not very consistent across runs at a finer scale. This is in my opinion signs of convergence problems if I was using ADMIXTURE.
Also if I compute the log-likelihoods of the Q and F matrices, then they are also far off the results of ADMIXTURE as well as SCOPE that optimizes the least square.

neural-admixture (10 runs, K=5)
-1107004412.7
-1107158223.9
-1107499553.3
-1107478393.7
-1107716015.0
-1107295943.5
-1107270413.1
-1107290867.5
-1107093124.2
-1107849482.0

ADMIXTURE (10 runs, K=5)
-1092429529.5
-1092429519.0
-1092429537.5
-1092429534.9
-1092429535.0
-1092429531.4
-1092429520.6
-1092429521.8
-1092429530.0
-1092429521.5

SCOPE (10 runs, K=5)
-1092788577.1
-1092788572.3
-1092789272.9
-1092788574.4
-1092788569.2
-1092788570.0
-1092788569.5
-1092788572.9
-1092789298.4
-1092788573.3

The log-likelihood also doesn't seem to be reported by neural-admixture, thus making it a bit harder to debug. It would also be nice to have a feature to control the number of threads!

Best,
Jonas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants