Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

still getting biallelic error (attempting to use supervised training) #25

Open
seetarajpara opened this issue Jul 23, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@seetarajpara
Copy link

I created a merged bed file and tried to make absolutely sure the file did not include any multiallelic sites:

(nadmenv) seetaraj@a02-01:/scratch1/seetaraj/BCSB_GDA/ORIEN_MOONSHOT_ANCESTRY/moonshot/multisample/run_ancestry$ plink --bfile merged --biallelic-only --make-bed --out biallelic_merged
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to biallelic_merged.log.
Options in effect:
  --bfile merged
  --biallelic-only
  --make-bed
  --out biallelic_merged

257433 MB RAM detected; reserving 128716 MB for main workspace.
78750 variants loaded from .bim file.
68 people (68 males, 0 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 68 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.146956.
78750 variants and 68 people pass filters and QC.
Note: No phenotypes present.
--make-bed to biallelic_merged.bed + biallelic_merged.bim +
biallelic_merged.fam ... done.
(nadmenv) seetaraj@a02-01:/scratch1/seetaraj/BCSB_GDA/ORIEN_MOONSHOT_ANCESTRY/moonshot/multisample/run_ancestry$ neural-admixture train --k 7 --supervised --populations_path ${POPS_PATH} --name moonshot_ancestry --data_path biallelic_merged.bed --save_dir ${SAVE_PATH}
INFO:neural_admixture.entry:Neural ADMIXTURE - Version 1.4.1
INFO:neural_admixture.entry:[CHANGELOG] Mean imputation for missing data was added in version 1.4.0. To reproduce old behaviour, please use `--imputation zero` when invoking the software.
INFO:neural_admixture.entry:[CHANGELOG] Default P initialization was changed to 'pckmeans' in version 1.3.0.
INFO:neural_admixture.entry:[CHANGELOG] Warmup training for initialization of Q was added in version 1.3.0 to improve training stability (only for `pckmeans`).
INFO:neural_admixture.entry:[CHANGELOG] Convergence check changed so it is performed after 15 epochs in version 1.3.0 to improve training stability.
INFO:neural_admixture.entry:[CHANGELOG] Default learning rate was changed to 1e-5 instead of 1e-4 in version 1.3.0 to improve training stability.
INFO:neural_admixture.src.utils:Reading data...
INFO:neural_admixture.src.snp_reader:Input format is BED.
Mapping files:   0%|                                                                                                                                     | 0/3 [00:00<?, ?it/s]/home1/seetaraj/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/snp_reader.py:55: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  _, _, G = read_plink(str(Path(file).with_suffix("")))
/home1/seetaraj/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/snp_reader.py:55: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  _, _, G = read_plink(str(Path(file).with_suffix("")))
Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 23.31it/s]
WARNING:neural_admixture.src.snp_reader:Data contains missing values. Will perform mean-imputation.
Traceback (most recent call last):
  File "/home1/seetaraj/venv/nadmenv/bin/neural-admixture", line 8, in <module>
    sys.exit(main())
  File "/home1/seetaraj/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/entry.py", line 19, in main
    sys.exit(train.main(arg_list[2:]))
  File "/home1/seetaraj/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/train.py", line 139, in main
    trX, trY, valX, valY = utils.read_data(tr_file, val_file, tr_pops_f, val_pops_f, imputation=args.imputation)
  File "/home1/seetaraj/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/utils.py", line 126, in read_data
    tr_snps = snp_reader.read_data(tr_file, imputation)
  File "/home1/seetaraj/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/snp_reader.py", line 124, in read_data
    assert int(G.min().compute()) == 0 and int(G.max().compute()) == 1, 'Only biallelic SNPs are supported. Please make sure multiallelic sites have been removed.'
AssertionError: Only biallelic SNPs are supported. Please make sure multiallelic sites have been removed.

Please let me know what I can do to troubleshoot. Thank you!

@seetarajpara
Copy link
Author

I should add, the same input works fine for the original "ADMIXTURE" tool. I was having trouble with that original tool in reading my pop file, which I can also explain if needed but it seems like the error I'm seeing in neural-admixture is the bed input file.

@AlbertDominguez
Copy link
Collaborator

Hello,

Thanks for the interest in our tool! Having checked the other issue you opened, it might be something related to missing data that we don't handle properly, or a bug regarding the external lib we use to handle reading BED files.

Do you think it would be possible to share a very small (anonymized if needed of course) subset so we can use it to debug the issue? If that's the case, feel free to drop me an email with a link to the data. Otherwise we can try to debug by me sending some code snippets you'll need to run yourself!

@AlbertDominguez AlbertDominguez self-assigned this Jul 24, 2024
@AlbertDominguez AlbertDominguez added the bug Something isn't working label Jul 24, 2024
@seetarajpara
Copy link
Author

Yes I'd be happy to help with this!! Since I'll be working on genetic ancestry for much of my PhD project it would be good to learn more about this tool and how things work on a deeper level. If possible, would it be ok if I shared this info with you in a few days? I have another major deadline by Friday but I will have more bandwidth after this date. Please let me know your email, I will send you an FTP link as soon as I'm able to. Thank you so much!

@AlbertDominguez
Copy link
Collaborator

Of course, no worries! Please send the e-mail to albert.dominguezmantes[at]epfl.ch. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants