Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Option to subset reference SNPs to query file positions before training #11

Open
weekend37 opened this issue Nov 6, 2021 · 1 comment
Assignees

Comments

@weekend37
Copy link
Collaborator

Would enhance model performance in most cases and avoid disaster where imputation has not been performed.

@weekend37 weekend37 self-assigned this Nov 6, 2021
@candevrivera2021
Copy link

Hi. I am having this issue when training the model from scratch after filtering the snps in the query file. I am not sure what to do.

After building the model, I get this error when doing the inference in the query samples.
Any advice is appreciated.

Launching inference...
Loading and processing query file...

  • Number of SNPs from model: 333651
  • Number of SNPs from file: 447132
  • Number of intersecting SNPs: 333651
  • Percentage of model SNPs covered by query file: 100.0%
  • Found 24392 (7.3106%) different reference variants. Adjusting...
    Inferring ancestry on query data...
    /users/cvergara/gnomix/src/utils.py:313: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
    df[data_samples[i]] = genotype_person

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants