Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insane memory usage #32

Open
BEFH opened this issue Jun 14, 2022 · 6 comments
Open

Insane memory usage #32

BEFH opened this issue Jun 14, 2022 · 6 comments
Assignees

Comments

@BEFH
Copy link

BEFH commented Jun 14, 2022

I have just had a gnomix run die after attempting to use more than 1.4 TB of memory. Yes, terabytes. These are unimputed GSA microarray data phased using eagle. I am fitting the model myself using the suggested microarray configuration, and for now, I am only calculating local ancestry on chromosome 17. Based on reference overlap even after generating the local model, it looks like I will either need to filter the reference before model generation or impute the data.

I suspect the issue is partially sample size. I have 31,705 samples in that cohort. I am also running it on a GDA cohort (10,859 samples, and another cohort of 13 samples, and it did not die on the smallest cohort. I have a couple of questions on how to optimize this:

Firstly, it appears that the model generation only uses the reference dataset and not the sample to which it will be applied. I wrote a script to compare the models generated with different datasets and they appear to be identical. Is that the case? I ran without calibration, so is that the case with calibration?

Secondly, is there any problem with first generating the models, then applying them to all of the different datasets? Do you have a recommendation for a minimal dataset to use for that for the model generation to happen as fast as possible?

@guidebortoli
Copy link

Hi @BEFH...
Have you been able to workaround this?
I am having the same problem running 888 samples and training my own model using 32 reference samples...
Amount of memory used when crashing is around 132gb...

@weekend37 weekend37 self-assigned this Jul 12, 2022
@weekend37
Copy link
Collaborator

weekend37 commented Jul 12, 2022

Hey @BEFH, sounds like you have a lot of data on your hands. Nice!

Are you attempting to use one of our trained models or are you training your own?

@guidebortoli
Copy link

@weekend37 hi,
Do you know how can I circumvent this issue?
Thanks

@BEFH
Copy link
Author

BEFH commented Jul 12, 2022

@weekend37, I'm training my own. It's microarray, so I need to filter the reference to get good overlap. Doing this seems to help, along with splitting the target dataset into multiple sets of samples. I just want to be sure doing this is not hurting the quality of the reference or how good the accuracy estimates are.

@arvind0422
Copy link
Collaborator

Hi @BEFH , Would you be able to share with us the config.yaml file that you are using (if you are using a custom config). If not, please let us know which default config you are using. We can explore some of the config options that lower your memory requirements.
Best,
Arvind

@guidebortoli
Copy link

For me, I ended up pruning my data (from 1.5kk markers), to 300k…
And it worked…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants