Insane memory usage #32

BEFH · 2022-06-14T14:26:45Z

I have just had a gnomix run die after attempting to use more than 1.4 TB of memory. Yes, terabytes. These are unimputed GSA microarray data phased using eagle. I am fitting the model myself using the suggested microarray configuration, and for now, I am only calculating local ancestry on chromosome 17. Based on reference overlap even after generating the local model, it looks like I will either need to filter the reference before model generation or impute the data.

I suspect the issue is partially sample size. I have 31,705 samples in that cohort. I am also running it on a GDA cohort (10,859 samples, and another cohort of 13 samples, and it did not die on the smallest cohort. I have a couple of questions on how to optimize this:

Firstly, it appears that the model generation only uses the reference dataset and not the sample to which it will be applied. I wrote a script to compare the models generated with different datasets and they appear to be identical. Is that the case? I ran without calibration, so is that the case with calibration?

Secondly, is there any problem with first generating the models, then applying them to all of the different datasets? Do you have a recommendation for a minimal dataset to use for that for the model generation to happen as fast as possible?

guidebortoli · 2022-07-12T00:10:59Z

Hi @BEFH...
Have you been able to workaround this?
I am having the same problem running 888 samples and training my own model using 32 reference samples...
Amount of memory used when crashing is around 132gb...

weekend37 · 2022-07-12T03:42:46Z

Hey @BEFH, sounds like you have a lot of data on your hands. Nice!

Are you attempting to use one of our trained models or are you training your own?

guidebortoli · 2022-07-12T17:35:48Z

@weekend37 hi,
Do you know how can I circumvent this issue?
Thanks

BEFH · 2022-07-12T17:47:52Z

@weekend37, I'm training my own. It's microarray, so I need to filter the reference to get good overlap. Doing this seems to help, along with splitting the target dataset into multiple sets of samples. I just want to be sure doing this is not hurting the quality of the reference or how good the accuracy estimates are.

arvind0422 · 2022-07-15T15:30:44Z

Hi @BEFH , Would you be able to share with us the config.yaml file that you are using (if you are using a custom config). If not, please let us know which default config you are using. We can explore some of the config options that lower your memory requirements.
Best,
Arvind

guidebortoli · 2022-07-15T15:34:39Z

For me, I ended up pruning my data (from 1.5kk markers), to 300k…
And it worked…

weekend37 self-assigned this Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insane memory usage #32

Insane memory usage #32

BEFH commented Jun 14, 2022

guidebortoli commented Jul 12, 2022

weekend37 commented Jul 12, 2022 •

edited

Loading

guidebortoli commented Jul 12, 2022

BEFH commented Jul 12, 2022

arvind0422 commented Jul 15, 2022

guidebortoli commented Jul 15, 2022

Insane memory usage #32

Insane memory usage #32

Comments

BEFH commented Jun 14, 2022

guidebortoli commented Jul 12, 2022

weekend37 commented Jul 12, 2022 • edited Loading

guidebortoli commented Jul 12, 2022

BEFH commented Jul 12, 2022

arvind0422 commented Jul 15, 2022

guidebortoli commented Jul 15, 2022

weekend37 commented Jul 12, 2022 •

edited

Loading