Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint size问题 #346

Open
DUKaige opened this issue Sep 15, 2020 · 1 comment
Open

Checkpoint size问题 #346

DUKaige opened this issue Sep 15, 2020 · 1 comment

Comments

@DUKaige
Copy link

DUKaige commented Sep 15, 2020

我设置了个以embedding table为主的模型,设置ps_memory_m=64G,并且调整模型大小,使的placement正好占满memory(
再大一点就会报cannot placement的错误)。然后使用saver take了checkpoint,但是发现每个server的checkpoint大小都只有20G左右。请问为什么checkpoint的大小相比ps_memory_m要小这么多?XDL里什么部分使用了大量的内存,但是却不需要写到checkpoint里?

@deerluffy
Copy link

不确定,印象中稀疏的embedding初始化时是会分配额外的内存的,但是保存ck的时候是只保存有值部分的,所以ps 的内存和ck大小肯定是不一致的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants