Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAML模型疑问 #22

Open
yinzhiqiangluvlzx opened this issue Dec 22, 2020 · 5 comments
Open

DAML模型疑问 #22

yinzhiqiangluvlzx opened this issue Dec 22, 2020 · 5 comments

Comments

@yinzhiqiangluvlzx
Copy link

DAML模型训练没问题,测试加载时候报错:
raceback (most recent call last):
File "", line 1, in
File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 210, in
fire.Fire()
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 468, in _Fire
target=component.name)
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 155, in test
model.load(opt.pth_path)
File "G:\yzq\Rec\Neu-Review-Rec\framework\models.py", line 49, in load
self.load_state_dict(torch.load(path),False)
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for predict_net.model.fm_V: copying a param with shape torch.Size([16, 128]) from checkpoint, the shape in current model is torch.Size([128, 10]).
跑模型时候仅仅是修改了fea=2,跑了2天才训练好,测试时候也没做修改,报这个错搜了一圈也没找到,想问下作者之前有遇到过嘛,谢谢您啦!

@FKCHAN
Copy link

FKCHAN commented Dec 22, 2020

--train :
python3 main.py train --dataset=Patio_Lawn_and_Garden_data --model=DAML --num_fea=1 --output=fm

--error
euclidean = (user_local_fea - item_local_fea.permute(0, 1, 3, 2)).pow(2).sum(1).sqrt()
RuntimeError: CUDA out of memory. Tried to allocate 5.94 GiB (GPU 0; 10.76 GiB total capacity; 6.17 GiB already allocated; 3.82 GiB free; 6.18 GiB reserved in total by PyTorch)

Please tell me where the error is.

@ShomyLiu
Copy link
Owner

@yinzhiqiangluvlzx 你好, 我刚刚测试下,没有问题;我的训练代码:

python3 main.py train --model=DAML --num_fea=2 --batch_size=16

测试脚本为:

python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau
lt.pth'

看报错信息应该是你那边一些参数没有修改,导致训练和测试不一致。

@FKCHAN
Copy link

FKCHAN commented Dec 23, 2020 via email

@ShomyLiu
Copy link
Owner

@FKCHAN 在那个issue里面已经提到, 多卡模型的save与单卡有点不同, https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models

后期的计划, 用pytorch-lightning 包装下模型,更好更简单的支持并行训练。 预计春节前做。

@FKCHAN
Copy link

FKCHAN commented Dec 23, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants