Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeStyle][NPU] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) #44988

Merged
merged 14 commits into from
Aug 17, 2022

Conversation

SigureMo
Copy link
Member

@SigureMo SigureMo commented Aug 8, 2022

PR types

Others

PR changes

Others

Describe

使用 np.testing.assert_allclose(...) 替换 self.assertTrue(np.allclose(...)),以优化单测报错信息

Changes

  • self.assertTrue(np.allclose(...)) -> np.testing.assert_allclose(...) autofix(自动跳过因为 shape 问题失败的 case)
  • 等价情况 self.assertEqual(np.allclose(...), True) -> np.testing.assert_allclose(...) autofix
  • 等价情况 self.assertTrue(numpy.allclose(...)) -> np.testing.assert_allclose(...) autofix
  • 部分原来代码里的错误使用 self.assertTrue(np.allclose(...), True) 移除 True autofix
  • 等价情况 self.assertTrue(np.isclose(...).all()) -> np.testing.assert_allclose(...) autofix
  • 语法树替换后有些 1e-4 变成 0.0001 了,明显 1e-4 更好一些(包含比 1e-4 小的,比 1e-4 大的暂不替换)
  • 清理冗余的 err_msg,部分 err_msg 只是打印了 a 和 b,np.testing.allclose 本就会打印它们,因此可以清理
  • 部分在 rtol=0.5 的情况下 CI 仍然不过时,手动对这些测试增加 atol=1e-8(仍然是保持默认行为),未设过 1e-8 之外的数值
  • 少许没能成功跳过的 shape error,如果确定是静态图返回值问题直接加 [0],如果暂时不确定是不是静态图返回值问题则暂时回退修改,留到下个 PR 处理

目前直接替换后发现大量测试无法通过(375),这基本是没有办法手动一个个修改的……问题主要在以下几个方面:

  • 精度误差(本 PR 将集中解决本问题)

    这是由于两者默认值不同

    • np.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
    • np.testing.assert_allclose(actual, desired, rtol=1e-07, atol=0, equal_nan=True, err_msg='', verbose=True)

    明显 np.testing.assert_allclose 精度要求更高,所以这里暂时 np.allclosertol(即设为默认值)时将值设为 1e-5,使其与原来行为一致,atol 暂时不改看看情况

    修改后测试失败大大降低(375 -> 195),基本降了一半,余量已经很少了,经检查,余量大多都是些误差非常小的(1e-10 以下),针对这些手动加上 atol=1e-8 应可解决即可解决

  • shape 不对齐

    本 PR 暂时不关注本问题,将在下一个 PR 详细描述~

    这是由于 np.allclose 在比较时会自动 broadcast,而 np.testing.allclose 不会,因此需要手动对这些数据进行检查及修改

    修改精度问题后,本问题占了 90% 以上,也就是将近 200 个需要考虑本问题,逐个手动修复,需要进一步评估成本

    目前发现的问题:

    • 静态图返回是一个 list,但有的开发者直接将返回值进行比较,这会在比较时认为静态图的 shape 比预期值多一维度,如下面这个(找个问题好麻烦呀……),该问题占了很大一部分

      res = exe.run(feed={'X': self.x_np}, fetch_list=[out])
      out_ref = ref_softmax(self.x_np, self.axis)
      self.assertTrue(np.allclose(out_ref, res))

    • 误操作,这里的 expected_dx 不需要添加 [0]

      dx = fluid.dygraph.grad(out, [x, z], allow_unused=True)
      dout = np.ones_like(np_y)
      expected_dx = np.matmul(dout, np.transpose(np_y))
      self.assertTrue(np.allclose(dx[0].numpy(), expected_dx[0]))

  • 静态图代码报错(shape 不对齐问题导致测试不通过而无法执行后面的转为静态图模式的代码,进而后续静态图测试全部无法通过,本质上是问题二)

剩余大概 48 个文件是还有需要修改的(存在大量 shape 问题和可能存在少量精度问题和其他问题),将在下个 PR 进行修复。

@paddle-bot
Copy link

paddle-bot bot commented Aug 8, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Aug 8, 2022
@luotao1 luotao1 self-assigned this Aug 9, 2022
@luotao1 luotao1 added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Aug 9, 2022
@luotao1
Copy link
Contributor

luotao1 commented Aug 9, 2022

静态图代码报错。应当可用测试代码开始时 paddle.enable_static(),测试代码结束时 paddle.disable_static() 包裹该段测试代码(还没测试)

可以贴一下哪些单测的静态图代码报错么?

@SigureMo SigureMo changed the title [WIP][CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) [WIP][CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) Aug 10, 2022
@SigureMo SigureMo closed this Aug 10, 2022
@paddle-bot
Copy link

paddle-bot bot commented Aug 10, 2022

很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

@SigureMo SigureMo reopened this Aug 10, 2022
@SigureMo SigureMo force-pushed the assert-allclose branch 2 times, most recently from ad8a787 to 179c9b6 Compare August 10, 2022 18:38
@SigureMo SigureMo changed the title [WIP][CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) [CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) Aug 10, 2022
@SigureMo SigureMo changed the title [CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) [WIP][CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) Aug 11, 2022
Yulv-git
Yulv-git previously approved these changes Aug 13, 2022
Copy link
Contributor

@Yulv-git Yulv-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1
Copy link
Contributor

luotao1 commented Aug 15, 2022

@SigureMo @qili93 需要在IPU/XPU的机器上跑一下看看,跑完后回复这个PR

@luotao1 luotao1 mentioned this pull request Aug 15, 2022
@SigureMo
Copy link
Member Author

@SigureMo @qili93 需要在IPU/XPU的机器上跑一下看看,跑完后回复这个PR

好的,如果有问题我恢复相关修改

@qili93 qili93 changed the title [CodeStyle] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) [CodeStyle][NPU] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) Aug 15, 2022
@qili93
Copy link
Contributor

qili93 commented Aug 16, 2022

@SigureMo

由于对于XPU相关修改比较多,和昆仑同学确认,请提交一条以“xxxx, test=kunlun” 结尾的的commit msg,会自动触发昆仑流水线进行单测检查。

@qili93
Copy link
Contributor

qili93 commented Aug 16, 2022

CI passed for MLU and IPU.

@SigureMo SigureMo dismissed stale reviews from Yulv-git and luotao1 via cb0e196 August 16, 2022 17:55
from paddle._C_ops import final_state_nearest_interp
from paddle import _C_ops
Copy link
Member Author

@SigureMo SigureMo Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是因为原来 NPU 下单测 test_nearest_interp_v2_op_npu 在从 test_nearest_interp_v2_op import numpy 的参考实现(nearest_neighbor_interp_np)时候会报错,报错如下:

2022-08-17 00:32:35 1/1 Test #93: test_nearest_interp_v2_op_npu ....***Failed    3.95 sec
2022-08-17 00:32:35 W0817 00:32:32.922722   813 init.cc:288] AVX is available, Please re-compile on local machine
2022-08-17 00:32:35 Hint: Your machine support AVX, but the installed paddlepaddle doesn't have avx core. Hence, no-avx core with worse preformance will be imported.
2022-08-17 00:32:35 If you like, you could reinstall paddlepaddle by 'python -m pip install --force-reinstall paddlepaddle-gpu[==version]' to get better performance.
2022-08-17 00:32:35 The original error is: cannot import name 'core_avx' from 'paddle.fluid' (/workspace/npu_dev/Paddle/build/python/paddle/fluid/__init__.py)
2022-08-17 00:32:35 /workspace/npu_dev/Paddle/build/python/paddle/fluid/framework.py:189: UserWarning: We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. 
2022-08-17 00:32:35   "We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. "
2022-08-17 00:32:35 Traceback (most recent call last):
2022-08-17 00:32:35   File "/workspace/npu_dev/Paddle/tools/test_runner.py", line 69, in <module>
2022-08-17 00:32:35     main()
2022-08-17 00:32:35   File "/workspace/npu_dev/Paddle/tools/test_runner.py", line 51, in main
2022-08-17 00:32:35     module = importlib.import_module(module_name)
2022-08-17 00:32:35   File "/opt/conda/lib/python3.7/importlib/__init__.py", line 127, in import_module
2022-08-17 00:32:35     return _bootstrap._gcd_import(name[level:], package, level)
2022-08-17 00:32:35   File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
2022-08-17 00:32:35   File "<frozen importlib._bootstrap>", line 983, in _find_and_load
2022-08-17 00:32:35   File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
2022-08-17 00:32:35   File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
2022-08-17 00:32:35   File "<frozen importlib._bootstrap_external>", line 728, in exec_module
2022-08-17 00:32:35   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
2022-08-17 00:32:35   File "/workspace/npu_dev/Paddle/build/python/paddle/fluid/tests/unittests/npu/test_nearest_interp_v2_op_npu.py", line 29, in <module>
2022-08-17 00:32:35     from test_nearest_interp_v2_op import nearest_neighbor_interp_np
2022-08-17 00:32:35   File "../test_nearest_interp_v2_op.py", line 25, in <module>
2022-08-17 00:32:35     from paddle._C_ops import final_state_nearest_interp
2022-08-17 00:32:35 ImportError: cannot import name 'final_state_nearest_interp' from 'paddle.fluid.core_noavx.ops' (/workspace/npu_dev/Paddle/build/python/paddle/_C_ops.py)
2022-08-17 00:32:35 Errors while running CTest
2022-08-17 00:32:35 0% tests passed, 1 tests failed out of 1
2022-08-17 00:32:35 Total Test time (real) =   3.96 sec
2022-08-17 00:32:35 The following tests FAILED:
2022-08-17 00:32:35 	 93 - test_nearest_interp_v2_op_npu (Failed)

看起来是刚刚 merge 的 #45148 新增的这一行(这里的 25 行)直接从 _C_ops import 了 final_state_nearest_interp,但在 NPU 下貌似是访问不到的。为了避免这个问题,修改为该文件全局仅 import 到 _C_ops,修改后 NPU 测试全部通过。

@SigureMo
Copy link
Member Author

由于对于XPU相关修改比较多,和昆仑同学确认,请提交一条以“xxxx, test=kunlun” 结尾的的commit msg,会自动触发昆仑流水线进行单测检查。

最新一次 commit 包含了 test=kunlun,目前除了两个非 Required 其余全部都通过了~

Copy link
Contributor

@qili93 qili93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants