gemm acc issue on AVX512 and AVX #15447

tensor-tang · 2019-01-21T09:19:47Z

code base:
#15448

how to reproduce

cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=ON -DWITH_TESTING=ON -DWITH_FLUID_ONLY=ON -DWITH_DOC=OFF -DWITH_MKLDNN=OFF -DWITH_CONTRIB=OFF
make jit_kernel_test -j
make test ARGS="-R jit_kernel_test -V"

This would pass.

But when delete line here, and do step2 and step3, then it will fail like:

...
73: /home/tangjian/paddle-tj-docker/paddle/fluid/operators/jit/test.cc:42: Failure
73: The difference between target[i] and refer[i] is 5.340576171875e-05, which exceeds FLAGS_acc, where
73: target[i] evaluates to -55.43963623046875,
73: refer[i] evaluates to -55.439689636230469, and
73: FLAGS_acc evaluates to 1.0000000000000001e-05.
73: /home/tangjian/paddle-tj-docker/paddle/fluid/operators/jit/test.cc:42: Failure
73: The difference between target[i] and refer[i] is 8.392333984375e-05, which exceeds FLAGS_acc, where
73: target[i] evaluates to 93.255752563476562,
73: refer[i] evaluates to 93.255836486816406, and
73: FLAGS_acc evaluates to 1.0000000000000001e-05.
73: /home/tangjian/paddle-tj-docker/paddle/fluid/operators/jit/test.cc:42: Failure
73: The difference between target[i] and refer[i] is 6.103515625e-05, which exceeds FLAGS_acc, where
73: target[i] evaluates to -67.680465698242188,
73: refer[i] evaluates to -67.680526733398438, and
73: FLAGS_acc evaluates to 1.0000000000000001e-05.
...

Failed on both 2620v2 and 5117.

The text was updated successfully, but these errors were encountered:

tensor-tang · 2019-01-21T11:28:56Z

This is a very urgent issue. Please help to fix this in high priority @jianhang-liu .
Thanks.

luotao1 · 2019-01-21T11:30:23Z

Could you try #15450, maybe it is a same problem with #15032 (comment)?

tensor-tang · 2019-01-21T11:41:34Z

This is nothing about scope cache, actually this should be an independent issue of MKL.

I made a separating test for this issue.

https://github.com/tensor-tang/benchmark/tree/master/gemm

tensor-tang · 2019-01-21T12:01:38Z

It pass on 6148, when tried

export MKL_CBWR=AVX

But this should slow down the speed.

And it's still failed on 2620v2.

export MKL_CBWR=AVX
export KMP_DETERMINISTIC_REDUCTION=yes

tensor-tang · 2019-01-23T02:51:28Z

结论

对于同一个版本的mkl

不同指令集系统， mkl本身的结果就是有一定误差(可能高于1e-5)，可以采用如下办法对齐差别：

export MKL_CBWR=AVX/COMPATIBLE
export KMP_DETERMINISTIC_REDUCTION=yes

前者会force使用指定的指令集run，这样便可保证逻辑与在对应指令集系统结果一致，但是由于换了指令集所以会损失一定的性能，其次只在对齐内存时才有效，还有mkl线程数必须一致。

对于不同版本的mkl

本身就有对不齐的风险

jianhang-liu · 2019-01-23T02:59:41Z

Run By Run的numeric reproduce, 应该不需要export MKL_CBWR=AVX/COMPATIBLE(这样会强制使用AVX甚至SSE指令），而是应设成Auto(由MKL选择指令集）。这是推荐的设置吧？
Processor By Processor的numeric reproduce, 则需要强行指定MKL_CBWR为具体的指令集（不能使用auto);
Version By Version的numeric reproduce, 应该是无法做到的

tensor-tang · 2019-01-23T06:29:15Z

Yes，再细分下来是这样，多谢补充。

tensor-tang added 内部提出 Intel labels Jan 21, 2019

tensor-tang changed the title ~~gemm acc issue on AVX512~~ gemm acc issue on AVX512 and AVX Jan 21, 2019

tensor-tang closed this as completed Jan 23, 2019

tensor-tang mentioned this issue Jan 25, 2019

jit benchmark use tensor #15515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemm acc issue on AVX512 and AVX #15447

gemm acc issue on AVX512 and AVX #15447

tensor-tang commented Jan 21, 2019 •

edited

Loading

tensor-tang commented Jan 21, 2019 •

edited

Loading

luotao1 commented Jan 21, 2019

tensor-tang commented Jan 21, 2019

tensor-tang commented Jan 21, 2019 •

edited

Loading

tensor-tang commented Jan 23, 2019

jianhang-liu commented Jan 23, 2019

tensor-tang commented Jan 23, 2019

gemm acc issue on AVX512 and AVX #15447

gemm acc issue on AVX512 and AVX #15447

Comments

tensor-tang commented Jan 21, 2019 • edited Loading

how to reproduce

tensor-tang commented Jan 21, 2019 • edited Loading

luotao1 commented Jan 21, 2019

tensor-tang commented Jan 21, 2019

tensor-tang commented Jan 21, 2019 • edited Loading

tensor-tang commented Jan 23, 2019

结论

对于同一个版本的mkl

对于不同版本的mkl

jianhang-liu commented Jan 23, 2019

tensor-tang commented Jan 23, 2019

tensor-tang commented Jan 21, 2019 •

edited

Loading

tensor-tang commented Jan 21, 2019 •

edited

Loading

tensor-tang commented Jan 21, 2019 •

edited

Loading