Add half precision float16 data type #5716

kexinzhao · 2017-11-16T18:07:02Z

No description provided.

… float16

hedaoyuan · 2017-11-21T02:45:28Z

paddle/math/float16.h

+
+#include <cstdint>
+
+#include <cuda.h>


Need #ifdef PADDLE_WITH_CUDA

Thanks! Will fix.

hedaoyuan · 2017-11-21T03:13:56Z

paddle/math/float16.h

+
+namespace fp16_impl {
+// Convert from float to half precision in round-to-nearest-even mode
+PADDLE_HOSTDEVICE inline float16 float_to_half_rn(float f);


Maybe float_to_half_rn is better as a member of class float16.

hedaoyuan · 2017-11-21T03:39:03Z

paddle/math/float16.h

+  // float16_t is an alias for __fp16 in arm_fp16.h,
+  // which is included in arm_neon.h.
+  PADDLE_HOSTDEVICE inline float16(const float16_t& h) {
+    float16_t tmp = h;


Is this assignment statement can be removed?

Yes, will fix.

hedaoyuan · 2017-11-21T03:41:00Z

paddle/math/float16.h

+
+  PADDLE_HOSTDEVICE inline explicit float16(bool b) : x(b ? 0x3c00 : 0) {}
+
+  PADDLE_HOSTDEVICE inline explicit float16(int8_t val) {


Line 125-173 can use templates to simplify.

hedaoyuan · 2017-11-21T05:11:59Z

paddle/math/float16.h

+#endif
+
+#ifdef PADDLE_ARM
+#ifdef __F16C__


Maybe line 70-72 can be removed.
ARM environment does not seem to define F16C macro.

Thanks! Will fix.

hedaoyuan · 2017-11-21T06:58:15Z

paddle/math/float16.h

+  return *reinterpret_cast<float16*>(&tmp);
+
+#elif defined(PADDLE_NEON_64)
+  float16 res;


Can we use vcvt_f16_f32 and vget_lane_f16.
I think this can avoid writing two pieces of code for NEON_64 and NEON_32.

Great point! Will fix.

hedaoyuan · 2017-11-21T07:25:13Z

paddle/math/float16.h

+
+// On ARMv8.2-A CPU
+#elif defined(PADDLE_NEON) && defined(PADDLE_ARM_FP16) && \
+    (PADDLE_GNUC_VER >= 71 || PADDLE_CLANG_VER >= 39)


use of undeclared identifier 'vaddh_f16'

I did not found arm_fp16.h in the android-ndk-r15c which the clang compiler version is 5.0.
Did I miss something?

Currently clang does not support armv8.2 float16 neon intrinsics. (The currently developing clang 6.0 is planning to add this support). So I add assembly code for float16 arithmetic operators on the armv8.2 architecture, which should work for both gcc and clang.

hedaoyuan · 2017-11-21T07:30:27Z

paddle/math/float16.h

+
+  PADDLE_HOSTDEVICE inline float16(const Eigen::half& h) : x(h.x) {}
+
+#if defined(PADDLE_NEON) && defined(PADDLE_ARM_FP16) && \


Where is PADDLE_ARM_FP16 define?

PADDLE_ARM_FP16 is not defined. It is intended for the build system to define it when it detects that the current CPU is ARM v8.2a (please check this comment #4853 (comment)).

In that comment, the ARM compute library use SCons as build tool and define ARM_COMPUTE_ENABLE_FP16 when the right arm arch 8.2 is found. I want cmake to do similar things to PADDLE_ARM_FP16. However, I didn't find a way. @hedaoyuan Do you know how to do that?

We can define -DPADDLE_ARM_FP16 in cmake when the architecture is specified as ARMv8.2.
Like this https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/configure.cmake#L24

Done. I am using the code below to specify -DPADDLE_ARM_FP16:

if(WITH_ARM_FP16) add_definitions(-DPADDLE_ARM_FP16) add_definitions("-march=armv8.2-a+fp16+simd") endif(WITH_ARM_FP16)

hedaoyuan · 2017-11-21T07:35:53Z

paddle/math/float16.h

+}
+
+__host__ inline bool operator<(const float16& a, const float16& b) {
+#ifdef PADDLE_NEON_64


Is still need PADDLE_NEON_64 here?
This code is under the macro of PADDLE_ARM_FP16.

PADDLE_ARM_FP16 is CPU Architecture related (intend to be defined when ARMv8.2A arch is found).
PADDLE_NEON_64 is more about the execution state of ARMv8.2A, because I believe ARMv8.2A CPU can run either in 32bit (when arm is defined) or 64 bit (when aarch64 is defined). GCC provides different sets of ARM intrinsics for arm and aarch64. That's why I define PADDLE_NEON_64 here.

So, when I specify the architecture as ARMv8.2(for those float16 instructions), can I still compile a 32-bit program?

I don't think so.

An ARMv8 cpu only runs on arm-32bit state when the operating system is 32bit, which is the case for Raspberry Pi 3 model B.

I don't think anyone would run a 32bit OS on a ARMv8.2 cpu. So I will delete PADDLE_NEON_64

A 32bit OS can run on an ARMv8.2 cpu. Also, a 32bit program can run on a 64bit OS(on ARMv8.2 cpu).
My point is, when you compile a program that uses the float16 instruction, it may only be compiled into a 64-bit program.

To reflect this, current code assumes 64-bit compilation when PADDLE_ARM_FP16 is defined.

hedaoyuan · 2017-11-21T07:50:03Z

paddle/math/float16.h

+
+// Arithmetic operators
+#if defined(PADDLE_CUDA_FP16) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+__device__ inline float16 operator+(const float16& a, const float16& b) {


Do we need to define these CUDA device operations?
We can use the half type directly in CUDA's Kernel.

Please refer to https://github.com/PaddlePaddle/Paddle/pull/5851/files for the different support provided by different versions of CUDA to half type.

By defining these CUDA device operations here along with the implicit conversion operator between our float16 and half, we can run the following code on CUDA < 9.0 (have tested on our nvidia-docker image with CUDA 8.0):

namespace paddle { __global__ void() { half a, b, c; // correct for cuda >= 7.5 if defined inside paddle namespace // gives compiler error if not put in paddle namespace for cuda < 9.0 c = a + b; } }

So these device operations make our code using cuda half data type arithmetic operations easy to write and compatible with all CUDA >= 7.5.

However, if we call c = a + b with a, b, c all being half data type, it is much less efficient compared to c = __hadd(a, b) because of all the unnecessary conversions performed.

So I think we should instead add the following code in paddle namespace (add operation for example):

__device__ inline half operator+(const half& a, const half& b) { return __hadd(a, b); }

This way c = a + b works on GPU for any CUDA >= 7.5 (for CUDA 9.0, this paddle::operator+ will be preferred over the counterpart in the global namespace because of name hiding in nested scope).

What do you think @hedaoyuan ?

However, if we call c = a + b with a, b, c all being half data type, it is much less efficient compared to c = __hadd(a, b) because of all the unnecessary conversions performed.

Yeah, implicit conversion is dangerous, the declaration of the conversion function needs to add explicit.

So, for now, there is two way to implementations for the FP16 Kernel in CUDA.

Use half and operator like c = __hadd(a, b) when CUDA < 9.0, this kernel is also work when CUDA >= 9.0

Use half and operator like c = a + b when CUDA >= 9.0;

For the first, if I am an outside contributor, I am not familiar with the type definition of paddle.
For the second, if we need those kernels write for CUDA >= 9.0 work well when CUDA < 9.0, I think the opinion in your comment is better(define operator+ for cuda half when CUDA < 9.0).

Done. Added half arithmetic operators for CUDA >= 7.5 and < 9.0.

wangkuiyi · 2017-11-22T02:00:13Z

paddle/math/float16.h

+
+#if defined(PADDLE_NEON) && defined(PADDLE_ARM_FP16) && \
+    (PADDLE_GNUC_VER >= 61 || PADDLE_CLANG_VER >= 34)
+  PADDLE_HOSTDEVICE inline float16& operator=(const float16_t& rhs) {


I noticed that the following pattern

#if defined(PADDLE_NEON) && defined(PADDLE_ARM_FP16) && \ (PADDLE_GNUC_VER >= 61 || PADDLE_CLANG_VER >= 34)

appeared in this file for three times. Should we define a new macro to improve the readability?

#if defined(PADDLE_NEON) && defined(PADDLE_ARM_FP16) && \ (PADDLE_GNUC_VER >= 61 || PADDLE_CLANG_VER >= 34) # define PADDLE_WITH_NATIVE_FP16 #endif

Thanks! Will do.

reyoung · 2017-11-28T04:14:02Z

paddle/math/float16.h

+#endif  // __clang__
+
+#ifdef __CUDACC__
+#define PADDLE_HOSTDEVICE __host__ __device__


HOSTDEVICE has been defined in

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/hostdevice.h

Thanks! Done.

… float16

hedaoyuan

赞~~

kexinzhao added 12 commits October 20, 2017 15:13

initial commit for float16

e473322

add float16 data type

a208dd6

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ff51737

… float16

small fix

9d8b305

add float16 arithmetic on arm cpu

e877cdb

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d9642cb

… float16

add test for float16

af37838

add test cases

4f1aa5b

small fix

979d2e0

fix GPU compiling

22dfa5f

two tests for cpu and gpu separately

080ff0c

fix CUDA_VERSION issue

734cac1

kexinzhao changed the title ~~Add half precision float16 data type~~ [WIP] Add half precision float16 data type Nov 17, 2017

kexinzhao added 2 commits November 20, 2017 00:21

Add GPU device code for testing

0f4bf1c

fix cmake

d646e47

kexinzhao requested review from Xreki and hedaoyuan November 20, 2017 09:32

kexinzhao changed the title ~~[WIP] Add half precision float16 data type~~ Add half precision float16 data type Nov 20, 2017

kexinzhao requested review from reyoung and QiJune November 20, 2017 17:15

fix bug

19e5c24

hedaoyuan reviewed Nov 21, 2017

View reviewed changes

wangkuiyi reviewed Nov 22, 2017

View reviewed changes

reyoung requested review from qingqing01 and removed request for reyoung November 28, 2017 04:12

reyoung reviewed Nov 28, 2017

View reviewed changes

kexinzhao added 4 commits November 28, 2017 15:30

address pr comment

a5feb77

fix gpu test, clean code and add cmake

41bd1f9

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

4901184

… float16

small fix

36df67b

hedaoyuan approved these changes Dec 6, 2017

View reviewed changes

kexinzhao merged commit 1d1555e into PaddlePaddle:develop Dec 6, 2017

kexinzhao deleted the float16 branch December 6, 2017 08:26

gongweibao added the AMP label Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add half precision float16 data type #5716

Add half precision float16 data type #5716

kexinzhao commented Nov 16, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 22, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 28, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 22, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 22, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 22, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 22, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 29, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 23, 2017

hedaoyuan Nov 23, 2017 •

edited

Loading

kexinzhao Nov 28, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 23, 2017

hedaoyuan Nov 23, 2017

kexinzhao Nov 23, 2017

hedaoyuan Nov 24, 2017

kexinzhao Nov 29, 2017

hedaoyuan Nov 21, 2017

kexinzhao Nov 22, 2017 •

edited

Loading

hedaoyuan Nov 23, 2017

hedaoyuan Nov 23, 2017 •

edited

Loading

kexinzhao Nov 28, 2017

wangkuiyi Nov 22, 2017

kexinzhao Nov 22, 2017

reyoung Nov 28, 2017

kexinzhao Nov 28, 2017

hedaoyuan left a comment


		PADDLE_HOSTDEVICE inline explicit float16(bool b) : x(b ? 0x3c00 : 0) {}

		PADDLE_HOSTDEVICE inline explicit float16(int8_t val) {


		PADDLE_HOSTDEVICE inline float16(const Eigen::half& h) : x(h.x) {}

		#if defined(PADDLE_NEON) && defined(PADDLE_ARM_FP16) && \


		#include <cstdint>

		#include <cuda.h>

Add half precision float16 data type #5716

Add half precision float16 data type #5716

Conversation

kexinzhao commented Nov 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hedaoyuan Nov 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kexinzhao Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hedaoyuan Nov 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hedaoyuan left a comment

Choose a reason for hiding this comment

hedaoyuan Nov 23, 2017 •

edited

Loading

kexinzhao Nov 22, 2017 •

edited

Loading

hedaoyuan Nov 23, 2017 •

edited

Loading