Skip to content
This repository has been archived by the owner on Jan 12, 2019. It is now read-only.

Error when build on arm64 with neon support? #11

Open
wuyiwuxin opened this issue Apr 7, 2015 · 8 comments
Open

Error when build on arm64 with neon support? #11

wuyiwuxin opened this issue Apr 7, 2015 · 8 comments

Comments

@wuyiwuxin
Copy link

I want to build those dmz code on arm64 platform with DMZ_HAS_NEON_COMPILETIME = 1, but it failed. Can dmz support arm64 with neon or not, do you have any suggestion if i want to build it on arm64.

@dgoldman-pdx
Copy link
Member

@Vincent-Echo We successfully include an arm64/Neon target when we use this card.io-dmz repo to build the card.io-source repo, in order to create the card.io SDK.

Are you building for iOS or for another OS?

What error messages are you getting?

@wuyiwuxin
Copy link
Author

I build it for ios. In file "processor_support.h", the value of DMZ_HAS_NEON_COMPILETIME will be 0 when build it on arm64 for ios.
#if IOS_DMZ
#ifdef _ARM_ARCH_7
#define DMZ_HAS_NEON_COMPILETIME 1
#else
#define DMZ_HAS_NEON_COMPILETIME 0
#endif
#elif CYTHON_DMZ
#define DMZ_HAS_NEON_COMPILETIME 0

This will make dmz no support neon. So i change the value to 1. I get error messages like register 'r0' is not exists in file conv.cpp.

@dgoldman-pdx
Copy link
Member

@Vincent-Echo my apologies -- you're quite right that our arm64 build sets DMZ_HAS_NEON_COMPILETIME to 0. I hadn't recalled that.

When we first updated our code to build for arm64 devices, the resulting library performed faster than the existing 32-bit versions of card.io. Therefore, apparently, we didn't even notice that our NEON support was being removed at compile time!

I know that the NEON instruction set did change with the move to the arm64 architecture, so it's not too surprising that our NEON code would need updating as well.

HOWEVER (and I suspect that this is probably the main reason that we did not notice any drop in performance with our 64-bit builds) the Clang compiler has gotten much smarter than it used to be regarding code vectorization.

I strongly suspect that Clang is now automatically generating appropriate vector-processor instructions on its own. That wasn't the case a few years ago, when we needed to explicitly use NEON intrinsics in our code to ensure that time-critical sections would be executed on the vector processor.

If you'd like to try to update our NEON code so that it builds successfully for both 32- and 64-bit architectures, that would be great - we'd love to review a Pull Request with such changes. But my guess is that Clang has now evolved to a point where this won't actually affect the performance of card.io.

@josharian
Copy link
Member

I very much doubt that Clang's codegen has improved to the point that it will outpace our hand-tuned implementations. Automatic vectorization is very hard, and many of our uses are not the sort of thing that are obviously vectorizable. (The convolution code in particular is not obviously vectorizable, and it was the single slowest operation last time I checked.) I'd love to be wrong about this, of course.

A PR with ARM64 NEON implementations--particularly of the 7x7 sobel convolutions--would be awesome. Yes, processors are now fast enough that it isn't the limiting factor, but it'd still save users' battery life and enable us to do more expensive per-frame things later, say during expiry/name scanning.

We might also want to check how Eigen's ARM64 NEON support is coming along. That might also have a big impact on card.io performance.

@dgoldman-pdx
Copy link
Member

I must defer to @josharian's much greater experience in this area!

I did just try the experiment of adding -fno-vectorize and -fno-slp-vectorize flags when compiling the DMZ layer for card.io. My impression in then running some scans is that those flags (which disable Clang vectorization) did indeed slow things down a perceptible amount -- but that's an unscientific impression. In objective terms, e.g., the frame rate (iPhone 5S) stayed at or near 30 frames/second either way.

However, this experiment introduced me to the -fslp-vectorize-aggressive flag, which I will now add to card.io on general principle. Though I have to admit that adding that flag hasn't so far resulted in any performance change that I can perceive. Hmm... Well, possibly a slight speed-up on iPhone 4S. (Worst frame rate rises from a bit below 20 fps to a bit above, on a handful of test runs.)

tl;dr: yes, a Pull Request that enables our NEON code for arm64 would be of great interest. 😺

@dgoldman-pdx
Copy link
Member

We might also want to check how Eigen's ARM64 NEON support is coming along. That might also have a big impact on card.io performance.

I did update our Eigen to the latest version, 3.2.4, a couple of months ago. Their website claims Explicit vectorization is performed for SSE 2/3/4, ARM NEON, and AltiVec instruction sets, with graceful fallback to non-vectorized code.

ETA: Hmm. Well, reviewing their various notes and statements, I haven't yet found an explicit statement re arm64 NEON.

@josharian
Copy link
Member

@dgoldman-ebay to measure the perf impact of a change, things to try include:

  • Create a benchmark. We don't have any but it would be nice to fix that. I don't think Apple has built-in support for them, though, and getting tests to run on a device is a PITA.
  • Add a bit of ratching timing code -- i.e. local code when measures the fastest (wall clock time) that a bit of code has ever executed in and logs when that number gets even lower. Ratchets are a good idea for tight, CPU-bound code, since non-minimum execution times are probably due to context switches and other noise.
  • Use Instruments to measure the relative CPU consumption of one function to another (only helpful when only one function has changed). Instruments' profiling was invaluable while developing and tuning the NEON code in the first place. If the function is slow enough to not have significant measurements in Instruments, we don't care anyway.
  • Ask clang to emit assembly code w/ and w/o the vectorization flags and run them through diff, to determine which functions (if any) the compiler did something different with, and what that was.

As for Eigen, ARM64 NEON is different than ARM NEON, so if they don't mention it, they probably don't support it. Another great opportunity for a motivated hacker to make some contributions. :)

@dgoldman-pdx
Copy link
Member

Thanks, @josharian!

As for Eigen, ARM64 NEON is different than ARM NEON, so if they don't mention it, they probably don't support it.

Actually, subsequent googling for eigen arm64 neon suggests that they've had it for a few years, but people are still finding and fixing things. Not clear to me, without further investigation, whether that's part of version 3.2.4.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants