Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on how to contribute #1

Closed
jackmott opened this issue May 26, 2017 · 28 comments
Closed

Questions on how to contribute #1

jackmott opened this issue May 26, 2017 · 28 comments

Comments

@jackmott
Copy link
Contributor

First, is this project still active and relevant? Or is rust taking a different direction with SIMD?

second:

#[inline(always)]
#[target_feature = "+sse"]
pub fn _mm_sqrt_ps(a: f32x4) -> f32x4 {
    unsafe { sqrtps(a) }
}

I assume this exposes a function with the usual intel intrinsic name and signature, then the sqrtps is some built in llvm function name? Which somehow is exposed via the target_feature attribute?

Assuming this is still active and relevant I could contribute, just need to know where/how to look up the llvm function names.

@BurntSushi
Copy link
Member

BurntSushi commented May 26, 2017

First, is this project still active and relevant? Or is rust taking a different direction with SIMD?

Yes! This repository represents my proposal-in-progress for an RFC to stabilize SIMD support in Rust's standard library.

I assume this exposes a function with the usual intel intrinsic name and signature, then the sqrtps is some built in llvm function name? Which somehow is exposed via the target_feature attribute?

Yes, you can see how sqrtps is brought in here: https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse.rs#L55-L56

The target_feature attribute says, "compile this function with SSE support." AFAIK, it's precisely the same as the __target__("...") attributes supported by Clang and gcc. For example: https://github.com/llvm-mirror/clang/blob/9a8f2b19b0416f6c10976342560479b45eb15724/lib/Headers/xmmintrin.h#L43

Assuming this is still active and relevant I could contribute, just need to know where/how to look up the llvm function names.

Contributions are welcome to fill out the vendor intrinsic APIs. If all goes according to plan, this will eventually get organized into a PR that goes into Rust's standard library. You can see the LLVM extern blocks in each of the files. Getting those names is trickier. I've been using this file I generated from LLVM's sources to derive the intrinsic names, and I've been following along Clang's *intrin.h headers to get hints for how to implement the intrinsics themselves.

In general, every intrinsic should have some documentation associated with them and at least one unit test. If you're working on x86, then the Intel intrinsic guide is helpful for documentation. Some of the intrinsics in Clang's header files also have documentation. (In general, I've found the Intel docs to be pretty lacking.)

@jackmott
Copy link
Contributor Author

Thanks.
I was thinking of taking a stab at filling in AVX2, would that make sense or are there other things that should be addressed first?

@BurntSushi
Copy link
Member

@jackmott Nope that makes sense to me! I very much want AVX/AVX2 myself as well and was hoping to have that done for the initial PR. So that would be a big help!

@jackmott
Copy link
Contributor Author

Can you explain simd.rs a little bit?

@BurntSushi
Copy link
Member

@jackmott Those are intrinsics defined by rustc, which in turn use LLVM intrinsics. They are described briefly at a high level in the RFC that introduced them: https://github.com/rust-lang/rfcs/blob/master/text/1199-simd-infrastructure.md The actual translation in rustc is done here: https://github.com/rust-lang/rust/blob/6d841da4a0d7629f826117f99052e3d4a7997a7e/src/librustc_trans/intrinsic.rs#L933

If you take a look at my work on the SSE2 intrinsics, you'll see that I try to avoid using the simd_* platform intrinsics directly to keep them encapsulated. For example, instead of using simd_add, I'd use the Add implementation on the specific SIMD types. No need to be religious about it though!

@jackmott
Copy link
Contributor Author

got it.
Hope you don't mind me asking tons of questions, I'm still new to Rust.
The macros that do the operator overloading for the vector types, will that end up being exposed to consumers of the api, or is it for internal use only?

@BurntSushi
Copy link
Member

BurntSushi commented May 26, 2017

No problem! Questions are good.

The macros themselves wouldn't be exposed, no. But the SIMD vector types along with the Add impls and such is definitely in my plan to expose. Whether that all shakes out isn't quite clear, since this still has to go through the RFC process. But there has been a lot of discussion about this on the internals forum, and my feeling is that this is pretty representative of the closest thing to a consensus that we have.

If you want to see the "public" API, then just run cargo doc and then $BROWSER target/doc/ and follow the yellow brick road to index.html. This should be much easier to read than following the macro soup.

@jackmott
Copy link
Contributor Author

the allintrinsics document is fascinating. starting to make sense of it.
My plan would be to just to just go here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2&expand=3

and go down 1 by 1, refer to the allintrinsics link, add it to avx2.rs, and then on to the next.

Maybe figure out a way to codegen?? :D

@BurntSushi
Copy link
Member

BurntSushi commented May 26, 2017

@jackmott Yeah that's pretty much what I do, although I use the XML file from Intel's web site directly: https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/data-3.3.16.xml I also sometimes look at Clang's headers for hints.

Maybe figure out a way to codegen?? :D

There's been a lot of talk about that but I'm skeptical. Maybe there's a way, but I'd much rather at least one human apply individual scrutiny to the type signatures of each function. Writing them out is a good forcing function for that. I say this because it's important to get all of the vector types correct. The Intel definitions don't tell you the right integer types to use (there's only __m128, __m128d and __m128i). You might be able to infer some patterns based on the names of the intrinsics, but in my experience there are sometimes exceptions...

@jackmott
Copy link
Contributor Author

allright, I think I've added my first avx2 intrinsic, it builds anyway, mind a quick double check?

https://github.com/jackmott/stdsimd/blob/master/src/x86/avx2.rs

also had to add avx2 to the mod.rs:

https://github.com/jackmott/stdsimd/blob/master/src/x86/mod.rs

@BurntSushi
Copy link
Member

At a brief glance, that looks fine, although I didn't check it against the reference material. Note though that docs/tests should still be added, and I think there are some unused use statements there?

@jackmott
Copy link
Contributor Author

jackmott commented May 26, 2017

how do I run the tests?
edit:
looks like just cargo test? TOO EASY!

@jackmott
Copy link
Contributor Author

jackmott commented May 26, 2017

for avx2 add, I'm not seeing a 1-1 correspondence between the intel guide and the llvm intrinsics, like no add for i64 for instance. Does LLVM just not support all of them? Or is the document possibly missing some?

hmm I see them here:
https://clang.llvm.org/doxygen/avx2intrin_8h_source.html

oh this is because those are already part of the type right?

@BurntSushi
Copy link
Member

@jackmott The tricky thing about this stuff is that a vendor intrinsic doesn't necessary map to a compiler intrinsic. Basically, as LLVM has been making their code generation better, they've been removing intrinsics that they can recognize perfectly in the code itself. So if x and y are 256 bit vector types and you add them together, then LLVM will automatically insert the right intrinsic. Here's an interesting example: https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse2.rs#L1150-L1156

This does have weird repercussions. For example, when compiling in debug mode, you probably won't end up with the desired CPU instruction. You essentially need to compile with optimizations on to get the right instruction.

@jackmott jackmott changed the title quick questions Questions on how to contibute May 27, 2017
@jackmott
Copy link
Contributor Author

The debug vs release thing is familiar, .NET's SIMD support does the same thing.

@jackmott
Copy link
Contributor Author

jackmott commented May 27, 2017

_mm256_alignr_epi8
Any idea what this corresponds to?

Do you want me to wait till the entire avx2 is done before doing a PR?
I've got [abs,add, adds, and, andnot,avg,cmpeq,cmpgt] instructions, with comments and tests so far:

https://github.com/jackmott/stdsimd/blob/master/src/x86/avx2.rs

@jackmott
Copy link
Contributor Author

#[allow(non_camel_case_types)]
pub type __m128i = ::v128::i8x16;

Does it matter that this is aliased to i8x16 or would anything do?
Should I use this same approach for:
__m256i _mm256_and_si256 (__m256i a, __m256i b) ?

@jackmott
Copy link
Contributor Author

jackmott commented May 27, 2017

the blend functions in avx2 works like so:

 #define _mm256_blend_epi16(V1, V2, M) __extension__ ({       \
   (__m256i)__builtin_shufflevector((__v16hi)(__m256i)(V1),   \
                                    (__v16hi)(__m256i)(V2),   \
                                    (((M) & 0x01) ? 16 : 0),  \
                                    (((M) & 0x02) ? 17 : 1),  \
                                    (((M) & 0x04) ? 18 : 2),  \
                                    (((M) & 0x08) ? 19 : 3),  \
                                    (((M) & 0x10) ? 20 : 4),  \
                                    (((M) & 0x20) ? 21 : 5),  \
                                    (((M) & 0x40) ? 22 : 6),  \
                                    (((M) & 0x80) ? 23 : 7),  \
                                    (((M) & 0x01) ? 24 : 8),  \
                                    (((M) & 0x02) ? 25 : 9),  \
                                    (((M) & 0x04) ? 26 : 10), \
                                    (((M) & 0x08) ? 27 : 11), \
                                    (((M) & 0x10) ? 28 : 12), \
                                    (((M) & 0x20) ? 29 : 13), \
                                    (((M) & 0x40) ? 30 : 14), \
                                    (((M) & 0x80) ? 31 : 15)); })

any general tips on how to implement that such that LLVM will optimize it right?

something like:

pub fn _mm256_blend_epi16(a:i16x16,b:i16x16,imm8:i32) -> i16x16 {
    let imm8 = imm8 as u32;
    simd_shuffle16(a,b,
        [
            match imm8 & 0x01 {
                0 => 0,
                _ => 16
            },
            match imm8 & 0x02 {
                0 => 1,
                _ => 17
            },
            ...

        ])
}

@BurntSushi
Copy link
Member

@jackmott I think it would be worth your time to carefully review x86/sse2.rs. There are some examples that define macros to get the right code gen. Ultimately, you want to make sure the right instruction is generated. The way I usually do it is stick a sample program in examples/whatever.rs, and then run RUSTFLAGS="--emit asm" cargo build --release --example whatever. Then look in target/release/examples for the asm output.

Do you want me to wait till the entire avx2 is done before doing a PR?

Up to you. Probably sending it in stages is a good idea. Please also write in a style that is consistent with the rest of the code.

I imagine you'll want a pub type __m256i = ::v256::i8x32;, yes. Whether we actually keep it or not, I'm not sure.

@jackmott jackmott changed the title Questions on how to contibute Questions on how to contribute May 28, 2017
@jackmott
Copy link
Contributor Author

the max functions in clang.s .h files are defined like so:

 static __inline__ __m256i __DEFAULT_FN_ATTRS
 _mm256_max_epi8(__m256i __a, __m256i __b)
 {
   return (__m256i)__builtin_ia32_pmaxsb256((__v32qi)__a, (__v32qi)__b);
 }

but don't appear in your extracted intrinsics gist list, and it seems like ones of this form should. Did it just get missed? Should I just infer the names then check that it is right with a test?

@BurntSushi
Copy link
Member

Interesting. Maybe my regexes for extraction are bad? Can you find it in Intel's docs?

@jackmott
Copy link
Contributor Author

jackmott commented May 28, 2017

the intel intrinsics reference? yeah they are all in there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2&text=max&expand=3298

@BurntSushi
Copy link
Member

BurntSushi commented May 28, 2017 via email

@jackmott
Copy link
Contributor Author

let a = i8x32::splat(-1);        
let r = avx2::_mm256_movemask_epi8(a);

release build r = -1
debug build r = 65535

run into anything like that?

@BurntSushi
Copy link
Member

I'd have to see the full code I think. There are still some details to iron out with #[target_feature] and the like that cause weird things to happen.

@jackmott
Copy link
Contributor Author

jackmott commented May 29, 2017

Hmm, another intrinsic from the "Misc" section and a similar result:

#[link_name = "llvm.x86.avx2.mpsadbw"]
fn mpsadbw(a: u8x32, b: u8x32, imm8: i32) -> u16x16;

works fine in release, in debug I get this error when running cargo test:

LLVM ERROR: Cannot select: intrinsic %llvm.x86.avx2.mpsadbw

Its all up in my fork if you want to glance at it.

@BurntSushi
Copy link
Member

BurntSushi commented May 29, 2017

@jackmott I don't know for sure, but my guess is that imm8 is an immediate value and therefore must be a constant at a compile time. In release mode, it probably works fine because it becomes a constant through various optimizations. But in debug mode, all bets are off. To fix this, you need to explicitly write out each call to the LLVM intrinsic with an explicit constant. You can see examples here https://github.com/jackmott/stdsimd/blob/master/src/x86/sse42.rs and here https://github.com/jackmott/stdsimd/blob/master/src/x86/sse2.rs#L282 --- Yes, they are a pain to write.

@BurntSushi
Copy link
Member

I've started a guide for contributors: https://github.com/BurntSushi/stdsimd/blob/master/CONTRIBUTING.md

I'm going to close this issue for now, but I hope to improve the guide as we learn more!

alexcrichton pushed a commit that referenced this issue May 4, 2018
)

* Work arounds for LLVM6 code-gen bugs in all/any reductions

This commit adds workarounds for the mask reductions: `all` and `any`.

64-bit wide mask types (`m8x8`, `m16x4`, `m32x2`)

`x86_64` with `MMX` enabled

```asm
all_8x8:
 push    rbp
 mov     rbp, rsp
 movzx   eax, byte, ptr, [rdi, +, 7]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi, +, 6]
 movd    xmm1, eax
 punpcklwd xmm1, xmm0
 movzx   eax, byte, ptr, [rdi, +, 5]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi, +, 4]
 movd    xmm2, eax
 punpcklwd xmm2, xmm0
 punpckldq xmm2, xmm1
 movzx   eax, byte, ptr, [rdi, +, 3]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi, +, 2]
 movd    xmm1, eax
 punpcklwd xmm1, xmm0
 movzx   eax, byte, ptr, [rdi, +, 1]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi]
 movd    xmm3, eax
 punpcklwd xmm3, xmm0
 punpckldq xmm3, xmm1
 punpcklqdq xmm3, xmm2
 movdqa  xmm0, xmmword, ptr, [rip, +, LCPI9_0]
 pand    xmm3, xmm0
 pcmpeqw xmm3, xmm0
 pshufd  xmm0, xmm3, 78
 pand    xmm0, xmm3
 pshufd  xmm1, xmm0, 229
 pand    xmm1, xmm0
 movdqa  xmm0, xmm1
 psrld   xmm0, 16
 pand    xmm0, xmm1
 movd    eax, xmm0
 and     al, 1
 pop     rbp
 ret
any_8x8:
 push    rbp
 mov     rbp, rsp
 movzx   eax, byte, ptr, [rdi, +, 7]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi, +, 6]
 movd    xmm1, eax
 punpcklwd xmm1, xmm0
 movzx   eax, byte, ptr, [rdi, +, 5]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi, +, 4]
 movd    xmm2, eax
 punpcklwd xmm2, xmm0
 punpckldq xmm2, xmm1
 movzx   eax, byte, ptr, [rdi, +, 3]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi, +, 2]
 movd    xmm1, eax
 punpcklwd xmm1, xmm0
 movzx   eax, byte, ptr, [rdi, +, 1]
 movd    xmm0, eax
 movzx   eax, byte, ptr, [rdi]
 movd    xmm3, eax
 punpcklwd xmm3, xmm0
 punpckldq xmm3, xmm1
 punpcklqdq xmm3, xmm2
 movdqa  xmm0, xmmword, ptr, [rip, +, LCPI8_0]
 pand    xmm3, xmm0
 pcmpeqw xmm3, xmm0
 pshufd  xmm0, xmm3, 78
 por     xmm0, xmm3
 pshufd  xmm1, xmm0, 229
 por     xmm1, xmm0
 movdqa  xmm0, xmm1
 psrld   xmm0, 16
 por     xmm0, xmm1
 movd    eax, xmm0
 and     al, 1
 pop     rbp
 ret
```

After this PR for `m8x8`, `m16x4`, `m32x2`:

```asm
all_8x8:
 push    rbp
 mov     rbp, rsp
 movq    mm0, qword, ptr, [rdi]
 pmovmskb eax, mm0
 cmp     eax, 255
 sete    al
 pop     rbp
 ret
any_8x8:
 push    rbp
 mov     rbp, rsp
 movq    mm0, qword, ptr, [rdi]
 pmovmskb eax, mm0
 test    eax, eax
 setne   al
 pop     rbp
 ret
```

x86` with `MMX` enabled

Before this PR:

```asm
all_8x8:
 call    L9$pb
L9$pb:
 pop     eax
 mov     ecx, dword, ptr, [esp, +, 4]
 movzx   edx, byte, ptr, [ecx, +, 7]
 movd    xmm0, edx
 movzx   edx, byte, ptr, [ecx, +, 6]
 movd    xmm1, edx
 punpcklwd xmm1, xmm0
 movzx   edx, byte, ptr, [ecx, +, 5]
 movd    xmm0, edx
 movzx   edx, byte, ptr, [ecx, +, 4]
 movd    xmm2, edx
 punpcklwd xmm2, xmm0
 punpckldq xmm2, xmm1
 movzx   edx, byte, ptr, [ecx, +, 3]
 movd    xmm0, edx
 movzx   edx, byte, ptr, [ecx, +, 2]
 movd    xmm1, edx
 punpcklwd xmm1, xmm0
 movzx   edx, byte, ptr, [ecx, +, 1]
 movd    xmm0, edx
 movzx   ecx, byte, ptr, [ecx]
 movd    xmm3, ecx
 punpcklwd xmm3, xmm0
 punpckldq xmm3, xmm1
 punpcklqdq xmm3, xmm2
 movdqa  xmm0, xmmword, ptr, [eax, +, LCPI9_0-L9$pb]
 pand    xmm3, xmm0
 pcmpeqw xmm3, xmm0
 pshufd  xmm0, xmm3, 78
 pand    xmm0, xmm3
 pshufd  xmm1, xmm0, 229
 pand    xmm1, xmm0
 movdqa  xmm0, xmm1
 psrld   xmm0, 16
 pand    xmm0, xmm1
 movd    eax, xmm0
 and     al, 1
 ret
any_8x8:
 call    L8$pb
L8$pb:
 pop     eax
 mov     ecx, dword, ptr, [esp, +, 4]
 movzx   edx, byte, ptr, [ecx, +, 7]
 movd    xmm0, edx
 movzx   edx, byte, ptr, [ecx, +, 6]
 movd    xmm1, edx
 punpcklwd xmm1, xmm0
 movzx   edx, byte, ptr, [ecx, +, 5]
 movd    xmm0, edx
 movzx   edx, byte, ptr, [ecx, +, 4]
 movd    xmm2, edx
 punpcklwd xmm2, xmm0
 punpckldq xmm2, xmm1
 movzx   edx, byte, ptr, [ecx, +, 3]
 movd    xmm0, edx
 movzx   edx, byte, ptr, [ecx, +, 2]
 movd    xmm1, edx
 punpcklwd xmm1, xmm0
 movzx   edx, byte, ptr, [ecx, +, 1]
 movd    xmm0, edx
 movzx   ecx, byte, ptr, [ecx]
 movd    xmm3, ecx
 punpcklwd xmm3, xmm0
 punpckldq xmm3, xmm1
 punpcklqdq xmm3, xmm2
 movdqa  xmm0, xmmword, ptr, [eax, +, LCPI8_0-L8$pb]
 pand    xmm3, xmm0
 pcmpeqw xmm3, xmm0
 pshufd  xmm0, xmm3, 78
 por     xmm0, xmm3
 pshufd  xmm1, xmm0, 229
 por     xmm1, xmm0
 movdqa  xmm0, xmm1
 psrld   xmm0, 16
 por     xmm0, xmm1
 movd    eax, xmm0
 and     al, 1
 ret
```

After this PR:

```asm
all_8x8:
 mov     eax, dword, ptr, [esp, +, 4]
 movq    mm0, qword, ptr, [eax]
 pmovmskb eax, mm0
 cmp     eax, 255
 sete    al
 ret
any_8x8:
 mov     eax, dword, ptr, [esp, +, 4]
 movq    mm0, qword, ptr, [eax]
 pmovmskb eax, mm0
 test    eax, eax
 setne   al
 ret
```

`aarch64`

Before this PR:

```asm
all_8x8:
 ldr     d0, [x0]
 umov    w8, v0.b[0]
 umov    w9, v0.b[1]
 tst     w8, #0xff
 umov    w10, v0.b[2]
 cset    w8, ne
 tst     w9, #0xff
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[3]
 and     w8, w8, w9
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[4]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[5]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[6]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[7]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 and     w8, w9, w8
 cset    w9, ne
 and     w0, w9, w8
 ret
any_8x8:
 ldr     d0, [x0]
 umov    w8, v0.b[0]
 umov    w9, v0.b[1]
 orr     w8, w8, w9
 umov    w9, v0.b[2]
 orr     w8, w8, w9
 umov    w9, v0.b[3]
 orr     w8, w8, w9
 umov    w9, v0.b[4]
 orr     w8, w8, w9
 umov    w9, v0.b[5]
 orr     w8, w8, w9
 umov    w9, v0.b[6]
 orr     w8, w8, w9
 umov    w9, v0.b[7]
 orr     w8, w8, w9
 tst     w8, #0xff
 cset    w0, ne
 ret
```

After this PR:

```asm
all_8x8:
 ldr     d0, [x0]
 mov     v0.d[1], v0.d[0]
 uminv   b0, v0.16b
 fmov    w8, s0
 tst     w8, #0xff
 cset    w0, ne
 ret
any_8x8:
 ldr     d0, [x0]
 mov     v0.d[1], v0.d[0]
 umaxv   b0, v0.16b
 fmov    w8, s0
 tst     w8, #0xff
 cset    w0, ne
 ret
```

`ARMv7` + `neon`

Before this PR:

```asm
all_8x8:
 vmov.i8 d0, #0x1
 vldr    d1, [r0]
 vtst.8  d0, d1, d0
 vext.8  d1, d0, d0, #4
 vand    d0, d0, d1
 vext.8  d1, d0, d0, #2
 vand    d0, d0, d1
 vdup.8  d1, d0[1]
 vand    d0, d0, d1
 vmov.u8 r0, d0[0]
 and     r0, r0, #1
 bx      lr
any_8x8:
 vmov.i8 d0, #0x1
 vldr    d1, [r0]
 vtst.8  d0, d1, d0
 vext.8  d1, d0, d0, #4
 vorr    d0, d0, d1
 vext.8  d1, d0, d0, #2
 vorr    d0, d0, d1
 vdup.8  d1, d0[1]
 vorr    d0, d0, d1
 vmov.u8 r0, d0[0]
 and     r0, r0, #1
 bx      lr
```

After this PR:

```asm
all_8x8:
 vldr    d0, [r0]
 b       <m8x8 as All>::all

<m8x8 as All>::all:
 vpmin.u8 d16, d0, d16
 vpmin.u8 d16, d16, d16
 vpmin.u8 d0, d16, d16
 b       m8x8::extract

any_8x8:
 vldr    d0, [r0]
 b       <m8x8 as Any>::any

<m8x8 as Any>::any:
 vpmax.u8 d16, d0, d16
 vpmax.u8 d16, d16, d16
 vpmax.u8 d0, d16, d16
 b       m8x8::extract
```

(note: inlining does not work properly on ARMv7)

128-bit wide mask types (`m8x16`, `m16x8`, `m32x4`, `m64x2`)

`x86_64` with SSE2 enabled

Before this PR:

```asm
all_8x16:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rip, +, LCPI9_0]
 movdqa  xmm1, xmmword, ptr, [rdi]
 pand    xmm1, xmm0
 pcmpeqb xmm1, xmm0
 pmovmskb eax, xmm1
 xor     ecx, ecx
 cmp     eax, 65535
 mov     eax, -1
 cmovne  eax, ecx
 and     al, 1
 pop     rbp
 ret
any_8x16:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rip, +, LCPI8_0]
 movdqa  xmm1, xmmword, ptr, [rdi]
 pand    xmm1, xmm0
 pcmpeqb xmm1, xmm0
 pmovmskb eax, xmm1
 neg     eax
 sbb     eax, eax
 and     al, 1
 pop     rbp
 ret
```

After this PR:

```asm
all_8x16:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 pmovmskb eax, xmm0
 cmp     eax, 65535
 sete    al
 pop     rbp
 ret
any_8x16:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 pmovmskb eax, xmm0
 test    eax, eax
 setne   al
 pop     rbp
 ret
```

`aarch64`

Before this PR:

```asm
all_8x16:
 ldr     q0, [x0]
 umov    w8, v0.b[0]
 umov    w9, v0.b[1]
 tst     w8, #0xff
 umov    w10, v0.b[2]
 cset    w8, ne
 tst     w9, #0xff
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[3]
 and     w8, w8, w9
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[4]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[5]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[6]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[7]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[8]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[9]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[10]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[11]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[12]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[13]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[14]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[15]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 and     w8, w9, w8
 cset    w9, ne
 and     w0, w9, w8
 ret
any_8x16:
 ldr     q0, [x0]
 umov    w8, v0.b[0]
 umov    w9, v0.b[1]
 orr     w8, w8, w9
 umov    w9, v0.b[2]
 orr     w8, w8, w9
 umov    w9, v0.b[3]
 orr     w8, w8, w9
 umov    w9, v0.b[4]
 orr     w8, w8, w9
 umov    w9, v0.b[5]
 orr     w8, w8, w9
 umov    w9, v0.b[6]
 orr     w8, w8, w9
 umov    w9, v0.b[7]
 orr     w8, w8, w9
 umov    w9, v0.b[8]
 orr     w8, w8, w9
 umov    w9, v0.b[9]
 orr     w8, w8, w9
 umov    w9, v0.b[10]
 orr     w8, w8, w9
 umov    w9, v0.b[11]
 orr     w8, w8, w9
 umov    w9, v0.b[12]
 orr     w8, w8, w9
 umov    w9, v0.b[13]
 orr     w8, w8, w9
 umov    w9, v0.b[14]
 orr     w8, w8, w9
 umov    w9, v0.b[15]
 orr     w8, w8, w9
 tst     w8, #0xff
 cset    w0, ne
 ret
```

After this PR:

```asm
all_8x16:
 ldr     q0, [x0]
 uminv   b0, v0.16b
 fmov    w8, s0
 tst     w8, #0xff
 cset    w0, ne
 ret
any_8x16:
 ldr     q0, [x0]
 umaxv   b0, v0.16b
 fmov    w8, s0
 tst     w8, #0xff
 cset    w0, ne
 ret
```

 `ARMv7` + `neon`

Before this PR:

```asm
all_8x16:
 vmov.i8 q0, #0x1
 vld1.64 {d2, d3}, [r0]
 vtst.8  q0, q1, q0
 vext.8  q1, q0, q0, #8
 vand    q0, q0, q1
 vext.8  q1, q0, q0, #4
 vand    q0, q0, q1
 vext.8  q1, q0, q0, #2
 vand    q0, q0, q1
 vdup.8  q1, d0[1]
 vand    q0, q0, q1
 vmov.u8 r0, d0[0]
 and     r0, r0, #1
 bx      lr
any_8x16:
 vmov.i8 q0, #0x1
 vld1.64 {d2, d3}, [r0]
 vtst.8  q0, q1, q0
 vext.8  q1, q0, q0, #8
 vorr    q0, q0, q1
 vext.8  q1, q0, q0, #4
 vorr    q0, q0, q1
 vext.8  q1, q0, q0, #2
 vorr    q0, q0, q1
 vdup.8  q1, d0[1]
 vorr    q0, q0, q1
 vmov.u8 r0, d0[0]
 and     r0, r0, #1
 bx      lr
```

After this PR:

```asm
all_8x16:
 vld1.64 {d0, d1}, [r0]
 b       <m8x16 as All>::all

<m8x16 as All>::all:
 vpmin.u8 d0, d0, d
 b       <m8x8 as All>::all
any_8x16:
 vld1.64 {d0, d1}, [r0]
 b       <m8x16 as Any>::any

<m8x16 as Any>::any:
 vpmax.u8 d0, d0, d1
 b       <m8x8 as Any>::any
```

The inlining problems are pretty bad on ARMv7 + NEON.

256-bit wide mask types (`m8x32`, `m16x16`, `m32x8`, `m64x4`)

With SSE2 enabled

Before this PR:

```asm
all_8x32:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rip, +, LCPI17_0]
 movdqa  xmm1, xmmword, ptr, [rdi]
 pand    xmm1, xmm0
 movdqa  xmm2, xmmword, ptr, [rdi, +, 16]
 pand    xmm2, xmm0
 pcmpeqb xmm2, xmm0
 pcmpeqb xmm1, xmm0
 pand    xmm1, xmm2
 pmovmskb eax, xmm1
 xor     ecx, ecx
 cmp     eax, 65535
 mov     eax, -1
 cmovne  eax, ecx
 and     al, 1
 pop     rbp
 ret
 any_8x32:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 por     xmm0, xmmword, ptr, [rdi, +, 16]
 movdqa  xmm1, xmmword, ptr, [rip, +, LCPI16_0]
 pand    xmm0, xmm1
 pcmpeqb xmm0, xmm1
 pmovmskb eax, xmm0
 neg     eax
 sbb     eax, eax
 and     al, 1
 pop     rbp
 ret
```

After this PR:

```asm
all_8x32:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 pmovmskb eax, xmm0
 cmp     eax, 65535
 jne     LBB17_1
 movdqa  xmm0, xmmword, ptr, [rdi, +, 16]
 pmovmskb ecx, xmm0
 mov     al, 1
 cmp     ecx, 65535
 je      LBB17_3
LBB17_1:
 xor     eax, eax
LBB17_3:
 pop     rbp
 ret
any_8x32:
 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 pmovmskb ecx, xmm0
 mov     al, 1
 test    ecx, ecx
 je      LBB16_1
 pop     rbp
 ret
LBB16_1:
 movdqa  xmm0, xmmword, ptr, [rdi, +, 16]
 pmovmskb eax, xmm0
 test    eax, eax
 setne   al
 pop     rbp
 ret
```

With AVX enabled

Before this PR:

```asm
all_8x32:
 push    rbp
 mov     rbp, rsp
 vmovaps ymm0, ymmword, ptr, [rdi]
 vandps  ymm0, ymm0, ymmword, ptr, [rip, +, LCPI25_0]
 vextractf128 xmm1, ymm0, 1
 vpxor   xmm2, xmm2, xmm2
 vpcmpeqb xmm1, xmm1, xmm2
 vpcmpeqd xmm3, xmm3, xmm3
 vpxor   xmm1, xmm1, xmm3
 vpcmpeqb xmm0, xmm0, xmm2
 vpxor   xmm0, xmm0, xmm3
 vinsertf128 ymm0, ymm0, xmm1, 1
 vandps  ymm0, ymm0, ymm1
 vpermilps xmm1, xmm0, 78
 vandps  ymm0, ymm0, ymm1
 vpermilps xmm1, xmm0, 229
 vandps  ymm0, ymm0, ymm1
 vpsrld  xmm1, xmm0, 16
 vandps  ymm0, ymm0, ymm1
 vpsrlw  xmm1, xmm0, 8
 vandps  ymm0, ymm0, ymm1
 vpextrb eax, xmm0, 0
 and     al, 1
 pop     rbp
 vzeroupper
 ret
any_8x32:
 push    rbp
 mov     rbp, rsp
 vmovaps ymm0, ymmword, ptr, [rdi]
 vandps  ymm0, ymm0, ymmword, ptr, [rip, +, LCPI24_0]
 vextractf128 xmm1, ymm0, 1
 vpxor   xmm2, xmm2, xmm2
 vpcmpeqb xmm1, xmm1, xmm2
 vpcmpeqd xmm3, xmm3, xmm3
 vpxor   xmm1, xmm1, xmm3
 vpcmpeqb xmm0, xmm0, xmm2
 vpxor   xmm0, xmm0, xmm3
 vinsertf128 ymm0, ymm0, xmm1, 1
 vorps   ymm0, ymm0, ymm1
 vpermilps xmm1, xmm0, 78
 vorps   ymm0, ymm0, ymm1
 vpermilps xmm1, xmm0, 229
 vorps   ymm0, ymm0, ymm1
 vpsrld  xmm1, xmm0, 16
 vorps   ymm0, ymm0, ymm1
 vpsrlw  xmm1, xmm0, 8
 vorps   ymm0, ymm0, ymm1
 vpextrb eax, xmm0, 0
 and     al, 1
 pop     rbp
 vzeroupper
 ret
```

After this PR:

```asm
all_8x32:
 push    rbp
 mov     rbp, rsp
 vmovdqa ymm0, ymmword, ptr, [rdi]
 vxorps  xmm1, xmm1, xmm1
 vcmptrueps ymm1, ymm1, ymm1
 vptest  ymm0, ymm1
 setb    al
 pop     rbp
 vzeroupper
 ret
any_8x32:
 push    rbp
 mov     rbp, rsp
 vmovdqa ymm0, ymmword, ptr, [rdi]
 vptest  ymm0, ymm0
 setne   al
 pop     rbp
 vzeroupper
 ret
```

---

Closes #362 .

* test avx on all x86 targets

* disable assert_instr on avx test

* enable all appropriate features

* disable assert_instr on x86+avx

* the fn_must_use is stable

* fix nbody example on armv7

* fixup

* fixup

* enable 64-bit wide mask MMX optimizations on x86_64 only

* remove coresimd dependency on cfg_if

* allow wasm to fail

* use an env variable to disable assert_instr tests

* disable m32x2 mask MMX optimization on macos

* move cfg_if to coresimd/macros.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants