Questions on how to contribute #1

jackmott · 2017-05-26T12:24:53Z

First, is this project still active and relevant? Or is rust taking a different direction with SIMD?

second:

#[inline(always)]
#[target_feature = "+sse"]
pub fn _mm_sqrt_ps(a: f32x4) -> f32x4 {
    unsafe { sqrtps(a) }
}

I assume this exposes a function with the usual intel intrinsic name and signature, then the sqrtps is some built in llvm function name? Which somehow is exposed via the target_feature attribute?

Assuming this is still active and relevant I could contribute, just need to know where/how to look up the llvm function names.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-05-26T12:36:27Z

First, is this project still active and relevant? Or is rust taking a different direction with SIMD?

Yes! This repository represents my proposal-in-progress for an RFC to stabilize SIMD support in Rust's standard library.

I assume this exposes a function with the usual intel intrinsic name and signature, then the sqrtps is some built in llvm function name? Which somehow is exposed via the target_feature attribute?

Yes, you can see how sqrtps is brought in here: https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse.rs#L55-L56

The target_feature attribute says, "compile this function with SSE support." AFAIK, it's precisely the same as the __target__("...") attributes supported by Clang and gcc. For example: https://github.com/llvm-mirror/clang/blob/9a8f2b19b0416f6c10976342560479b45eb15724/lib/Headers/xmmintrin.h#L43

Assuming this is still active and relevant I could contribute, just need to know where/how to look up the llvm function names.

Contributions are welcome to fill out the vendor intrinsic APIs. If all goes according to plan, this will eventually get organized into a PR that goes into Rust's standard library. You can see the LLVM extern blocks in each of the files. Getting those names is trickier. I've been using this file I generated from LLVM's sources to derive the intrinsic names, and I've been following along Clang's *intrin.h headers to get hints for how to implement the intrinsics themselves.

In general, every intrinsic should have some documentation associated with them and at least one unit test. If you're working on x86, then the Intel intrinsic guide is helpful for documentation. Some of the intrinsics in Clang's header files also have documentation. (In general, I've found the Intel docs to be pretty lacking.)

jackmott · 2017-05-26T13:00:41Z

Thanks.
I was thinking of taking a stab at filling in AVX2, would that make sense or are there other things that should be addressed first?

BurntSushi · 2017-05-26T13:02:34Z

@jackmott Nope that makes sense to me! I very much want AVX/AVX2 myself as well and was hoping to have that done for the initial PR. So that would be a big help!

jackmott · 2017-05-26T13:18:30Z

Can you explain simd.rs a little bit?

BurntSushi · 2017-05-26T13:27:11Z

@jackmott Those are intrinsics defined by rustc, which in turn use LLVM intrinsics. They are described briefly at a high level in the RFC that introduced them: https://github.com/rust-lang/rfcs/blob/master/text/1199-simd-infrastructure.md The actual translation in rustc is done here: https://github.com/rust-lang/rust/blob/6d841da4a0d7629f826117f99052e3d4a7997a7e/src/librustc_trans/intrinsic.rs#L933

If you take a look at my work on the SSE2 intrinsics, you'll see that I try to avoid using the simd_* platform intrinsics directly to keep them encapsulated. For example, instead of using simd_add, I'd use the Add implementation on the specific SIMD types. No need to be religious about it though!

jackmott · 2017-05-26T13:33:02Z

got it.
Hope you don't mind me asking tons of questions, I'm still new to Rust.
The macros that do the operator overloading for the vector types, will that end up being exposed to consumers of the api, or is it for internal use only?

BurntSushi · 2017-05-26T13:40:41Z

No problem! Questions are good.

The macros themselves wouldn't be exposed, no. But the SIMD vector types along with the Add impls and such is definitely in my plan to expose. Whether that all shakes out isn't quite clear, since this still has to go through the RFC process. But there has been a lot of discussion about this on the internals forum, and my feeling is that this is pretty representative of the closest thing to a consensus that we have.

If you want to see the "public" API, then just run cargo doc and then $BROWSER target/doc/ and follow the yellow brick road to index.html. This should be much easier to read than following the macro soup.

jackmott · 2017-05-26T13:42:59Z

the allintrinsics document is fascinating. starting to make sense of it.
My plan would be to just to just go here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2&expand=3

and go down 1 by 1, refer to the allintrinsics link, add it to avx2.rs, and then on to the next.

Maybe figure out a way to codegen?? :D

BurntSushi · 2017-05-26T13:47:01Z

@jackmott Yeah that's pretty much what I do, although I use the XML file from Intel's web site directly: https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/data-3.3.16.xml I also sometimes look at Clang's headers for hints.

Maybe figure out a way to codegen?? :D

There's been a lot of talk about that but I'm skeptical. Maybe there's a way, but I'd much rather at least one human apply individual scrutiny to the type signatures of each function. Writing them out is a good forcing function for that. I say this because it's important to get all of the vector types correct. The Intel definitions don't tell you the right integer types to use (there's only __m128, __m128d and __m128i). You might be able to infer some patterns based on the names of the intrinsics, but in my experience there are sometimes exceptions...

jackmott · 2017-05-26T15:22:25Z

allright, I think I've added my first avx2 intrinsic, it builds anyway, mind a quick double check?

https://github.com/jackmott/stdsimd/blob/master/src/x86/avx2.rs

also had to add avx2 to the mod.rs:

https://github.com/jackmott/stdsimd/blob/master/src/x86/mod.rs

BurntSushi · 2017-05-26T15:35:29Z

At a brief glance, that looks fine, although I didn't check it against the reference material. Note though that docs/tests should still be added, and I think there are some unused use statements there?

jackmott · 2017-05-26T16:43:41Z

how do I run the tests?
edit:
looks like just cargo test? TOO EASY!

jackmott · 2017-05-26T16:56:40Z

for avx2 add, I'm not seeing a 1-1 correspondence between the intel guide and the llvm intrinsics, like no add for i64 for instance. Does LLVM just not support all of them? Or is the document possibly missing some?

hmm I see them here:
https://clang.llvm.org/doxygen/avx2intrin_8h_source.html

oh this is because those are already part of the type right?

BurntSushi · 2017-05-26T21:46:12Z

@jackmott The tricky thing about this stuff is that a vendor intrinsic doesn't necessary map to a compiler intrinsic. Basically, as LLVM has been making their code generation better, they've been removing intrinsics that they can recognize perfectly in the code itself. So if x and y are 256 bit vector types and you add them together, then LLVM will automatically insert the right intrinsic. Here's an interesting example: https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse2.rs#L1150-L1156

This does have weird repercussions. For example, when compiling in debug mode, you probably won't end up with the desired CPU instruction. You essentially need to compile with optimizations on to get the right instruction.

jackmott · 2017-05-27T00:12:19Z

The debug vs release thing is familiar, .NET's SIMD support does the same thing.

jackmott · 2017-05-27T15:48:03Z

_mm256_alignr_epi8
Any idea what this corresponds to?

Do you want me to wait till the entire avx2 is done before doing a PR?
I've got [abs,add, adds, and, andnot,avg,cmpeq,cmpgt] instructions, with comments and tests so far:

https://github.com/jackmott/stdsimd/blob/master/src/x86/avx2.rs

jackmott · 2017-05-27T15:55:12Z

#[allow(non_camel_case_types)]
pub type __m128i = ::v128::i8x16;

Does it matter that this is aliased to i8x16 or would anything do?
Should I use this same approach for:
__m256i _mm256_and_si256 (__m256i a, __m256i b) ?

jackmott · 2017-05-27T18:53:37Z

the blend functions in avx2 works like so:

 #define _mm256_blend_epi16(V1, V2, M) __extension__ ({       \
   (__m256i)__builtin_shufflevector((__v16hi)(__m256i)(V1),   \
                                    (__v16hi)(__m256i)(V2),   \
                                    (((M) & 0x01) ? 16 : 0),  \
                                    (((M) & 0x02) ? 17 : 1),  \
                                    (((M) & 0x04) ? 18 : 2),  \
                                    (((M) & 0x08) ? 19 : 3),  \
                                    (((M) & 0x10) ? 20 : 4),  \
                                    (((M) & 0x20) ? 21 : 5),  \
                                    (((M) & 0x40) ? 22 : 6),  \
                                    (((M) & 0x80) ? 23 : 7),  \
                                    (((M) & 0x01) ? 24 : 8),  \
                                    (((M) & 0x02) ? 25 : 9),  \
                                    (((M) & 0x04) ? 26 : 10), \
                                    (((M) & 0x08) ? 27 : 11), \
                                    (((M) & 0x10) ? 28 : 12), \
                                    (((M) & 0x20) ? 29 : 13), \
                                    (((M) & 0x40) ? 30 : 14), \
                                    (((M) & 0x80) ? 31 : 15)); })

any general tips on how to implement that such that LLVM will optimize it right?

something like:

pub fn _mm256_blend_epi16(a:i16x16,b:i16x16,imm8:i32) -> i16x16 {
    let imm8 = imm8 as u32;
    simd_shuffle16(a,b,
        [
            match imm8 & 0x01 {
                0 => 0,
                _ => 16
            },
            match imm8 & 0x02 {
                0 => 1,
                _ => 17
            },
            ...

        ])
}

BurntSushi · 2017-05-28T13:03:38Z

@jackmott I think it would be worth your time to carefully review x86/sse2.rs. There are some examples that define macros to get the right code gen. Ultimately, you want to make sure the right instruction is generated. The way I usually do it is stick a sample program in examples/whatever.rs, and then run RUSTFLAGS="--emit asm" cargo build --release --example whatever. Then look in target/release/examples for the asm output.

Do you want me to wait till the entire avx2 is done before doing a PR?

Up to you. Probably sending it in stages is a good idea. Please also write in a style that is consistent with the rest of the code.

I imagine you'll want a pub type __m256i = ::v256::i8x32;, yes. Whether we actually keep it or not, I'm not sure.

jackmott · 2017-05-28T21:12:26Z

the max functions in clang.s .h files are defined like so:

 static __inline__ __m256i __DEFAULT_FN_ATTRS
 _mm256_max_epi8(__m256i __a, __m256i __b)
 {
   return (__m256i)__builtin_ia32_pmaxsb256((__v32qi)__a, (__v32qi)__b);
 }

but don't appear in your extracted intrinsics gist list, and it seems like ones of this form should. Did it just get missed? Should I just infer the names then check that it is right with a test?

BurntSushi · 2017-05-28T22:08:33Z

Interesting. Maybe my regexes for extraction are bad? Can you find it in Intel's docs?

jackmott · 2017-05-28T22:20:45Z

the intel intrinsics reference? yeah they are all in there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2&text=max&expand=3298

BurntSushi · 2017-05-28T22:21:27Z

Probably just a bad regex then. :)

…

On May 28, 2017 6:20 PM, "Jack Mott" ***@***.***> wrote: the intel see reference? yeah they are all in there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2& text=max&expand=3298 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAb34s6aoGTpxzR6OBHWJpOJZIuM28jDks5r-fM9gaJpZM4NnigW> .

jackmott · 2017-05-29T00:32:33Z

let a = i8x32::splat(-1);        
let r = avx2::_mm256_movemask_epi8(a);

release build r = -1
debug build r = 65535

run into anything like that?

BurntSushi · 2017-05-29T00:37:06Z

I'd have to see the full code I think. There are still some details to iron out with #[target_feature] and the like that cause weird things to happen.

jackmott · 2017-05-29T01:12:14Z

Hmm, another intrinsic from the "Misc" section and a similar result:

#[link_name = "llvm.x86.avx2.mpsadbw"]
fn mpsadbw(a: u8x32, b: u8x32, imm8: i32) -> u16x16;

works fine in release, in debug I get this error when running cargo test:

LLVM ERROR: Cannot select: intrinsic %llvm.x86.avx2.mpsadbw

Its all up in my fork if you want to glance at it.

BurntSushi · 2017-05-29T13:26:56Z

@jackmott I don't know for sure, but my guess is that imm8 is an immediate value and therefore must be a constant at a compile time. In release mode, it probably works fine because it becomes a constant through various optimizations. But in debug mode, all bets are off. To fix this, you need to explicitly write out each call to the LLVM intrinsic with an explicit constant. You can see examples here https://github.com/jackmott/stdsimd/blob/master/src/x86/sse42.rs and here https://github.com/jackmott/stdsimd/blob/master/src/x86/sse2.rs#L282 --- Yes, they are a pain to write.

BurntSushi · 2017-09-17T22:47:42Z

I've started a guide for contributors: https://github.com/BurntSushi/stdsimd/blob/master/CONTRIBUTING.md

I'm going to close this issue for now, but I hope to improve the guide as we learn more!

) * Work arounds for LLVM6 code-gen bugs in all/any reductions This commit adds workarounds for the mask reductions: `all` and `any`. 64-bit wide mask types (`m8x8`, `m16x4`, `m32x2`) `x86_64` with `MMX` enabled ```asm all_8x8: push rbp mov rbp, rsp movzx eax, byte, ptr, [rdi, +, 7] movd xmm0, eax movzx eax, byte, ptr, [rdi, +, 6] movd xmm1, eax punpcklwd xmm1, xmm0 movzx eax, byte, ptr, [rdi, +, 5] movd xmm0, eax movzx eax, byte, ptr, [rdi, +, 4] movd xmm2, eax punpcklwd xmm2, xmm0 punpckldq xmm2, xmm1 movzx eax, byte, ptr, [rdi, +, 3] movd xmm0, eax movzx eax, byte, ptr, [rdi, +, 2] movd xmm1, eax punpcklwd xmm1, xmm0 movzx eax, byte, ptr, [rdi, +, 1] movd xmm0, eax movzx eax, byte, ptr, [rdi] movd xmm3, eax punpcklwd xmm3, xmm0 punpckldq xmm3, xmm1 punpcklqdq xmm3, xmm2 movdqa xmm0, xmmword, ptr, [rip, +, LCPI9_0] pand xmm3, xmm0 pcmpeqw xmm3, xmm0 pshufd xmm0, xmm3, 78 pand xmm0, xmm3 pshufd xmm1, xmm0, 229 pand xmm1, xmm0 movdqa xmm0, xmm1 psrld xmm0, 16 pand xmm0, xmm1 movd eax, xmm0 and al, 1 pop rbp ret any_8x8: push rbp mov rbp, rsp movzx eax, byte, ptr, [rdi, +, 7] movd xmm0, eax movzx eax, byte, ptr, [rdi, +, 6] movd xmm1, eax punpcklwd xmm1, xmm0 movzx eax, byte, ptr, [rdi, +, 5] movd xmm0, eax movzx eax, byte, ptr, [rdi, +, 4] movd xmm2, eax punpcklwd xmm2, xmm0 punpckldq xmm2, xmm1 movzx eax, byte, ptr, [rdi, +, 3] movd xmm0, eax movzx eax, byte, ptr, [rdi, +, 2] movd xmm1, eax punpcklwd xmm1, xmm0 movzx eax, byte, ptr, [rdi, +, 1] movd xmm0, eax movzx eax, byte, ptr, [rdi] movd xmm3, eax punpcklwd xmm3, xmm0 punpckldq xmm3, xmm1 punpcklqdq xmm3, xmm2 movdqa xmm0, xmmword, ptr, [rip, +, LCPI8_0] pand xmm3, xmm0 pcmpeqw xmm3, xmm0 pshufd xmm0, xmm3, 78 por xmm0, xmm3 pshufd xmm1, xmm0, 229 por xmm1, xmm0 movdqa xmm0, xmm1 psrld xmm0, 16 por xmm0, xmm1 movd eax, xmm0 and al, 1 pop rbp ret ``` After this PR for `m8x8`, `m16x4`, `m32x2`: ```asm all_8x8: push rbp mov rbp, rsp movq mm0, qword, ptr, [rdi] pmovmskb eax, mm0 cmp eax, 255 sete al pop rbp ret any_8x8: push rbp mov rbp, rsp movq mm0, qword, ptr, [rdi] pmovmskb eax, mm0 test eax, eax setne al pop rbp ret ``` x86` with `MMX` enabled Before this PR: ```asm all_8x8: call L9$pb L9$pb: pop eax mov ecx, dword, ptr, [esp, +, 4] movzx edx, byte, ptr, [ecx, +, 7] movd xmm0, edx movzx edx, byte, ptr, [ecx, +, 6] movd xmm1, edx punpcklwd xmm1, xmm0 movzx edx, byte, ptr, [ecx, +, 5] movd xmm0, edx movzx edx, byte, ptr, [ecx, +, 4] movd xmm2, edx punpcklwd xmm2, xmm0 punpckldq xmm2, xmm1 movzx edx, byte, ptr, [ecx, +, 3] movd xmm0, edx movzx edx, byte, ptr, [ecx, +, 2] movd xmm1, edx punpcklwd xmm1, xmm0 movzx edx, byte, ptr, [ecx, +, 1] movd xmm0, edx movzx ecx, byte, ptr, [ecx] movd xmm3, ecx punpcklwd xmm3, xmm0 punpckldq xmm3, xmm1 punpcklqdq xmm3, xmm2 movdqa xmm0, xmmword, ptr, [eax, +, LCPI9_0-L9$pb] pand xmm3, xmm0 pcmpeqw xmm3, xmm0 pshufd xmm0, xmm3, 78 pand xmm0, xmm3 pshufd xmm1, xmm0, 229 pand xmm1, xmm0 movdqa xmm0, xmm1 psrld xmm0, 16 pand xmm0, xmm1 movd eax, xmm0 and al, 1 ret any_8x8: call L8$pb L8$pb: pop eax mov ecx, dword, ptr, [esp, +, 4] movzx edx, byte, ptr, [ecx, +, 7] movd xmm0, edx movzx edx, byte, ptr, [ecx, +, 6] movd xmm1, edx punpcklwd xmm1, xmm0 movzx edx, byte, ptr, [ecx, +, 5] movd xmm0, edx movzx edx, byte, ptr, [ecx, +, 4] movd xmm2, edx punpcklwd xmm2, xmm0 punpckldq xmm2, xmm1 movzx edx, byte, ptr, [ecx, +, 3] movd xmm0, edx movzx edx, byte, ptr, [ecx, +, 2] movd xmm1, edx punpcklwd xmm1, xmm0 movzx edx, byte, ptr, [ecx, +, 1] movd xmm0, edx movzx ecx, byte, ptr, [ecx] movd xmm3, ecx punpcklwd xmm3, xmm0 punpckldq xmm3, xmm1 punpcklqdq xmm3, xmm2 movdqa xmm0, xmmword, ptr, [eax, +, LCPI8_0-L8$pb] pand xmm3, xmm0 pcmpeqw xmm3, xmm0 pshufd xmm0, xmm3, 78 por xmm0, xmm3 pshufd xmm1, xmm0, 229 por xmm1, xmm0 movdqa xmm0, xmm1 psrld xmm0, 16 por xmm0, xmm1 movd eax, xmm0 and al, 1 ret ``` After this PR: ```asm all_8x8: mov eax, dword, ptr, [esp, +, 4] movq mm0, qword, ptr, [eax] pmovmskb eax, mm0 cmp eax, 255 sete al ret any_8x8: mov eax, dword, ptr, [esp, +, 4] movq mm0, qword, ptr, [eax] pmovmskb eax, mm0 test eax, eax setne al ret ``` `aarch64` Before this PR: ```asm all_8x8: ldr d0, [x0] umov w8, v0.b[0] umov w9, v0.b[1] tst w8, #0xff umov w10, v0.b[2] cset w8, ne tst w9, #0xff cset w9, ne tst w10, #0xff umov w10, v0.b[3] and w8, w8, w9 cset w9, ne tst w10, #0xff umov w10, v0.b[4] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[5] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[6] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[7] and w8, w9, w8 cset w9, ne tst w10, #0xff and w8, w9, w8 cset w9, ne and w0, w9, w8 ret any_8x8: ldr d0, [x0] umov w8, v0.b[0] umov w9, v0.b[1] orr w8, w8, w9 umov w9, v0.b[2] orr w8, w8, w9 umov w9, v0.b[3] orr w8, w8, w9 umov w9, v0.b[4] orr w8, w8, w9 umov w9, v0.b[5] orr w8, w8, w9 umov w9, v0.b[6] orr w8, w8, w9 umov w9, v0.b[7] orr w8, w8, w9 tst w8, #0xff cset w0, ne ret ``` After this PR: ```asm all_8x8: ldr d0, [x0] mov v0.d[1], v0.d[0] uminv b0, v0.16b fmov w8, s0 tst w8, #0xff cset w0, ne ret any_8x8: ldr d0, [x0] mov v0.d[1], v0.d[0] umaxv b0, v0.16b fmov w8, s0 tst w8, #0xff cset w0, ne ret ``` `ARMv7` + `neon` Before this PR: ```asm all_8x8: vmov.i8 d0, #0x1 vldr d1, [r0] vtst.8 d0, d1, d0 vext.8 d1, d0, d0, #4 vand d0, d0, d1 vext.8 d1, d0, d0, #2 vand d0, d0, d1 vdup.8 d1, d0[1] vand d0, d0, d1 vmov.u8 r0, d0[0] and r0, r0, #1 bx lr any_8x8: vmov.i8 d0, #0x1 vldr d1, [r0] vtst.8 d0, d1, d0 vext.8 d1, d0, d0, #4 vorr d0, d0, d1 vext.8 d1, d0, d0, #2 vorr d0, d0, d1 vdup.8 d1, d0[1] vorr d0, d0, d1 vmov.u8 r0, d0[0] and r0, r0, #1 bx lr ``` After this PR: ```asm all_8x8: vldr d0, [r0] b <m8x8 as All>::all <m8x8 as All>::all: vpmin.u8 d16, d0, d16 vpmin.u8 d16, d16, d16 vpmin.u8 d0, d16, d16 b m8x8::extract any_8x8: vldr d0, [r0] b <m8x8 as Any>::any <m8x8 as Any>::any: vpmax.u8 d16, d0, d16 vpmax.u8 d16, d16, d16 vpmax.u8 d0, d16, d16 b m8x8::extract ``` (note: inlining does not work properly on ARMv7) 128-bit wide mask types (`m8x16`, `m16x8`, `m32x4`, `m64x2`) `x86_64` with SSE2 enabled Before this PR: ```asm all_8x16: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rip, +, LCPI9_0] movdqa xmm1, xmmword, ptr, [rdi] pand xmm1, xmm0 pcmpeqb xmm1, xmm0 pmovmskb eax, xmm1 xor ecx, ecx cmp eax, 65535 mov eax, -1 cmovne eax, ecx and al, 1 pop rbp ret any_8x16: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rip, +, LCPI8_0] movdqa xmm1, xmmword, ptr, [rdi] pand xmm1, xmm0 pcmpeqb xmm1, xmm0 pmovmskb eax, xmm1 neg eax sbb eax, eax and al, 1 pop rbp ret ``` After this PR: ```asm all_8x16: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] pmovmskb eax, xmm0 cmp eax, 65535 sete al pop rbp ret any_8x16: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] pmovmskb eax, xmm0 test eax, eax setne al pop rbp ret ``` `aarch64` Before this PR: ```asm all_8x16: ldr q0, [x0] umov w8, v0.b[0] umov w9, v0.b[1] tst w8, #0xff umov w10, v0.b[2] cset w8, ne tst w9, #0xff cset w9, ne tst w10, #0xff umov w10, v0.b[3] and w8, w8, w9 cset w9, ne tst w10, #0xff umov w10, v0.b[4] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[5] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[6] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[7] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[8] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[9] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[10] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[11] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[12] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[13] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[14] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[15] and w8, w9, w8 cset w9, ne tst w10, #0xff and w8, w9, w8 cset w9, ne and w0, w9, w8 ret any_8x16: ldr q0, [x0] umov w8, v0.b[0] umov w9, v0.b[1] orr w8, w8, w9 umov w9, v0.b[2] orr w8, w8, w9 umov w9, v0.b[3] orr w8, w8, w9 umov w9, v0.b[4] orr w8, w8, w9 umov w9, v0.b[5] orr w8, w8, w9 umov w9, v0.b[6] orr w8, w8, w9 umov w9, v0.b[7] orr w8, w8, w9 umov w9, v0.b[8] orr w8, w8, w9 umov w9, v0.b[9] orr w8, w8, w9 umov w9, v0.b[10] orr w8, w8, w9 umov w9, v0.b[11] orr w8, w8, w9 umov w9, v0.b[12] orr w8, w8, w9 umov w9, v0.b[13] orr w8, w8, w9 umov w9, v0.b[14] orr w8, w8, w9 umov w9, v0.b[15] orr w8, w8, w9 tst w8, #0xff cset w0, ne ret ``` After this PR: ```asm all_8x16: ldr q0, [x0] uminv b0, v0.16b fmov w8, s0 tst w8, #0xff cset w0, ne ret any_8x16: ldr q0, [x0] umaxv b0, v0.16b fmov w8, s0 tst w8, #0xff cset w0, ne ret ``` `ARMv7` + `neon` Before this PR: ```asm all_8x16: vmov.i8 q0, #0x1 vld1.64 {d2, d3}, [r0] vtst.8 q0, q1, q0 vext.8 q1, q0, q0, #8 vand q0, q0, q1 vext.8 q1, q0, q0, #4 vand q0, q0, q1 vext.8 q1, q0, q0, #2 vand q0, q0, q1 vdup.8 q1, d0[1] vand q0, q0, q1 vmov.u8 r0, d0[0] and r0, r0, #1 bx lr any_8x16: vmov.i8 q0, #0x1 vld1.64 {d2, d3}, [r0] vtst.8 q0, q1, q0 vext.8 q1, q0, q0, #8 vorr q0, q0, q1 vext.8 q1, q0, q0, #4 vorr q0, q0, q1 vext.8 q1, q0, q0, #2 vorr q0, q0, q1 vdup.8 q1, d0[1] vorr q0, q0, q1 vmov.u8 r0, d0[0] and r0, r0, #1 bx lr ``` After this PR: ```asm all_8x16: vld1.64 {d0, d1}, [r0] b <m8x16 as All>::all <m8x16 as All>::all: vpmin.u8 d0, d0, d b <m8x8 as All>::all any_8x16: vld1.64 {d0, d1}, [r0] b <m8x16 as Any>::any <m8x16 as Any>::any: vpmax.u8 d0, d0, d1 b <m8x8 as Any>::any ``` The inlining problems are pretty bad on ARMv7 + NEON. 256-bit wide mask types (`m8x32`, `m16x16`, `m32x8`, `m64x4`) With SSE2 enabled Before this PR: ```asm all_8x32: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rip, +, LCPI17_0] movdqa xmm1, xmmword, ptr, [rdi] pand xmm1, xmm0 movdqa xmm2, xmmword, ptr, [rdi, +, 16] pand xmm2, xmm0 pcmpeqb xmm2, xmm0 pcmpeqb xmm1, xmm0 pand xmm1, xmm2 pmovmskb eax, xmm1 xor ecx, ecx cmp eax, 65535 mov eax, -1 cmovne eax, ecx and al, 1 pop rbp ret any_8x32: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] por xmm0, xmmword, ptr, [rdi, +, 16] movdqa xmm1, xmmword, ptr, [rip, +, LCPI16_0] pand xmm0, xmm1 pcmpeqb xmm0, xmm1 pmovmskb eax, xmm0 neg eax sbb eax, eax and al, 1 pop rbp ret ``` After this PR: ```asm all_8x32: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] pmovmskb eax, xmm0 cmp eax, 65535 jne LBB17_1 movdqa xmm0, xmmword, ptr, [rdi, +, 16] pmovmskb ecx, xmm0 mov al, 1 cmp ecx, 65535 je LBB17_3 LBB17_1: xor eax, eax LBB17_3: pop rbp ret any_8x32: push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] pmovmskb ecx, xmm0 mov al, 1 test ecx, ecx je LBB16_1 pop rbp ret LBB16_1: movdqa xmm0, xmmword, ptr, [rdi, +, 16] pmovmskb eax, xmm0 test eax, eax setne al pop rbp ret ``` With AVX enabled Before this PR: ```asm all_8x32: push rbp mov rbp, rsp vmovaps ymm0, ymmword, ptr, [rdi] vandps ymm0, ymm0, ymmword, ptr, [rip, +, LCPI25_0] vextractf128 xmm1, ymm0, 1 vpxor xmm2, xmm2, xmm2 vpcmpeqb xmm1, xmm1, xmm2 vpcmpeqd xmm3, xmm3, xmm3 vpxor xmm1, xmm1, xmm3 vpcmpeqb xmm0, xmm0, xmm2 vpxor xmm0, xmm0, xmm3 vinsertf128 ymm0, ymm0, xmm1, 1 vandps ymm0, ymm0, ymm1 vpermilps xmm1, xmm0, 78 vandps ymm0, ymm0, ymm1 vpermilps xmm1, xmm0, 229 vandps ymm0, ymm0, ymm1 vpsrld xmm1, xmm0, 16 vandps ymm0, ymm0, ymm1 vpsrlw xmm1, xmm0, 8 vandps ymm0, ymm0, ymm1 vpextrb eax, xmm0, 0 and al, 1 pop rbp vzeroupper ret any_8x32: push rbp mov rbp, rsp vmovaps ymm0, ymmword, ptr, [rdi] vandps ymm0, ymm0, ymmword, ptr, [rip, +, LCPI24_0] vextractf128 xmm1, ymm0, 1 vpxor xmm2, xmm2, xmm2 vpcmpeqb xmm1, xmm1, xmm2 vpcmpeqd xmm3, xmm3, xmm3 vpxor xmm1, xmm1, xmm3 vpcmpeqb xmm0, xmm0, xmm2 vpxor xmm0, xmm0, xmm3 vinsertf128 ymm0, ymm0, xmm1, 1 vorps ymm0, ymm0, ymm1 vpermilps xmm1, xmm0, 78 vorps ymm0, ymm0, ymm1 vpermilps xmm1, xmm0, 229 vorps ymm0, ymm0, ymm1 vpsrld xmm1, xmm0, 16 vorps ymm0, ymm0, ymm1 vpsrlw xmm1, xmm0, 8 vorps ymm0, ymm0, ymm1 vpextrb eax, xmm0, 0 and al, 1 pop rbp vzeroupper ret ``` After this PR: ```asm all_8x32: push rbp mov rbp, rsp vmovdqa ymm0, ymmword, ptr, [rdi] vxorps xmm1, xmm1, xmm1 vcmptrueps ymm1, ymm1, ymm1 vptest ymm0, ymm1 setb al pop rbp vzeroupper ret any_8x32: push rbp mov rbp, rsp vmovdqa ymm0, ymmword, ptr, [rdi] vptest ymm0, ymm0 setne al pop rbp vzeroupper ret ``` --- Closes #362 . * test avx on all x86 targets * disable assert_instr on avx test * enable all appropriate features * disable assert_instr on x86+avx * the fn_must_use is stable * fix nbody example on armv7 * fixup * fixup * enable 64-bit wide mask MMX optimizations on x86_64 only * remove coresimd dependency on cfg_if * allow wasm to fail * use an env variable to disable assert_instr tests * disable m32x2 mask MMX optimization on macos * move cfg_if to coresimd/macros.rs

jackmott changed the title ~~quick questions~~ Questions on how to contibute May 27, 2017

jackmott changed the title ~~Questions on how to contibute~~ Questions on how to contribute May 28, 2017

alexcrichton closed this as completed Sep 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on how to contribute #1

Questions on how to contribute #1

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017 •

edited

Loading

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017 •

edited

Loading

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017 •

edited

Loading

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017

jackmott commented May 26, 2017 •

edited

Loading

jackmott commented May 26, 2017 •

edited

Loading

BurntSushi commented May 26, 2017

jackmott commented May 27, 2017

jackmott commented May 27, 2017 •

edited

Loading

jackmott commented May 27, 2017

jackmott commented May 27, 2017 •

edited

Loading

BurntSushi commented May 28, 2017

jackmott commented May 28, 2017

BurntSushi commented May 28, 2017

jackmott commented May 28, 2017 •

edited

Loading

BurntSushi commented May 28, 2017 via email

jackmott commented May 29, 2017

BurntSushi commented May 29, 2017

jackmott commented May 29, 2017 •

edited

Loading

BurntSushi commented May 29, 2017 •

edited

Loading

BurntSushi commented Sep 17, 2017

Questions on how to contribute #1

Questions on how to contribute #1

Comments

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017 • edited Loading

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017 • edited Loading

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017 • edited Loading

jackmott commented May 26, 2017

BurntSushi commented May 26, 2017

jackmott commented May 26, 2017 • edited Loading

jackmott commented May 26, 2017 • edited Loading

BurntSushi commented May 26, 2017

jackmott commented May 27, 2017

jackmott commented May 27, 2017 • edited Loading

jackmott commented May 27, 2017

jackmott commented May 27, 2017 • edited Loading

BurntSushi commented May 28, 2017

jackmott commented May 28, 2017

BurntSushi commented May 28, 2017

jackmott commented May 28, 2017 • edited Loading

BurntSushi commented May 28, 2017 via email

jackmott commented May 29, 2017

BurntSushi commented May 29, 2017

jackmott commented May 29, 2017 • edited Loading

BurntSushi commented May 29, 2017 • edited Loading

BurntSushi commented Sep 17, 2017

BurntSushi commented May 26, 2017 •

edited

Loading

BurntSushi commented May 26, 2017 •

edited

Loading

BurntSushi commented May 26, 2017 •

edited

Loading

jackmott commented May 26, 2017 •

edited

Loading

jackmott commented May 26, 2017 •

edited

Loading

jackmott commented May 27, 2017 •

edited

Loading

jackmott commented May 27, 2017 •

edited

Loading

jackmott commented May 28, 2017 •

edited

Loading

jackmott commented May 29, 2017 •

edited

Loading

BurntSushi commented May 29, 2017 •

edited

Loading