Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently use the highest bit of vector masks when converting to i1 vectors #104693

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Nov 21, 2022

This improves the codegen for vector select, gather, scatter and boolean reduction intrinsics and fixes rust-lang/portable-simd#316.

The current behavior of mask operations during llvm codegen is to truncate the mask vector to , telling llvm to use the least significat bit.

Since sse/avx instructions are defined to use the most significant bit, llvm has to insert a left shift before the mask can actually be used.

Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero.

By shifting before truncating to , we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.

@rustbot
Copy link
Collaborator

rustbot commented Nov 21, 2022

r? @davidtwco

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 21, 2022
@workingjubilee
Copy link
Member

Does this affect codegen for non-x86 platforms?

@workingjubilee
Copy link
Member

You may want to study the other simd intrinsic tests and also the assembly or codegen tests.

@jhorstmann
Copy link
Contributor Author

jhorstmann commented Nov 22, 2022

Does this affect codegen for non-x86 platforms?

Seems to have a very similar effect for target aarch64-unknown-linux-gnu. Example with simple select (using 128 bit registers now):

pub fn select(m: mask32x4, a: i32x4, b: i32x4) -> i32x4 {
    m.select(a, b)
}

Before

portable_simd_test::select:
	ldr q1, [x0]
	ldr q0, [x1]
	ldr q2, [x2]
	shl v1.4s, v1.4s, #31
	cmlt v1.4s, v1.4s, #0
	bif v0.16b, v2.16b, v1.16b
	str q0, [x8]
	ret

After

portable_simd_test::select:
	ldr q0, [x0]
	ldr q1, [x1]
	ldr q2, [x2]
	cmlt v0.4s, v0.4s, #0
	bsl v0.16b, v1.16b, v2.16b
	str q0, [x8]
	ret

Interestingly there is a big effect on all reduction in the following example:

pub unsafe fn mask_all(m: mask8x16) -> bool {
    m.all()
}

Before

portable_simd_test::mask_all:
	.cfi_startproc
	sub sp, sp, #16
	.cfi_def_cfa_offset 16

	ldr q0, [x0]

	mov w8, #65535

	umov w9, v0.b[1]
	umov w11, v0.b[2]
	umov w10, v0.b[0]
	umov w12, v0.b[3]
	umov w13, v0.b[4]
	umov w14, v0.b[5]
	and w9, w9, #0x1
	and w11, w11, #0x1
	and w10, w10, #0x1
	and w12, w12, #0x1
	and w13, w13, #0x1
	and w14, w14, #0x1
	bfi w10, w9, #1, #1
	umov w9, v0.b[6]
	bfi w10, w11, #2, #1
	umov w11, v0.b[7]
	bfi w10, w12, #3, #1
	umov w12, v0.b[8]
	bfi w10, w13, #4, #1
	umov w13, v0.b[9]
	and w9, w9, #0x1
	bfi w10, w14, #5, #1
	umov w14, v0.b[10]
	and w11, w11, #0x1
	orr w9, w10, w9, lsl #6
	umov w10, v0.b[11]
	and w12, w12, #0x1
	orr w9, w9, w11, lsl #7
	umov w11, v0.b[12]
	and w13, w13, #0x1
	orr w9, w9, w12, lsl #8
	umov w12, v0.b[13]
	and w14, w14, #0x1
	orr w9, w9, w13, lsl #9
	umov w13, v0.b[14]
	and w10, w10, #0x1
	orr w9, w9, w14, lsl #10
	and w11, w11, #0x1
	orr w9, w9, w10, lsl #11
	and w10, w12, #0x1
	umov w12, v0.b[15]
	orr w9, w9, w11, lsl #12
	and w11, w13, #0x1
	orr w9, w9, w10, lsl #13
	orr w9, w9, w11, lsl #14
	orr w9, w9, w12, lsl #15
	bics wzr, w8, w9
	cset w0, eq

	add sp, sp, #16
	.cfi_def_cfa_offset 0
	ret

After

portable_simd_test::mask_all:
	movi v0.2d, #0xffffffffffffffff
	ldr q1, [x0]
	cmgt v0.16b, v1.16b, v0.16b
	umaxv b0, v0.16b
	fmov w8, s0
	mvn w8, w8
	and w0, w8, #0x1
	ret

But this does not seem to lead to an improvement for the is_hex function from rust-lang/portable-simd#303.

I haven't checked any other target platforms yet. Thanks for the pointer to the tests, I'll have a look at those.

@workingjubilee
Copy link
Member

Cool! That's weird, but also good to know.

@jhorstmann
Copy link
Contributor Author

That is_hex function from rust-lang/portable-simd#303 is so close to getting vectorized though.

pub fn is_hex_mask(chunk: &[u8; 16]) -> bool {
    let x = u8x16::from_array(*chunk);
    let m1 = x.simd_gt(splat(b'0' - 1));
    let m2 = x.simd_lt(splat(b'9' + 1));
    let m3 = x.simd_gt(splat(b'a' - 1));
    let m4 = x.simd_lt(splat(b'f' + 1));
    let m = (m1 & m2) | (m3 & m4);
    m.all()
}

With the changes from this PR, changing that last line to either one of the following alternatives gets vectorized:

    (m.to_int().simd_ne(splat(0))).all()
    !((!m).any())

And maybe the weirdest one

    fn mask_all(m: mask8x16) -> bool {
        m.all()
    }

    mask_all(m)

All these seem to help llvm in knowing that the bit pattern is all ones/zeros.

Maybe this is less of a problem on x86 since llvm knows which of the mask instructions only use the high bit. So the pattern of "shift to lowest, trunc to i1, i1 vector op", allows it to directly use the high bit. While on aarch64, the operations require a full mask, but with the shift this mask can be created with a simpler vector comparison.

@davidtwco
Copy link
Member

r? wg-llvm

@rustbot rustbot assigned nagisa and unassigned davidtwco Jan 7, 2023
@bors
Copy link
Contributor

bors commented Jan 8, 2023

☔ The latest upstream changes (presumably #106573) made this pull request unmergeable. Please resolve the merge conflicts.

@jhorstmann jhorstmann force-pushed the use-high-bit-of-mask-for-select branch 2 times, most recently from 18516d7 to ca3971d Compare January 10, 2023 19:00
Copy link
Member

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this remains in a draft state? I see that some tests have been added already and they now test two targets. I believe proving one or two target to produce a better code to be enough of a bar for this kind of change.


// assembly-output: emit-asm
// compile-flags: --crate-type=lib -O -C llvm-args=-x86-asm-syntax=intel
// only-x86_64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests could be deduplicated slightly if you made these tests #[no_core] and to use revisions and --target=$TARGET (grep for revisions in the test suite for usage examples).

These tests look like they wouldn’t be too hard to adapt to #[no_core] either, given that Copy and Clone derives are superfluous…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into combining the tests, but so far I could not get the CHECK annotations to work, or rather now they only work for one revision. Looking at runtest.rs I think the following should work:

// revisions: aarch64 x86_64
// [x86_64] compile-flags: --target=x86_64-unknown-linux-gnu -C llvm-args=-x86-asm-syntax=intel
// [x86_64] needs-llvm-components: x86_64
// [aarch64] compile-flags: --target=aarch64-unknown-linux-gnu
// [aarch64] needs-llvm-components: aarch64
// assembly-output: emit-asm
// compile-flags: --crate-type=lib -O

...

pub unsafe fn mask_reduce_all(m: mask8x16) -> bool {
    // CHECK,NONMSVC,x86_64-NOT: psllw
    // CHECK,NONMSVC,x86_64: pmovmskb
    // CHECK,NONMSVC,x86_64: foo
    // CHECK,NONMSVC,aarch64: umaxv
    // CHECK,NONMSVC,aarch64: foo
    simd_reduce_all(m)
}

The aarch64 checks seem to work, since I can cause them to fail by looking for foo. The x86_64 revision always passes though.

Copy link
Member

@nagisa nagisa Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess CHECK,NONMSVC,x86_64 comes from the --check-prefix argument to filecheck – that just means that it will accept all of these prefixes (separated by comma) and ignore all others. But it won't understand a CHECK,NONMSVC,x86_64 prefix.

So writing CHECKs for specific revisions is just replacing CHECK with the name of the revision. So:

    // x86_64-NOT: psllw
    // x86_64: pmovmskb
    // x86_64: foo
    // aarch64: umaxv
    // aarch64: foo

I’m sorry we don't have good docs for this T_T

But Î'm really less concerned about the duplication and more about the ability to run the x86_64 tests from e.g. a M1 mac and vice-versa, which would require them to be #[no_core] and drop the only-* annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll take another look at this, my previous assumption was that the whole prefix was matched as a single string.

@jhorstmann jhorstmann force-pushed the use-high-bit-of-mask-for-select branch from ca3971d to 9dd9391 Compare February 28, 2023 20:54
@jhorstmann
Copy link
Contributor Author

That was an interesting rebase, with the tests having moved in the mean time. Sorry for taking so long coming back to this PR.

@jhorstmann jhorstmann marked this pull request as ready for review February 28, 2023 20:59
@workingjubilee
Copy link
Member

Nagisa said they'll try to get to this soon. In the meantime, my experience is that these kinds of PRs are very prone to having something break during the cross-platform builds, so let's

@bors try

@bors
Copy link
Contributor

bors commented Mar 23, 2023

⌛ Trying commit 9dd9391 with merge 8e6cd1d9198b904379afff0d143d7d037b710b61...

@bors
Copy link
Contributor

bors commented Mar 23, 2023

☀️ Try build successful - checks-actions
Build commit: 8e6cd1d9198b904379afff0d143d7d037b710b61 (8e6cd1d9198b904379afff0d143d7d037b710b61)

@workingjubilee
Copy link
Member

Oh hm that doesn't run the full CI (anymore?) (did it ever?), I should probably just finally start up an aarch64 machine and test this on that then.

@nagisa
Copy link
Member

nagisa commented Mar 29, 2023

Since we already have a try build, it wouldn’t hurt a to have a
@rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 29, 2023
Copy link
Member

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the extremely late review.

I see a couple examples where a plain trunc is replaced by a shift with a follow-up trunc. Can you expand more on why that is correct? At the very least Alive2 claims this is not equivalent. Do we always store the mask bit as the MSB when legalizing the argument into a passable type?

// CHECK: select <4 x i1>
// CHECK: [[A:%[0-9]+]] = lshr <4 x i8> %{{m|1}}, <i8 7, i8 7, i8 7, i8 7>
// CHECK: [[B:%[0-9]+]] = trunc <4 x i8> [[A]] to <4 x i1>
// CHECK: select <4 x i1> [[B]]
simd_select(m, a, b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe machine-code wise this ends up being a regression on x86_64, isn’t it? Seems like the optimizations proposed here are very dependent on the source test code...

Godbolt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually (or at least in portable_simd) the size of the mask elements would be the same as the corresponding data elements. So if this example regresses, but the one using 32 bit masks for f32 improves I think that would be an ok tradeoff.

I think though that llvm should be able to still optimize this. Running the llvm bitcode through opt first transforms it into

define void @select(ptr noalias nocapture noundef writeonly sret(<4 x float>) dereferenceable(16) %v0, ptr noalias nocapture noundef readonly dereferenceable(4) %m, ptr noalias nocapture noundef readonly dereferenceable(16) %a, ptr noalias nocapture noundef readonly dereferenceable(16) %b) unnamed_addr #0 !dbg !6 {
  %v1 = load <4 x i8>, ptr %m, align 4
  %v2 = load <4 x float>, ptr %a, align 16
  %v3 = load <4 x float>, ptr %b, align 16
  %v4.not1 = icmp slt <4 x i8> %v1, zeroinitializer
  %v5 = select <4 x i1> %v4.not1, <4 x float> %v2, <4 x float> %v3
  store <4 x float> %v5, ptr %v0, align 16
  ret void, !dbg !11
}

Which then compiles to even shorter assembly, removing also the shift by 31 bits:

select_opt:                             # @select_opt
        mov     rax, rdi
        vpmovsxbd       xmm0, dword ptr [rsi]
        vmovaps xmm1, xmmword ptr [rcx]
        vblendvps       xmm0, xmm1, xmmword ptr [rdx], xmm0
        vmovaps xmmword ptr [rdi], xmm0
        ret

// SIMD vectors are passed directly, resulting in `%x` being a vector,
// while on others they're passed indirectly, resulting in `%x` being
// a pointer to a vector, and `%1` a vector loaded from that pointer.
// This is controlled by the target spec option `simd_types_indirect`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems somewhat fickle, but I guess there isn't much in terms of alternatives here…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment and logic was copied from another codegen test simd-intrinsic-generic-bitmask.rs, Since the code there has not changed I assume there is still no better alternative.

@@ -1518,31 +1515,25 @@ fn generic_simd_intrinsic<'ll, 'tcx>(
assert_eq!(underlying_ty, non_ptr(element_ty0));

// The element type of the third argument must be a signed integer type of any width:
match element_ty2.kind() {
ty::Int(_) => (),
let mask_elem_bitwidth = match element_ty2.kind() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the simd_scatter intrinsic, but none of the tests tests that were touched in this test involve it? Do we have no simd_scatter tests? Please add one or adjust an existing one if there is one.

@@ -1377,31 +1380,25 @@ fn generic_simd_intrinsic<'ll, 'tcx>(

// The element type of the third argument must be a signed integer type of any width:
let (_, element_ty2) = arg_tys[2].simd_size_and_type(bx.tcx());
match element_ty2.kind() {
ty::Int(_) => (),
let mask_elem_bitwidth = match element_ty2.kind() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the simd_gather intrinsic, but none of the tests tests that were touched in this test involve it? Do we have no simd_gather tests? Please add one or adjust an existing one if there is one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have existing simd_gather and simd_scatter tests, but I believe none of the existing tests are affected by this change. Is that a problem, exactly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this exposes a hole/deficiency in the test suite coverage. Ideally we take care of it at the same time we find out about this, but I’d be okay with an issue describing tests that should be added or even just not doing anything if y'all are comfortable with not having test coverage here.

That said, seeing test changes would definitely have helped with reviewing this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized I still had some uncommitted changes to the tests locally :( will push those in a minute.

There were codegen tests for gather/scatter, but they only checked that the llvm intrinsic was correctly emitted and did not verify how the mask was converted to i1. I adjusted these tests.

What might be missing is more assembly tests for these intrinsics that verify no unnecessary shifts or other instructions get emitted.

@apiraino
Copy link
Contributor

apiraino commented May 3, 2023

Switching to waiting on author as it seems there has been a review. @jhorstmann Feel free to request a review with @rustbot ready, thanks!

@rustbot author

@rustbot rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 3, 2023
@apiraino apiraino removed the I-compiler-nominated Nominated for discussion during a compiler team meeting. label Jul 6, 2023
@bors
Copy link
Contributor

bors commented Jul 29, 2023

☔ The latest upstream changes (presumably #114148) made this pull request unmergeable. Please resolve the merge conflicts.

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Aug 1, 2023

☔ The latest upstream changes (presumably #105545) made this pull request unmergeable. Please resolve the merge conflicts.

@bors
Copy link
Contributor

bors commented Oct 31, 2023

☔ The latest upstream changes (presumably #117444) made this pull request unmergeable. Please resolve the merge conflicts.

@Dylan-DPC
Copy link
Member

@jhorstmann any updates on this? thanks

@jhorstmann jhorstmann force-pushed the use-high-bit-of-mask-for-select branch from d17819e to 83a5dd4 Compare March 1, 2024 19:19
@jhorstmann
Copy link
Contributor Author

@Dylan-DPC I updated the PR and adjusted also the new masked load/store intrinsics to use the same logic. I also added another assembly test for masked load that shows there are no unneeded shift instructions in the output. The current output is

load_f64x4:
        vpsllq  ymm0, ymmword ptr [rdi], 63
        vpmovq2m        k1, ymm0
        vmovupd ymm0 {k1} {z}, ymmword ptr [rsi]
        vmovapd ymmword ptr [rdx], ymm0
        vzeroupper
        ret

(https://rust.godbolt.org/z/ThTz6E39s)

@rust-log-analyzer

This comment has been minimized.

require!(
matches!(mask_elem.kind(), ty::Int(_)),
let m_elem_bitwidth = require_int_ty!(
mask_elem.kind(),
InvalidMonomorphization::ThirdArgElementType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should refer to the first argument, but there is a ui test for it and fixing it probably belongs in a separate PR

@rust-log-analyzer

This comment has been minimized.

jhorstmann added a commit to jhorstmann/rust that referenced this pull request Mar 3, 2024
The tests show that the code generation currently uses the least
significant bits of <iX x N> vector masks when converting to <i1 xN>.
This leads to an additional left shift operation in the assembly for
x86, since mask operations on x86 operate based on the most significant
bit.

The exception is simd_bitmask, which already uses the most-significant
bit.

This additional instruction would be removed by the fix in rust-lang#104693,
which uses the most significant bit for all mask operations.
@jhorstmann
Copy link
Contributor Author

I opened #121953 with assembly tests showing the current behavior. If those are accepted to not cause problems during further llvm updates I would rebase this PR on top of those tests.

@Dylan-DPC Dylan-DPC added S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 12, 2024
jhorstmann added a commit to jhorstmann/rust that referenced this pull request Mar 12, 2024
The tests show that the code generation currently uses the least
significant bits of <iX x N> vector masks when converting to <i1 xN>.
This leads to an additional left shift operation in the assembly for
x86, since mask operations on x86 operate based on the most significant
bit. On aarch64 the left shift is followed by a comparison against zero,
which repeats the sign bit across the whole lane.

The exception, which does not introduce an unneeded shift, is
simd_bitmask, because the code generation already shifts before
truncating.

This additional instruction would be removed by the fix in rust-lang#104693,
which uses the most significant bit for all mask operations.
jhorstmann added a commit to jhorstmann/rust that referenced this pull request Mar 12, 2024
The tests show that the code generation currently uses the least
significant bits of <iX x N> vector masks when converting to <i1 xN>.
This leads to an additional left shift operation in the assembly for
x86, since mask operations on x86 operate based on the most significant
bit. On aarch64 the left shift is followed by a comparison against zero,
which repeats the sign bit across the whole lane.

The exception, which does not introduce an unneeded shift, is
simd_bitmask, because the code generation already shifts before
truncating.

By using the "C" calling convention the tests should be stable regarding
changes in register allocation, but it is possible that future llvm
updates will require updating some of the checks.

This additional instruction would be removed by the fix in rust-lang#104693,
which uses the most significant bit for all mask operations.
workingjubilee added a commit to workingjubilee/rustc that referenced this pull request Mar 12, 2024
…ed-simd-instructions, r=workingjubilee

Add tests for the generated assembly of mask related simd instructions.

The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit.

The exception is simd_bitmask, which already uses the most-significant bit.

This additional instruction would be removed by the changes in rust-lang#104693, which makes all mask operations consistently use the most significant bits.

By using the "C" calling convention the tests should be stable regarding changes in register allocation, but it is possible that future llvm updates will require updating some of the checks.
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 12, 2024
Rollup merge of rust-lang#121953 - jhorstmann:assembly-tests-for-masked-simd-instructions, r=workingjubilee

Add tests for the generated assembly of mask related simd instructions.

The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit.

The exception is simd_bitmask, which already uses the most-significant bit.

This additional instruction would be removed by the changes in rust-lang#104693, which makes all mask operations consistently use the most significant bits.

By using the "C" calling convention the tests should be stable regarding changes in register allocation, but it is possible that future llvm updates will require updating some of the checks.
@bors
Copy link
Contributor

bors commented Mar 12, 2024

☔ The latest upstream changes (presumably #122389) made this pull request unmergeable. Please resolve the merge conflicts.

This improves the codegen for vector `select`, `gather`, `scatter` and
boolean reduction intrinsics and fixes rust-lang/portable-simd#316.

The current behavior of mask operations during llvm codegen is to
truncate the mask vector to <N x i1>, telling llvm to use the least
significat bit.

Since sse/avx instructions are defined to use the most significant bit,
llvm has to insert a left shift before the mask can actually be used.

Similarly on aarch64, mask operations like blend work bit by bit,
repeating the least significant bit across the whole lane involves
shifting it into the sign position and then comparing against zero.

By shifting before truncating to <N x i1>, we tell llvm that we only
consider the most significant bit, removing the need for additional
shift instructions in the assembly.
@jhorstmann jhorstmann force-pushed the use-high-bit-of-mask-for-select branch from a0e0528 to 7252870 Compare March 13, 2024 09:05
@jhorstmann
Copy link
Contributor Author

Rebased on top of the just merged assembly tests from #121953, this hopefully makes the improvement clearer.

I also tried adding an assume that all bits in a lane are either one or zero, but that did not seem to affect code gen (https://rust.godbolt.org/z/fWbKrc78r):

pub unsafe extern "C" fn mask_reduce_all(m: mask8x16) -> bool {
    //let either_ones_or_zeroes: mask8x16 = simd_eq(m, simd_shr(m, mask8x16([7; 16])));
    let either_ones_or_zeroes: mask8x16 = simd_or(simd_eq(m, mask8x16([0; 16])), simd_eq(m, mask8x16([0xFF_u8 as i8; 16])));

    if simd_reduce_all(either_ones_or_zeroes) {
        simd_reduce_all(m)
    } else {
        unreachable()
    }
}

@Dylan-DPC Dylan-DPC added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suboptimal codegen for mask8x32 all