Vectorize `base:runtime.memory_*` #4063

Feoramund · 2024-08-11T23:54:37Z

The speedups here aren't as profound as with index_byte, but there are some noticeable improvements with larger data sizes.

I'm going to leave this as a draft to await feedback. I'm mostly interested to hear if this has any impact in real-world programs.

I want to note that the original memory_compare* procs had no notion of unaligned loads. Were we just getting by without checking for alignment somehow?

odin test . -o:aggressive -disable-assert -no-bounds-check -define:ODIN_TEST_THREADS=1 -microarch:alderlake

Plain results

[INFO ] --- [2024-08-11 23:38:19] [runtime.odin:259:benchmark_plain_memory_compare_cold()]
        +++      15B | 101ns
        +++      16B | 127ns
        +++      17B | 45ns
        +++      31B | 56ns
        +++      32B | 49ns
        +++      33B | 45ns
        +++      63B | 69ns
        +++      64B | 62ns
        +++      65B | 36ns
        +++     256B | 64ns
        +++     512B | 78ns
        +++  1.00KiB | 99ns
        +++  1.00MiB | 90.804µs
        +++  1.00GiB | 73.176684ms
[INFO ] --- [2024-08-11 23:38:20] [runtime.odin:269:benchmark_plain_memory_compare_hot()]
        +++      15B | 26ns
        +++      16B | 19ns
        +++      17B | 21ns
        +++      31B | 20ns
        +++      32B | 16ns
        +++      33B | 17ns
        +++      63B | 25ns
        +++      64B | 19ns
        +++      65B | 18ns
        +++     256B | 23ns
        +++     512B | 36ns
        +++  1.00KiB | 47ns
        +++  1.00MiB | 38.926µs
        +++  1.00GiB | 73.154996ms
[INFO ] --- [2024-08-11 23:38:20] [runtime.odin:301:benchmark_plain_memory_compare_zero_cold()]
        +++      15B | 126ns
        +++      16B | 33ns
        +++      17B | 30ns
        +++      31B | 40ns
        +++      32B | 21ns
        +++      33B | 25ns
        +++      63B | 39ns
        +++      64B | 21ns
        +++      65B | 34ns
        +++     256B | 61ns
        +++     512B | 45ns
        +++  1.00KiB | 53ns
        +++  1.00MiB | 40.53µs
        +++  1.00GiB | 54.984574ms
[INFO ] --- [2024-08-11 23:38:21] [runtime.odin:311:benchmark_plain_memory_compare_zero_hot()]
        +++      15B | 15ns
        +++      16B | 17ns
        +++      17B | 21ns
        +++      31B | 21ns
        +++      32B | 16ns
        +++      33B | 23ns
        +++      63B | 22ns
        +++      64B | 18ns
        +++      65B | 18ns
        +++     256B | 26ns
        +++     512B | 39ns
        +++  1.00KiB | 52ns
        +++  1.00MiB | 36.948µs
        +++  1.00GiB | 54.865693ms
[INFO ] --- [2024-08-11 23:38:21] [runtime.odin:217:benchmark_plain_memory_equal_cold()]
        +++      15B | 192ns
        +++      16B | 40ns
        +++      17B | 59ns
        +++      31B | 36ns
        +++      32B | 40ns
        +++      33B | 21ns
        +++      63B | 38ns
        +++      64B | 36ns
        +++      65B | 32ns
        +++     256B | 75ns
        +++     512B | 126ns
        +++  1.00KiB | 224ns
        +++  1.00MiB | 226.184µs
        +++  1.00GiB | 240.402243ms
[INFO ] --- [2024-08-11 23:38:23] [runtime.odin:227:benchmark_plain_memory_equal_hot()]
        +++      15B | 21ns
        +++      16B | 22ns
        +++      17B | 22ns
        +++      31B | 31ns
        +++      32B | 28ns
        +++      33B | 24ns
        +++      63B | 35ns
        +++      64B | 32ns
        +++      65B | 30ns
        +++     256B | 69ns
        +++     512B | 122ns
        +++  1.00KiB | 228ns
        +++  1.00MiB | 214.525µs
        +++  1.00GiB | 237.592753ms

SIMD results

[INFO ] --- [2024-08-11 23:38:23] [runtime.odin:279:benchmark_simd_memory_compare_cold()]
        +++      15B | 90ns
        +++      16B | 31ns
        +++      17B | 46ns
        +++      31B | 47ns
        +++      32B | 47ns
        +++      33B | 50ns
        +++      63B | 47ns
        +++      64B | 36ns
        +++      65B | 38ns
        +++     256B | 62ns
        +++     512B | 46ns
        +++  1.00KiB | 43ns
        +++  1.00MiB | 50.884µs
        +++  1.00GiB | 62.515442ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:289:benchmark_simd_memory_compare_hot()]
        +++      15B | 34ns
        +++      16B | 24ns
        +++      17B | 23ns
        +++      31B | 33ns
        +++      32B | 24ns
        +++      33B | 33ns
        +++      63B | 37ns
        +++      64B | 20ns
        +++      65B | 19ns
        +++     256B | 25ns
        +++     512B | 25ns
        +++  1.00KiB | 35ns
        +++  1.00MiB | 29.662µs
        +++  1.00GiB | 62.706172ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:321:benchmark_simd_memory_compare_zero_cold()]
        +++      15B | 158ns
        +++      16B | 46ns
        +++      17B | 30ns
        +++      31B | 49ns
        +++      32B | 32ns
        +++      33B | 25ns
        +++      63B | 39ns
        +++      64B | 103ns
        +++      65B | 60ns
        +++     256B | 56ns
        +++     512B | 48ns
        +++  1.00KiB | 59ns
        +++  1.00MiB | 9.361µs
        +++  1.00GiB | 35.172623ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:331:benchmark_simd_memory_compare_zero_hot()]
        +++      15B | 32ns
        +++      16B | 25ns
        +++      17B | 23ns
        +++      31B | 31ns
        +++      32B | 24ns
        +++      33B | 24ns
        +++      63B | 34ns
        +++      64B | 34ns
        +++      65B | 32ns
        +++     256B | 33ns
        +++     512B | 33ns
        +++  1.00KiB | 37ns
        +++  1.00MiB | 8.932µs
        +++  1.00GiB | 34.838497ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:237:benchmark_simd_memory_equal_cold()]
        +++      15B | 181ns
        +++      16B | 50ns
        +++      17B | 51ns
        +++      31B | 35ns
        +++      32B | 32ns
        +++      33B | 25ns
        +++      63B | 37ns
        +++      64B | 42ns
        +++      65B | 33ns
        +++     256B | 52ns
        +++     512B | 23ns
        +++  1.00KiB | 40ns
        +++  1.00MiB | 45.397µs
        +++  1.00GiB | 62.699833ms
[INFO ] --- [2024-08-11 23:38:25] [runtime.odin:247:benchmark_simd_memory_equal_hot()]
        +++      15B | 30ns
        +++      16B | 27ns
        +++      17B | 24ns
        +++      31B | 29ns
        +++      32B | 31ns
        +++      33B | 34ns
        +++      63B | 36ns
        +++      64B | 12ns
        +++      65B | 16ns
        +++     256B | 23ns
        +++     512B | 23ns
        +++  1.00KiB | 30ns
        +++  1.00MiB | 47.923µs
        +++  1.00GiB | 62.565219ms

Kelimion · 2024-08-12T00:04:54Z

At first glance it looks like the SIMD version is slower until somewhere between 64 and 256 bytes, so it should probably remain scalar for small sizes.

Feoramund · 2024-08-12T00:14:46Z

At first glance it looks like the SIMD version is slower until somewhere between 64 and 256 bytes, so it should probably remain scalar for small sizes.

Do you mean to add in a specific path for < SIMD_SCAN_WIDTH which is explicitly scalar and exits early? It otherwise iterates as a scalar by virtue of the 2nd loop being unable to entertain any further iteration after the first.

I was wondering if it might be a good idea to have variable-width SIMD scanners for something like this too, or defaulting to a smaller register size, like 128 bits.

Kelimion · 2024-08-12T00:47:36Z

Well, it would be great if the SIMD timings were formatted as relative to the scalar results, e.g.: (made up factors)

[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:289:benchmark_simd_memory_compare_hot()]
        +++     Size |  SIMD (Scalar, relative)
        +++      15B | 34ns (23ns, 1.2x)
        +++      16B | 24ns
        +++      17B | 23ns
        +++      31B | 33ns
        +++      32B | 24ns
        +++      33B | 33ns
        +++      63B | 37ns
        +++      64B | 20ns (23ns, 0.92x)
        +++      65B | 19ns
        +++     256B | 25ns
        +++     512B | 25ns
        +++  1.00KiB | 35ns
        +++  1.00MiB | 29.662µs
        +++  1.00GiB | 62.706172ms

At first blush it looks like the SIMD paths aren't really any faster until around 64 bytes. Once we can compare the numbers side by side instead of scrolling up 2 pages, it'll be possible to see where the scalar cut-off should be.

At a glance it appears that it would be sensible to stay scalar until the next aligned size that's greater or equal to 64 bytes and go wide after that. Or alternatively, if n <= 64 { scalar } else { wide }.

Would it be a lot of work to perform the two sets of timing first, and then present them in a core:text/table or something so you can see the time in ns for scalar and SIMD side by side, as well as the percentage slowdown/speedup compared to scalar?

Because as you intimated, it's not a clear 10x win across the board, so we need to know where the sensible cut-off actually is.

Feoramund · 2024-08-12T02:17:27Z

At first blush it looks like the SIMD paths aren't really any faster until around 64 bytes.

That's to be expected, since it's iterating as a scalar up until SIMD_SCAN_WIDTH, which should be 64 bytes for most 64-bit platforms. index_byte works the same way.

Would it be a lot of work to perform the two sets of timing first, and then present them in a core:text/table or something so you can see the time in ns for scalar and SIMD side by side, as well as the percentage slowdown/speedup compared to scalar?

I'll see what I can do about a more sensible output later. It's just a matter of getting all the data points at once, then composing the output.

Yawning · 2024-08-13T00:57:22Z

I was wondering if it might be a good idea to have variable-width SIMD scanners for something like this too, or defaulting to a smaller register size, like 128 bits.

This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.

I want to note that the original memory_compare* procs had no notion of unaligned loads. Were we just getting by without checking for alignment somehow?

We currently do not test on targets that are picky about non-SIMD alignment as ARM starting with v8 (both 32 and 64-bit) is fine with it, and x86/amd64 has always been fine with it. and WASM is fine with it. Looking forward, off the top of my head, unaligned loads would cause problems on ARMv7 (Raspi <= 3), PPC64/POWER, and probably RISC-V.

As a side note, while unapplicable to this PR, I'm kind of unconvinced that forcing alignment before entering the bulk path is worth the added complexity/codesize (at least on Intel architectures) as:

Memory operand alignment restrictions are relaxed with VEX encoding (AVX, unless you are doing VMOVA).
MOVA vs MOVU has identical performance on aligned data (Nehalem/Bulldozer or later).
The penalty for an unaligned load/store is 1 cycle if it straddles a cache line (64-byte boundary).
The penalty for an unaligned load/store is 10 cycles if it straddles a page boundary (Skylake or later, ~200 cycles on older microarchitectures, but those are e-waste).

Feoramund · 2024-08-13T18:55:49Z

This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.

I should clarify that when I said register, I was referring to chunk. So how we use #simd[SIMD_SCAN_WIDTH]u8, which ends up being [64]u8 or 512 bits in most 64-bit platform cases, I was wondering if it might be better to use something like #simd[N]u8 where N could be 8 or 16 in certain cases, since memory comparisons outside of strings may not be long enough to benefit from the typical 64 byte length requirement of the SIMD scanner.

We currently do not test on targets that are picky about non-SIMD alignment as ARM starting with v8 (both 32 and 64-bit) is fine with it, and x86/amd64 has always been fine with it. and WASM is fine with it. Looking forward, off the top of my head, unaligned loads would cause problems on ARMv7 (Raspi <= 3), PPC64/POWER, and probably RISC-V.

Interesting.

As a side note, while unapplicable to this PR, I'm kind of unconvinced that forcing alignment before entering the bulk path is worth the added complexity/codesize (at least on Intel architectures) as:

Memory operand alignment restrictions are relaxed with VEX encoding (AVX, unless you are doing VMOVA).

MOVA vs MOVU has identical performance on aligned data (Nehalem/Bulldozer or later).

The penalty for an unaligned load/store is 1 cycle if it straddles a cache line (64-byte boundary).

The penalty for an unaligned load/store is 10 cycles if it straddles a page boundary (Skylake or later, ~200 cycles on older microarchitectures, but those are e-waste).

If not applicable to this PR, are you saying this comment applies more to the index_byte implementation? Here in this PR, memory_compare_zero at least forces memory alignment, because there's no secondary line of memory to compare against, just zeroes.

Are you suggesting to instead iterate as a scalar until the length is aligned to the scanner width, then do unaligned loads the full way through?

Either way, I find this sort of information fascinating, so thank you for your input. I'm aware of Agner Fog's work, though I haven't read into it too deeply yet. Do you have any other resources on this topic to share?

Yawning · 2024-08-14T09:45:09Z

nb: I will be explicitly ignoring i386 for the purpose of this discussion, along with ARM SVE2 (because of lack of Odin support).

This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.

I should clarify that when I said register, I was referring to chunk. So how we use #simd[SIMD_SCAN_WIDTH]u8, which ends up being [64]u8 or 512 bits in most 64-bit platform cases, I was wondering if it might be better to use something like #simd[N]u8 where N could be 8 or 16 in certain cases, since memory comparisons outside of strings may not be long enough to benefit from the typical 64 byte length requirement of the SIMD scanner.

So the way this PR (and the previous PR) are written, we are telling LLVM to do 512-bit vector operations. In most situations this causes a disconnect between the chunk size and what the actual hardware is capable of (128-bit SIMD on wasm/ARM/vanilla amd64, 256-bit SIMD on modern amd64, 512-bit SIMD on certain Intel/Ryzen).

So this is roughly what happens, depending on the target microarch:

WASM (without the simd128): GP registers to emulate SIMD (and does an ok job)
amd64 (no microarch specified): 128-bit vector ops, no VEX encoding
Alder Lake: 256-bit vector ops, AVX2
Zen 4: 512-bit vector ops, AVX512

This is both good and bad. It is good in the sense that "the best possible thing gets used", bad in the sense that "we probably should be avoiding AVX512 for now" (Anything that predates Ice Lake/Zen 4 downclocks etc). This also depends on quite a bit of LLVM magic, and as a matter of preference I tend to favor writing code like this that mirrors the underlying hardware.

If not applicable to this PR, are you saying this comment applies more to the index_byte implementation? Here in this PR, memory_compare_zero at least forces memory alignment, because there's no secondary line of memory to compare against, just zeroes.

Are you suggesting to instead iterate as a scalar until the length is aligned to the scanner width, then do unaligned loads the full way through?

Referring to the previous PR, and my comment about "writing code that is true to the underlying hardware" something like this is what I would have done.

https://gist.github.com/Yawning/2f7826f10a5d7af813d9b9b7121eb117

In each of the SIMD while loops, I just do unaligned loads. On anything recent, intuitively, I doubt the difference is large enough to matter, and simplicity is nice, but this depends on "what is the expected typical input size". With a sufficiently non-suck -microarch (anything Sandy Bridge or later, the unaligned loads get converted to a memory operand).

This does give up on AVX512, but support for that is spotty and it's a foot+gun on things that aren't Ice Lake/Rocket Lake/Zen 4.

Either way, I find this sort of information fascinating, so thank you for your input. I'm aware of Agner Fog's work, though I haven't read into it too deeply yet. Do you have any other resources on this topic to share?

Intel's optimization reference is good (https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)

ps: Hopefully this doesn't come off as too nitpicky.

Feoramund added 7 commits August 11, 2024 04:18

Remove commented-out block of code

9afb7ec

Use tick_* procs in core:bytes benchmark

c534191

Vectorize base:runtime.memory_*

97107b6

Add tests for base:runtime.memory_*

d230a19

Remove unneeded alloc in bytes test

2afb4df

Add sizes adjacent to 64-bit SIMD_SCAN_WIDTH in bytes benchmark

90820ea

Add benchmarks for base:runtime.memory_*

de7dd9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `base:runtime.memory_*` #4063

Vectorize `base:runtime.memory_*` #4063

Feoramund commented Aug 11, 2024

Kelimion commented Aug 12, 2024

Feoramund commented Aug 12, 2024

Kelimion commented Aug 12, 2024 •

edited

Loading

Feoramund commented Aug 12, 2024

Yawning commented Aug 13, 2024 •

edited

Loading

Feoramund commented Aug 13, 2024

Yawning commented Aug 14, 2024 •

edited

Loading

Vectorize base:runtime.memory_* #4063

Are you sure you want to change the base?

Vectorize base:runtime.memory_* #4063

Conversation

Feoramund commented Aug 11, 2024

Kelimion commented Aug 12, 2024

Feoramund commented Aug 12, 2024

Kelimion commented Aug 12, 2024 • edited Loading

Feoramund commented Aug 12, 2024

Yawning commented Aug 13, 2024 • edited Loading

Feoramund commented Aug 13, 2024

Yawning commented Aug 14, 2024 • edited Loading

Vectorize `base:runtime.memory_*` #4063

Vectorize `base:runtime.memory_*` #4063

Kelimion commented Aug 12, 2024 •

edited

Loading

Yawning commented Aug 13, 2024 •

edited

Loading

Yawning commented Aug 14, 2024 •

edited

Loading