Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize base:runtime.memory_* #4063

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Feoramund
Copy link
Contributor

The speedups here aren't as profound as with index_byte, but there are some noticeable improvements with larger data sizes.

I'm going to leave this as a draft to await feedback. I'm mostly interested to hear if this has any impact in real-world programs.

I want to note that the original memory_compare* procs had no notion of unaligned loads. Were we just getting by without checking for alignment somehow?

odin test . -o:aggressive -disable-assert -no-bounds-check -define:ODIN_TEST_THREADS=1 -microarch:alderlake

Plain results
[INFO ] --- [2024-08-11 23:38:19] [runtime.odin:259:benchmark_plain_memory_compare_cold()]
        +++      15B | 101ns
        +++      16B | 127ns
        +++      17B | 45ns
        +++      31B | 56ns
        +++      32B | 49ns
        +++      33B | 45ns
        +++      63B | 69ns
        +++      64B | 62ns
        +++      65B | 36ns
        +++     256B | 64ns
        +++     512B | 78ns
        +++  1.00KiB | 99ns
        +++  1.00MiB | 90.804µs
        +++  1.00GiB | 73.176684ms
[INFO ] --- [2024-08-11 23:38:20] [runtime.odin:269:benchmark_plain_memory_compare_hot()]
        +++      15B | 26ns
        +++      16B | 19ns
        +++      17B | 21ns
        +++      31B | 20ns
        +++      32B | 16ns
        +++      33B | 17ns
        +++      63B | 25ns
        +++      64B | 19ns
        +++      65B | 18ns
        +++     256B | 23ns
        +++     512B | 36ns
        +++  1.00KiB | 47ns
        +++  1.00MiB | 38.926µs
        +++  1.00GiB | 73.154996ms
[INFO ] --- [2024-08-11 23:38:20] [runtime.odin:301:benchmark_plain_memory_compare_zero_cold()]
        +++      15B | 126ns
        +++      16B | 33ns
        +++      17B | 30ns
        +++      31B | 40ns
        +++      32B | 21ns
        +++      33B | 25ns
        +++      63B | 39ns
        +++      64B | 21ns
        +++      65B | 34ns
        +++     256B | 61ns
        +++     512B | 45ns
        +++  1.00KiB | 53ns
        +++  1.00MiB | 40.53µs
        +++  1.00GiB | 54.984574ms
[INFO ] --- [2024-08-11 23:38:21] [runtime.odin:311:benchmark_plain_memory_compare_zero_hot()]
        +++      15B | 15ns
        +++      16B | 17ns
        +++      17B | 21ns
        +++      31B | 21ns
        +++      32B | 16ns
        +++      33B | 23ns
        +++      63B | 22ns
        +++      64B | 18ns
        +++      65B | 18ns
        +++     256B | 26ns
        +++     512B | 39ns
        +++  1.00KiB | 52ns
        +++  1.00MiB | 36.948µs
        +++  1.00GiB | 54.865693ms
[INFO ] --- [2024-08-11 23:38:21] [runtime.odin:217:benchmark_plain_memory_equal_cold()]
        +++      15B | 192ns
        +++      16B | 40ns
        +++      17B | 59ns
        +++      31B | 36ns
        +++      32B | 40ns
        +++      33B | 21ns
        +++      63B | 38ns
        +++      64B | 36ns
        +++      65B | 32ns
        +++     256B | 75ns
        +++     512B | 126ns
        +++  1.00KiB | 224ns
        +++  1.00MiB | 226.184µs
        +++  1.00GiB | 240.402243ms
[INFO ] --- [2024-08-11 23:38:23] [runtime.odin:227:benchmark_plain_memory_equal_hot()]
        +++      15B | 21ns
        +++      16B | 22ns
        +++      17B | 22ns
        +++      31B | 31ns
        +++      32B | 28ns
        +++      33B | 24ns
        +++      63B | 35ns
        +++      64B | 32ns
        +++      65B | 30ns
        +++     256B | 69ns
        +++     512B | 122ns
        +++  1.00KiB | 228ns
        +++  1.00MiB | 214.525µs
        +++  1.00GiB | 237.592753ms
SIMD results
[INFO ] --- [2024-08-11 23:38:23] [runtime.odin:279:benchmark_simd_memory_compare_cold()]
        +++      15B | 90ns
        +++      16B | 31ns
        +++      17B | 46ns
        +++      31B | 47ns
        +++      32B | 47ns
        +++      33B | 50ns
        +++      63B | 47ns
        +++      64B | 36ns
        +++      65B | 38ns
        +++     256B | 62ns
        +++     512B | 46ns
        +++  1.00KiB | 43ns
        +++  1.00MiB | 50.884µs
        +++  1.00GiB | 62.515442ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:289:benchmark_simd_memory_compare_hot()]
        +++      15B | 34ns
        +++      16B | 24ns
        +++      17B | 23ns
        +++      31B | 33ns
        +++      32B | 24ns
        +++      33B | 33ns
        +++      63B | 37ns
        +++      64B | 20ns
        +++      65B | 19ns
        +++     256B | 25ns
        +++     512B | 25ns
        +++  1.00KiB | 35ns
        +++  1.00MiB | 29.662µs
        +++  1.00GiB | 62.706172ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:321:benchmark_simd_memory_compare_zero_cold()]
        +++      15B | 158ns
        +++      16B | 46ns
        +++      17B | 30ns
        +++      31B | 49ns
        +++      32B | 32ns
        +++      33B | 25ns
        +++      63B | 39ns
        +++      64B | 103ns
        +++      65B | 60ns
        +++     256B | 56ns
        +++     512B | 48ns
        +++  1.00KiB | 59ns
        +++  1.00MiB | 9.361µs
        +++  1.00GiB | 35.172623ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:331:benchmark_simd_memory_compare_zero_hot()]
        +++      15B | 32ns
        +++      16B | 25ns
        +++      17B | 23ns
        +++      31B | 31ns
        +++      32B | 24ns
        +++      33B | 24ns
        +++      63B | 34ns
        +++      64B | 34ns
        +++      65B | 32ns
        +++     256B | 33ns
        +++     512B | 33ns
        +++  1.00KiB | 37ns
        +++  1.00MiB | 8.932µs
        +++  1.00GiB | 34.838497ms
[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:237:benchmark_simd_memory_equal_cold()]
        +++      15B | 181ns
        +++      16B | 50ns
        +++      17B | 51ns
        +++      31B | 35ns
        +++      32B | 32ns
        +++      33B | 25ns
        +++      63B | 37ns
        +++      64B | 42ns
        +++      65B | 33ns
        +++     256B | 52ns
        +++     512B | 23ns
        +++  1.00KiB | 40ns
        +++  1.00MiB | 45.397µs
        +++  1.00GiB | 62.699833ms
[INFO ] --- [2024-08-11 23:38:25] [runtime.odin:247:benchmark_simd_memory_equal_hot()]
        +++      15B | 30ns
        +++      16B | 27ns
        +++      17B | 24ns
        +++      31B | 29ns
        +++      32B | 31ns
        +++      33B | 34ns
        +++      63B | 36ns
        +++      64B | 12ns
        +++      65B | 16ns
        +++     256B | 23ns
        +++     512B | 23ns
        +++  1.00KiB | 30ns
        +++  1.00MiB | 47.923µs
        +++  1.00GiB | 62.565219ms

@Kelimion
Copy link
Member

At first glance it looks like the SIMD version is slower until somewhere between 64 and 256 bytes, so it should probably remain scalar for small sizes.

@Feoramund
Copy link
Contributor Author

At first glance it looks like the SIMD version is slower until somewhere between 64 and 256 bytes, so it should probably remain scalar for small sizes.

Do you mean to add in a specific path for < SIMD_SCAN_WIDTH which is explicitly scalar and exits early? It otherwise iterates as a scalar by virtue of the 2nd loop being unable to entertain any further iteration after the first.

I was wondering if it might be a good idea to have variable-width SIMD scanners for something like this too, or defaulting to a smaller register size, like 128 bits.

@Kelimion
Copy link
Member

Kelimion commented Aug 12, 2024

Well, it would be great if the SIMD timings were formatted as relative to the scalar results, e.g.: (made up factors)

[INFO ] --- [2024-08-11 23:38:24] [runtime.odin:289:benchmark_simd_memory_compare_hot()]
        +++     Size |  SIMD (Scalar, relative)
        +++      15B | 34ns (23ns, 1.2x)
        +++      16B | 24ns
        +++      17B | 23ns
        +++      31B | 33ns
        +++      32B | 24ns
        +++      33B | 33ns
        +++      63B | 37ns
        +++      64B | 20ns (23ns, 0.92x)
        +++      65B | 19ns
        +++     256B | 25ns
        +++     512B | 25ns
        +++  1.00KiB | 35ns
        +++  1.00MiB | 29.662µs
        +++  1.00GiB | 62.706172ms

At first blush it looks like the SIMD paths aren't really any faster until around 64 bytes. Once we can compare the numbers side by side instead of scrolling up 2 pages, it'll be possible to see where the scalar cut-off should be.

At a glance it appears that it would be sensible to stay scalar until the next aligned size that's greater or equal to 64 bytes and go wide after that. Or alternatively, if n <= 64 { scalar } else { wide }.

Would it be a lot of work to perform the two sets of timing first, and then present them in a core:text/table or something so you can see the time in ns for scalar and SIMD side by side, as well as the percentage slowdown/speedup compared to scalar?

Because as you intimated, it's not a clear 10x win across the board, so we need to know where the sensible cut-off actually is.

@Feoramund
Copy link
Contributor Author

At first blush it looks like the SIMD paths aren't really any faster until around 64 bytes.

That's to be expected, since it's iterating as a scalar up until SIMD_SCAN_WIDTH, which should be 64 bytes for most 64-bit platforms. index_byte works the same way.

Would it be a lot of work to perform the two sets of timing first, and then present them in a core:text/table or something so you can see the time in ns for scalar and SIMD side by side, as well as the percentage slowdown/speedup compared to scalar?

I'll see what I can do about a more sensible output later. It's just a matter of getting all the data points at once, then composing the output.

@Yawning
Copy link
Contributor

Yawning commented Aug 13, 2024

I was wondering if it might be a good idea to have variable-width SIMD scanners for something like this too, or defaulting to a smaller register size, like 128 bits.

This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.

I want to note that the original memory_compare* procs had no notion of unaligned loads. Were we just getting by without checking for alignment somehow?

We currently do not test on targets that are picky about non-SIMD alignment as ARM starting with v8 (both 32 and 64-bit) is fine with it, and x86/amd64 has always been fine with it. and WASM is fine with it. Looking forward, off the top of my head, unaligned loads would cause problems on ARMv7 (Raspi <= 3), PPC64/POWER, and probably RISC-V.

As a side note, while unapplicable to this PR, I'm kind of unconvinced that forcing alignment before entering the bulk path is worth the added complexity/codesize (at least on Intel architectures) as:

  • Memory operand alignment restrictions are relaxed with VEX encoding (AVX, unless you are doing VMOVA).
  • MOVA vs MOVU has identical performance on aligned data (Nehalem/Bulldozer or later).
  • The penalty for an unaligned load/store is 1 cycle if it straddles a cache line (64-byte boundary).
  • The penalty for an unaligned load/store is 10 cycles if it straddles a page boundary (Skylake or later, ~200 cycles on older microarchitectures, but those are e-waste).

@Feoramund
Copy link
Contributor Author

This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.

I should clarify that when I said register, I was referring to chunk. So how we use #simd[SIMD_SCAN_WIDTH]u8, which ends up being [64]u8 or 512 bits in most 64-bit platform cases, I was wondering if it might be better to use something like #simd[N]u8 where N could be 8 or 16 in certain cases, since memory comparisons outside of strings may not be long enough to benefit from the typical 64 byte length requirement of the SIMD scanner.

We currently do not test on targets that are picky about non-SIMD alignment as ARM starting with v8 (both 32 and 64-bit) is fine with it, and x86/amd64 has always been fine with it. and WASM is fine with it. Looking forward, off the top of my head, unaligned loads would cause problems on ARMv7 (Raspi <= 3), PPC64/POWER, and probably RISC-V.

Interesting.

As a side note, while unapplicable to this PR, I'm kind of unconvinced that forcing alignment before entering the bulk path is worth the added complexity/codesize (at least on Intel architectures) as:

  • Memory operand alignment restrictions are relaxed with VEX encoding (AVX, unless you are doing VMOVA).
  • MOVA vs MOVU has identical performance on aligned data (Nehalem/Bulldozer or later).
  • The penalty for an unaligned load/store is 1 cycle if it straddles a cache line (64-byte boundary).
  • The penalty for an unaligned load/store is 10 cycles if it straddles a page boundary (Skylake or later, ~200 cycles on older microarchitectures, but those are e-waste).

If not applicable to this PR, are you saying this comment applies more to the index_byte implementation? Here in this PR, memory_compare_zero at least forces memory alignment, because there's no secondary line of memory to compare against, just zeroes.

Are you suggesting to instead iterate as a scalar until the length is aligned to the scanner width, then do unaligned loads the full way through?

Either way, I find this sort of information fascinating, so thank you for your input. I'm aware of Agner Fog's work, though I haven't read into it too deeply yet. Do you have any other resources on this topic to share?

@Yawning
Copy link
Contributor

Yawning commented Aug 14, 2024

nb: I will be explicitly ignoring i386 for the purpose of this discussion, along with ARM SVE2 (because of lack of Odin support).

This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.

I should clarify that when I said register, I was referring to chunk. So how we use #simd[SIMD_SCAN_WIDTH]u8, which ends up being [64]u8 or 512 bits in most 64-bit platform cases, I was wondering if it might be better to use something like #simd[N]u8 where N could be 8 or 16 in certain cases, since memory comparisons outside of strings may not be long enough to benefit from the typical 64 byte length requirement of the SIMD scanner.

So the way this PR (and the previous PR) are written, we are telling LLVM to do 512-bit vector operations. In most situations this causes a disconnect between the chunk size and what the actual hardware is capable of (128-bit SIMD on wasm/ARM/vanilla amd64, 256-bit SIMD on modern amd64, 512-bit SIMD on certain Intel/Ryzen).

So this is roughly what happens, depending on the target microarch:

  • WASM (without the simd128): GP registers to emulate SIMD (and does an ok job)
  • amd64 (no microarch specified): 128-bit vector ops, no VEX encoding
  • Alder Lake: 256-bit vector ops, AVX2
  • Zen 4: 512-bit vector ops, AVX512

This is both good and bad. It is good in the sense that "the best possible thing gets used", bad in the sense that "we probably should be avoiding AVX512 for now" (Anything that predates Ice Lake/Zen 4 downclocks etc). This also depends on quite a bit of LLVM magic, and as a matter of preference I tend to favor writing code like this that mirrors the underlying hardware.

If not applicable to this PR, are you saying this comment applies more to the index_byte implementation? Here in this PR, memory_compare_zero at least forces memory alignment, because there's no secondary line of memory to compare against, just zeroes.

Are you suggesting to instead iterate as a scalar until the length is aligned to the scanner width, then do unaligned loads the full way through?

Referring to the previous PR, and my comment about "writing code that is true to the underlying hardware" something like this is what I would have done.

https://gist.github.com/Yawning/2f7826f10a5d7af813d9b9b7121eb117

In each of the SIMD while loops, I just do unaligned loads. On anything recent, intuitively, I doubt the difference is large enough to matter, and simplicity is nice, but this depends on "what is the expected typical input size". With a sufficiently non-suck -microarch (anything Sandy Bridge or later, the unaligned loads get converted to a memory operand).

This does give up on AVX512, but support for that is spotty and it's a foot+gun on things that aren't Ice Lake/Rocket Lake/Zen 4.

Either way, I find this sort of information fascinating, so thank you for your input. I'm aware of Agner Fog's work, though I haven't read into it too deeply yet. Do you have any other resources on this topic to share?

Intel's optimization reference is good (https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)

ps: Hopefully this doesn't come off as too nitpicky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants