Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter and gather #12

Open
penzn opened this issue Jun 13, 2020 · 9 comments
Open

Scatter and gather #12

penzn opened this issue Jun 13, 2020 · 9 comments

Comments

@penzn
Copy link
Contributor

penzn commented Jun 13, 2020

As @Maratyszcza and @lemaitre point out in #7, we should consider scatter and gather operations. This is an issue to track that.

Potential topics to discuss:

  • Emulation (it is only supported by AVX512 and SVE)
  • Compiler support - is there any ways to enable it aside from intrinsics?
@jan-wassenberg
Copy link

Developers made do quite successfully without it on native hardware, why is it a must for Wasm?

IMHO, I think it is a must for flexible vectors (much less for WASM SIMD in general). If you know the size of your SIMD register

I am surprised to hear it described as a must, haven't yet seen an application that really required it.
We do know the size of the register, right? We have to have some function that tells us the loop increment, which is (by definition) the register size.

@lemaitre
Copy link

We do know the size of the register, right? We have to have some function that tells us the loop increment, which is (by definition) the register size.

It is seems to be a misconception here: We do not know, as developers, the SIMD width.
We only know how to get it at runtime via a specific instruction (or global).

For instance, one way to emulate scatter (or gather for that matter) is to implement a full in-register transposition.
This means that, at some point, you need as many registers as their width to store the data.
If you transpose floats in SSE, you need 4 registers. With AVX2, 8 registers, and so on.
So you cannot do this trick if you don't know at code time (or compile time) the size of the registers.

Even the extract pattern for scatter would be problematic as the extract index would most likely be an immediate (and certainly is on most architectures). So here, either you unroll completely the loop at compile time and check for each index that it is less than actual width, or we make the extract with runtime indices and hope the generation will see that the index is actually compile-time...

Neither solution sounds appealing.

@jan-wassenberg
Copy link

Yes, to be clear: having the function could allow us to compare the runtime value against a small set of candidates, and use the corresponding code pregenerated for each. Which raises an interesting question: is there some abstraction we can provide that allows developers to know that SVE will always have n*128 bit, x86 will have {1,2,4}x128?
RiscV V has no such limitation, but if the function returns something the app doesn't expect (e.g. 16K bits) then the app can fall back to some codepath that doesn't do in-register transposition.

@penzn
Copy link
Contributor Author

penzn commented Jun 15, 2020

For Arm and x86 ISAs it would be perfectly legal to say that maximum width is always a multiple of 128 bits, though I am not sure how that would map to RiscV.

@lemaitre
Copy link

lemaitre commented Jun 15, 2020

According to Risc-V V spec (https://riscv.github.io/documents/riscv-v-spec/riscv-v-spec.pdf#_implementation_defined_constant_parameters), maximum width should be a power of larger than or equal to 32 bits. (EDIT: I got confused in a previous version of this message)

Only SVE does not require that maximum width should be a power of 2.

@programmerjake
Copy link

There's also SimpleV, a WIP extension on OpenPower that guarantees availability of any vector length from 1 to 64 (not limited to powers of 2, so e.g. 35 is a valid vector length), and allows (like RISC-V V) the length to be set dynamically. It supports gather-load, scatter-store, and gather register-to-register moves.

@lemaitre
Copy link

lemaitre commented Apr 9, 2021

Thanks @programmerjake, I was not aware of SimpleV. To me, the interesting point is the guarantee that vector length of 64 is available. So on SimpleV, we can force the vl to a power of 2 if required.

Also, you mention "gather register-to-register moves". While in hardware, it makes sense to group it with gather loads, for the software point of view, such connection is not required and the terminology used is more shuffle/swizzle.

@programmerjake
Copy link

Some additional comments on SimpleV on Libre-SOC's mailing list:
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-April/002318.html

@programmerjake
Copy link

Thanks @programmerjake, I was not aware of SimpleV. To me, the interesting point is the guarantee that vector length of 64 is available.

Yup! We basically picked 64 as the max since the general purpose integer registers are 64-bits wide allowing 1 predicate bit per vector element.

So on SimpleV, we can force the vl to a power of 2 if required.

Yup, though if your forcing VL to be bigger than necessary just so it's a power of 2, it will probably run slower, since it's implemented using a hardware-level loop over vector elements.

Also, you mention "gather register-to-register moves". While in hardware, it makes sense to group it with gather loads, for the software point of view, such connection is not required and the terminology used is more shuffle/swizzle.

Yup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants