From 47ba7e7bcc9dc37b2a69216efc4b891a29076237 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Wed, 14 Mar 2018 19:34:10 +0100
Subject: [PATCH] rfc: portable packed SIMD vector types

---
 text/0000-ppv.md | 1386 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1386 insertions(+)
 create mode 100644 text/0000-ppv.md

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
new file mode 100644
index 00000000000..53da42a2281
--- /dev/null
+++ b/text/0000-ppv.md
@@ -0,0 +1,1386 @@
+- Feature Name: `portable_packed_vector_types`
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: (leave this empty)
+- Rust Issue: (leave this empty)
+
+# Summary
+[summary]: #summary
+
+This RFC adds portable packed SIMD vector types up to 256-bit.
+
+Future RFCs will attempt to answer some of the unresolved questions and might
+potentially cover extensions as they mature in `stdsimd`, like, for example,
+portable memory gather and scatter operations, `m1xN` vector masks, masked
+arithmetic/bitwise/shift operations, etc.
+
+# Motivation
+[motivation]: #motivation
+
+The `std::arch` module exposes architecture-specific SIMD types like `_m128` - a
+128-bit wide SIMD vector type. How these bits are interpreted depends on the intrinsic
+being used. For example, let's sum 8 `f32`s values using the SSE4.1 facilities
+in the `std::arch` module. This is one way to do it
+([playground](https://play.rust-lang.org/?gist=165e2886b4883ec98d4e8bb4d6a32e22&version=nightly)):
+
+```rust
+unsafe fn add_reduce(a: __m128, b: __m128) -> f32 {
+    let c = _mm_hadd_ps(a, b);
+    let c = _mm_hadd_ps(c, _mm_setzero_ps());
+    let c = _mm_hadd_ps(c, _mm_setzero_ps());
+    std::mem::transmute(_mm_extract_ps(c, 0))
+}
+
+fn main() {
+    unsafe {
+        let a = _mm_set_ps(1., 2., 3., 4.);
+        let b = _mm_set_ps(5., 6., 7., 8.);
+        let r = add_reduce(a, b);
+        assert_eq!(r, 36.);
+    }
+}
+```
+
+Notice that:
+
+* one has to put some effort to extrapolate from `add_reduce`'s signature what
+  types of vectors it actually expects: "`add_reduce` takes 128-bit wide vectors and
+  returns an `f32` therefore those 128-bit vectors _probably_ must contain 4 packed
+  f32s because that's the only combination of `f32`s that fits in 128 bits!"
+  
+* it requires a lot of `unsafe` code: the intrinsics are unsafe (which could be
+  improved via [RFC2122](https://github.com/rust-lang/rfcs/pull/2212)), the
+  intrinsic API relies on the user performing transmutes, constructing the
+  vectors is unsafe because it needs to be done via intrinsic calls, etc.
+
+* it requires a lot of architecture specific knowledge: how the intrinsics are
+  called, how they are used together
+  
+* this solution only works on `x86` or `x86_64` with SSE4.1 enabled, that is, it
+  is not portable.
+
+With portable packed vector types, we can do much better
+([playground](https://play.rust-lang.org/?gist=7fb4e3b6c711b5feb35533b50315a5fb&version=nightly)):
+
+```rust
+fn main() {
+    let a = f32x4::new(1., 2., 3., 4.);
+    let b = f32x4::new(5., 6., 7., 8.);
+    let r = (a + b).sum();
+    assert_eq!(r, 36.);
+}
+```
+
+These types add zero-overhead over the architecture-specific types for the
+operations that they support - if there is an architecture in which this does
+not hold for some operation: the implementation has a bug.
+
+The motivation of this RFC is to provide reasonably high-level, reliable, and
+portable access to common SIMD vector types and SIMD operations.
+
+At a higher level, the actual use cases for these specialty instructions are
+boundless. SIMD intrinsics are used in graphics, multimedia, linear algebra,
+scientific computing, games, cryptography, text search, machine learning, low
+latency, and more. There are many crates in the Rust ecosystem using SIMD
+intrinsics today, either through `stdsimd`, the `simd` crate, or both. Some
+examples include:
+
+* [`encoding_rs`](https://github.com/hsivonen/encoding_rs) which uses the `simd`
+  crate to assist with speedy decoding.
+* [`bytecount`](https://github.com/llogiq/bytecount) which uses the `simd` crate
+with AVX2 extensions to accelerate counting bytes.
+* [`regex`](https://github.com/rust-lang/regex) which uses the `stdsimd` crate
+with SSSE3 extensions to accelerate multiple substrings search and also its
+parent crate `teddy`.
+
+However, providing portable SIMD algorithms for all application domains is not
+the intent of this RFC.
+
+The purpose of this RFC is to provide users with vocabulary types and
+fundamental operations that they can build upon in their own crates to
+effectively implement SIMD algorithms in their respective application domains.
+
+These types are meant to be extended by users with portable (or nonportable) SIMD
+operations in their own crates, for example, via extension traits or new types.
+
+The operations provided in this RFC are thus either:
+
+**fundamental**: that is, they build the foundation required to write
+higher-level SIMD algorithms. These include, amongst others, instantiating
+vector types, load/stores from memory, masks and branchless conditional
+operations, and type casts and conversions.
+
+**required**: to be part of the std. These include backend-specific compiler
+intrinsics that we might never want to stabilize as well as the implementation of
+std library traits which, due to trait coherence, users cannot extend the vector
+types with.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+This RFC extends Rust with **portable packed SIMD vector types**, a set of types
+used to perform **explicit vectorization**:
+
+* **SIMD**: stands for Single Instruction, Multiple Data. This RFC uses this
+  term in the context of hardware instruction set architectures (ISAs) to refer
+  to:
+    * SIMD instructions: instructions that (typically) perform operations on
+      multiple values simultaneously, and
+    * SIMD registers: the registers that the SIMD instructions take as operands.
+      These registers (typically) store multiple values that are
+      operated upon simultaneously by SIMD instructions.
+
+* **vector** types: types that abstract over memory stored in SIMD registers,
+  allowing to transfer memory to/from the registers and performing operations
+  directly on these registers.
+
+* **packed**: means that these vectors have a compile-time fixed size. It is
+  the opposite of **scalable** or "Cray vectors", which are SIMD vector types
+  with a dynamic size, that is, whose size is only known at run-time.
+
+* **explicit vectorization**: vectorization is the process of producing programs
+  that operate on multiple values simultaneously (typically) using SIMD
+  instructions and registers. Automatic vectorization is the process by which
+  the Rust compiler is, in some cases, able to transform scalar Rust code, that
+  is, code that does not use SIMD vector types, into machine code that does use
+  SIMD registers and instructions automatically (without user intervention).
+  Explicit vectorization is the process by which a Rust **user** manually writes
+  Rust code that states what kind of SIMD registers are to be used and what SIMD
+  instructions are executed on them.
+
+* **portable**: is the opposite of architecture-specific. These types work both
+  correctly and efficiently on all architectures. They are a zero-overhead
+  abstraction, that is, for the operations that these types support, one cannot
+  write better code by hand (otherwise, it is an implementation bug).
+  
+* **masks**: are vector types used to **select** vector elements on which
+  operations are to be performed. This selection is performed by setting or
+  clearing the bits of the masks for a particular lane.
+  
+Packed vector types are denotes as follows: `{i,u,f,m}{lane_width}x{#lanes}`, so
+that `i64x8` is a 512-bit vector with eight `i64` lanes and `f32x4` a 128-bit
+vector with four `f32` lanes. Here:
+
+* **lane**: is the number of values of a particular type stored in a vector -
+  the vector operations act on these values simultaneously.
+  
+* **lane width**: the bit width of a vector lane, that is, the bit width of
+  the objects stored in the vector. For example, the type `f32` is 32-bits wide.
+
+Operations on vector types can be either:
+
+* **vertical**: that is, lane-wise. For example, `a + b` adds each lane of `a`
+  with the corresponding lane of `b`, while `a.lt(b)` returns a boolean mask
+  that indicates whether the less-than (`<`, `lt`) comparison returned `true` or
+  `false` for each of the vector lanes. Most vertical operations are binary operations (they take two input vectors). These operations are typically very fast on most architectures and they are the most widely used in practice.
+  
+* **horizontal**: that is, along a single vector - they are unary operations.
+  For example, `a.sum()` adds the elements of a vector together while
+  `a.hmax()` returns the largest element in a vector. These operations
+  (typically) translate to a sequence of multiple SIMD instructions on most architectures and are therefore slower. In many cases, they are, however,  necessary.
+  
+## Example: Average
+
+The first example computes the arithmetic average of the elements in a list.
+Sequentially, we would write using iterators as follows:
+
+```rust
+/// Arithmetic average of the elements in `xs`.
+fn average_seq(xs: &[f32]) -> f32 {
+    if xs.len() > 0 {
+        xs.iter().sum() / xs.len()
+    } else {
+        0.
+    }
+}
+```
+
+The following implementation uses the 256-bit SIMD facilities provided by this
+RFC. As the name suggests, it will be "slow":
+
+```rust
+/// Computes the arithmetic average of the elements in the list.
+///
+/// # Panics
+///
+/// If `xs.len()` is not a multiple of `8`.
+fn average_slow256(xs: &[f32]) -> f32 {
+    // The 256-bit wide floating-point vector type is f32x8. To
+    // avoid handling extra elements in this example we just panic.
+    assert!(xs.len() % 8 == 0, 
+            "input length `{}` is not a multiple of 8", 
+            xs.len());
+    
+    let mut result = 0._f32;  // This is where we store the result
+    
+    // We iterate over the input slice with a step of `8` elements:
+    for i in (0..xs.len()).step_by(8) {
+        // First, we load the next `8` elements into an `f32x8`.
+        // Since we haven't checked whether the input slice
+        // is aligned to the alignment of `f32x8`, we perform
+        // an unaligned memory load.
+        let data = f32x8::load_unaligned(&xs[i..]);
+
+        // With the element in the vector, we perform an horizontal reduction
+        // and add them to the result.
+        result += data.sum();
+    }
+    result
+}
+```
+
+As mentioned this operation is "slow", why is that? The main issue is that, on
+most architectures, horizontal reductions must perform a sequence of SIMD
+operations while vertical operations typically require only a single
+instruction.
+
+We can significantly improve the performance of our algorithm by writing it in
+such a way that the number of horizontal reductions performed is reduced.
+
+```rust
+fn average_fast256(xs: &[f32]) -> f32 {
+    assert!(xs.len() % 8 == 0, 
+            "input length `{}` is not a multiple of 8", 
+            xs.len());
+    
+    // Our temporary result is now a f32x8 vector:
+    let mut result = f32x8::splat(0.);
+    for i in (0..xs.len()).step_by(8) {
+        let data = f32x8::load_unaligned(&xs[i..]);
+        // This adds the data elements to tour temporary result using 
+        // a vertical lane-wise simd operation - this is a single SIMD
+        // instruction on most architectures.
+        result += data; 
+    }
+    // Perform a single horizontal reduction at the end:
+    result.sum()
+}
+```
+
+The performance could by further improved by requiring the input data to be
+aligned to a 16-byte boundary, and/or by handling the elements before the next 16-byte boundary in a special way.
+
+## Example: scalar-vector multiply even
+
+To showcase the mask and `select` API the following function multiplies the
+even elements of a vector with a scalar:
+
+```rust
+fn mul_even(a: f32, x: f32x4) -> f32x4 {
+    // Create a mask for the even elements 0 and 2:
+    let m = m32x4::new(true, false, true, false);
+
+    // Perform a full multiplication
+    let r = f32x4::splat(a) * x;
+    
+    // Use the mask to select the even elements from the
+    // multiplication result and the odd elements from
+    // the input:
+    m.select(r, x)
+}
+```
+
+## Example: 4x4 Matrix multiplication
+
+To showcase the `shuffle` API the following function implements 4x4 Matrix
+multiply using 128-bit wide vectors.
+
+```rust
+fn mul4x4(a: [f32x4; 4], b: [f32x4; 4]) -> [f32x4; 4] {
+    let r = [f32x4::splat(0.); 4];
+    
+    for i in 0..4 {
+        r[i] = 
+            a[0] * shuffle!(b[i], [0,0,0,0]) + 
+            a[1] * shuffle!(b[i], [1,1,1,1]) +
+            a[2] * shuffle!(b[i], [2,2,2,2]) +
+            a[3] * shuffle!(b[i], [3,3,3,3]);
+    }
+    r
+}
+```
+  
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+  
+## Vector types
+
+The vector types are named according to the following scheme:
+
+> {element_type}{lane_width}x{number_of_lanes}
+
+where the following element types are introduced by this RFC:
+
+* `i`: signed integer
+* `u`: unsigned integer
+* `f`: float
+* `m`: mask
+
+So that `u16x8` reads "a SIMD vector of eight packed 16-bit wide unsigned
+integers". The width of a vector can be computed by multiplying the
+`{lane_width}` times the `{number_of_lanes}`. For `u16x8`, 16 x 8 = 128, so
+this vector type is 128 bits wide.
+
+This RFC proposes adding all vector types with sizes in range [16, 256] bit to
+the `std::simd` module, that is:
+
+* 16-bit wide vectors: `i8x2`, `u8x2`, `m8x2`
+* 32-bit wide vectors: `i8x4`, `u8x4`, `m8x4`, `i16x2`, `u16x2`,  `m16x2`
+* 64-bit wide vectors: `i8x8`, `u8x8`, `m8x8`, `i16x4`, `u16x4`, `m16x4`,
+  `i32x2`, `u32x2`, `f32x2`, `m32x2`
+* 128-bit wide vectors: `i8x16`, u8x16`, `m8x16`, `i16x8`, `u16x8`, `m16x8`,
+  `i32x4`, `u32x4`, `f32x4`, `m32x4`, `i64x2`, `u64x2`, `f64x2`, `m64x2`
+* 256-bit wide vectors: `i8x32`, u8x32`, m8x32`, i16x16`, u16x16`, m16x16`,
+  i32x8`, u32x8`, f32x8`, m32x8`, i64x4`, u64x4`, f64x4`, `m64x4`
+
+Note that this list is not comprehensive. In particular:
+
+* half-float `f16xN`: these vectors are supported in many architectures (ARM,
+  AArch64, PowerPC64, RISC-V, MIPS, ...) but their support is blocked on Rust
+  half-float support.
+* AVX-512 vector types, not only 512-bit wide vector types, but also `m1xN`
+  vector masks. These are blocked on `std::arch` AVX-512 support.
+* other vector types: x86, AArch64, PowerPC and others include types like
+  `i64x1`, `u64x1`, `f64x1`, `m64x1`, `i128x1`, `u128x1`, `m128x1`, ... These
+  can be always added later as the need for these arises, potentially in combination with the stabilization of the `std::arch` intrinsics for those
+  architectures.
+
+## API of portable packed SIMD vector types
+
+### Traits overview
+
+All vector types implement the following traits:
+
+* `Copy`
+* `Clone`
+* `Default`: zero-initializes the vector.
+* `Debug`: formats the vector as `({}, {}, ...)`.
+* `PartialEq<Self>`: performs a lane-wise comparison between two vectors and
+  returns `true` if all lanes compare `true`. It is equivalent to
+  `a.eq(b).all()`.
+* `PartialOrd<Self>`: compares two vectors lexicographically.
+* `From/Into` lossless casts between vectors with the same number of lanes.
+
+All signed integer, unsigned integer, and floating point vector types implement
+the following traits:
+
+* `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`,
+  `{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`: vertical (lane-wise) arithmetic
+  operations.
+
+All signed and unsigned integer vectors and vector masks also implement:
+
+* `Eq`: equivalent to `PartialEq<Self>`
+* `Ord`: equivalent to `PartialOrd<Self>`
+* `Hash`: equivalent to `Hash` for `[element_type; number_of_elements]`.
+* `fmt::LowerHex`/`fmt::UpperHex`: formats the vector as hexadecimal.
+* `fmt::Octal`: formats the vector as an octal number.
+* `fmt::Binary`: formats the vector as binary number.
+* `Not<Output=Self>`: vertical (lane-wise) negation,
+* `Bit{And,Or,Xor}<RHS=Self,Output=Self>`, `Bit{And,Or,Xor}Assign<RHS=Self>`:
+  vertical (lane-wise) bitwise operations.
+
+All signed and unsigned integer vectors also implement:
+
+* `{Shl,Shr}<RHS=Self,Output=Self>`, `{Shl,Shr}Assign<RHS=Self>`: vertical
+  (lane-wise) bit-shift operations.
+
+Note: While IEEE 754-2008 provides total ordering predicates for floating-point
+numbers, Rust does not implement `Eq` and `Ord` for the `f32` and `f64`
+primitive types. This RFC follows suit and does not propose to implement `Eq`
+and `Ord` for vectors of floating-point types. Any future RFC that might want to
+extend Rust with a total order for floats should extend the portable
+floating-point vector types with it as well. See [this internal
+thread](https://users.rust-lang.org/t/how-to-sort-a-vec-of-floats/2838/3) for
+more information.
+
+### Inherent Methods
+
+#### Construction and element access
+
+All portable signed integer, unsigned integer, and floating-point vector types
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Creates a new instance of the vector from `number_of_lanes` 
+/// values.
+pub const fn new(args...: element_type) -> Self;
+
+/// Returns the number of vector lanes.
+pub const fn lanes() -> usize;
+
+/// Constructs a new instance with each element initialized to
+/// `value`.
+pub const fn splat(value: element_type) -> Self;
+
+/// Extracts the value at `index`.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+pub fn extract(self, index: usize) -> element_type;
+
+/// Extracts the value at `index`.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+pub unsafe fn extract_unchecked(self, index: usize) -> element_type;
+
+/// Returns a new vector where the value at `index` is replaced by `new_value`.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+#[must_use = error-message]
+pub fn replace(self, index: usize, new_value: $elem_ty) -> Self;
+
+/// Returns a new vector where the value at `index` is replaced by `new_value`.
+#[must_use = error-message]
+pub unsafe fn replace_unchecked(self, index: usize, 
+                                new_value: element_type) -> Self;
+}
+```
+
+#### Loads and Stores
+
+All portable vector types implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
+/// aligned to an `align_of::<Self>()` boundary.
+pub fn store_aligned(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()`.
+pub fn store_unaligned(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
+/// aligned to an `align_of::<Self>()` boundary, the behavior is
+/// undefined.
+pub unsafe fn store_aligned_unchecked(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` the behavior is undefined.
+pub unsafe fn store_unaligned_unchecked(self, slice: &mut [element_type]);
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
+/// to an `align_of::<Self>()` boundary.
+pub fn load_aligned(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()`.
+pub fn load_unaligned(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
+/// to an `align_of::<Self>()` boundary, the behavior is undefined.
+pub unsafe fn load_aligned_unchecked(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` the behavior is undefined.
+pub unsafe fn load_unaligned_unchecked(slice: &[element_type]) -> Self;
+}
+```
+
+#### Binary minmax vertical operations
+
+All portable signed integer, unsigned integer, and floating-point vectors
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {   
+/// Lane-wise `min`.
+///
+/// Returns a vector whose lanes contain the smallest 
+/// element of the corresponding lane of `self` and `other`.
+pub fn min(self, other: Self) -> Self;
+
+/// Lane-wise `max`.
+///
+/// Returns a vector whose lanes contain the largest 
+/// element of the corresponding lane of `self` and `other`.
+pub fn max(self, other: Self) -> Self;
+}
+```
+
+##### Floating-point semantics
+
+The floating-point semantics follow the IEEE-754 semantics for `minNum` and
+`maxNum`. That is:
+
+If either operand is a `NaN`, returns the other non-NaN operand. Returns `NaN`
+only if both operands are `NaN`. If the operands compare equal, returns a value
+that compares equal to both operands. This means that `min(+/-0.0, +/-0.0)`
+could return either `-0.0` or `0.0`. Otherwise, `min` and `max` return the
+smallest and largest operand, respectively.
+
+#### Arithmetic reductions 
+
+##### Integers
+
+All portable signed and unsigned integer vector types implement the following
+methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Horizontal wrapping sum of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 4 element vector:
+///
+/// > (x0.wrapping_add(x1)).wrapping_add(x2.wrapping_add(x3))
+///
+/// If an operation overflows it returns the mathematical result
+/// modulo `2^n` where `n` is the number of times it overflows.
+pub fn wrapping_sum(self) -> element_type;
+
+/// Horizontal wrapping product of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 4 element vector:
+///
+/// > (x0.wrapping_mul(x1)).wrapping_mul(x2.wrapping_mul(x3))
+///
+/// If an operation overflows it returns the mathematical result
+/// modulo `2^n` where `n` is the number of times it overflows.
+pub fn wrapping_product(self) -> element_type;
+}
+```
+
+##### Floating-point
+
+All portable floating-point vector types implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Horizontal sum of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 8 element vector:
+///
+/// > ((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7))
+///
+/// If one of the vector element is `NaN` the reduction returns
+/// `NaN`. The resulting `NaN` is not required to be equal to any
+/// of the `NaN`s in the vector.
+pub fn sum(self) -> element_type;
+
+/// Horizontal product of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for an 8 element vector:
+///
+/// > ((x0 * x1) * (x2 * x3)) * ((x4 * x5) * (x6 * x7))
+///
+/// If one of the vector element is `NaN` the reduction returns
+/// `NaN`. The resulting `NaN` is not required to be equal to any
+/// of the `NaN`s in the vector.
+pub fn product(self) -> element_type;
+}
+```
+
+#### Bitwise reductions 
+
+All signed and unsigned integer vectors implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Horizontal bitwise `and` of the vector elements.
+pub fn horizontal_and(self) -> element_type;
+
+/// Horizontal bitwise `or` of the vector elements.
+pub fn horizontal_or(self) -> element_type;
+
+/// Horizontal bitwise `xor` of the vector elements.
+pub fn horizontal_xor(self) -> element_type;
+}
+```
+
+#### Min/Max reductions
+
+All portable signed integer, unsigned integer, and floating-point vector types
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Value of the largest element in the vector.
+///
+/// # Floating-point
+///
+/// If the result contains `NaN`s the result is a 
+/// `NaN` that is not necessarily equal to any of 
+/// the `NaN`s in the vector.
+pub fn hmax(self) -> element_type;
+
+/// Value of the smallest element in the vector.
+///
+/// # Floating-point
+///
+/// If the result contains `NaN`s the result is a 
+/// `NaN` that is not necessarily equal to any of 
+/// the `NaN`s in the vector.
+pub fn hmin(self) -> element_type;
+}
+```
+
+#### Mask construction and element access
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Creates a new vector mask from `number_of_lanes` boolean
+/// values.
+///
+/// The values `true` and `false` respectively set and clear 
+/// the mask for a particular lane.
+pub const fn new(args...: bool...) -> Self;
+
+/// Returns the number of vector lanes.
+pub const fn lanes() -> usize;
+
+/// Constructs a new vector mask with all lane-wise 
+/// masks either set, if `value` equals `true`, or cleared, if 
+/// `value` equals `false`.
+pub const fn splat(value: bool) -> Self;
+
+/// Returns `true` if the mask for the lane `index` is 
+/// set and `false` otherwise.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+pub fn extract(self, index: usize) -> bool;
+
+/// Returns `true` if the mask for the lane `index` is 
+/// set and `false` otherwise.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+pub unsafe fn extract_unchecked(self, index: usize) -> bool;
+
+/// Returns a new vector mask where mask of the lane `index` is
+/// set if `new_value` is `true` and cleared otherwise.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+#[must_use = error-message]
+pub fn replace(self, index: usize, new_value: bool) -> Self;
+
+/// Returns a new vector mask where mask of the lane `index` is
+/// set if `new_value` is `true` and cleared otherwise.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+#[must_use = error-message]
+pub unsafe fn replace_unchecked(self, index: usize, new_value: bool) -> Self;
+}
+```
+
+#### Mask reductions
+
+All vector masks implement the following methods:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Are "all" lanes `true`?
+pub fn all(self) -> bool;
+
+/// Is "any" lanes `true`?
+pub fn any(self) -> bool;
+
+/// Are "all" lanes `false`?
+pub fn none(self) -> bool;
+}
+```
+
+#### Mask vertical selection
+
+All vector masks implement the following method:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Lane-wise selection. 
+///
+/// The lanes of the result for which the mask is `true` contain
+/// the values of `a` while the remaining lanes contain the values of `b`.
+pub fn select<T>(self, a: T, b: T) -> T
+    where T::lanes() == number_of_lanes; // implementation-defined 
+}
+```
+
+Note: how `where` clause is enforced is an implementation detail. `stdsimd`
+implements this using a sealed trait:
+
+```rust
+pub fn select<T>(self, a: T, b: T) -> T
+    where T: SelectMask<Self>
+}
+```
+
+#### Vertical comparisions
+
+All vector types implement the following vertical (lane-wise) comparison methods
+that returns a mask expressing the result.
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Lane-wise equality comparison.
+pub fn eq(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise inequality comparison.
+pub fn ne(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise less-than comparison.
+pub fn lt(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise less-than-or-equals comparison.
+pub fn le(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise greater-than comparison.
+pub fn gt(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise greater-than-or-equals comparison.
+pub fn ge(self, other: $id) -> m{lane_width}x{number_of_lanes};
+}
+```
+
+For all vector types proposed in this RFC, the `{lane_width}` of the mask
+matches that of the vector type. However, this will not be the case for the
+AVX-512 vector types.
+
+##### Semantics for floating-point numbers
+
+* `eq`: yields `true` if both operands are not a `QNAN` and `self` is equal to
+  `other`, yields `false` otherwise.
+* `gt`: yield `true` if both operands are not a `QNAN` and ``self`` is greater
+  than `other`, yields `false` otherwise.
+* `ge`: yields `true` if both operands are not a `QNAN` and `self` is greater
+  than or equal to `other`, yields `false` otherwise.
+* `lt`: yields `true` if both operands are not a `QNAN` and `self` is less than
+  `other`, yields `false` otherwise.
+* `le`: yields `true` if both operands are not a `QNAN` and `self` is less than
+  or equal to `other`, yields `false` otherwise.
+* `ne`: yields `true` if either operand is a `QNAN` or `self` is not equal to
+  `other`, yields `false` otherwise.
+  
+  
+### Portable vector shuffles
+
+```
+/// Shuffles vector elements.
+std::simd::shuffle!(...);
+```
+
+The `shuffle!` macro returns a new vector that contains a shuffle of the elements in
+one or two input vectors. That is, there are two versions:
+
+ * `shuffle!(vec, [indices...])`: one-vector version
+ * `shuffle!(vec0, vec1, [indices...])`: two-vector version
+
+In the two-vector version, both `vec0` and `vec1` must have the same type.
+The element type of the resulting vector is the element type of the input
+vector.
+
+The number of `indices` must be a power-of-two in range `[0, 64)` smaller
+than two times the number of lanes in the input vector. The length of the
+resulting vector equals the number of indices provided.
+
+Given a vector with `N` lanes, the indices in range `[0, N)` refer to the `N` elements in the vector. In the two-vector version, the indices in range `[N, 2*N)` refer to elements in the second vector.
+
+#### Example: shuffles
+
+The `shuffle!` macro allows reordering the elements of a vector:
+
+```rust
+let x = i32x4::new(1, 2, 3, 4);
+let r = shuffle!(x, [2, 1, 3, 0]);
+assert_eq!(r, i32x4::new(3, 2, 4, 1));
+```
+
+where the resulting vector can also be smaller:
+
+```rust
+let r = shuffle!(x, [1, 3]);
+assert_eq!(r, i32x2::new(2, 4));
+```
+
+or larger
+
+```
+let r = shuffle!(x, [1, 3, 2, 2, 1, 3, 2, 2]);
+assert_eq!(r, i32x8::new(2, 4, 3, 3, 2, 4, 3, 3));
+```
+
+than the input. The length of the result must be, however, limited to the range
+`[2, 2 * vec::lanes()]`.
+
+It also allows shuffling between two vectors
+
+```rust
+let y = i32x4::new(5, 6, 7, 8);
+let r = shuffle!(x, y, [4, 0, 5, 1]);
+assert_eq!(r, i32x4::new(5, 1, 6, 2));
+```
+
+where the indices of the second vector's elements start at the `vec::lanes()`
+offset.
+
+#### Conversions and bitcasts
+[casts-and-conversions]: #casts-and-conversions
+
+There are three different ways to convert between vector types.
+
+* `From`/`Into`: value-preserving widening-conversion between vectors with the
+  same number of lanes. That is, `f32x4` can be converted into `f64x4` using
+  `From`/`Into`, but the opposite is not true because that conversion is not
+  value preserving. The `From`/`Into` implementations mirror that of the
+  primitive integer and floating-point types. These conversions can widen the
+  size of the element type, and thus the size of the SIMD vector type. Signed
+  vector types are sign-extended lane-wise, while unsigned vector types are
+  zero-extended lane-wise. The result of these conversions is
+  endian-independent.
+
+* `as`: non-value preserving truncating-conversions between vectors with the
+  same number of lanes. That is, `f64x4 as f32x4` performs a lane-wise `as`
+  cast, truncating the values if they would overflow the destination type. The
+  result of these conversions is endian-independent.
+  
+* `unsafe mem::transmute`: bit-casts between vectors with the same size, that
+  is, the vectors do not need to have the same number of lanes. For example,
+  transmuting a `u8x16` into a `u16x8`. Note that while all bit-patterns of the
+  `{i,u,f}` vector types represent a valid vector value, there are many vector
+  mask bit-patterns that do not represent a valid mask. Note also that the
+  result of `unsafe mem::transmute` is **endian-dependent** (see examples
+  below).
+  
+It is extremely common to perform "transmute" operations between equally-sized
+portable vector types when writing SIMD algorithms. Rust currently does not have
+any facilities to express that all bit-patterns of one type are also valid
+bit-patterns of another type, and to perform these safe transmutes in an
+endian-independent way.
+
+This forces users to resort to `unsafe { mem::transmute(x) }` and, very likely,
+to write non-portable code.
+
+There is a very interesting discussion about [this in this internal
+thread](https://internals.rust-lang.org/t/pre-rfc-frombits-intobits/7071/23)
+about potential ways to attack this problem, and there is also an [open issue in
+`stdsimd` about endian-dependent
+behavior](https://github.com/rust-lang-nursery/stdsimd/issues/393) - if you care
+deeply about it please chime in. 
+
+These issues are not specific to portable packed SIMD vector types and fixing
+them is not the purpose of this RFC, but these issues are critical for writing
+efficient and portable SIMD code reliably and ergonomically.
+
+# ABI and `std::simd`
+
+The ABI is first and foremost unspecified and may change at any time.
+
+All `std::simd` types are forbidden in `extern` functions (or warned against).
+Basically the same story as types like `__m128i` and `extern` functions.
+
+As of today, they will be implemented as pass-via-pointer unconditionally. For
+example:
+
+```rust
+fn foo(a: u32x4) { /* ... */ }
+
+foo(u32x4::splat(3));
+```
+
+This example will pass the variable `a` through memory. The function calling
+`foo` will place `a` on the stack and then `foo` will read `a` from the stack
+to work with it. Note that if `foo` changes the value of `a` this will not be
+visible to the caller, they're semantically pass-by-value but implemented as
+pass-via-pointers.
+  
+Currently, we aren't aware of any slowdowns of perf hits from this mechanism
+(pass through memory instead of by value). If something comes up, leaving the
+ABI unspecified allows us to try to address it.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+## Generic vector type requirement for backends
+
+The `std::arch` module provides architecture-specific vector types where
+backends only need to provide vector types for the architectures that they
+support.
+
+This RFC requires backends to provide generic vector types. Most backends support
+this in one form or another, but if one future backend does not, this RFC can be
+implemented on top of the architecture specific types.
+
+## Zero-overhead requirement for backends
+
+A future architecture might have an instruction that performs multiple
+operations exposed by this API in one go, like `(a + b).sum()` on an
+`f32x4` vector. The zero-overhead requirement makes it a bug if Rust does not
+generate optimal code for this situation.
+
+This is not a performance bug that can be easily worked around in `stdsimd` or
+`rustc`, making this almost certainly a performance bug in the backend.
+
+It is reasonable to assume that every optimizing Rust backed will have a
+pattern-matching engine powerful enough to perform these
+transformations, but it is worth it to keep this requirement in mind.
+
+## Performance of this API might vary dramatically
+
+The performance of this API can vary dramatically depending on the architecture
+being targeted and the target features enabled.
+
+First, this is a consequence of portability, and thus a feature. However, that
+portability can introduce performance bugs is a real concern. In any case, if
+the user is able to write faster code for some architecture, they should fill a
+performance bug.
+
+# Rationale and alternatives
+[alternatives]: #alternatives
+
+### Dynamic values result in poor code generation for some operations
+
+Some of the fundamental APIs proposed in this RFC, like `vec::{new, extract,
+store, replace}` take run-time dynamic parameters. Consider the following
+example (see the whole example live at [`rust.godbolt.org`](https://godbolt.org/g/yhiAa2):
+
+```rust 
+/// Returns a f32x8 with 0.,1.,2.,3.
+fn increasing() -> f32x8 {
+   let mut x = f32x8::splat(0.);
+   for i in 0..f32x8::lanes() {
+       x = x.replace(i, i as f32); 
+   }
+   x 
+}
+```
+
+In release mode, `rustc` generates the following assembly for this function:
+
+```asm
+.LCPI0_0:
+  .long 0
+  .long 1065353216
+  .long 1073741824
+  .long 1077936128
+  .long 1082130432
+  .long 1084227584
+  .long 1086324736
+  .long 1088421888
+example::increasing:
+  pushq %rbp
+  movq %rsp, %rbp
+  vmovaps .LCPI0_0(%rip), %ymm0
+  vmovaps %ymm0, (%rdi)
+  movq %rdi, %rax
+  popq %rbp
+  vzeroupper
+  retq
+```
+
+which uses two vector loads to load the values into a SIMD register -
+digression: this two loads are due to Rust's SIMD vector types ABI and happen
+only "isolated" examples.
+
+If we change this function to accept run-time bounds for the loop
+
+```rust 
+/// Returns a f32x4::splat(0.) with the elements in [a, b) initialized 
+/// with an increasing sequence 0.,1.,2.,3.
+fn increasing(a: usize, b: usize) -> f32x4 {
+   let mut x = f32x4::splat(0.);
+   for i in a..b {
+       x = x.replace(i, i as f32); 
+   }
+   x 
+}
+```
+
+then the amount of instruction generated explodes:
+
+```asm
+example::increasing_rt:
+  pushq %rbp
+  movq %rsp, %rbp
+  andq $-32, %rsp
+  subq $320, %rsp
+  vxorps %xmm0, %xmm0, %xmm0
+  cmpq %rsi, %rdx
+  jbe .LBB1_34
+  movl %edx, %r9d
+  subl %esi, %r9d
+  leaq -1(%rdx), %r8
+  subq %rsi, %r8
+  andq $7, %r9
+  je .LBB1_2
+  negq %r9
+  vxorps %xmm0, %xmm0, %xmm0
+  movq %rsi, %rcx
+.LBB1_4:
+  testq %rcx, %rcx
+  js .LBB1_5
+  vcvtsi2ssq %rcx, %xmm2, %xmm1
+...200 lines more...
+```
+
+This code isn't necessarily horrible, but it is definitely harder to reason about its
+performance. This has two main causes:
+
+*  **ISAs do not support these operations**: most (all?) ISAs support operations
+   like `extract`, `store`, and `replace` with constant indices only. That is,
+   these operations do not map to single instructions on most ISAs.
+   
+* **these operations are slow**: even for constant indices, these operations are
+  slow. Often, for each constant index, a different instruction must be
+  generated, and occasionally, for a particular constant index, the operation
+  requires multiple instructions.
+  
+So we have a trade-off to make between providing a comfortable API for programs
+that really must extract a single value with a run-time index, and providing an
+API that provides "reliable" performance. 
+
+The proposed API accepts run-time indices (and values for `new`):
+
+* **common** SIMD code indexes with compile-time indices: this code gets optimized
+  reasonably well with the LLVM backend, but the user needs to deal with the
+  safe-but-checked and `unsafe`-but-unchecked APIs. If we were to only accept
+  constant indices, the unchecked API would not be necessary, since the checked
+  API would ensure that the indices are in-bounds at compile-time.
+  
+* **rare** SIMD code indexes with run-time indices: this is code that one should
+  really avoid writing. The current API makes writing this code extremely easy,
+  resulting in SIMD code with potentially unexpected performance. Users also
+  have to deal with two APIs for this, the checked/unchecked APIs, and
+  also, the memory `load`/`store` APIs that are better suited for this use case.
+  
+Whether the current design is the right design should probably be clarified
+during the RFC. An important aspect to consider is that Rust support for
+`const`ants is very basic: `const fn`s are getting started, `const` generics are
+not there yet, etc. That is, making the API take constant indices might severely
+limit the type of code that can be used with these APIs in today's Rust.
+
+### Binary (vector,scalar) and (scalar,vector) operations
+
+This RFC can be extended with binary vector-scalar and scalar vector operations
+by implementing the following traits for signed integer, unsigned integer, and
+floating-point vectors:
+
+* `{Add,Sub,Mul,Div,Rem}<RHS={element_type},Output=Self>`,
+  `{Add,Sub,Mul,Div,Rem}<RHS={vector_type},Output={vector_type}> for
+  {element_type}`, `{Add,Sub,Mul,Div,Rem}Assign<RHS={element_type}>`: binary
+  scalar-vector vertical (lane-wise) arithmetic operations.
+
+and the following trait for signed and unsigned integer vectors:
+
+* `Bit{And,Or,Xor}<RHS={element_type},Output=Self>`,
+  `Bit{And,Or,Xor}<RHS={vector_type},Output={vector_type}> for {element_type}`,
+  `Bit{And,Or,Xor}Assign<RHS={element_type}>` binary scalar-vector vertical
+  (lane-wise) bitwise operations.
+
+* `{Shl,Shr}<RHS=I>`, `{Shl,Shr}Assign<RHS=I>`: for all integer types `I` in
+  {`i8`, `i16`, `i32`, `i64`, `i128`, `isize`, `u8`, `u16`, `u32`, `u64`,
+  `u128`, `usize`}. Note: whether only `element_type` or all integer types
+  should be allowed is debatable: `stdsimd` currently allows using all integer
+  types.
+
+These traits slightly improve the ergonomics of scalar vector operations:
+
+```rust
+let x: f32x4;
+let y: f32x4;
+let a: f32;
+let z = a * x + y;
+// instead of: z = f32x4::splat(a) * x + y;
+x += a;
+// instead of: x += f32x4::splat(a);
+```
+
+but they do not enable to do anything new that can't be easily done without them
+by just using `vec::splat`, and initial feedback on the RFC suggested that they
+are an abstraction that hides the cost of splatting the scalar into the vector.
+
+These traits are implemented in `stdsimd` (and thus available in nightly Rust),
+are trivial to implement (`op(vec_ty::splat(scalar), vec)` and `op(vec,
+vec_ty::splat(scalar))`), and cannot be "seamlessly" provided by users due to
+coherence.
+
+They are not part of this RFC, but they can be easily added (now or later) if
+there is consensus to do so. In the meantime, they can be experimented with on
+nightly Rust. If there is consensus to remove them, porting nightly code off
+these is also pretty easy.
+
+### Tiny vector types
+
+Most platforms SIMD registers have a constant width, and they can be used to
+operate on vectors with a smaller bit width. However, 16 and 32-bit wide
+vectors are "small" by most platforms standards.
+
+These types are useful for performing Simd Within A Register (SWAR) operations
+in platforms without SIMD registers. While their performance has not been
+extensively investigated in `stdsimd` yet, any performance issues are
+performance bugs that should be fixed.
+
+### Portable shuffles API
+
+The portable shuffles are exposed via the `shuffle!` macro. Generating the
+sequence of instructions required to perform a shuffle requires the shuffle
+indices to be known at compile time.
+
+In the future, an alternative API based on `const`-generics and/or
+`const`-function-arguments could be added in a backwards compatible way:
+
+```rust
+impl {element_type}{element_width}x{number_of_lanes} {
+    pub fn shuffle<const N: usize, R>(self, const indices: [isize; N]) 
+        -> <R as ShuffleResult<element_type, [isize; N]>>::ShuffleResultType
+      where R: ShuffleResult<element_type, [isize; N]>;
+}
+```
+
+Offering this same API today is doable:
+
+```rust
+impl {element_type}{element_width}x{number_of_lanes} {
+    #[rustc_const_argument(2)] // specifies that indices must be a const
+    #[rustc_platform_intrinsic(simd_shuffle2)]
+    // ^^^ specifies that this method should be treated as the 
+    // "platform-intrinsic" "simd_shuffle1"
+    pub fn shuffle2<I>(self, other: Self, indices: I) 
+        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
+      where R: ShuffleResult<element_type, I>;
+      
+    #[rustc_const_argument(1)]
+    #[rustc_platform_intrinsic(simd_shuffle1)]
+    pub fn shuffle<I>(self, indices: I) 
+        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
+      where R: ShuffleResult<element_type, I>;
+}
+```
+
+If there is consensus for it the RFC can be easily amended. 
+
+# Prior art
+[prior-art]: #prior-art
+
+All of this is implemented in `stdsimd` and can be used on nightly today via the
+`std::simd` module. The `stdsimd` crate is an effort started by @burntsushi to
+put the `rust-lang-nursery/simd` crate into a state suitable for stabilization.
+The `rust-lang-nursery/simd` crate was mainly developed by @huonw and IIRC it is
+heavily-inspired by Dart's SIMD which is from where the `f32x4` naming scheme
+comes from. This RFC has been heavily inspired by Dart, and two of the three
+examples used in the motivation come from the [Using SIMD in
+Dart](https://www.dartlang.org/articles/dart-vm/simd) article written by John
+McCutchan.
+
+# Unresolved questions
+[unresolved]: #unresolved-questions
+
+### Interaction with scalable vectors
+
+The vector types proposed in this RFC are packed, that is, their size is fixed
+at compile-time.
+
+Many modern architectures support vector operations of run-time size, often
+called scalable Vectors or scalable vectors. These include, amongst others, NecSX,
+ARM SVE, RISC-V Vectors. These architectures have traditionally relied on
+auto-vectorization combined with support for explicit vectorization annotations,
+but newer architectures like ARM SVE and RISC-V introduce explicit vectorization
+intrinsics. 
+
+This is an example adapted from this [ARM SVE
+paper](https://developer.arm.com/hpc/arm-scalable-vector-extensions-and-application-to-machine-learning)
+to pseudo-Rust:
+
+```rust
+/// Adds `c` to every element of the slice `src` storing the result in `dst`.
+fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
+    assert!(dst.len() == src.len());
+    
+    // Instantiate a dynamic vector (f64xN) with all lanes set to c:
+    let vc: f64xN = f64xN::splat(c);
+    
+    // The number of lanes that each iteration of the loop can process
+    // is unknown at compile-time (f64xN::lanes() is evaluated at run-time):
+    for i in (0..src.len()).step_by_with(f64xN::lanes()) {
+    
+        // Instantiate a dynamic boolean vector with the
+        // result of the predicate: `i + lane < src.len()`.
+        // This boolean vector acts as a mask, so that elements 
+        // "in-bounds" of the slice `src` are initialized to `true`,
+        // while out-of-bounds elements contain `false`:
+        let m: bxN = f64xN::while_lt(i, src.len());
+
+        // Load the elements of the source using the mask:
+        let vsrc: f64xN = f64xN::load(m, &src[i..]);
+        
+        // Add the vector with the constan using the mask:
+        let vdst: f64xN = vsrc.add(m, vc);
+        
+        // Store the result back to memory using the mask:
+        vdst.store_unaligned(m, &mut dst[i..]);
+    }
+}
+```
+
+RISC-V proposes a model similar in spirit, but not identical to the ARM SVE one.
+It would not be surprising if other popular architectures offered similar but not necessarily identical explicit vectorization models for scalable vectors in the future.
+
+The main differences between scalable and portable vectors are that:
+
+* the number of lanes of scalable vectors is a run-time dynamic value 
+* the scalable vector "objects" are like magical compiler token values
+* the induction loop variable must be incremented by the dynamic number of lanes
+  of the vector type
+* most scalable vector operations require a mask indicating which elements of
+  the vector the operation applies to
+  
+These differences will probably force the API of scalable vector types to be
+slightly different than that of packed vector types.
+
+The current RFC, therefore, assumes no interaction with scalable vector types. 
+
+It does not prevent for portable scalable vector types to be added to Rust in
+the future via an orthogonal API, nor it does prevent adding a way to interact
+between both of them (e.g. through memory). But at this point in time whether
+these things are possible are open research problems.
+
+### Half-float support
+
+Many architectures (ARM, AArch64, PowerPC, MIPS, RISC-V) support half-floats
+(`f16`) vector types. It is unclear what to do with these at this point in time
+since Rust currently lacks language support for half-float.
+
+### AVX-512 and m1xN masks support
+
+Currently, `std::arch` provides very limited AVX-512 support and the prototype
+implementation of the `m1xN` masks like `m1x64` in `stdsimd` implements them as
+512-bit wide vectors when they actually should only be 64-bit wide. 
+
+Finishing the implementation of these types requires work that just has not been
+done yet. 
+
+### Fast math
+
+The performance of the portable operations can in some cases be significantly
+improved by making assumptions about the kind of arithmetic that is allowed.
+
+For example, some of the horizontal reductions benefit from assuming math to be
+finite (no `NaN`s) and others from assuming math to be associative (e.g. it
+allows tree-like reductions from sums).
+
+A future RFC could add more reduction variants with different requirements and
+performance characteristics, for example, `.sum_unordered()` or
+`.hmax_nanless()`, but these are not considered in this RFC because
+their interaction with fast-math is unclear.
+
+A potentially better idea would be to allow users to specify the assumptions
+that an optimizing compiler can make about floating-point arithmetic in a finer
+grained way.
+
+For example, we could design an `#[fp_math]` attribute usable at, for example,
+crate, module, function, and block scope, so that users can exactly specify
+which IEEE754 restrictions the compiler is allowed to lift where:
+
+```rust
+fn foo(x: f32x4, y: f32x4) -> f32x4 {
+  let (w, z) = 
+  #[fp_math(assume = "associativity")] {
+      // All fp math is associative, reductions can be unordered:
+      let w = x.sum();
+      let z = y.sum();
+      (w, z)
+  };
+  
+  let m = (w + z) * (x + y);
+  
+  #[fp_math(assume = "finite")] {
+      // All fp math is assumed finite, reduction can assume NaNs 
+      // aren't present:
+      m.max()
+  }
+}
+```
+
+There are obviously many approaches to tackle this problem, but it does make
+sense to have a plan to tackle them before workarounds start getting bolted into
+RFCs like this one.
+
+### Endian-dependent behavior
+
+The results of the indexed operations (`extract`, `replace`, `store`), and the
+`new` method are endian independent. That is, the following example is
+guaranteed to pass on little-endian (LE) and big-endian (BE) architectures:
+
+```rust
+let v = i32x4::new(0, 1, 2, 3);
+assert_eq!(v.extract(0), 0); // OK in LE and BE
+assert_eq!(v.extract(3), 0); // OK in LE - OK in BE
+```
+
+The result of bit-casting two equally-sized vectors using `mem::transmute` is,
+however, endian dependent:
+
+```rust
+let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+let t: i16x8 = unsafe { mem::transmute(x) }; // UNSAFE
+if cfg!(target_endian = "little") {
+    let t_el = i16x8::new(256, 770, 1284, 1798, 2312, 2826, 3340, 3854);
+    assert_eq!(t, t_el); // OK in LE | (would) ERROR in BE
+} else if cfg!(target_endian = "big") {
+    let t_eb = i16x8::new(1, 515, 1029, 1543, 2057, 2571, 3085, 3599);
+    assert_eq!(t, t_eb); // OK in BE | (would) ERROR in LE
+}
+```
+
+which applies to memory load and stores as well:
+
+```rust
+let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+let mut y: [i16; 8] = [0; 8];
+x.store_unaligned( unsafe {
+    slice::from_raw_parts_mut(&mut y as *mut _ as *mut i8, 16)
+});
+
+if cfg!(target_endian = "little") {
+    let e: [i16; 8] = [256, 770, 1284, 1798, 2312, 2826, 3340, 3854];
+    assert_eq!(y, e);
+} else if cfg!(target_endian = "big") {
+    let e: [i16; 8] = [1, 515, 1029, 1543, 2057, 2571, 3085, 3599];
+    assert_eq!(y, e);
+}
+
+let z = i8x16::load_unaligned(unsafe {
+    slice::from_raw_parts(&y as *const _ as *const i8, 16)
+});
+assert_eq!(z, x);
+```