Skip to content

Latest commit

 

History

History
1596 lines (1258 loc) · 56.8 KB

0000-ppv.md

File metadata and controls

1596 lines (1258 loc) · 56.8 KB
  • Feature Name: portable_packed_vector_types
  • Start Date: (fill me in with today's date, YYYY-MM-DD)
  • RFC PR: (leave this empty)
  • Rust Issue: (leave this empty)

Summary

This RFC adds portable packed SIMD vector types up to 256-bit.

Future RFCs will attempt to answer some of the unresolved questions and might potentially cover extensions as they mature in stdsimd, like, for example, portable memory gather and scatter operations, m1xN vector masks, masked arithmetic/bitwise/shift operations, etc.

Motivation

The std::arch module exposes architecture-specific SIMD types like _m128 - a 128-bit wide SIMD vector type. How these bits are interpreted depends on the intrinsic being used. For example, let's sum 8 f32s values using the SSE4.1 facilities in the std::arch module. This is one way to do it (playground):

unsafe fn add_reduce(a: __m128, b: __m128) -> f32 {
    let c = _mm_hadd_ps(a, b);
    let c = _mm_hadd_ps(c, _mm_setzero_ps());
    let c = _mm_hadd_ps(c, _mm_setzero_ps());
    std::mem::transmute(_mm_extract_ps(c, 0))
}

fn main() {
    unsafe {
        let a = _mm_set_ps(1., 2., 3., 4.);
        let b = _mm_set_ps(5., 6., 7., 8.);
        let r = add_reduce(a, b);
        assert_eq!(r, 36.);
    }
}

Notice that:

  • one has to put some effort to extrapolate from add_reduce's signature what types of vectors it actually expects: "add_reduce takes 128-bit wide vectors and returns an f32 therefore those 128-bit vectors probably must contain 4 packed f32s because that's the only combination of f32s that fits in 128 bits!"

  • it requires a lot of unsafe code: the intrinsics are unsafe (which could be improved via RFC2122), the intrinsic API relies on the user performing transmutes, constructing the vectors is unsafe because it needs to be done via intrinsic calls, etc.

  • it requires a lot of architecture specific knowledge: how the intrinsics are called, how they are used together

  • this solution only works on x86 or x86_64 with SSE4.1 enabled, that is, it is not portable.

With portable packed vector types, we can do much better (playground):

fn main() {
    let a = f32x4::new(1., 2., 3., 4.);
    let b = f32x4::new(5., 6., 7., 8.);
    let r = (a + b).sum();
    assert_eq!(r, 36.);
}

These types add zero-overhead over the architecture-specific types for the operations that they support - if there is an architecture in which this does not hold for some operation: the implementation has a bug.

The motivation of this RFC is to provide reasonably high-level, reliable, and portable access to common SIMD vector types and SIMD operations.

At a higher level, the actual use cases for these specialty instructions are boundless. SIMD intrinsics are used in graphics, multimedia, linear algebra, scientific computing, games, cryptography, text search, machine learning, low latency, and more. There are many crates in the Rust ecosystem using SIMD intrinsics today, either through stdsimd, the simd crate, or both. Some examples include:

  • encoding_rs which uses the simd crate to assist with speedy decoding.
  • bytecount which uses the simd crate with AVX2 extensions to accelerate counting bytes.
  • regex which uses the stdsimd crate with SSSE3 extensions to accelerate multiple substrings search and also its parent crate teddy.

However, providing portable SIMD algorithms for all application domains is not the intent of this RFC.

The purpose of this RFC is to provide users with vocabulary types and fundamental operations that they can build upon in their own crates to effectively implement SIMD algorithms in their respective application domains.

These types are meant to be extended by users with portable (or nonportable) SIMD operations in their own crates, for example, via extension traits or new types.

The operations provided in this RFC are thus either:

fundamental: that is, they build the foundation required to write higher-level SIMD algorithms. These include, amongst others, instantiating vector types, read/writes from memory, masks and branchless conditional operations, and type casts and conversions.

required: to be part of the std. These include backend-specific compiler intrinsics that we might never want to stabilize as well as the implementation of std library traits which, due to trait coherence, users cannot extend the vector types with.

Guide-level explanation

This RFC extends Rust with portable packed SIMD vector types, a set of types used to perform explicit vectorization:

  • SIMD: stands for Single Instruction, Multiple Data. This RFC uses this term in the context of hardware instruction set architectures (ISAs) to refer to:

    • SIMD instructions: instructions that (typically) perform operations on multiple values simultaneously, and
    • SIMD registers: the registers that the SIMD instructions take as operands. These registers (typically) store multiple values that are operated upon simultaneously by SIMD instructions.
  • vector types: types that abstract over memory stored in SIMD registers, allowing to transfer memory to/from the registers and performing operations directly on these registers.

  • packed: means that these vectors have a compile-time fixed size. It is the opposite of scalable or "Cray vectors", which are SIMD vector types with a dynamic size, that is, whose size is only known at run-time.

  • explicit vectorization: vectorization is the process of producing programs that operate on multiple values simultaneously (typically) using SIMD instructions and registers. Automatic vectorization is the process by which the Rust compiler is, in some cases, able to transform scalar Rust code, that is, code that does not use SIMD vector types, into machine code that does use SIMD registers and instructions automatically (without user intervention). Explicit vectorization is the process by which a Rust user manually writes Rust code that states what kind of SIMD registers are to be used and what SIMD instructions are executed on them.

  • portable: is the opposite of architecture-specific. These types work both correctly and efficiently on all architectures. They are a zero-overhead abstraction, that is, for the operations that these types support, one cannot write better code by hand (otherwise, it is an implementation bug).

  • masks: are vector types used to select vector elements on which operations are to be performed. This selection is performed by setting or clearing the bits of the masks for a particular lane.

Packed vector types are denotes as follows: {i,u,f,m}{lane_width}x{#lanes}, so that i64x8 is a 512-bit vector with eight i64 lanes and f32x4 a 128-bit vector with four f32 lanes. Here:

  • lane: is the number of values of a particular type stored in a vector - the vector operations act on these values simultaneously.

  • lane width: the bit width of a vector lane, that is, the bit width of the objects stored in the vector. For example, the type f32 is 32-bits wide.

That is, the m8x4 type is a 32-bit wide vector mask with 4 lanes containing an 8-bit wide mask each. Vector masks are mainly used to select the lanes on which vector operations are performed. When a lane has all of its bits set to true, that lane is "selected", and when a lane has all of its bits set to false, that lane is "not selected". The following bit pattern is thus a valid bit-pattern for the m8x4 mask:

00000000_11111111_00000000_11111111

and it select two eight-bit wide lanes from a 32-bit wide vector type with four lanes. The following bit-pattern is not, however, a valid value of the same mask type:

00000000_11111111_00000000_11110111

because it does not satisfies the invariant that all bits of a lane must be either set or cleared.

Operations on vector types can be either:

  • vertical: that is, lane-wise. For example, a + b adds each lane of a with the corresponding lane of b, while a.lt(b) returns a boolean mask that indicates whether the less-than (<, lt) comparison returned true or false for each of the vector lanes. Most vertical operations are binary operations (they take two input vectors). These operations are typically very fast on most architectures and they are the most widely used in practice.

  • horizontal: that is, along a single vector - they are unary operations. For example, a.sum() adds the elements of a vector together while a.max_element() returns the largest element in a vector. These operations (typically) translate to a sequence of multiple SIMD instructions on most architectures and are therefore slower. In many cases, they are, however, necessary.

Example: Average

The first example computes the arithmetic average of the elements in a list. Sequentially, we would write using iterators as follows:

/// Arithmetic average of the elements in `xs`.
fn average_seq(xs: &[f32]) -> f32 {
    if xs.len() > 0 {
        xs.iter().sum() / xs.len()
    } else {
        0.
    }
}

The following implementation uses the 256-bit SIMD facilities provided by this RFC. As the name suggests, it will be "slow":

/// Computes the arithmetic average of the elements in the list.
///
/// # Panics
///
/// If `xs.len()` is not a multiple of `8`.
fn average_slow256(xs: &[f32]) -> f32 {
    // The 256-bit wide floating-point vector type is f32x8. To
    // avoid handling extra elements in this example we just panic.
    assert!(xs.len() % 8 == 0, 
            "input length `{}` is not a multiple of 8", 
            xs.len());
    
    let mut result = 0._f32;  // This is where we store the result
    
    // We iterate over the input slice with a step of `8` elements:
    for i in (0..xs.len()).step_by(8) {
        // First, we read the next `8` elements into an `f32x8`.
        // Since we haven't checked whether the input slice
        // is aligned to the alignment of `f32x8`, we perform
        // an unaligned memory read.
        let data = f32x8::read_unaligned(&xs[i..]);

        // With the element in the vector, we perform an horizontal reduction
        // and add them to the result.
        result += data.sum();
    }
    result / xs.len()
}

As mentioned this operation is "slow", why is that? The main issue is that, on most architectures, horizontal reductions must perform a sequence of SIMD operations while vertical operations typically require only a single instruction.

We can significantly improve the performance of our algorithm by writing it in such a way that the number of horizontal reductions performed is reduced.

fn average_fast256(xs: &[f32]) -> f32 {
    assert!(xs.len() % 8 == 0, 
            "input length `{}` is not a multiple of 8", 
            xs.len());
    
    // Our temporary result is now a f32x8 vector:
    let mut result = f32x8::splat(0.);
    for i in (0..xs.len()).step_by(8) {
        let data = f32x8::read_unaligned(&xs[i..]);
        // This adds the data elements to tour temporary result using 
        // a vertical lane-wise simd operation - this is a single SIMD
        // instruction on most architectures.
        result += data; 
    }
    // Perform a single horizontal reduction at the end:
    result.sum() / xs.len()
}

The performance could by further improved by requiring the input data to be aligned to a 16-byte boundary, and/or by handling the elements before the next 16-byte boundary in a special way.

Example: scalar-vector multiply even

To showcase the mask and select API the following function multiplies the even elements of a vector with a scalar:

fn mul_even(a: f32, x: f32x4) -> f32x4 {
    // Create a vector mask for the even elements 0 and 2.
    // The vector mask API uses `bool`s to set or clear 
    // all bits of a lane:
    let m = m32x4::new(true, false, true, false);

    // Perform a full multiplication
    let r = f32x4::splat(a) * x;
    
    // Use the mask to select the even elements from the
    // multiplication result and the odd elements from
    // the input:
    m.select(r, x)
}

Example: 4x4 Matrix multiplication

To showcase the shuffle API the following function implements 4x4 Matrix multiply using 128-bit wide vectors.

fn mul4x4(a: [f32x4; 4], b: [f32x4; 4]) -> [f32x4; 4] {
    let r = [f32x4::splat(0.); 4];
    
    for i in 0..4 {
        r[i] = 
            a[0] * shuffle!(b[i], [0,0,0,0]) + 
            a[1] * shuffle!(b[i], [1,1,1,1]) +
            a[2] * shuffle!(b[i], [2,2,2,2]) +
            a[3] * shuffle!(b[i], [3,3,3,3]);
    }
    r
}

Reference-level explanation

Vector types

The vector types are named according to the following scheme:

{element_type}{lane_width}x{number_of_lanes}

where the following element types are introduced by this RFC:

  • i: signed integer
  • u: unsigned integer
  • f: float
  • m: mask

So that u16x8 reads "a SIMD vector of eight packed 16-bit wide unsigned integers". The width of a vector can be computed by multiplying the {lane_width} times the {number_of_lanes}. For u16x8, 16 x 8 = 128, so this vector type is 128 bits wide.

This RFC proposes adding all vector types with sizes in range [16, 256] bit to the std::simd module, that is:

  • 16-bit wide vectors: i8x2, u8x2, m8x2
  • 32-bit wide vectors: i8x4, u8x4, m8x4, i16x2, u16x2, m16x2
  • 64-bit wide vectors: i8x8, u8x8, m8x8, i16x4, u16x4, m16x4, i32x2, u32x2, f32x2, m32x2
  • 128-bit wide vectors: i8x16, u8x16, m8x16, i16x8, u16x8, m16x8, i32x4, u32x4, f32x4, m32x4, i64x2, u64x2, f64x2, m64x2
  • 256-bit wide vectors: i8x32, u8x32, m8x32, i16x16, u16x16, m16x16, i32x8, u32x8, f32x8, m32x8, i64x4, u64x4, f64x4, m64x4

Note that this list is not comprehensive. In particular:

  • half-float f16xN: these vectors are supported in many architectures (ARM, AArch64, PowerPC64, RISC-V, MIPS, ...) but their support is blocked on Rust half-float support.
  • AVX-512 vector types, not only 512-bit wide vector types, but also m1xN vector masks. These are blocked on std::arch AVX-512 support.
  • other vector types: x86, AArch64, PowerPC and others include types like i64x1, u64x1, f64x1, m64x1, i128x1, u128x1, m128x1, ... These can be always added later as the need for these arises, potentially in combination with the stabilization of the std::arch intrinsics for those architectures.

Layout of vector types

The portable packed SIMD vector types introduced in this RFC are layout compatible with the architecture-specific vector types. That is:

union A {
   port: f32x4,
   arch: __m128,
}
let x: __m128 = _mm_setr_ps (0.0, 1.0, 2.0, 3.0);
let y: f32x4 = A { arch: x }.port;
assert_eq!(y.extract(0), 0.0);  // OK
assert_eq!(y.extract(1), 1.0);  // OK
assert_eq!(y.extract(2), 2.0);  // OK
assert_eq!(y.extract(3), 3.0);  // OK

The portable packed SIMD vector types are also layout compatible with arrays of equal element type and whose length equals the number of vector lanes. That is:

union A {
   port: f32x4,
   arr: [f32; 4],
}
let x: [f32; 4] = [0.0, 1.0, 2.0, 3.0];
let y: f32x4 = A { arr: x }.port;
assert_eq!(y.extract(0), 0.0);  // OK
assert_eq!(y.extract(1), 1.0);  // OK
assert_eq!(y.extract(2), 2.0);  // OK
assert_eq!(y.extract(3), 3.0);  // OK

This transitively makes both portable packed and architecture specific SIMD vector types layout compatible with all other types that are also layout compatible with these array types.

API of portable packed SIMD vector types

Traits overview

All vector types implement the following traits:

  • Copy
  • Clone
  • Default: zero-initializes the vector.
  • Debug: formats the vector as ({}, {}, ...).
  • PartialEq<Self>: performs a lane-wise comparison between two vectors and returns true if all lanes compare true. It is equivalent to a.eq(b).all().
  • PartialOrd<Self>: compares two vectors lexicographically.
  • From/Into lossless casts between vectors with the same number of lanes.

All signed integer, unsigned integer, and floating point vector types implement the following traits:

  • {Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>, {Add,Sub,Mul,Div,Rem}Assign<RHS=Self>: vertical (lane-wise) arithmetic operations.

All signed and unsigned integer vectors and vector masks also implement:

  • Eq: equivalent to PartialEq<Self>
  • Ord: equivalent to PartialOrd<Self>
  • Hash: equivalent to Hash for [element_type; number_of_elements].
  • fmt::LowerHex/fmt::UpperHex: formats the vector as hexadecimal.
  • fmt::Octal: formats the vector as an octal number.
  • fmt::Binary: formats the vector as binary number.
  • Not<Output=Self>: vertical (lane-wise) negation,
  • Bit{And,Or,Xor}<RHS=Self,Output=Self>, Bit{And,Or,Xor}Assign<RHS=Self>: vertical (lane-wise) bitwise operations.

All signed and unsigned integer vectors also implement:

  • {Shl,Shr}<RHS=Self,Output=Self>, {Shl,Shr}Assign<RHS=Self>: vertical (lane-wise) bit-shift operations.

Note: While IEEE 754-2008 provides total ordering predicates for floating-point numbers, Rust does not implement Eq and Ord for the f32 and f64 primitive types. This RFC follows suit and does not propose to implement Eq and Ord for vectors of floating-point types. Any future RFC that might want to extend Rust with a total order for floats should extend the portable floating-point vector types with it as well. See this internal thread for more information.

Inherent Methods

Construction and element access

All portable signed integer, unsigned integer, and floating-point vector types implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Creates a new instance of the vector from `number_of_lanes` 
/// values.
pub const fn new(args...: element_type) -> Self;

/// Returns the number of vector lanes.
pub const fn lanes() -> usize;

/// Constructs a new instance with each element initialized to
/// `value`.
pub const fn splat(value: element_type) -> Self;

/// Extracts the value at `index`.
///
/// # Panics
///
/// If `index >= Self::lanes()`.
pub fn extract(self, index: usize) -> element_type;

/// Extracts the value at `index`.
///
/// If `index >= Self::lanes()` the behavior is undefined.
pub unsafe fn extract_unchecked(self, index: usize) -> element_type;

/// Returns a new vector where the value at `index` is replaced by `new_value`.
///
/// # Panics
///
/// If `index >= Self::lanes()`.
#[must_use = error-message]
pub fn replace(self, index: usize, new_value: $elem_ty) -> Self;

/// Returns a new vector where the value at `index` is replaced by `new_value`.
#[must_use = error-message]
pub unsafe fn replace_unchecked(self, index: usize, 
                                new_value: element_type) -> Self;
}

Reads and Writes

Contiguous reads and writes

All portable vector types implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Writes the values of the vector to the `slice` without 
/// reading or dropping the old value.
///
/// # Panics
///
/// If `slice.len() != Self::lanes()` or `&slice[0]` is not
/// aligned to an `align_of::<Self>()` boundary.
pub fn write_aligned(self, slice: &mut [element_type]);

/// Writes the values of the vector to the `slice` without 
/// reading or dropping the old value.
///
/// # Panics
///
/// If `slice.len() != Self::lanes()`.
pub fn write_unaligned(self, slice: &mut [element_type]);

/// Writes the values of the vector to the `slice` without 
/// reading or dropping the old value.
///
/// # Precondition
///
/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
/// aligned to an `align_of::<Self>()` boundary, the behavior is
/// undefined.
pub unsafe fn write_aligned_unchecked(self, slice: &mut [element_type]);

/// Writes the values of the vector to the `slice` without reading 
/// or dropping the old value.
///
/// # Precondition
///
/// If `slice.len() < Self::lanes()` the behavior is undefined.
pub unsafe fn write_unaligned_unchecked(self, slice: &mut [element_type]);

/// Instantiates a new vector with the values of the `slice` without 
/// moving them, leaving the memory in `slice` unchanged.
///
/// # Panics
///
/// If `slice.len() != Self::lanes()` or `&slice[0]` is not aligned
/// to an `align_of::<Self>()` boundary.
pub fn read_aligned(slice: &[element_type]) -> Self;

/// Instantiates a new vector with the values of the `slice` without 
/// moving them, leaving the memory in `slice` unchanged.
///
/// # Panics
///
/// If `slice.len() != Self::lanes()`.
pub fn read_unaligned(slice: &[element_type]) -> Self;

/// Instantiates a new vector with the values of the `slice` without 
/// moving them, leaving the memory in `slice` unchanged.
///
/// # Precondition
///
/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
/// to an `align_of::<Self>()` boundary, the behavior is undefined.
pub unsafe fn read_aligned_unchecked(slice: &[element_type]) -> Self;

/// Instantiates a new vector with the values of the `slice` without 
/// moving them, leaving the memory in `slice` unchanged.
///
/// # Precondition
///
/// If `slice.len() < Self::lanes()` the behavior is undefined.
pub unsafe fn read_unaligned_unchecked(slice: &[element_type]) -> Self;
}
Discontinuous masked reads and writes (scatter and gather)

Vector masks implement the following methods:

impl m{lane_width}x{number_of_lanes} {
/// Instantiates a new vector with the values of the `slice` located at 
/// the `offset`s without moving them for which the mask (`self`) is `true`
/// and with the values of `default` otherwise. The memory of the `slice` at 
/// the `offsets` for which the mask is `false` is not read.
///
/// # Precondition
///
/// If `slice.len() < offset.max_element()` the behavior is undefined.
pub unsafe fn read_scattered_unchecked<T, O, D>(self, slice: &[T], offset: O, default: D) -> D
    where <implementation defined> 
        // for exposition only:
        // number_of_lanes == D::lanes() == O::lanes(), 
        // D::element_type == T,
        // O::element_type == usize,
;

/// Writes the elements of the vector `values` for which the mask (`self`) 
/// is `true` to the `slice` at `offset`s without reading or dropping 
/// the old values. No memory is written to the `slice` elements at 
/// the `offset`s for which the mask is `false`.
///
/// If multiple `offset`s have the same value, that is, if multiple lanes 
/// from `values` are to be written to the same memory location, the writes
/// are ordered from least significant to most significant element.
///
/// # Precondition
///
/// If `slice.len() < offset.max_element()` the behavior is undefined.
pub unsafe fn write_scattered_unchecked<T, O, D>(self, slice: &mut [T], offset: O, values: D)
    where <implementation defined> 
        // for exposition only:
        // number_of_lanes == D::lanes() == O::lanes(), 
        // D::element_type == T,
        // O::element_type == usize,
;
}

Vertical arithmetic operations

Vertical (lane-wise) arithmetic operations are provided by the following trait implementations:

  • All signed integer, unsigned integer, and floating point vector types implement:

    • {Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>
    • {Add,Sub,Mul,Div,Rem}Assign<RHS=Self>
  • All signed and unsigned integer vectors also implement:

    • {Shl,Shr}<RHS=Self,Output=Self>, {Shl,Shr}Assign<RHS=Self>: vertical (lane-wise) bit-shift operations.
Integer vector semantics

The behavior of these operations for integer vectors is the same as that of the scalar integer types. That is: panic! on both overflow and division by zero if -C overflow-checks=on.

Floating-point semantics

The behavior of these operations for floating-point numbers is the same as that of the scalar floating point types, that is, +-INFINITY on overflow, NaN on division by zero, etc.

Wrapping arithmetic operations

All signed and unsigned integer vector types implement the whole set of pub fn wrapping_{add,sub,mul,div,rem}(self, Self) -> Self methods which, on overflow, produce the correct mathematical result modulo 2^n.

The div and rem method panic! on division by zero in debug mode.

Unsafe wrapping arithmetic operations

All signed and unsigned integer vectors implement pub unsafe fn wrapping_{div,rem}_unchecked(self, Self) -> Self methods which, on overflow, produce the correct mathematical result modulo 2^n.

If any of the vector elements is divided by zero the behavior is undefined.

Saturating arithmetic operations

All signed and unsigned integer vector types implement the whole set of pub fn saturated_{add,sub,mul,div,rem}(self, Self) -> Self methods which saturate on overflow.

The div and rem method panic! on division by zero in debug mode.

Unsafe saturating arithmetic operations

All signed and unsigned integer vectors implement pub unsafe fn saturating_{div,rem}_unchecked(self, Self) -> Self methods which saturate on overflow.

If any of the vector elements is divided by zero the behavior is undefined.

Binary min/max vertical operations

All portable signed integer, unsigned integer, and floating-point vectors implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Lane-wise `min`.
///
/// Returns a vector whose lanes contain the smallest 
/// element of the corresponding lane of `self` and `other`.
pub fn min(self, other: Self) -> Self;

/// Lane-wise `max`.
///
/// Returns a vector whose lanes contain the largest 
/// element of the corresponding lane of `self` and `other`.
pub fn max(self, other: Self) -> Self;
}
Floating-point semantics

The floating-point semantics follow the semantics of min and max for the scalar f32 and f64 types.

Floating-point vertical math operations

All portable floating-point vector types implement the following methods:

impl f{lane_width}x{number_of_lanes} {
    /// Square-root
    fn sqrt(self) -> Self;
    /// Reciprocal square-root estimate
    ///
    /// **FIXME**: an upper bound on the error should
    /// be guaranteed before stabilization.
    fn rsqrte(self) -> Self;
    /// Fused multiply add: `self * b + c`
    fn fma(self, b: Self, c: self) -> Self;
}

Arithmetic reductions

Integers

All portable signed and unsigned integer vector types implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Horizontal wrapping sum of the vector elements.
///
/// The intrinsic performs a tree-reduction of the vector elements.
/// That is, for a 4 element vector:
///
/// > (x0.wrapping_add(x1)).wrapping_add(x2.wrapping_add(x3))
///
/// If an operation overflows it returns the mathematical result
/// modulo `2^n` where `n` is the number of times it overflows.
pub fn wrapping_sum(self) -> element_type;

/// Horizontal wrapping product of the vector elements.
///
/// The intrinsic performs a tree-reduction of the vector elements.
/// That is, for a 4 element vector:
///
/// > (x0.wrapping_mul(x1)).wrapping_mul(x2.wrapping_mul(x3))
///
/// If an operation overflows it returns the mathematical result
/// modulo `2^n` where `n` is the number of times it overflows.
pub fn wrapping_product(self) -> element_type;
}
Floating-point

All portable floating-point vector types implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Horizontal sum of the vector elements.
///
/// The intrinsic performs a tree-reduction of the vector elements.
/// That is, for a 8 element vector:
///
/// > ((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7))
///
/// If one of the vector element is `NaN` the reduction returns
/// `NaN`. The resulting `NaN` is not required to be equal to any
/// of the `NaN`s in the vector.
pub fn sum(self) -> element_type;

/// Horizontal product of the vector elements.
///
/// The intrinsic performs a tree-reduction of the vector elements.
/// That is, for an 8 element vector:
///
/// > ((x0 * x1) * (x2 * x3)) * ((x4 * x5) * (x6 * x7))
///
/// If one of the vector element is `NaN` the reduction returns
/// `NaN`. The resulting `NaN` is not required to be equal to any
/// of the `NaN`s in the vector.
pub fn product(self) -> element_type;
}

Bitwise reductions

All signed and unsigned integer vectors implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Horizontal bitwise `and` of the vector elements.
pub fn and(self) -> element_type;

/// Horizontal bitwise `or` of the vector elements.
pub fn or(self) -> element_type;

/// Horizontal bitwise `xor` of the vector elements.
pub fn xor(self) -> element_type;
}

Min/Max reductions

All portable signed integer, unsigned integer, and floating-point vector types implement the following methods:

impl {element_type}{lane_width}x{number_of_lanes} {
/// Largest vector element value.
pub fn max_element(self) -> element_type;

/// Smallest vector element value.
pub fn min_element(self) -> element_type;
}

Note: the semantics of {min,max}_element for floating-point numbers are the same as that of their min/max methods.

Mask construction and element access

impl m{lane_width}x{number_of_lanes} {
/// Creates a new vector mask from `number_of_lanes` boolean
/// values.
///
/// The values `true` and `false` respectively set and clear 
/// the mask for a particular lane.
pub const fn new(args...: bool...) -> Self;

/// Returns the number of vector lanes.
pub const fn lanes() -> usize;

/// Constructs a new vector mask with all lane-wise 
/// masks either set, if `value` equals `true`, or cleared, if 
/// `value` equals `false`.
pub const fn splat(value: bool) -> Self;

/// Returns `true` if the mask for the lane `index` is 
/// set and `false` otherwise.
///
/// # Panics
///
/// If `index >= Self::lanes()`.
pub fn extract(self, index: usize) -> bool;

/// Returns `true` if the mask for the lane `index` is 
/// set and `false` otherwise.
///
/// If `index >= Self::lanes()` the behavior is undefined.
pub unsafe fn extract_unchecked(self, index: usize) -> bool;

/// Returns a new vector mask where mask of the lane `index` is
/// set if `new_value` is `true` and cleared otherwise.
///
/// # Panics
///
/// If `index >= Self::lanes()`.
#[must_use = error-message]
pub fn replace(self, index: usize, new_value: bool) -> Self;

/// Returns a new vector mask where mask of the lane `index` is
/// set if `new_value` is `true` and cleared otherwise.
///
/// If `index >= Self::lanes()` the behavior is undefined.
#[must_use = error-message]
pub unsafe fn replace_unchecked(self, index: usize, new_value: bool) -> Self;
}

Mask reductions

All vector masks implement the following methods:

impl m{lane_width}x{number_of_lanes} {
/// Are "all" lanes `true`?
pub fn all(self) -> bool;

/// Is "any" lane `true`?
pub fn any(self) -> bool;

/// Are "all" lanes `false`?
pub fn none(self) -> bool;
}

Mask vertical selection

All vector masks implement the following method:

impl m{lane_width}x{number_of_lanes} {
/// Lane-wise selection. 
///
/// The lanes of the result for which the mask is `true` contain
/// the values of `a` while the remaining lanes contain the values of `b`.
pub fn select<T>(self, a: T, b: T) -> T
    where <implementation-defined>
        // for exposition only:
        // T::lanes() == number_of_lanes,
;
}

Note: how where clause is enforced is an implementation detail. stdsimd implements this using a sealed trait:

pub fn select<T>(self, a: T, b: T) -> T
    where T: SelectMask<Self>
}

Vertical comparisions

All vector types implement the following vertical (lane-wise) comparison methods that returns a mask expressing the result.

impl {element_type}{lane_width}x{number_of_lanes} {
/// Lane-wise equality comparison.
pub fn eq(self, other: $id) -> m{lane_width}x{number_of_lanes};

/// Lane-wise inequality comparison.
pub fn ne(self, other: $id) -> m{lane_width}x{number_of_lanes};

/// Lane-wise less-than comparison.
pub fn lt(self, other: $id) -> m{lane_width}x{number_of_lanes};

/// Lane-wise less-than-or-equals comparison.
pub fn le(self, other: $id) -> m{lane_width}x{number_of_lanes};

/// Lane-wise greater-than comparison.
pub fn gt(self, other: $id) -> m{lane_width}x{number_of_lanes};

/// Lane-wise greater-than-or-equals comparison.
pub fn ge(self, other: $id) -> m{lane_width}x{number_of_lanes};
}

For all vector types proposed in this RFC, the {lane_width} of the mask matches that of the vector type. However, this will not be the case for the AVX-512 vector types.

Semantics for floating-point numbers

The semantics of the lane-wise comparisons for floating point numbers are the same as in the scalar case.

Portable vector shuffles

/// Shuffles vector elements.
std::simd::shuffle!(...);

The shuffle! macro returns a new vector that contains a shuffle of the elements in one or two input vectors. There are two versions:

  • shuffle!(vec, indices): one-vector version
  • shuffle!(vec0, vec1, indices): two-vector version

with the following preconditions:

  • vec, vec0, and vec1 must be portable packed SIMD vector types.
  • vec0 and vec1 must have the same type.
  • indices must be a const array of type [usize; N] where N is any power-of-two in range (0, 2 * {vec,vec0,vec1}::lanes()].
  • the values of indices must be in range [0, N) for the one-vector version, and in range [0, 2N) for the two-vector version.

On precondition violation a type error is produced.

The macro returns a new vector whose:

  • element type equals that of the input vectors,
  • length equals N, that is, the length of the indices array

The i-th element of indices with value j in range [0, N) stores the j-th element of the first vector into the i-th element of the result vector.

In the two-vector version, the i-th element of indices with value j in range [N, 2N) stores the j - N-th element of the second vector into the i-th element of the result vector.

Example: shuffles

The shuffle! macro allows reordering the elements of a vector:

let x = i32x4::new(1, 2, 3, 4);
let r = shuffle!(x, [2, 1, 3, 0]);
assert_eq!(r, i32x4::new(3, 2, 4, 1));

where the resulting vector can also be smaller:

let r = shuffle!(x, [1, 3]);
assert_eq!(r, i32x2::new(2, 4));

or larger

let r = shuffle!(x, [1, 3, 2, 2, 1, 3, 2, 2]);
assert_eq!(r, i32x8::new(2, 4, 3, 3, 2, 4, 3, 3));

than the input. The length of the result must be, however, limited to the range [2, 2 * vec::lanes()].

It also allows shuffling between two vectors

let y = i32x4::new(5, 6, 7, 8);
let r = shuffle!(x, y, [4, 0, 5, 1]);
assert_eq!(r, i32x4::new(5, 1, 6, 2));

where the indices of the second vector's elements start at the vec::lanes() offset.

Conversions and bitcasts

Conversions / bitcasts between vector types

There are three different ways to convert between vector types.

  • From/Into: value-preserving widening-conversion between vectors with the same number of lanes. That is, f32x4 can be converted into f64x4 using From/Into, but the opposite is not true because that conversion is not value preserving. The From/Into implementations mirror that of the primitive integer and floating-point types. These conversions can widen the size of the element type, and thus the size of the SIMD vector type. Signed vector types are sign-extended lane-wise, while unsigned vector types are zero-extended lane-wise. The result of these conversions is endian-independent.

  • as: non-value preserving truncating-conversions between vectors with the same number of lanes. That is, f64x4 as f32x4 performs a lane-wise as cast, truncating the values if they would overflow the destination type. The result of these conversions is endian-independent.

  • unsafe mem::transmute: bit-casts between vectors with the same size, that is, the vectors do not need to have the same number of lanes. For example, transmuting a u8x16 into a u16x8. Note that while all bit-patterns of the {i,u,f} vector types represent a valid vector value, there are many vector mask bit-patterns that do not represent a valid mask. Note also that the result of unsafe mem::transmute is endian-dependent (see examples below).

It is extremely common to perform "transmute" operations between equally-sized portable vector types when writing SIMD algorithms. Rust currently does not have any facilities to express that all bit-patterns of one type are also valid bit-patterns of another type, and to perform these safe transmutes in an endian-independent way.

This forces users to resort to unsafe { mem::transmute(x) } and, very likely, to write non-portable code.

There is a very interesting discussion about this in this internal thread about potential ways to attack this problem, and there is also an open issue in stdsimd about endian-dependent behavior - if you care deeply about it please chime in.

These issues are not specific to portable packed SIMD vector types and fixing them is not the purpose of this RFC, but these issues are critical for writing efficient and portable SIMD code reliably and ergonomically.

Other conversions

The layout of the portable packed vector types is compatible to the layout of fixed-size arrays of the same element type and the same number of lanes (e.g. f32x4 is layout compatible with [f32; 4].

For all signed, unsigned, and floating-point vector types with element type E and number of lanes N, the following implementations exist:

impl From<[E; N]> for ExN;
impl From<ExN> for [E; N];

ABI and std::simd

The ABI is first and foremost unspecified and may change at any time.

All std::simd types are forbidden in extern functions (or warned against). Basically the same story as types like __m128i and extern functions.

As of today, they will be implemented as pass-via-pointer unconditionally. For example:

fn foo(a: u32x4) { /* ... */ }

foo(u32x4::splat(3));

This example will pass the variable a through memory. The function calling foo will place a on the stack and then foo will read a from the stack to work with it. Note that if foo changes the value of a this will not be visible to the caller, they're semantically pass-by-value but implemented as pass-via-pointers.

Currently, we aren't aware of any slowdowns of perf hits from this mechanism (pass through memory instead of by value). If something comes up, leaving the ABI unspecified allows us to try to address it.

Drawbacks

Generic vector type requirement for backends

The std::arch module provides architecture-specific vector types where backends only need to provide vector types for the architectures that they support.

This RFC requires backends to provide generic vector types. Most backends support this in one form or another, but if one future backend does not, this RFC can be implemented on top of the architecture specific types.

Achieving zero-overhead is outside Rust's control

A future architecture might have an instruction that performs multiple operations exposed by this API in one go, like (a + b).wrapping_sum() on an f32x4 vector. If that expression does not produce optimal machine code, Rust has a performance bug.

This is not a performance bug that can be easily worked around in stdsimd or rustc, making this, almost certainly, a performance bug in the backend. These performance bugs can be arbitrarily hard to fix, and fixing these might not always be worth it.

That is, while these APIs should make it possible for reasonably-designed optimizing Rust backends to achieve zero-overhead, zero-overhead can only be provided in practice on a best-effort basis.

Performance of this API might vary dramatically

The performance of this API can vary dramatically depending on the architecture being targeted and the target features enabled.

First, this is a consequence of portability, and thus a feature. However, that portability can introduce performance bugs is a real concern. In any case, if the user is able to write faster code for some architecture, they should fill a performance bug.

Rationale and alternatives

Dynamic values result in poor code generation for some operations

Some of the fundamental APIs proposed in this RFC, like vec::{new, extract, replace} take run-time dynamic parameters. Consider the following example (see the whole example live at rust.godbolt.org:

/// Returns a f32x8 with 0.,1.,2.,3.
fn increasing() -> f32x8 {
   let mut x = f32x8::splat(0.);
   for i in 0..f32x8::lanes() {
       x = x.replace(i, i as f32); 
   }
   x 
}

In release mode, rustc generates the following assembly for this function:

.LCPI0_0:
  .long 0
  .long 1065353216
  .long 1073741824
  .long 1077936128
  .long 1082130432
  .long 1084227584
  .long 1086324736
  .long 1088421888
example::increasing:
  pushq %rbp
  movq %rsp, %rbp
  vmovaps .LCPI0_0(%rip), %ymm0
  vmovaps %ymm0, (%rdi)
  movq %rdi, %rax
  popq %rbp
  vzeroupper
  retq

which uses two vector reads to read the values into a SIMD register - digression: this two reads are due to Rust's SIMD vector types ABI and happen only "isolated" examples.

If we change this function to accept run-time bounds for the loop

/// Returns a f32x4::splat(0.) with the elements in [a, b) initialized 
/// with an increasing sequence 0.,1.,2.,3.
fn increasing(a: usize, b: usize) -> f32x4 {
   let mut x = f32x4::splat(0.);
   for i in a..b {
       x = x.replace(i, i as f32); 
   }
   x 
}

then the amount of instruction generated explodes:

example::increasing_rt:
  pushq %rbp
  movq %rsp, %rbp
  andq $-32, %rsp
  subq $320, %rsp
  vxorps %xmm0, %xmm0, %xmm0
  cmpq %rsi, %rdx
  jbe .LBB1_34
  movl %edx, %r9d
  subl %esi, %r9d
  leaq -1(%rdx), %r8
  subq %rsi, %r8
  andq $7, %r9
  je .LBB1_2
  negq %r9
  vxorps %xmm0, %xmm0, %xmm0
  movq %rsi, %rcx
.LBB1_4:
  testq %rcx, %rcx
  js .LBB1_5
  vcvtsi2ssq %rcx, %xmm2, %xmm1
...200 lines more...

This code isn't necessarily horrible, but it is definitely harder to reason about its performance. This has two main causes:

  • ISAs do not support these operations: most (all?) ISAs support operations like extract, write, and replace with constant indices only. That is, these operations do not map to single instructions on most ISAs.

  • these operations are slow: even for constant indices, these operations are slow. Often, for each constant index, a different instruction must be generated, and occasionally, for a particular constant index, the operation requires multiple instructions.

So we have a trade-off to make between providing a comfortable API for programs that really must extract a single value with a run-time index, and providing an API that provides "reliable" performance.

The proposed API accepts run-time indices (and values for new):

  • common SIMD code indexes with compile-time indices: this code gets optimized reasonably well with the LLVM backend, but the user needs to deal with the safe-but-checked and unsafe-but-unchecked APIs. If we were to only accept constant indices, the unchecked API would not be necessary, since the checked API would ensure that the indices are in-bounds at compile-time.

  • rare SIMD code indexes with run-time indices: this is code that one should really avoid writing. The current API makes writing this code extremely easy, resulting in SIMD code with potentially unexpected performance. Users also have to deal with two APIs for this, the checked/unchecked APIs, and also, the memory read/write APIs that are better suited for this use case.

Whether the current design is the right design should probably be clarified during the RFC. An important aspect to consider is that Rust support for constants is very basic: const fns are getting started, const generics are not there yet, etc. That is, making the API take constant indices might severely limit the type of code that can be used with these APIs in today's Rust.

Binary (vector,scalar) and (scalar,vector) operations

This RFC can be extended with binary vector-scalar and scalar vector operations by implementing the following traits for signed integer, unsigned integer, and floating-point vectors:

  • {Add,Sub,Mul,Div,Rem}<RHS={element_type},Output=Self>, {Add,Sub,Mul,Div,Rem}<RHS={vector_type},Output={vector_type}> for {element_type}, {Add,Sub,Mul,Div,Rem}Assign<RHS={element_type}>: binary scalar-vector vertical (lane-wise) arithmetic operations.

and the following trait for signed and unsigned integer vectors:

  • Bit{And,Or,Xor}<RHS={element_type},Output=Self>, Bit{And,Or,Xor}<RHS={vector_type},Output={vector_type}> for {element_type}, Bit{And,Or,Xor}Assign<RHS={element_type}> binary scalar-vector vertical (lane-wise) bitwise operations.

  • {Shl,Shr}<RHS=I>, {Shl,Shr}Assign<RHS=I>: for all integer types I in {i8, i16, i32, i64, i128, isize, u8, u16, u32, u64, u128, usize}. Note: whether only element_type or all integer types should be allowed is debatable: stdsimd currently allows using all integer types.

These traits slightly improve the ergonomics of scalar vector operations:

let x: f32x4;
let y: f32x4;
let a: f32;
let z = a * x + y;
// instead of: z = f32x4::splat(a) * x + y;
x += a;
// instead of: x += f32x4::splat(a);

but they do not enable to do anything new that can't be easily done without them by just using vec::splat, and initial feedback on the RFC suggested that they are an abstraction that hides the cost of splatting the scalar into the vector.

These traits are implemented in stdsimd (and thus available in nightly Rust), are trivial to implement (op(vec_ty::splat(scalar), vec) and op(vec, vec_ty::splat(scalar))), and cannot be "seamlessly" provided by users due to coherence.

They are not part of this RFC, but they can be easily added (now or later) if there is consensus to do so. In the meantime, they can be experimented with on nightly Rust. If there is consensus to remove them, porting nightly code off these is also pretty easy.

Tiny vector types

Most platforms SIMD registers have a constant width, and they can be used to operate on vectors with a smaller bit width. However, 16 and 32-bit wide vectors are "small" by most platforms standards.

These types are useful for performing Simd Within A Register (SWAR) operations in platforms without SIMD registers. While their performance has not been extensively investigated in stdsimd yet, any performance issues are performance bugs that should be fixed.

Portable shuffles API

The portable shuffles are exposed via the shuffle! macro. Generating the sequence of instructions required to perform a shuffle requires the shuffle indices to be known at compile time.

In the future, an alternative API based on const-generics and/or const-function-arguments could be added in a backwards compatible way:

impl {element_type}{element_width}x{number_of_lanes} {
    pub fn shuffle<const N: usize, R>(self, const indices: [isize; N]) 
        -> <R as ShuffleResult<element_type, [isize; N]>>::ShuffleResultType
      where R: ShuffleResult<element_type, [isize; N]>;
}

Offering this same API today is doable:

impl {element_type}{element_width}x{number_of_lanes} {
    #[rustc_const_argument(2)] // specifies that indices must be a const
    #[rustc_platform_intrinsic(simd_shuffle2)]
    // ^^^ specifies that this method should be treated as the 
    // "platform-intrinsic" "simd_shuffle1"
    pub fn shuffle2<I>(self, other: Self, indices: I) 
        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
      where R: ShuffleResult<element_type, I>;
      
    #[rustc_const_argument(1)]
    #[rustc_platform_intrinsic(simd_shuffle1)]
    pub fn shuffle<I>(self, indices: I) 
        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
      where R: ShuffleResult<element_type, I>;
}

If there is consensus for it the RFC can be easily amended.

Prior art

Most of this is implemented in stdsimd and can be used on nightly today via the std::simd module. The stdsimd crate is an effort started by @burntsushi to put the rust-lang-nursery/simd crate into a state suitable for stabilization. The rust-lang-nursery/simd crate was mainly developed by @huonw and IIRC it is heavily-inspired by Dart's SIMD which is from where the f32x4 naming scheme comes from. This RFC has been heavily inspired by Dart, and two of the three examples used in the motivation come from the Using SIMD in Dart article written by John McCutchan. Some of the key ideas of this RFC come from LLVM's design, which was originally inspired by GCC's vector extensions, which was probably inspired by something else. Most parts of this RFC are also consistent with the 128-bit SIMD proposal for WebAssembly

Or in other words: to the author's best knowledge, this RFC does not contain any really novel ideas. Instead, it only draws inspriation from previous designs that have withstood the test of time, and it adapts these designs to Rust.

Unresolved questions

Interaction with Cray vectors

The vector types proposed in this RFC are packed, that is, their size is fixed at compile-time.

Many modern architectures support vector operations of run-time size, often called Cray Vectors or scalable vectors. These include, amongst others, NecSX, ARM SVE, and RISC-V's Vector Extension Proposal. These architectures have traditionally relied on auto-vectorization combined with support for explicit vectorization annotations, but newer architectures like ARM SVE introduce explicit vectorization intrinsics.

This is an example adapted from this ARM SVE paper to pseudo-Rust:

/// Adds `c` to every element of the slice `src` storing the result in `dst`.
fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
    assert!(dst.len() == src.len());
    
    // Instantiate a dynamic vector (f64xN) with all lanes set to c:
    let vc: f64xN = f64xN::splat(c);
    
    // The number of lanes that each iteration of the loop can process
    // is unknown at compile-time (f64xN::lanes() is evaluated at run-time):
    for i in (0..src.len()).step_by_with(f64xN::lanes()) {
    
        // Instantiate a dynamic boolean vector with the
        // result of the predicate: `i + lane < src.len()`.
        // This boolean vector acts as a mask, so that elements 
        // "in-bounds" of the slice `src` are initialized to `true`,
        // while out-of-bounds elements contain `false`:
        let m: bxN = f64xN::while_lt(i, src.len());

        // Read the elements of the source using the mask:
        let vsrc: f64xN = f64xN::read_unaligned(m, &src[i..]);
        
        // Add the vector with the constan using the mask:
        let vdst: f64xN = vsrc.add(m, vc);
        
        // Write the result back to memory using the mask:
        vdst.write_unaligned(m, &mut dst[i..]);
    }
}

The RISC-V vector extension proposal introduces a model similar in spirit to ARM SVE. These extensions are, however, not official yet, and it is currently unknown whether GCC and LLVM will expose explicit intrinsics for them. It would not be surprising if they do, and it would not be surprising if similar Cray vector extensions are introduced in other architectures in the future.

The main differences between Cray vectors and portable vectors are that:

  • the number of lanes of Cray vectors is a run-time dynamic value
  • the Cray vector "objects" are like magical compiler token values
  • the induction loop variable must be incremented by the dynamic number of lanes of the vector type
  • most Cray vector operations require a mask indicating which elements of the vector the operation applies to

These differences will probably force the API of Cray vector types to be slightly different than that of packed vector types.

The current RFC, therefore, assumes no interaction with Cray vector types.

It does not prevent for portable Cray vector types to be added to Rust in the future via an orthogonal API, nor it does prevent adding a way to interact between both of them (e.g. through memory). But at this point in time whether these things are possible are open research problems.

Half-float support

Many architectures (ARM, AArch64, PowerPC, MIPS, RISC-V) support half-floats (f16) vector types. It is unclear what to do with these at this point in time since Rust currently lacks language support for half-float.

AVX-512 and m1xN masks support

Currently, std::arch provides very limited AVX-512 support and the prototype implementation of the m1xN masks like m1x64 in stdsimd implements them as 512-bit wide vectors when they actually should only be 64-bit wide.

Finishing the implementation of these types requires work that just has not been done yet.

Fast math

The performance of the portable operations can in some cases be significantly improved by making assumptions about the kind of arithmetic that is allowed.

For example, some of the horizontal reductions benefit from assuming math to be finite (no NaNs) and others from assuming math to be associative (e.g. it allows tree-like reductions from sums).

A future RFC could add more reduction variants with different requirements and performance characteristics, for example, .wrapping_sum_unordered() or .max_element_nanless(), but these are not considered in this RFC because their interaction with fast-math is unclear.

A potentially better idea would be to allow users to specify the assumptions that an optimizing compiler can make about floating-point arithmetic in a finer grained way.

For example, we could design an #[fp_math] attribute usable at, for example, crate, module, function, and block scope, so that users can exactly specify which IEEE754 restrictions the compiler is allowed to lift where:

fn foo(x: f32x4, y: f32x4) -> f32x4 {
  let (w, z) = 
  #[fp_math(assume = "associativity")] {
      // All fp math is associative, reductions can be unordered:
      let w = x.sum();
      let z = y.sum();
      (w, z)
  };
  
  let m = (w + z) * (x + y);
  
  #[fp_math(assume = "finite")] {
      // All fp math is assumed finite, reduction can assume NaNs 
      // aren't present:
      m.max()
  }
}

There are obviously many approaches to tackle this problem, but it does make sense to have a plan to tackle them before workarounds start getting bolted into RFCs like this one. There is an internal's post exploring the design space.

Endian-dependent behavior

The results of the indexed operations (extract, replace, write), and the new method are endian independent. That is, the following example is guaranteed to pass on little-endian (LE) and big-endian (BE) architectures:

let v = i32x4::new(0, 1, 2, 3);
assert_eq!(v.extract(0), 0); // OK in LE and BE
assert_eq!(v.extract(3), 0); // OK in LE - OK in BE

The result of bit-casting two equally-sized vectors using mem::transmute is, however, endian dependent:

let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
let t: i16x8 = unsafe { mem::transmute(x) }; // UNSAFE
if cfg!(target_endian = "little") {
    let t_el = i16x8::new(256, 770, 1284, 1798, 2312, 2826, 3340, 3854);
    assert_eq!(t, t_el); // OK in LE | (would) ERROR in BE
} else if cfg!(target_endian = "big") {
    let t_eb = i16x8::new(1, 515, 1029, 1543, 2057, 2571, 3085, 3599);
    assert_eq!(t, t_eb); // OK in BE | (would) ERROR in LE
}

which applies to memory read and writes as well:

let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
let mut y: [i16; 8] = [0; 8];
x.write_unaligned( unsafe {
    slice::from_raw_parts_mut(&mut y as *mut _ as *mut i8, 16)
});

if cfg!(target_endian = "little") {
    let e: [i16; 8] = [256, 770, 1284, 1798, 2312, 2826, 3340, 3854];
    assert_eq!(y, e);
} else if cfg!(target_endian = "big") {
    let e: [i16; 8] = [1, 515, 1029, 1543, 2057, 2571, 3085, 3599];
    assert_eq!(y, e);
}

let z = i8x16::read_unaligned(unsafe {
    slice::from_raw_parts(&y as *const _ as *const i8, 16)
});
assert_eq!(z, x);