Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AtomicU128::{load, store} #10

Closed
ibraheemdev opened this issue Apr 26, 2022 · 5 comments · Fixed by #16
Closed

Optimize AtomicU128::{load, store} #10

ibraheemdev opened this issue Apr 26, 2022 · 5 comments · Fixed by #16
Labels
C-enhancement Category: A new feature or an improvement for an existing one O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode O-x86 Target: x86/x64 processors

Comments

@ibraheemdev
Copy link

ibraheemdev commented Apr 26, 2022

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

  • MOVAPD, MOVAPS, and MOVDQA.
  • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
  • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

AtomicU128::{load, store} can take advantage of this instead of using the more expensive cmpxchg instruction. See this GCC issue/patch for details: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688.

@taiki-e taiki-e added C-enhancement Category: A new feature or an improvement for an existing one O-x86 Target: x86/x64 processors labels Apr 26, 2022
@ibraheemdev ibraheemdev changed the title AtomicU128 can use SSE for loads on supported platforms Optimize AtomicU128::{load, store} Apr 30, 2022
@ibraheemdev
Copy link
Author

ibraheemdev commented Apr 30, 2022

As of ARM v8.4 the LDP/STP instructions are guaranteed to be single-copy atomic for 16 byte accesses:

Changes to single-copy atomicity in Armv8.4 In addition to the single-copy atomicity requirements listed above:

Instructions that are introduced in FEAT_LRCPC are single-copy atomic when all of the following conditions are true:

  • All bytes being accessed are within the same 16-byte quantity aligned to 16 bytes.
  • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, all loads and stores are single-copy atomic when all of the following conditions are true:

  • Accesses are unaligned to their data size but all bytes being accessed are within a 16-byte quantity that is aligned to 16 bytes.
  • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy atomic when all of the following conditions are true:

  • The overall memory access is aligned to 16 bytes.
  • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

See also the relevant LLVM patch.

@taiki-e taiki-e added the O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Apr 30, 2022
@taiki-e
Copy link
Owner

taiki-e commented Jun 5, 2022

Btw, recently I learned that powerpc64 (pwr8+) supports 128-bit atomics (llvm patch. although qemu doesn't seem to support some of them) and added support for it to another library. I plan to do the same with this library.

@taiki-e
Copy link
Owner

taiki-e commented Jun 18, 2022

UPDATE: This table is outdated. See the atomic128 module's readme for the latest version.

Once #16 merged, the list of targets that support 128-bit atomics and the instructions used is as follows.

target_arch load store CAS note
x86_64 cmpxchg16b or vmovdqa cmpxchg16b or vmovdqa cmpxchg16b cmpxchg16b target feature required. vmovdqa requires Intel or AMD CPU with AVX.
Both compile-time and run-time detection are supported for cmpxchg16b. vmovdqa is currently run-time detection only.
Requires rustc 1.59+ when cmpxchg16b target feature is enabled at compile-time, otherwise requires nightly
aarch64 ldxp/stxp or ldp ldxp/stxp or stp ldxp/stxp or casp casp requires lse target feature, ldp/stp requires lse2 target feature.
Both compile-time and run-time detection are supported for lse. lse2 is currently compile-time detection only.
Requires rustc 1.59+
powerpc64 lq stq lqarx/stqcx. Little endian or target CPU pwr8+.
Requires nightly
s390x lpq stpq cdsg Requires nightly

Note: Run-time detections require outline-atomics optional feature of this crate EDIT: since 0.3.19, run-time detections are enabled by default.

bors bot added a commit that referenced this issue Jun 19, 2022
16: Use SSE for 128-bit atomic load/store on Intel CPU with AVX r=taiki-e a=taiki-e

x86_64 part of #10

The following are the results of a simple microbenchmark:

```
bench_portable_atomic_arch/u128_load
                        time:   [1.4598 ns 1.4671 ns 1.4753 ns]
                        change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_store
                        time:   [1.3852 ns 1.3937 ns 1.4024 ns]
                        change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load
                        time:   [56.422 us 56.767 us 57.204 us]
                        change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [136.53 us 139.96 us 145.39 us]
                        change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [146.03 us 147.67 us 149.98 us]
                        change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [765.11 us 766.69 us 768.29 us]
                        change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
```

Closes #10

Co-authored-by: Taiki Endo <te316e89@gmail.com>
@bors bors bot closed this as completed in 10b561a Jun 19, 2022
@taiki-e
Copy link
Owner

taiki-e commented Jul 27, 2022

Other optimizations:

(Apart from these, there have also been some minor optimizations regarding inline assembly since #16 was merged.)

@taiki-e
Copy link
Owner

taiki-e commented Dec 11, 2022

AMD is also going to guarantee atomicity of 128-bit SSE: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

UPDATE: filed #49

bors bot added a commit that referenced this issue Dec 14, 2022
49: Use SSE for 128-bit atomic load/store on AMD CPU with AVX r=taiki-e a=taiki-e

As mentioned in #10 (comment), AMD is also going to guarantee this.

Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

Co-authored-by: Taiki Endo <te316e89@gmail.com>
bors bot added a commit that referenced this issue Dec 25, 2022
57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e

This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it.

(outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.)

It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default.

Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature.

Closes #25

Co-authored-by: Taiki Endo <te316e89@gmail.com>
bors bot added a commit that referenced this issue Dec 25, 2022
57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e

This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it.

(outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.)

It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default.

Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature.

Closes #25

Co-authored-by: Taiki Endo <te316e89@gmail.com>
bors bot added a commit that referenced this issue Dec 25, 2022
57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e

This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it.

(outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.)

It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default.

Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature.

Closes #25

Co-authored-by: Taiki Endo <te316e89@gmail.com>
@taiki-e taiki-e added O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode and removed O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state labels Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: A new feature or an improvement for an existing one O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode O-x86 Target: x86/x64 processors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants