Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568

DeepakRajendrakumaran · 2023-12-04T08:24:21Z

Overview

This PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use hadd and are not the most efficient. This commit modifies the existing implementations and adds the Vector512 sum(0 implementations

Vector128

Case 1: byte/ubyte types

Without PR

Not lowered

With PR

       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR
Not lowered

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector256

Case 1: byte/ubyte types

Without PR
Not lowered

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR
Not lowered
With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector512

Case 1: byte/ubyte types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddb   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddw   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddd   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddq   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddps   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddpd   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Instructions and dependencies

Vector128

8 bits integer
vpsrldq - sse2
vpaddb - sse2

16 bits integer
vpsrldq - sse2
vpaddw - sse2

32 bits integer
vpsrldq - sse2
vpaddd - sse2

64 bits integer
vpsrldq - sse2
vpaddq - sse2

float
shufps - sse
addps - sse

double
shufpd - sse
addpd - sse2

Vector256
the corresponding Vector128 instr + the following

All integer types
vextracti128 - AVX2

All float types
vextractf128 - AVX

Vector512
the corresponding Vector128 instr + the following

All integer types
vextracti64x4 - AVX512F

All float types
vextractf64x4 - AVX512F

Performance numbers

On ICX

On SPR

ghost · 2023-12-04T08:24:31Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

This PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use hadd and are not the most efficient. This commit modifies the existing implementations and adds the Vector512 sum(0 implementations

Vector128

Case 1: byte/ubyte types

Without PR

With PR

       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector256

Case 1: byte/ubyte types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector512

Case 1: byte/ubyte types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddb   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddw   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddd   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddq   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddps   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddpd   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Instructions and dependencies

Vector128

8 bits integer
vpsrldq - sse2
vpaddb - sse2

16 bits integer
vpsrldq - sse2
vpaddw - sse2

32 bits integer
vpsrldq - sse2
vpaddd - sse2

64 bits integer
vpsrldq - sse2
vpaddq - sse2

float
shufps - sse
addps - sse

double
shufpd - sse
addpd - sse2

Vector256
the corresponding Vector128 instr + the following

All integer types
vextracti128 - AVX2

All float types
vextractf128 - AVX

Vector512
the corresponding Vector128 instr + the following

All integer types
vextracti64x4 - AVX512F

All float types
vextractf64x4 - AVX512F

Author:	DeepakRajendrakumaran
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

EgorBo · 2023-12-04T14:53:06Z

Does new implementation change the resulting value compared to the previous? Due to IEEE754 rules
PS: I am curious - Is there an algorithm where horizontal sum could be on hot path?

DeepakRajendrakumaran · 2023-12-04T17:57:31Z

Does new implementation change the resulting value compared to the previous? Due to IEEE754 rules PS: I am curious - Is there an algorithm where horizontal sum could be on hot path?

The resulting values should remain the same. I see some test failures. Will verify/fix those before making PR 'ReadyForReview'

re hadd - my original implementation was using hadd (#87851). This comment got me to reconsider the implementation. The main reference point were the available native implementations for 512(https://godbolt.org/z/hqr8hbYKc) and the discussion here My understanding based on these is that I need the keep the 'order of adds' consistent for 128 wide floating versions.

Feel free to let me know if you see any issues here.

DeepakRajendrakumaran · 2023-12-05T10:32:13Z

@EgorBo The test failures confuse me. They are with AVX512F disabled. I looked at the disassembly and the disasm is the same with my change vs main. I'll look at this but pointing it out incase I'm missing anything obvious

   Starting:    System.Runtime.Intrinsics.Tests (parallel test collections = on, max threads = 36)
  ; Assembly listing for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (MinOpts)
  ; Emitting BLENDED_CODE for X64 with AVX - Windows
  ; MinOpts code
  ; debuggable code
  ; rbp based frame
  ; fully interruptible
  ; No PGO data
  ; Final local variable assignments
  ;
  ;  V00 this         [V00    ] (  1,  1   )     ref  ->  [rbp+0x10]  do-not-enreg[] this class-hnd <System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests>
  ;  V01 loc0         [V01    ] (  1,  1   )  struct (64) [rbp-0x40]  do-not-enreg[S] must-init <System.Runtime.Intrinsics.Vector512`1[byte]>
  ;  V02 OutArgs      [V02    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
  ;  V03 tmp1         [V03    ] (  1,  1   )  struct (64) [rbp-0x80]  do-not-enreg[HS] hidden-struct-arg "impSpillStackEnsure" <System.Runtime.Intrinsics.Vector512`1[byte]>
  ;  V04 tmp2         [V04    ] (  1,  1   )     int  ->  [rbp-0x84]  do-not-enreg[] "impSpillStackEnsure"
  ;  V05 tmp3         [V05    ] (  1,  1   )     int  ->  [rbp-0x88]  do-not-enreg[] "impSpillStackEnsure"
  ;  V06 tmp4         [V06    ] (  1,  1   )  struct (64) [rbp-0xC8]  do-not-enreg[XS] addr-exposed "by-value struct argument" <System.Runtime.Intrinsics.Vector512`1[byte]>
  ;
  ; Lcl frame size = 240
  
  G_M2536_IG01:  ;; offset=0x0000
         push     rbp
         sub      rsp, 240
         vzeroupper 
         lea      rbp, [rsp+0xF0]
         vxorps   xmm4, xmm4, xmm4
         vmovdqu  ymmword ptr [rbp-0x40], ymm4
         vmovdqu  ymmword ptr [rbp-0x20], ymm4
         mov      gword ptr [rbp+0x10], rcx
  						;; size=37 bbWeight=1 PerfScore 8.08
  G_M2536_IG02:  ;; offset=0x0025
         cmp      dword ptr [(reloc 0x7ff8b3e07b30)], 0
         je       SHORT G_M2536_IG04
  						;; size=9 bbWeight=1 PerfScore 4.00
  G_M2536_IG03:  ;; offset=0x002E
         call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
  						;; size=5 bbWeight=0.50 PerfScore 0.50
  G_M2536_IG04:  ;; offset=0x0033
         nop      
         lea      rcx, [rbp-0x80]
         mov      edx, 1
         call     [System.Runtime.Intrinsics.Vector512:Create(byte):System.Runtime.Intrinsics.Vector512`1[byte]]
         vmovdqu  ymm0, ymmword ptr [rbp-0x80]
         vmovdqu  ymmword ptr [rbp-0x40], ymm0
         vmovdqu  ymm0, ymmword ptr [rbp-0x60]
         vmovdqu  ymmword ptr [rbp-0x20], ymm0
         mov      dword ptr [rbp-0x84], 64
         vmovdqu  ymm0, ymmword ptr [rbp-0x40]
         vmovdqu  ymmword ptr [rbp-0xC8], ymm0
         vmovdqu  ymm0, ymmword ptr [rbp-0x20]
         vmovdqu  ymmword ptr [rbp-0xA8], ymm0
         lea      rcx, [rbp-0xC8]
         call     [System.Runtime.Intrinsics.Vector512:Sum[byte](System.Runtime.Intrinsics.Vector512`1[byte]):byte]
         mov      dword ptr [rbp-0x88], eax
         mov      ecx, dword ptr [rbp-0x84]
         mov      edx, dword ptr [rbp-0x88]
         call     [Xunit.Assert:Equal[byte](byte,byte)]
         nop      
         nop      
  						;; size=111 bbWeight=1 PerfScore 35.00
  G_M2536_IG05:  ;; offset=0x00A2
         add      rsp, 240
         pop      rbp
         ret      
  						;; size=9 bbWeight=1 PerfScore 1.75
  
  ; Total bytes of code 171, prolog size 37, PerfScore 66.43, instruction count 35, allocated bytes for code 171 (MethodHash=03abf617) for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (MinOpts)
  ; ============================================================
  
      System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512SByteSumTest [FAIL]
        Assert.Equal() Failure: Values differ
        Expected: 64
        Actual:   100
        Stack Trace:
          C:\Users\deepakra\Dotnet\runtime\src\libraries\System.Runtime.Intrinsics\tests\Vectors\Vector512Tests.cs(4891,0): at System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512SByteSumTest()
             at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
          C:\Users\deepakra\Dotnet\runtime\src\coreclr\System.Private.CoreLib\src\System\Reflection\MethodBaseInvoker.CoreCLR.cs(36,0): at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args)
          C:\Users\deepakra\Dotnet\runtime\src\libraries\System.Private.CoreLib\src\System\Reflection\MethodBaseInvoker.cs(57,0): at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

With AVX512F Enabled - This passes

; Assembly listing for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (Mi
 nOpts)
 ; Emitting BLENDED_CODE for X64 with AVX512 - Windows
 ; MinOpts code
 ; debuggable code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
 ; Final local variable assignments
 ;
 ;  V00 this         [V00    ] (  1,  1   )     ref  ->  [rbp+0x10]  do-not-enreg[] this class-hnd <System.Runtime.Int
 rinsics.Tests.Vectors.Vector512Tests>
 ;  V01 loc0         [V01    ] (  1,  1   )  simd64  ->  [rbp-0x70]  do-not-enreg[S] must-init <System.Runtime.Intrins
 ics.Vector512`1[byte]>
 ;  V02 OutArgs      [V02    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V03 tmp1         [V03    ] (  1,  1   )  simd64  ->  [rbp-0xB0]  do-not-enreg[S] "impSpillStackEnsure"
 ;  V04 tmp2         [V04    ] (  1,  1   )  simd32  ->  [rbp-0xD0]  do-not-enreg[S] "fgMakeTemp is creating a new loc
 al variable"
 ;  V05 tmp3         [V05    ] (  1,  1   )  simd16  ->  [rbp-0xE0]  do-not-enreg[S] "fgMakeTemp is creating a new loc
 al variable"
 ;  V06 tmp4         [V06    ] (  1,  1   )  simd16  ->  [rbp-0xF0]  do-not-enreg[S] "fgMakeTemp is creating a new loc
 al variable"
 ;  V07 tmp5         [V07    ] (  1,  1   )  simd16  ->  [rbp-0x100]  do-not-enreg[S] "fgMakeTemp is creating a new lo
 cal variable"
 ;  V08 tmp6         [V08    ] (  1,  1   )  simd16  ->  [rbp-0x110]  do-not-enreg[S] "fgMakeTemp is creating a new lo
 cal variable"
 ;  V09 tmp7         [V09    ] (  1,  1   )     int  ->  [rbp-0x114]  do-not-enreg[] "impSpillStackEnsure"
 ;  V10 tmp8         [V10    ] (  1,  1   )     int  ->  [rbp-0x118]  do-not-enreg[] "impSpillStackEnsure"
 ;
 ; Lcl frame size = 320

 G_M2536_IG01:  ;; offset=0x0000
        push     rbp
        sub      rsp, 320
        vzeroupper
        lea      rbp, [rsp+0x140]
        vxorps   xmm4, xmm4, xmm4
        vmovdqu  ymmword ptr [rbp-0x70], ymm4
        vmovdqu  ymmword ptr [rbp-0x50], ymm4
        mov      gword ptr [rbp+0x10], rcx
                                               ;; size=37 bbWeight=1 PerfScore 8.08
 G_M2536_IG02:  ;; offset=0x0025
        cmp      dword ptr [(reloc 0x7ff8b75b3b80)], 0
        je       SHORT G_M2536_IG04
                                               ;; size=9 bbWeight=1 PerfScore 4.00
 G_M2536_IG03:  ;; offset=0x002E
        call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
                                               ;; size=5 bbWeight=0.50 PerfScore 0.50
 G_M2536_IG04:  ;; offset=0x0033
        nop
        vmovups  zmm0, zmmword ptr [reloc @RWD00]
        vmovups  zmmword ptr [rbp-0xB0], zmm0
        vmovups  zmm0, zmmword ptr [rbp-0xB0]
        vmovups  zmmword ptr [rbp-0x70], zmm0
        mov      dword ptr [rbp-0x114], 64
        vmovups  zmm0, zmmword ptr [rbp-0x70]
        vextracti64x4 ymm0, zmm0, 1
        vmovdqu  ymm1, ymmword ptr [rbp-0x70]
        vpaddb   ymm0, ymm0, ymm1
        vmovups  ymmword ptr [rbp-0xD0], ymm0
        vmovups  ymm0, ymmword ptr [rbp-0xD0]
        vextracti128 xmm0, ymm0, 1
        vmovdqu  xmm1, xmmword ptr [rbp-0xD0]
        vpaddb   xmm0, xmm0, xmm1
        vmovaps  xmmword ptr [rbp-0xE0], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0xE0], 8
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0xE0]
        vmovaps  xmmword ptr [rbp-0xF0], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0xF0], 4
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0xF0]
        vmovaps  xmmword ptr [rbp-0x100], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0x100], 2
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0x100]
        vmovaps  xmmword ptr [rbp-0x110], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0x110], 1
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0x110]
        vmovd    ecx, xmm0
        movsx    rcx, cl
        mov      dword ptr [rbp-0x118], ecx
        mov      ecx, dword ptr [rbp-0x114]
        mov      edx, dword ptr [rbp-0x118]
        call     [Xunit.Assert:Equal[byte](byte,byte)]
        nop
        nop
                                               ;; size=241 bbWeight=1 PerfScore 57.67
 G_M2536_IG05:  ;; offset=0x0124
        vzeroupper
        add      rsp, 320
        pop      rbp
        ret
                                               ;; size=12 bbWeight=1 PerfScore 2.75
 RWD00         dq      0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h,
 0101010101010101h, 0101010101010101h, 0101010101010101h


 ; Total bytes of code 304, prolog size 37, PerfScore 104.60, instruction count 50, allocated bytes for code 316 (Meth
 odHash=03abf617) for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (MinO
 pts)
 ; ============================================================

   Finished:    System.Runtime.Intrinsics.Tests
 === TEST EXECUTION SUMMARY ===
    System.Runtime.Intrinsics.Tests  Total: 1, Errors: 0, Failed: 0, Skipped: 0, Time: 4.176s

src/coreclr/jit/hwintrinsicxarch.cpp

src/coreclr/jit/gentree.cpp

tannergooding · 2024-01-02T20:08:37Z

src/coreclr/jit/hwintrinsicxarch.cpp

                break;
            }
-            else if (varTypeIsByte(simdBaseType) || varTypeIsLong(simdBaseType))
+#if defined(TARGET_X86)
+            else if (varTypeIsLong(simdBaseType))


What support is missing for long on 32-bit?

I would have expected this to generally "just work" and for us to be able to use _ToScalar provided SSE4.1 is supported given that GetElement has the required decomposition support for that case.

I had a failure with decomposition. I tried again and passes locally. I enabled that code to get it to run on CI

I need to debug this to pin point exactly what's going wrong here.

Not sure if this is the right way to handle this - 9e3657f

That won't quite work unfortunately and the actual handling is going to be a little bit more complex.

Noting that I'm fine with this being handled as a separate PR (and am happy to do that work, seeing as I'm pretty sure I know what will need to be done here), I had initially thought it might be slightly simpler but looks like its not quite all there..

The "proper" fix likely entails:

Extracting this logic to a gtNewSimdToScalarNode helper: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L2917-L2937

Updating various places that are manually doing gtNewSimdHWIntrinsicNode(retType, op1, NI_Vector###_ToScalar, simdBaseJitType, simdSize) to call the introduced helper

Have the impSpecialIntrinsic handling for NI_Vector###_Sum do:

#if defined(TARGET_X86) else if (varTypeIsLong(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_SSE41)) { // We need SSE41 to handle long, use software fallback break; } #endif // TARGET_X86

At some future point we can ensure that NI_Vector###_GetElement has decomp handling for pre-SSE4.1 as well, it's just slightly more complex given that TYP_INT currently requires SSE4.1 as well. The fix to get that working is likely to just reuse the TYP_FLOAT handling for getting the relevant 32-bit part and then using ToScalar (shifting can then be used for the relevant 8-bit or 16-bit part for byte/short).

I've updated the implementation to use the helper. Waiting for CI tests to pass. My local tests passed

tannergooding · 2024-01-10T20:54:40Z

src/coreclr/jit/gentree.cpp

+//
+GenTree* Compiler::gtNewSimdToScalarNode(var_types type, GenTree* op1, CorInfoType simdBaseJitType, unsigned simdSize)
+{
+


stray empty line:

Suggested change

tannergooding · 2024-01-10T20:57:42Z

CC. @dotnet/jit-contrib, this needs a secondary review but should be good to merge otherwise

…wering for Vector512 (dotnet#95568) * Updating Sum() implementation. * Fixing codegen * Addressing review comments. * Fix Formatting * Enabling for long on x86. * Cleaning up ToScalar implementation

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 4, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 4, 2023

DeepakRajendrakumaran force-pushed the Sum_dot branch 2 times, most recently from 26e8e12 to 2408ec9 Compare December 4, 2023 09:59

EgorBo reviewed Dec 5, 2023

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

DeepakRajendrakumaran force-pushed the Sum_dot branch 2 times, most recently from 216f087 to 044aebe Compare December 13, 2023 17:28

DeepakRajendrakumaran changed the title ~~Updating Sum() implementation.~~ Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 Dec 14, 2023

DeepakRajendrakumaran marked this pull request as ready for review December 15, 2023 21:07

tannergooding reviewed Jan 2, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jan 2, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jan 2, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

tannergooding reviewed Jan 2, 2024

View reviewed changes

DeepakRajendrakumaran added 3 commits January 2, 2024 15:44

Updating Sum() implementation.

a072eea

Fixing codegen

7c7a3b5

Addressing review comments.

b814254

DeepakRajendrakumaran force-pushed the Sum_dot branch 2 times, most recently from df322c6 to 6e77487 Compare January 4, 2024 17:58

Fix Formatting

28f571a

DeepakRajendrakumaran force-pushed the Sum_dot branch from 6e77487 to 9e3657f Compare January 4, 2024 19:36

Enabling for long on x86.

9e3657f

build-analysis bot mentioned this pull request Jan 4, 2024

NRE in iOS.Device.Aot.Test when starting up xharness #96403

Closed

DeepakRajendrakumaran force-pushed the Sum_dot branch from d7a3c62 to 6b3eec7 Compare January 6, 2024 00:40

build-analysis bot mentioned this pull request Jan 6, 2024

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

DeepakRajendrakumaran force-pushed the Sum_dot branch from 6b3eec7 to 76e307e Compare January 9, 2024 18:09

Cleaning up ToScalar implementation

76e307e

tannergooding reviewed Jan 10, 2024

View reviewed changes

tannergooding approved these changes Jan 10, 2024

View reviewed changes

tannergooding mentioned this pull request Jan 13, 2024

Tiered miscompilation of Vector2.Dot(x, x) without SSE4.1 #96939

Closed

BruceForstall approved these changes Jan 17, 2024

View reviewed changes

BruceForstall merged commit 473a983 into dotnet:main Jan 17, 2024
139 checks passed

BruceForstall mentioned this pull request Jan 18, 2024

Intel architecture improvements for .NET 9 #93196

Closed

33 tasks

github-actions bot locked and limited conversation to collaborators Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568

Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568

DeepakRajendrakumaran commented Dec 4, 2023 •

edited

Loading

ghost commented Dec 4, 2023

Overview

Vector128

Vector256

Vector512

Instructions and dependencies

EgorBo commented Dec 4, 2023

DeepakRajendrakumaran commented Dec 4, 2023

DeepakRajendrakumaran commented Dec 5, 2023

tannergooding Jan 2, 2024

DeepakRajendrakumaran Jan 3, 2024 •

edited

Loading

DeepakRajendrakumaran Jan 4, 2024

DeepakRajendrakumaran Jan 4, 2024

tannergooding Jan 5, 2024 •

edited

Loading

DeepakRajendrakumaran Jan 6, 2024

tannergooding Jan 10, 2024

tannergooding commented Jan 10, 2024

Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568

Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568

Conversation

DeepakRajendrakumaran commented Dec 4, 2023 • edited Loading

Overview

Vector128

Vector256

Vector512

Instructions and dependencies

Performance numbers

On ICX

On SPR

ghost commented Dec 4, 2023

Overview

Vector128

Vector256

Vector512

Instructions and dependencies

EgorBo commented Dec 4, 2023

DeepakRajendrakumaran commented Dec 4, 2023

DeepakRajendrakumaran commented Dec 5, 2023

tannergooding Jan 2, 2024

Choose a reason for hiding this comment

DeepakRajendrakumaran Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

DeepakRajendrakumaran Jan 4, 2024

Choose a reason for hiding this comment

DeepakRajendrakumaran Jan 4, 2024

Choose a reason for hiding this comment

tannergooding Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

DeepakRajendrakumaran Jan 6, 2024

Choose a reason for hiding this comment

tannergooding Jan 10, 2024

Choose a reason for hiding this comment

tannergooding commented Jan 10, 2024

DeepakRajendrakumaran commented Dec 4, 2023 •

edited

Loading

DeepakRajendrakumaran Jan 3, 2024 •

edited

Loading

tannergooding Jan 5, 2024 •

edited

Loading