Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568

Merged
merged 6 commits into from
Jan 17, 2024

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented Dec 4, 2023

Overview

This PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use hadd and are not the most efficient. This commit modifies the existing implementations and adds the Vector512 sum(0 implementations

Vector128

Case 1: byte/ubyte types

Without PR

Not lowered

With PR

       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR
Not lowered

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector256

Case 1: byte/ubyte types

Without PR
Not lowered

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR
Not lowered
With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector512

Case 1: byte/ubyte types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddb   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddw   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddd   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddq   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddps   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR
Not lowered

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddpd   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Instructions and dependencies

Vector128

8 bits integer
vpsrldq - sse2
vpaddb - sse2

16 bits integer
vpsrldq - sse2
vpaddw - sse2

32 bits integer
vpsrldq - sse2
vpaddd - sse2

64 bits integer
vpsrldq - sse2
vpaddq - sse2

float
shufps - sse
addps - sse

double
shufpd - sse
addpd - sse2

Vector256
the corresponding Vector128 instr + the following

All integer types
vextracti128 - AVX2

All float types
vextractf128 - AVX

Vector512
the corresponding Vector128 instr + the following

All integer types
vextracti64x4 - AVX512F

All float types
vextractf64x4 - AVX512F

Performance numbers

On ICX

image

On SPR

image

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 4, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 4, 2023
@ghost
Copy link

ghost commented Dec 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

This PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use hadd and are not the most efficient. This commit modifies the existing implementations and adds the Vector512 sum(0 implementations

Vector128

Case 1: byte/ubyte types

Without PR

With PR

       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  xmm0, xmmword ptr [rcx]
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector256

Case 1: byte/ubyte types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  ymm0, ymmword ptr [rcx]
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Vector512

Case 1: byte/ubyte types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddb   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddb   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 1
       vpaddb   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 2: Int16 types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddw   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddw   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 2
       vpaddw   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 3: Int32 types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddd   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0

Case 4: long types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextracti64x4 ymm1, zmm0, 1
       vpaddq   ymm0, ymm1, ymm0
       vextracti128 xmm1, ymm0, 1
       vpaddq   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 8
       vpaddq   xmm0, xmm1, xmm0
       vmovd    rax, xmm0

Case 5: float types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddps   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, -79
       vaddps   xmm0, xmm1, xmm0
       vshufps  xmm1, xmm0, xmm0, 3
       vaddps   xmm0, xmm1, xmm0

Case 6: double types

Without PR

With PR

       vmovups  zmm0, zmmword ptr [rcx]
       vextractf64x4 ymm1, zmm0, 1
       vaddpd   ymm0, ymm1, ymm0
       vextractf128 xmm1, ymm0, 1
       vaddpd   xmm0, xmm1, xmm0
       vshufpd  xmm1, xmm0, xmm0, 3
       vaddpd   xmm0, xmm1, xmm0

Instructions and dependencies

Vector128

8 bits integer
vpsrldq - sse2
vpaddb - sse2

16 bits integer
vpsrldq - sse2
vpaddw - sse2

32 bits integer
vpsrldq - sse2
vpaddd - sse2

64 bits integer
vpsrldq - sse2
vpaddq - sse2

float
shufps - sse
addps - sse

double
shufpd - sse
addpd - sse2

Vector256
the corresponding Vector128 instr + the following

All integer types
vextracti128 - AVX2

All float types
vextractf128 - AVX

Vector512
the corresponding Vector128 instr + the following

All integer types
vextracti64x4 - AVX512F

All float types
vextractf64x4 - AVX512F

Author: DeepakRajendrakumaran
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Dec 4, 2023

Does new implementation change the resulting value compared to the previous? Due to IEEE754 rules
PS: I am curious - Is there an algorithm where horizontal sum could be on hot path?

@DeepakRajendrakumaran
Copy link
Contributor Author

Does new implementation change the resulting value compared to the previous? Due to IEEE754 rules PS: I am curious - Is there an algorithm where horizontal sum could be on hot path?

The resulting values should remain the same. I see some test failures. Will verify/fix those before making PR 'ReadyForReview'

re hadd - my original implementation was using hadd (#87851). This comment got me to reconsider the implementation. The main reference point were the available native implementations for 512(https://godbolt.org/z/hqr8hbYKc) and the discussion here My understanding based on these is that I need the keep the 'order of adds' consistent for 128 wide floating versions.

Feel free to let me know if you see any issues here.

@DeepakRajendrakumaran
Copy link
Contributor Author

@EgorBo The test failures confuse me. They are with AVX512F disabled. I looked at the disassembly and the disasm is the same with my change vs main. I'll look at this but pointing it out incase I'm missing anything obvious

   Starting:    System.Runtime.Intrinsics.Tests (parallel test collections = on, max threads = 36)
  ; Assembly listing for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (MinOpts)
  ; Emitting BLENDED_CODE for X64 with AVX - Windows
  ; MinOpts code
  ; debuggable code
  ; rbp based frame
  ; fully interruptible
  ; No PGO data
  ; Final local variable assignments
  ;
  ;  V00 this         [V00    ] (  1,  1   )     ref  ->  [rbp+0x10]  do-not-enreg[] this class-hnd <System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests>
  ;  V01 loc0         [V01    ] (  1,  1   )  struct (64) [rbp-0x40]  do-not-enreg[S] must-init <System.Runtime.Intrinsics.Vector512`1[byte]>
  ;  V02 OutArgs      [V02    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
  ;  V03 tmp1         [V03    ] (  1,  1   )  struct (64) [rbp-0x80]  do-not-enreg[HS] hidden-struct-arg "impSpillStackEnsure" <System.Runtime.Intrinsics.Vector512`1[byte]>
  ;  V04 tmp2         [V04    ] (  1,  1   )     int  ->  [rbp-0x84]  do-not-enreg[] "impSpillStackEnsure"
  ;  V05 tmp3         [V05    ] (  1,  1   )     int  ->  [rbp-0x88]  do-not-enreg[] "impSpillStackEnsure"
  ;  V06 tmp4         [V06    ] (  1,  1   )  struct (64) [rbp-0xC8]  do-not-enreg[XS] addr-exposed "by-value struct argument" <System.Runtime.Intrinsics.Vector512`1[byte]>
  ;
  ; Lcl frame size = 240
  
  G_M2536_IG01:  ;; offset=0x0000
         push     rbp
         sub      rsp, 240
         vzeroupper 
         lea      rbp, [rsp+0xF0]
         vxorps   xmm4, xmm4, xmm4
         vmovdqu  ymmword ptr [rbp-0x40], ymm4
         vmovdqu  ymmword ptr [rbp-0x20], ymm4
         mov      gword ptr [rbp+0x10], rcx
  						;; size=37 bbWeight=1 PerfScore 8.08
  G_M2536_IG02:  ;; offset=0x0025
         cmp      dword ptr [(reloc 0x7ff8b3e07b30)], 0
         je       SHORT G_M2536_IG04
  						;; size=9 bbWeight=1 PerfScore 4.00
  G_M2536_IG03:  ;; offset=0x002E
         call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
  						;; size=5 bbWeight=0.50 PerfScore 0.50
  G_M2536_IG04:  ;; offset=0x0033
         nop      
         lea      rcx, [rbp-0x80]
         mov      edx, 1
         call     [System.Runtime.Intrinsics.Vector512:Create(byte):System.Runtime.Intrinsics.Vector512`1[byte]]
         vmovdqu  ymm0, ymmword ptr [rbp-0x80]
         vmovdqu  ymmword ptr [rbp-0x40], ymm0
         vmovdqu  ymm0, ymmword ptr [rbp-0x60]
         vmovdqu  ymmword ptr [rbp-0x20], ymm0
         mov      dword ptr [rbp-0x84], 64
         vmovdqu  ymm0, ymmword ptr [rbp-0x40]
         vmovdqu  ymmword ptr [rbp-0xC8], ymm0
         vmovdqu  ymm0, ymmword ptr [rbp-0x20]
         vmovdqu  ymmword ptr [rbp-0xA8], ymm0
         lea      rcx, [rbp-0xC8]
         call     [System.Runtime.Intrinsics.Vector512:Sum[byte](System.Runtime.Intrinsics.Vector512`1[byte]):byte]
         mov      dword ptr [rbp-0x88], eax
         mov      ecx, dword ptr [rbp-0x84]
         mov      edx, dword ptr [rbp-0x88]
         call     [Xunit.Assert:Equal[byte](byte,byte)]
         nop      
         nop      
  						;; size=111 bbWeight=1 PerfScore 35.00
  G_M2536_IG05:  ;; offset=0x00A2
         add      rsp, 240
         pop      rbp
         ret      
  						;; size=9 bbWeight=1 PerfScore 1.75
  
  ; Total bytes of code 171, prolog size 37, PerfScore 66.43, instruction count 35, allocated bytes for code 171 (MethodHash=03abf617) for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (MinOpts)
  ; ============================================================
  
      System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512SByteSumTest [FAIL]
        Assert.Equal() Failure: Values differ
        Expected: 64
        Actual:   100
        Stack Trace:
          C:\Users\deepakra\Dotnet\runtime\src\libraries\System.Runtime.Intrinsics\tests\Vectors\Vector512Tests.cs(4891,0): at System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512SByteSumTest()
             at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
          C:\Users\deepakra\Dotnet\runtime\src\coreclr\System.Private.CoreLib\src\System\Reflection\MethodBaseInvoker.CoreCLR.cs(36,0): at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args)
          C:\Users\deepakra\Dotnet\runtime\src\libraries\System.Private.CoreLib\src\System\Reflection\MethodBaseInvoker.cs(57,0): at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

With AVX512F Enabled - This passes

; Assembly listing for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (Mi
 nOpts)
 ; Emitting BLENDED_CODE for X64 with AVX512 - Windows
 ; MinOpts code
 ; debuggable code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
 ; Final local variable assignments
 ;
 ;  V00 this         [V00    ] (  1,  1   )     ref  ->  [rbp+0x10]  do-not-enreg[] this class-hnd <System.Runtime.Int
 rinsics.Tests.Vectors.Vector512Tests>
 ;  V01 loc0         [V01    ] (  1,  1   )  simd64  ->  [rbp-0x70]  do-not-enreg[S] must-init <System.Runtime.Intrins
 ics.Vector512`1[byte]>
 ;  V02 OutArgs      [V02    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V03 tmp1         [V03    ] (  1,  1   )  simd64  ->  [rbp-0xB0]  do-not-enreg[S] "impSpillStackEnsure"
 ;  V04 tmp2         [V04    ] (  1,  1   )  simd32  ->  [rbp-0xD0]  do-not-enreg[S] "fgMakeTemp is creating a new loc
 al variable"
 ;  V05 tmp3         [V05    ] (  1,  1   )  simd16  ->  [rbp-0xE0]  do-not-enreg[S] "fgMakeTemp is creating a new loc
 al variable"
 ;  V06 tmp4         [V06    ] (  1,  1   )  simd16  ->  [rbp-0xF0]  do-not-enreg[S] "fgMakeTemp is creating a new loc
 al variable"
 ;  V07 tmp5         [V07    ] (  1,  1   )  simd16  ->  [rbp-0x100]  do-not-enreg[S] "fgMakeTemp is creating a new lo
 cal variable"
 ;  V08 tmp6         [V08    ] (  1,  1   )  simd16  ->  [rbp-0x110]  do-not-enreg[S] "fgMakeTemp is creating a new lo
 cal variable"
 ;  V09 tmp7         [V09    ] (  1,  1   )     int  ->  [rbp-0x114]  do-not-enreg[] "impSpillStackEnsure"
 ;  V10 tmp8         [V10    ] (  1,  1   )     int  ->  [rbp-0x118]  do-not-enreg[] "impSpillStackEnsure"
 ;
 ; Lcl frame size = 320

 G_M2536_IG01:  ;; offset=0x0000
        push     rbp
        sub      rsp, 320
        vzeroupper
        lea      rbp, [rsp+0x140]
        vxorps   xmm4, xmm4, xmm4
        vmovdqu  ymmword ptr [rbp-0x70], ymm4
        vmovdqu  ymmword ptr [rbp-0x50], ymm4
        mov      gword ptr [rbp+0x10], rcx
                                               ;; size=37 bbWeight=1 PerfScore 8.08
 G_M2536_IG02:  ;; offset=0x0025
        cmp      dword ptr [(reloc 0x7ff8b75b3b80)], 0
        je       SHORT G_M2536_IG04
                                               ;; size=9 bbWeight=1 PerfScore 4.00
 G_M2536_IG03:  ;; offset=0x002E
        call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
                                               ;; size=5 bbWeight=0.50 PerfScore 0.50
 G_M2536_IG04:  ;; offset=0x0033
        nop
        vmovups  zmm0, zmmword ptr [reloc @RWD00]
        vmovups  zmmword ptr [rbp-0xB0], zmm0
        vmovups  zmm0, zmmword ptr [rbp-0xB0]
        vmovups  zmmword ptr [rbp-0x70], zmm0
        mov      dword ptr [rbp-0x114], 64
        vmovups  zmm0, zmmword ptr [rbp-0x70]
        vextracti64x4 ymm0, zmm0, 1
        vmovdqu  ymm1, ymmword ptr [rbp-0x70]
        vpaddb   ymm0, ymm0, ymm1
        vmovups  ymmword ptr [rbp-0xD0], ymm0
        vmovups  ymm0, ymmword ptr [rbp-0xD0]
        vextracti128 xmm0, ymm0, 1
        vmovdqu  xmm1, xmmword ptr [rbp-0xD0]
        vpaddb   xmm0, xmm0, xmm1
        vmovaps  xmmword ptr [rbp-0xE0], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0xE0], 8
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0xE0]
        vmovaps  xmmword ptr [rbp-0xF0], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0xF0], 4
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0xF0]
        vmovaps  xmmword ptr [rbp-0x100], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0x100], 2
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0x100]
        vmovaps  xmmword ptr [rbp-0x110], xmm0
        vpsrldq  xmm0, xmmword ptr [rbp-0x110], 1
        vpaddb   xmm0, xmm0, xmmword ptr [rbp-0x110]
        vmovd    ecx, xmm0
        movsx    rcx, cl
        mov      dword ptr [rbp-0x118], ecx
        mov      ecx, dword ptr [rbp-0x114]
        mov      edx, dword ptr [rbp-0x118]
        call     [Xunit.Assert:Equal[byte](byte,byte)]
        nop
        nop
                                               ;; size=241 bbWeight=1 PerfScore 57.67
 G_M2536_IG05:  ;; offset=0x0124
        vzeroupper
        add      rsp, 320
        pop      rbp
        ret
                                               ;; size=12 bbWeight=1 PerfScore 2.75
 RWD00         dq      0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h,
 0101010101010101h, 0101010101010101h, 0101010101010101h


 ; Total bytes of code 304, prolog size 37, PerfScore 104.60, instruction count 50, allocated bytes for code 316 (Meth
 odHash=03abf617) for method System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SByteSumTest():this (MinO
 pts)
 ; ============================================================

   Finished:    System.Runtime.Intrinsics.Tests
 === TEST EXECUTION SUMMARY ===
    System.Runtime.Intrinsics.Tests  Total: 1, Errors: 0, Failed: 0, Skipped: 0, Time: 4.176s

@DeepakRajendrakumaran DeepakRajendrakumaran changed the title Updating Sum() implementation. Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 Dec 14, 2023
@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review December 15, 2023 21:07
break;
}
else if (varTypeIsByte(simdBaseType) || varTypeIsLong(simdBaseType))
#if defined(TARGET_X86)
else if (varTypeIsLong(simdBaseType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What support is missing for long on 32-bit?

I would have expected this to generally "just work" and for us to be able to use _ToScalar provided SSE4.1 is supported given that GetElement has the required decomposition support for that case.

Copy link
Contributor Author

@DeepakRajendrakumaran DeepakRajendrakumaran Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a failure with decomposition. I tried again and passes locally. I enabled that code to get it to run on CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

I need to debug this to pin point exactly what's going wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the right way to handle this - 9e3657f

Copy link
Member

@tannergooding tannergooding Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't quite work unfortunately and the actual handling is going to be a little bit more complex.

Noting that I'm fine with this being handled as a separate PR (and am happy to do that work, seeing as I'm pretty sure I know what will need to be done here), I had initially thought it might be slightly simpler but looks like its not quite all there..

The "proper" fix likely entails:

  1. Extracting this logic to a gtNewSimdToScalarNode helper: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L2917-L2937
  2. Updating various places that are manually doing gtNewSimdHWIntrinsicNode(retType, op1, NI_Vector###_ToScalar, simdBaseJitType, simdSize) to call the introduced helper
  3. Have the impSpecialIntrinsic handling for NI_Vector###_Sum do:
#if defined(TARGET_X86)
            else if (varTypeIsLong(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_SSE41))
            {
                // We need SSE41 to handle long, use software fallback
                break;
            }
#endif // TARGET_X86

At some future point we can ensure that NI_Vector###_GetElement has decomp handling for pre-SSE4.1 as well, it's just slightly more complex given that TYP_INT currently requires SSE4.1 as well. The fix to get that working is likely to just reuse the TYP_FLOAT handling for getting the relevant 32-bit part and then using ToScalar (shifting can then be used for the relevant 8-bit or 16-bit part for byte/short).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the implementation to use the helper. Waiting for CI tests to pass. My local tests passed

//
GenTree* Compiler::gtNewSimdToScalarNode(var_types type, GenTree* op1, CorInfoType simdBaseJitType, unsigned simdSize)
{

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray empty line:

Suggested change

@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib, this needs a secondary review but should be good to merge otherwise

@BruceForstall BruceForstall merged commit 473a983 into dotnet:main Jan 17, 2024
139 checks passed
tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024
…wering for Vector512 (dotnet#95568)

* Updating Sum() implementation.

* Fixing codegen

* Addressing review comments.

* Fix Formatting

* Enabling for long on x86.

* Cleaning up ToScalar implementation
@github-actions github-actions bot locked and limited conversation to collaborators Feb 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants