Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable NI_Vector256_Create on AVX1 #72522

Merged
merged 4 commits into from
Jul 25, 2022
Merged

Enable NI_Vector256_Create on AVX1 #72522

merged 4 commits into from
Jul 25, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jul 20, 2022

Fixes #72506

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 20, 2022
@ghost ghost assigned EgorBo Jul 20, 2022
@ghost
Copy link

ghost commented Jul 20, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #72506

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo marked this pull request as ready for review July 20, 2022 13:41
@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

@tannergooding PTAL

@tannergooding
Copy link
Member

An alternative and likely better fix would to just remove or update the recursive path from the Create calls. Such a path is only used for indirect invocation (reflection, delegates, function pointers, etc) and its not really worth optimizing them when the software fallback already does the right thing. Additionally, it simplifies the handling and reduces risk of other errors due to future ISA or implementation changes.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

An alternative and likely better fix would to just remove or update the recursive path from the Create calls. Such a path is only used for indirect invocation (reflection, delegates, function pointers, etc) and its not really worth optimizing them when the software fallback already does the right thing. Additionally, it simplifies the handling and reduces risk of other errors due to future ISA or implementation changes.

@tannergooding I am not sure I understand how it's better - if we remove the recursive pass we won't be able to optimize Vector256.Create on AVX1 so e.g. Avx.Xor(Vector256.Create(...), Vector256.Create(...)) will be terribly slow.

you probably meant that we can drop the recursive path and keep my JIT change? Because if I revert my change and remove the recursive path it will be slow

@tannergooding
Copy link
Member

@tannergooding I am not sure I understand how it's better - if we remove the recursive pass we won't be able to optimize Vector256.Create on AVX1 so e.g. Avx.Xor(Vector256.Create(...), Vector256.Create(...)) will be terribly slow.

AVX only hardware is pretty rare (only Sandy Bridge and Ivy Bridge processors) and performance is already going to be problematic in various cases due to other APIs and scenarios not being accelerated.

We also continue hitting various edge cases and hard to find bugs due to the complex relationship between what's available on AVX vs what requires AVX2. If we want to provide AVX only acceleration then that should likely be looked at more broadly and we should consider the best way to model that to ensure that all the "expected" functions work as users require. It's notably too late in the .NET 7 cycle to do that, however, and is decently non-trivial since it requires looking at each Vector256.* ISA and considering when it can light up for each of the supporting types (often float/double being AVX and integer types being AVX2; but there is some intermixing there).

Keeping this so that Vector256.Create is only accelerated when Vector256.IsHardwareAccelerated returns true simplifies the overall handling and ensures that .NET 7 ships more robustly. This will not impact cloud providers (who, to my knowledge, are all on Broadwell or later) and won't impact most end users.

@tannergooding
Copy link
Member

tannergooding commented Jul 20, 2022

Taking this change "as is" requires running the outerloop ISA tests and validating that no AVX2 instructions are generated by any path that Vector256.Create can hit, including under GT_CNS_VEC and any other tree shapes that morph, valuenum, lowering, or other phases may introduce.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

@tannergooding I think it's better to do both drop the recursive calls (I've just done it) and still accelerate Vector256.Create for AVX1 (also done).

@tannergooding
Copy link
Member

and still accelerate Vector256.Create for AVX1 (also done).

This is still going to require running the outerloop ISA stress jobs and confirming that other phases aren't expecting to only be hit when AVX2 is enabled.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

Because it's possible to work with AVX-only in .NET 6.0 via Vector256.Create + Avx.X APIs - I don't see why we want to break it

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

/azp list

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

and still accelerate Vector256.Create for AVX1 (also done).

This is still going to require running the outerloop ISA stress jobs and confirming that other phases aren't expecting to only be hit when AVX2 is enabled.

do you mean runtime outerloop pipeline?

@tannergooding
Copy link
Member

Because it's possible to work with AVX-only in .NET 6.0 via Vector256.Create + Avx.X APIs - I don't see why we want to break it

It's semi-possible and you run into some poor experiences and bad codegen because the JIT can't optimize everything as it needs. We also hit and had to patch several bugs related to mismatches.

Generating AVX only code (when AVX2 isn't supported) has been one of the biggest issues we've faced because the Vector<T> never had such support and various paths don't expect it. The raw hardware intrinsics mostly mitigate this in that they are directly tied to an ISA/base type. The helper APIs (like Create or GetElement) are not directly tied to an ISA or base type, they are generic and you have to keep in mind in the JIT exactly how each tree shape is handled in the backend.

@tannergooding
Copy link
Member

tannergooding commented Jul 20, 2022

/azp run runtime-coreclr jitstress-isas-x86

^ this one (edit: thought this would work since the run command was on a single line, apparently it doesn't >.>)

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

/azp run runtime-coreclr jitstress-isas-x86

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 20, 2022

@tannergooding sure, I understand your point, but as we now have stronger AOT-only story (NativeAOT) we should be ready for cases when developers limit capabilities to some lower common denominator e.g. --instruction-set avx (even some modern games distribute binaries for avx1 only afaik). So while for JIT ivy-bridge like machines might be rare - it's not the case for AOT case like ^.

@tannergooding
Copy link
Member

tannergooding commented Jul 20, 2022

Sure, the main concern here is destabilizing .NET 7 and the work that might be required to ensure that everything end to end works as expected particularly with all the changes and issues that had already cropped up and caused the limitation to be put in for .NET 7

@EgorBo
Copy link
Member Author

EgorBo commented Jul 21, 2022

@tannergooding CI looks good, pri1 isas x86 pipelines are broken #71928

@saucecontrol
Copy link
Member

Keeping this so that Vector256.Create is only accelerated when Vector256.IsHardwareAccelerated returns true simplifies the overall handling and ensures that .NET 7 ships more robustly. This will not impact cloud providers (who, to my knowledge, are all on Broadwell or later) and won't impact most end users.

This may not impact Azure or AWS, but some of the smaller hosts definitely have these older machines in the mix. I have a client running .NET apps on a large number of Rackspace Cloud VMs, and all the ones I saw in a recent spot check were Sandy Bridge (Xeon E5-2680 to be precise).

Vector256.Create(float) has been accelerated since 3.0 on AVX-only, and no matter how uncommon that hardware is, disabling acceleration now would be a nasty surprise for anyone that's running currently-fast AVX code paths that would be crippled on update to 7.0.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 25, 2022

@tannergooding ping

@tannergooding
Copy link
Member

Vector256.Create(float) has been accelerated since 3.0 on AVX-only, and no matter how uncommon that hardware is, disabling acceleration now would be a nasty surprise for anyone that's running currently-fast AVX code paths that would be crippled on update to 7.0.

More generally speaking, I'm raising a concern around the number of bugs we've had and continue hitting on trying to split AVX/AVX2 functionality.

There is complexity in ensuring that support exists, that it works well, and that some code path doesn't accidentally pessimize it because it really needs some functionality that was only in AVX2 instead (not to mention that there isn't a bug, of which we've had a few, where AVX2 is generated when only AVX is supported). This is why when AMD, Intel, LLVM, and others got together to define the x86-x64-v* baselines, they were:

  • v1: CMOV, CMPXCHG8B, SSE, SSE2
  • v2: ..., CMPXCHG16B, POPCNT, SSE3, SSSE3, SSE4.1, SSE4.2
  • v3: ..., AVX, AVX2, BMI1, BMI2, FMA, F16C, LZCNT, MOVBE
  • v4: ..., AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

In order to keep things "simple" and avoid explosions around support matrices. Supporting just "SSSE3" or just "AVX" doesn't necessarily make sense and so Vector256.IsHardwareAccelerated reports "false" when only AVX is supported. And while I agree that not regressing an existing scenario of Vector256.Create(float) working on AVX only is goodness, there is a balance in ensuring that the overall support in the JIT remains robust/correct and that we don't unnecessarily complicate the JIT implementation.

Some examples of where this gets complex, even for Create(float) are that VBROADCASTSS ymm, xmm is AVX2; AVX only exposes VBROADCASTSS ymm, mem32. The same goes for VPERMPS. You then get into more complex cases with integer types where Vector256<int> ends up partially accelerated because GetLower/GetUpper and WithLower/WithUpper can be "always accelerated" as can Load, Store, and Test. But other operations can't be. So it becomes a question of "where do we draw the line" and how complex do we make the import checks around whether something can or cannot be intrinsified in importation.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 25, 2022

So it becomes a question of "where do we draw the line" and how complex do we make the import checks around whether something can or cannot be intrinsified in importation.

@tannergooding I agree but It still reports Vector256.IsHardwareAccelerated as false and significantly improves perf for the most common operations Vector256.Create and fixes the unexpected recursion = PlatformNotSupported. Other less common portable APIs can be accelerated as well as we hit them, but they not that common as Create IMO

I do think we need to fix #72506 for 7.0 - there are no good reasons to break what worked in 3.1-6.0

@tannergooding
Copy link
Member

I do think we need to fix #72506 for 7.0

Yes, 100% agree

there are no good reasons to break what worked in 3.1-6.0

Provided we have sufficient coverage and validation that it is fixed and isn't regressing other key scenarios, and isn't going to be something we're likely to decide to regress in the future due to the complexity required in the JIT, etc.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 25, 2022

@tannergooding so can you approve it?
the jit change is literally few lines of code, CI is happy
the rest is just a clean up to remove the hack that is only improving perf for indirect usages (which are slow by definition) you asked me to do

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vector256.Create() randomly throws PlatformNotSupportedException
4 participants