-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize String.Equals for OrdinalIgnoreCase #77947
Vectorize String.Equals for OrdinalIgnoreCase #77947
Conversation
Tagging subscribers to this area: @dotnet/area-system-globalization Issue DetailsI tried to keep it simple so no AVX fallback + scalar path for trailing elements. Benchmark: [Benchmark]
[ArgumentsSource(nameof(TestData))]
public bool EqualsIgnoreCase(string s1, string s2) =>
s1.Equals(s2, StringComparison.OrdinalIgnoreCase);
public static IEnumerable<object[]> TestData()
{
yield return new object[]
{
@"Hi!", // 3 chars (to make sure overhead is not big)
@"HI!",
};
yield return new object[]
{
@"hello!!!", // 8 chars (switches to SIMD)
@"HELLO!!!",
};
yield return new object[]
{
@"hello world", // 11 chars (1 simd + trailing elements)
@"HELLO WORLD",
};
yield return new object[]
{
@"C:\prj\runtime-main\src\coreclr\CMakeLists.txt",
@"C:\prj\runtime-main\src\CORECLR\CMakeLists.txt",
};
yield return new object[]
{
@"Good bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:", // 226 chars
@"GOOD bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:",
};
}
I expect better difference on ARM64 where the scalar (SWAR) path suffers from 64bit constants everywhere.
|
X86 codegen for the main loop: G_M65434_IG03: ;; offset=0011H
4C63C8 movsxd r9, eax
C4A17A6F0C49 vmovdqu xmm1, xmmword ptr [rcx+2*r9]
C4A17A6F144A vmovdqu xmm2, xmmword ptr [rdx+2*r9]
C5F1EBDA vpor xmm3, xmm1, xmm2
C5E1DBD8 vpand xmm3, xmm3, xmm0
C4E27917DB vptest xmm3, xmm3
755B jne SHORT G_M65434_IG05 ;; non-ASCII
G_M65434_IG04:
C5F9101D99000000 vmovupd xmm3, xmmword ptr [reloc @RWD16]
C5E1FCE1 vpaddb xmm4, xmm3, xmm1
C5E1FCDA vpaddb xmm3, xmm3, xmm2
C5F9102D99000000 vmovupd xmm5, xmmword ptr [reloc @RWD32]
C5D964E5 vpcmpgtb xmm4, xmm4, xmm5
C5E164DD vpcmpgtb xmm3, xmm3, xmm5
C5F9102D99000000 vmovupd xmm5, xmmword ptr [reloc @RWD48]
C5D9DFE5 vpandn xmm4, xmm4, xmm5
C5E1DFDD vpandn xmm3, xmm3, xmm5
C5D9FCC9 vpaddb xmm1, xmm4, xmm1
C5E1FCD2 vpaddb xmm2, xmm3, xmm2
C5F1EFCA vpxor xmm1, xmm1, xmm2
C4E27917C9 vptest xmm1, xmm1
7523 jne SHORT G_M65434_IG07 ;; not equal
83C008 add eax, 8
458D48F8 lea r9d, [r8-08H]
413BC1 cmp eax, r9d
7E93 jle SHORT G_M65434_IG03 ;; next iteration ARM64: G_M65434_IG03: ;; offset=0010H
937F7C64 sbfiz x4, x3, #1, #32
3CE46811 ldr q17, [x0, x4]
3CE46832 ldr q18, [x1, x4]
4EB21E33 orr v19.8h, v17.8h, v18.8h
4E301E73 and v19.8h, v19.8h, v16.8h
6EB3A673 umaxp v19.4s, v19.4s, v19.4s
4E083E64 umov x4, v19.d[0]
F100009F cmp x4, #0
540002A1 bne G_M65434_IG05 ;; non-ascii
G_M65434_IG04:
9C0005F3 ldr q19, [@RWD16]
4E318674 add v20.16b, v19.16b, v17.16b
4E328673 add v19.16b, v19.16b, v18.16b
9C000615 ldr q21, [@RWD32]
4E353694 cmgt v20.16b, v20.16b, v21.16b
4E353673 cmgt v19.16b, v19.16b, v21.16b
9C000635 ldr q21, [@RWD48]
4E741EB4 bic v20.16b, v21.16b, v20.16b
4E731EB3 bic v19.16b, v21.16b, v19.16b
4E318691 add v17.16b, v20.16b, v17.16b
4E328672 add v18.16b, v19.16b, v18.16b
6E321E31 eor v17.16b, v17.16b, v18.16b
6EB1A631 umaxp v17.4s, v17.4s, v17.4s
4E083E24 umov x4, v17.d[0]
F100009F cmp x4, #0
54000201 bne G_M65434_IG07 ;; not equal
11002063 add w3, w3, #8
51002044 sub w4, w2, #8
6B04007F cmp w3, w4
54FFFC8D ble G_M65434_IG03 ;; next iteration Will work separately in the jit to figure out why the memory loads were not hoisted (constant vectors) |
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
…Ordinal.cs Co-authored-by: Stephen Toub <stoub@microsoft.com>
…o/runtime-1 into vectorize-equals-ordinalignorecase
src/tests/JIT/opt/Vectorization/UnrollEqualsStartsWIth_minopts.csproj
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
@stephentoub @tarekgh @GrabYourPitchforks does it look good otherwise? CI is finally green (one more job to finish) |
Decided to delete the runtime MinOpts test - it's too slow under stress modes and OrdinalIngoreCase is already covered with libs tests |
…s-ordinalignorecase
Windows docker scenarios are broken due to dotnet/arcade#11554 ; I am investigating. |
…s-ordinalignorecase
Improvements on Linux-x64 dotnet/perf-autofiling-issues#9737 |
I tried to keep it simple so no AVX path, only SSE and NEON + SWAR path for trailing elements.
Benchmark:
I expect better difference on ARM64 where the scalar (SWAR) path suffers from 64bit constants everywhere.