Vectorize String.Equals for OrdinalIgnoreCase #77947

EgorBo · 2022-11-05T15:49:00Z

I tried to keep it simple so no AVX path, only SSE and NEON + SWAR path for trailing elements.

Benchmark:

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public bool EqualsIgnoreCase(string s1, string s2) => 
    s1.Equals(s2, StringComparison.OrdinalIgnoreCase);



public static IEnumerable<object[]> TestData()
{
    yield return new object[]
    {
        @"Hi!", // 3 chars (to make sure overhead is not big)
        @"HI!",
    };

    yield return new object[]
    {
        @"hello!!!", // 8 chars (switches to SIMD)
        @"HELLO!!!",
    };

    yield return new object[]
    {
        @"hello world", // 11 chars (1xV128 + trailing elements)
        @"HELLO WORLD",
    };

    yield return new object[]
    {
        @"C:\prj\runtime-main\src\coreclr\CMakeLists.txt", // 46 chars (5xV128 + trailing elements)
        @"C:\prj\runtime-main\src\CORECLR\CMakeLists.txt",
    };

    yield return new object[]
    {
        @"Good bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:", // 226 chars
        @"GOOD bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:",
    };
}

Method	Toolchain	s1	s2	Mean
EqualsIgnoreCase	\Core_Root\corerun.exe	Hi!	HI!	2.940 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	Hi!	HI!	2.901 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	hello!!! [8]	HELLO!!! [8]	3.418 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	hello!!! [8]	HELLO!!! [8]	4.829 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	hello world [11]	HELLO WORLD [11]	5.880 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	hello world [11]	HELLO WORLD [11]	6.143 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	C:\pr(...)s.txt [46]	C:\pr(...)s.txt [46]	11.422 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	C:\pr(...)s.txt [46]	C:\pr(...)s.txt [46]	19.049 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	Good(...)ion: [226]	GOOD(...)ion: [226]	40.289 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	Good(...)ion: [226]	GOOD(...)ion: [226]	86.035 ns

I expect better difference on ARM64 where the scalar (SWAR) path suffers from 64bit constants everywhere.

ghost · 2022-11-05T15:49:12Z

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

I tried to keep it simple so no AVX fallback + scalar path for trailing elements.

Benchmark:

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public bool EqualsIgnoreCase(string s1, string s2) => 
    s1.Equals(s2, StringComparison.OrdinalIgnoreCase);



public static IEnumerable<object[]> TestData()
{
    yield return new object[]
    {
        @"Hi!", // 3 chars (to make sure overhead is not big)
        @"HI!",
    };

    yield return new object[]
    {
        @"hello!!!", // 8 chars (switches to SIMD)
        @"HELLO!!!",
    };

    yield return new object[]
    {
        @"hello world", // 11 chars (1 simd + trailing elements)
        @"HELLO WORLD",
    };

    yield return new object[]
    {
        @"C:\prj\runtime-main\src\coreclr\CMakeLists.txt",
        @"C:\prj\runtime-main\src\CORECLR\CMakeLists.txt",
    };

    yield return new object[]
    {
        @"Good bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:", // 226 chars
        @"GOOD bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:",
    };
}

Method	Toolchain	s1	s2	Mean
EqualsIgnoreCase	\Core_Root\corerun.exe	Hi!	HI!	2.940 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	Hi!	HI!	2.901 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	hello!!!	HELLO!!!	4.898 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	hello!!!	HELLO!!!	4.829 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	hello world	HELLO WORLD	5.880 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	hello world	HELLO WORLD	6.143 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	C:\pr(...)s.txt [46]	C:\pr(...)s.txt [46]	11.422 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	C:\pr(...)s.txt [46]	C:\pr(...)s.txt [46]	19.049 ns

EqualsIgnoreCase	\Core_Root\corerun.exe	Good(...)ion: [226]	GOOD(...)ion: [226]	40.289 ns
EqualsIgnoreCase	\Core_Root_base\corerun.exe	Good(...)ion: [226]	GOOD(...)ion: [226]	86.035 ns

I expect better difference on ARM64 where the scalar (SWAR) path suffers from 64bit constants everywhere.

Author:	EgorBo
Assignees:	-
Labels:	`area-System.Globalization`
Milestone:	-

EgorBo · 2022-11-05T16:09:31Z

X86 codegen for the main loop:

G_M65434_IG03:              ;; offset=0011H
       4C63C8               movsxd   r9, eax
       C4A17A6F0C49         vmovdqu  xmm1, xmmword ptr [rcx+2*r9]
       C4A17A6F144A         vmovdqu  xmm2, xmmword ptr [rdx+2*r9]
       C5F1EBDA             vpor     xmm3, xmm1, xmm2
       C5E1DBD8             vpand    xmm3, xmm3, xmm0
       C4E27917DB           vptest   xmm3, xmm3
       755B                 jne      SHORT G_M65434_IG05  ;; non-ASCII

G_M65434_IG04: 
       C5F9101D99000000     vmovupd  xmm3, xmmword ptr [reloc @RWD16]
       C5E1FCE1             vpaddb   xmm4, xmm3, xmm1
       C5E1FCDA             vpaddb   xmm3, xmm3, xmm2
       C5F9102D99000000     vmovupd  xmm5, xmmword ptr [reloc @RWD32]
       C5D964E5             vpcmpgtb xmm4, xmm4, xmm5
       C5E164DD             vpcmpgtb xmm3, xmm3, xmm5
       C5F9102D99000000     vmovupd  xmm5, xmmword ptr [reloc @RWD48]
       C5D9DFE5             vpandn   xmm4, xmm4, xmm5
       C5E1DFDD             vpandn   xmm3, xmm3, xmm5
       C5D9FCC9             vpaddb   xmm1, xmm4, xmm1
       C5E1FCD2             vpaddb   xmm2, xmm3, xmm2
       C5F1EFCA             vpxor    xmm1, xmm1, xmm2
       C4E27917C9           vptest   xmm1, xmm1
       7523                 jne      SHORT G_M65434_IG07  ;; not equal
       83C008               add      eax, 8
       458D48F8             lea      r9d, [r8-08H]
       413BC1               cmp      eax, r9d
       7E93                 jle      SHORT G_M65434_IG03  ;; next iteration

ARM64:

G_M65434_IG03:              ;; offset=0010H
        937F7C64          sbfiz   x4, x3, #1, #32
        3CE46811          ldr     q17, [x0, x4]
        3CE46832          ldr     q18, [x1, x4]
        4EB21E33          orr     v19.8h, v17.8h, v18.8h
        4E301E73          and     v19.8h, v19.8h, v16.8h
        6EB3A673          umaxp   v19.4s, v19.4s, v19.4s
        4E083E64          umov    x4, v19.d[0]
        F100009F          cmp     x4, #0
        540002A1          bne     G_M65434_IG05  ;; non-ascii

G_M65434_IG04:              
        9C0005F3          ldr     q19, [@RWD16]
        4E318674          add     v20.16b, v19.16b, v17.16b
        4E328673          add     v19.16b, v19.16b, v18.16b
        9C000615          ldr     q21, [@RWD32]
        4E353694          cmgt    v20.16b, v20.16b, v21.16b
        4E353673          cmgt    v19.16b, v19.16b, v21.16b
        9C000635          ldr     q21, [@RWD48]
        4E741EB4          bic     v20.16b, v21.16b, v20.16b
        4E731EB3          bic     v19.16b, v21.16b, v19.16b
        4E318691          add     v17.16b, v20.16b, v17.16b
        4E328672          add     v18.16b, v19.16b, v18.16b
        6E321E31          eor     v17.16b, v17.16b, v18.16b
        6EB1A631          umaxp   v17.4s, v17.4s, v17.4s
        4E083E24          umov    x4, v17.d[0]
        F100009F          cmp     x4, #0
        54000201          bne     G_M65434_IG07 ;; not equal
        11002063          add     w3, w3, #8
        51002044          sub     w4, w2, #8
        6B04007F          cmp     w3, w4
        54FFFC8D          ble     G_M65434_IG03 ;; next iteration

Will work separately in the jit to figure out why the memory loads were not hoisted (constant vectors)

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs

…Ordinal.cs Co-authored-by: Stephen Toub <stoub@microsoft.com>

…o/runtime-1 into vectorize-equals-ordinalignorecase

src/tests/JIT/opt/Vectorization/UnrollEqualsStartsWIth_minopts.csproj

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs

EgorBo · 2022-11-07T20:30:46Z

@stephentoub @tarekgh @GrabYourPitchforks does it look good otherwise? CI is finally green (one more job to finish)

EgorBo · 2022-11-07T21:52:59Z

Decided to delete the runtime MinOpts test - it's too slow under stress modes and OrdinalIngoreCase is already covered with libs tests

…s-ordinalignorecase

MattGal · 2022-11-08T21:15:32Z

Windows docker scenarios are broken due to dotnet/arcade#11554 ; I am investigating.

…s-ordinalignorecase

EgorBo · 2022-11-15T17:22:17Z

Improvements on Linux-x64 dotnet/perf-autofiling-issues#9737
Improvements on Linux-arm64 dotnet/perf-autofiling-issues#9853
dotnet/perf-autofiling-issues#9846
dotnet/perf-autofiling-issues#9860

Vectorize String.Equals for OrdinalIgnoreCase

5f33929

dotnet-issue-labeler bot added the area-System.Globalization label Nov 5, 2022

ghost assigned EgorBo Nov 5, 2022

EgorBo added 2 commits November 5, 2022 16:55

Clean up

2c60579

Clean up

a922987

EgorBo added 2 commits November 5, 2022 17:36

Remove goto, it has no impact on codegen

4be63a0

Enable 8 chars case for SIMD

6ca66fa

gfoidl reviewed Nov 5, 2022

View reviewed changes

Add comments

042e18a

stephentoub reviewed Nov 5, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 5, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs Show resolved Hide resolved

EgorBo and others added 3 commits November 5, 2022 22:05

Clean up

ac5edf6

Update src/libraries/System.Private.CoreLib/src/System/Globalization/…

8d4c238

…Ordinal.cs Co-authored-by: Stephen Toub <stoub@microsoft.com>

Merge branch 'vectorize-equals-ordinalignorecase' of github.com:EgorB…

4c89741

…o/runtime-1 into vectorize-equals-ordinalignorecase

EgorBo commented Nov 5, 2022

View reviewed changes

src/tests/JIT/opt/Vectorization/UnrollEqualsStartsWIth_minopts.csproj Outdated Show resolved Hide resolved

EgorBo added 3 commits November 5, 2022 23:11

Update Utf16Utility.cs

fd46ce4

Update Utf16Utility.cs

6a9d467

Update Utf16Utility.cs

e6bc04f

build-analysis bot mentioned this pull request Nov 6, 2022

Tracking issue for CI build timeouts #76454

Closed

watfordsuzy reviewed Nov 7, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs Show resolved Hide resolved

GrabYourPitchforks reviewed Nov 7, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs Show resolved Hide resolved

stephentoub approved these changes Nov 7, 2022

View reviewed changes

Delete MinOpts test - it's too slow

435682f

build-analysis bot mentioned this pull request Nov 8, 2022

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

Merge branch 'main' of github.com:dotnet/runtime into vectorize-equal…

33a939a

…s-ordinalignorecase

Merge branch 'main' of github.com:dotnet/runtime into vectorize-equal…

93aee4a

…s-ordinalignorecase

EgorBo merged commit 1980c7b into dotnet:main Nov 11, 2022

EgorBo deleted the vectorize-equals-ordinalignorecase branch November 11, 2022 11:02

dakersnar mentioned this pull request Nov 15, 2022

Regression in System.Tests.Perf_Boolean.TryParse #78408

Closed

BruceForstall mentioned this pull request Nov 17, 2022

[pgo] Assertion failed 'tree == stmt->GetRootNode()' during 'Optimize layout' #78322

Closed

dakersnar mentioned this pull request Nov 17, 2022

Regressions in System.Globalization.Tests.StringSearch #78512

Closed

ghost locked as resolved and limited conversation to collaborators Dec 17, 2022

jeffhandley added the blog-candidate Completed PRs that are candidate topics for blog post coverage label Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize String.Equals for OrdinalIgnoreCase #77947

Vectorize String.Equals for OrdinalIgnoreCase #77947

EgorBo commented Nov 5, 2022 •

edited

Loading

ghost commented Nov 5, 2022

EgorBo commented Nov 5, 2022 •

edited

Loading

EgorBo commented Nov 7, 2022

EgorBo commented Nov 7, 2022 •

edited

Loading

MattGal commented Nov 8, 2022

EgorBo commented Nov 15, 2022 •

edited

Loading

Vectorize String.Equals for OrdinalIgnoreCase #77947

Vectorize String.Equals for OrdinalIgnoreCase #77947

Conversation

EgorBo commented Nov 5, 2022 • edited Loading

ghost commented Nov 5, 2022

EgorBo commented Nov 5, 2022 • edited Loading

EgorBo commented Nov 7, 2022

EgorBo commented Nov 7, 2022 • edited Loading

MattGal commented Nov 8, 2022

EgorBo commented Nov 15, 2022 • edited Loading

EgorBo commented Nov 5, 2022 •

edited

Loading

EgorBo commented Nov 5, 2022 •

edited

Loading

EgorBo commented Nov 7, 2022 •

edited

Loading

EgorBo commented Nov 15, 2022 •

edited

Loading