Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize String.Equals for OrdinalIgnoreCase #77947

Merged
merged 15 commits into from
Nov 11, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Nov 5, 2022

I tried to keep it simple so no AVX path, only SSE and NEON + SWAR path for trailing elements.

Benchmark:

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public bool EqualsIgnoreCase(string s1, string s2) => 
    s1.Equals(s2, StringComparison.OrdinalIgnoreCase);



public static IEnumerable<object[]> TestData()
{
    yield return new object[]
    {
        @"Hi!", // 3 chars (to make sure overhead is not big)
        @"HI!",
    };

    yield return new object[]
    {
        @"hello!!!", // 8 chars (switches to SIMD)
        @"HELLO!!!",
    };

    yield return new object[]
    {
        @"hello world", // 11 chars (1xV128 + trailing elements)
        @"HELLO WORLD",
    };

    yield return new object[]
    {
        @"C:\prj\runtime-main\src\coreclr\CMakeLists.txt", // 46 chars (5xV128 + trailing elements)
        @"C:\prj\runtime-main\src\CORECLR\CMakeLists.txt",
    };

    yield return new object[]
    {
        @"Good bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:", // 226 chars
        @"GOOD bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:",
    };
}
Method Toolchain s1 s2 Mean
EqualsIgnoreCase \Core_Root\corerun.exe Hi! HI! 2.940 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe Hi! HI! 2.901 ns
EqualsIgnoreCase \Core_Root\corerun.exe hello!!! [8] HELLO!!! [8] 3.418 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe hello!!! [8] HELLO!!! [8] 4.829 ns
EqualsIgnoreCase \Core_Root\corerun.exe hello world [11] HELLO WORLD [11] 5.880 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe hello world [11] HELLO WORLD [11] 6.143 ns
EqualsIgnoreCase \Core_Root\corerun.exe C:\pr(...)s.txt [46] C:\pr(...)s.txt [46] 11.422 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe C:\pr(...)s.txt [46] C:\pr(...)s.txt [46] 19.049 ns
EqualsIgnoreCase \Core_Root\corerun.exe Good(...)ion: [226] GOOD(...)ion: [226] 40.289 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe Good(...)ion: [226] GOOD(...)ion: [226] 86.035 ns

I expect better difference on ARM64 where the scalar (SWAR) path suffers from 64bit constants everywhere.

@ghost
Copy link

ghost commented Nov 5, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

I tried to keep it simple so no AVX fallback + scalar path for trailing elements.

Benchmark:

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public bool EqualsIgnoreCase(string s1, string s2) => 
    s1.Equals(s2, StringComparison.OrdinalIgnoreCase);



public static IEnumerable<object[]> TestData()
{
    yield return new object[]
    {
        @"Hi!", // 3 chars (to make sure overhead is not big)
        @"HI!",
    };

    yield return new object[]
    {
        @"hello!!!", // 8 chars (switches to SIMD)
        @"HELLO!!!",
    };

    yield return new object[]
    {
        @"hello world", // 11 chars (1 simd + trailing elements)
        @"HELLO WORLD",
    };

    yield return new object[]
    {
        @"C:\prj\runtime-main\src\coreclr\CMakeLists.txt",
        @"C:\prj\runtime-main\src\CORECLR\CMakeLists.txt",
    };

    yield return new object[]
    {
        @"Good bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:", // 226 chars
        @"GOOD bug reports make it easier for maintainers to verify and root cause the underlying problem. The better a bug report, the faster the problem will be resolved. Ideally, a bug report should contain the following information:",
    };
}
Method Toolchain s1 s2 Mean
EqualsIgnoreCase \Core_Root\corerun.exe Hi! HI! 2.940 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe Hi! HI! 2.901 ns
EqualsIgnoreCase \Core_Root\corerun.exe hello!!! HELLO!!! 4.898 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe hello!!! HELLO!!! 4.829 ns
EqualsIgnoreCase \Core_Root\corerun.exe hello world HELLO WORLD 5.880 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe hello world HELLO WORLD 6.143 ns
EqualsIgnoreCase \Core_Root\corerun.exe C:\pr(...)s.txt [46] C:\pr(...)s.txt [46] 11.422 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe C:\pr(...)s.txt [46] C:\pr(...)s.txt [46] 19.049 ns
EqualsIgnoreCase \Core_Root\corerun.exe Good(...)ion: [226] GOOD(...)ion: [226] 40.289 ns
EqualsIgnoreCase \Core_Root_base\corerun.exe Good(...)ion: [226] GOOD(...)ion: [226] 86.035 ns

I expect better difference on ARM64 where the scalar (SWAR) path suffers from 64bit constants everywhere.

Author: EgorBo
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Nov 5, 2022

X86 codegen for the main loop:

G_M65434_IG03:              ;; offset=0011H
       4C63C8               movsxd   r9, eax
       C4A17A6F0C49         vmovdqu  xmm1, xmmword ptr [rcx+2*r9]
       C4A17A6F144A         vmovdqu  xmm2, xmmword ptr [rdx+2*r9]
       C5F1EBDA             vpor     xmm3, xmm1, xmm2
       C5E1DBD8             vpand    xmm3, xmm3, xmm0
       C4E27917DB           vptest   xmm3, xmm3
       755B                 jne      SHORT G_M65434_IG05  ;; non-ASCII

G_M65434_IG04: 
       C5F9101D99000000     vmovupd  xmm3, xmmword ptr [reloc @RWD16]
       C5E1FCE1             vpaddb   xmm4, xmm3, xmm1
       C5E1FCDA             vpaddb   xmm3, xmm3, xmm2
       C5F9102D99000000     vmovupd  xmm5, xmmword ptr [reloc @RWD32]
       C5D964E5             vpcmpgtb xmm4, xmm4, xmm5
       C5E164DD             vpcmpgtb xmm3, xmm3, xmm5
       C5F9102D99000000     vmovupd  xmm5, xmmword ptr [reloc @RWD48]
       C5D9DFE5             vpandn   xmm4, xmm4, xmm5
       C5E1DFDD             vpandn   xmm3, xmm3, xmm5
       C5D9FCC9             vpaddb   xmm1, xmm4, xmm1
       C5E1FCD2             vpaddb   xmm2, xmm3, xmm2
       C5F1EFCA             vpxor    xmm1, xmm1, xmm2
       C4E27917C9           vptest   xmm1, xmm1
       7523                 jne      SHORT G_M65434_IG07  ;; not equal
       83C008               add      eax, 8
       458D48F8             lea      r9d, [r8-08H]
       413BC1               cmp      eax, r9d
       7E93                 jle      SHORT G_M65434_IG03  ;; next iteration

ARM64:

G_M65434_IG03:              ;; offset=0010H
        937F7C64          sbfiz   x4, x3, #1, #32
        3CE46811          ldr     q17, [x0, x4]
        3CE46832          ldr     q18, [x1, x4]
        4EB21E33          orr     v19.8h, v17.8h, v18.8h
        4E301E73          and     v19.8h, v19.8h, v16.8h
        6EB3A673          umaxp   v19.4s, v19.4s, v19.4s
        4E083E64          umov    x4, v19.d[0]
        F100009F          cmp     x4, #0
        540002A1          bne     G_M65434_IG05  ;; non-ascii

G_M65434_IG04:              
        9C0005F3          ldr     q19, [@RWD16]
        4E318674          add     v20.16b, v19.16b, v17.16b
        4E328673          add     v19.16b, v19.16b, v18.16b
        9C000615          ldr     q21, [@RWD32]
        4E353694          cmgt    v20.16b, v20.16b, v21.16b
        4E353673          cmgt    v19.16b, v19.16b, v21.16b
        9C000635          ldr     q21, [@RWD48]
        4E741EB4          bic     v20.16b, v21.16b, v20.16b
        4E731EB3          bic     v19.16b, v21.16b, v19.16b
        4E318691          add     v17.16b, v20.16b, v17.16b
        4E328672          add     v18.16b, v19.16b, v18.16b
        6E321E31          eor     v17.16b, v17.16b, v18.16b
        6EB1A631          umaxp   v17.4s, v17.4s, v17.4s
        4E083E24          umov    x4, v17.d[0]
        F100009F          cmp     x4, #0
        54000201          bne     G_M65434_IG07 ;; not equal
        11002063          add     w3, w3, #8
        51002044          sub     w4, w2, #8
        6B04007F          cmp     w3, w4
        54FFFC8D          ble     G_M65434_IG03 ;; next iteration

Will work separately in the jit to figure out why the memory loads were not hoisted (constant vectors)

EgorBo and others added 3 commits November 5, 2022 22:05
…Ordinal.cs

Co-authored-by: Stephen Toub <stoub@microsoft.com>
…o/runtime-1 into vectorize-equals-ordinalignorecase
@EgorBo
Copy link
Member Author

EgorBo commented Nov 7, 2022

@stephentoub @tarekgh @GrabYourPitchforks does it look good otherwise? CI is finally green (one more job to finish)

@EgorBo
Copy link
Member Author

EgorBo commented Nov 7, 2022

Decided to delete the runtime MinOpts test - it's too slow under stress modes and OrdinalIngoreCase is already covered with libs tests

@MattGal
Copy link
Member

MattGal commented Nov 8, 2022

Windows docker scenarios are broken due to dotnet/arcade#11554 ; I am investigating.

@EgorBo EgorBo merged commit 1980c7b into dotnet:main Nov 11, 2022
@EgorBo EgorBo deleted the vectorize-equals-ordinalignorecase branch November 11, 2022 11:02
@EgorBo
Copy link
Member Author

EgorBo commented Nov 15, 2022

@ghost ghost locked as resolved and limited conversation to collaborators Dec 17, 2022
@jeffhandley jeffhandley added the blog-candidate Completed PRs that are candidate topics for blog post coverage label Mar 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization blog-candidate Completed PRs that are candidate topics for blog post coverage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants