Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Recognize 'bt' bit test idiom #72986

Closed
TheBlackPlague opened this issue Jul 28, 2022 · 9 comments · Fixed by #73120
Closed

JIT: Recognize 'bt' bit test idiom #72986

TheBlackPlague opened this issue Jul 28, 2022 · 9 comments · Fixed by #73120
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Milestone

Comments

@TheBlackPlague
Copy link

Description

I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the -O3 flag to ensure proper optimization, I imagine the equivalent to dotnet's Release configuration) and turns out I was right.

Consider the following code:

readonly ulong Internal = 0x003;

bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;

RyuJIT generates the following assembly for the method GetSetBit in release configuration:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       test     al, 1
       setne    al
       movzx    rax, al
       ret      

The similar code in C++ looks like this:

unsigned long long internal = 0x003;

bool get_set_bit(int i)
{
    return (internal >> i & 1ULL) == 1ULL;
}

GCC 12.1 x86-64 generates the following assembly for the method get_set_bit with the -O3 argument:

        mov     rax, QWORD PTR internal[rip]
        bt      rax, rdi
        setc    al
        ret

As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:

bool GetSetBit(int i) 
{
    byte value = (byte)(Internal >> i & 1UL);
    return Unsafe.As<byte, bool>(ref value);
}
typedef int boolean;
#define true 1
#define false 0

boolean get_set_bit(int i)
{
    return internal >> i & 1ULL;
}

The generated assembly for this by RyuJIT is:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       and      eax, 1
       ret      

...and by GCC:

        mov     rax, QWORD PTR internal[rip]
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1
        ret

This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.

I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.

@TheBlackPlague TheBlackPlague added the tenet-performance Performance related issue label Jul 28, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 28, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 28, 2022
@ghost
Copy link

ghost commented Jul 28, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the -O3 flag to ensure proper optimization, I imagine the equivalent to dotnet's Release configuration) and turns out I was right.

Consider the following code:

readonly ulong Internal = 0x003;

bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;

RyuJIT generates the following assembly for the method GetSetBit in release configuration:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       test     al, 1
       setne    al
       movzx    rax, al
       ret      

The similar code in C++ looks like this:

unsigned long long internal = 0x003;

bool get_set_bit(int i)
{
    return (internal >> i & 1ULL) == 1ULL;
}

GCC 12.1 x86-64 generates the following assembly for the method get_set_bit with the -O3 argument:

        mov     rax, QWORD PTR internal[rip]
        bt      rax, rdi
        setc    al
        ret

As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:

bool GetSetBit(int i) 
{
    byte value = (byte)(Internal >> i & 1UL);
    return Unsafe.As<byte, bool>(ref value);
}
typedef int boolean;
#define true 1
#define false 0

boolean get_set_bit(int i)
{
    return internal >> i & 1ULL;
}

The generated assembly for this by RyuJIT is:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       and      eax, 1
       ret      

...and by GCC:

        mov     rax, QWORD PTR internal[rip]
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1
        ret

This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.

I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.

Author: TheBlackPlague
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@danmoseley
Copy link
Member

Probably the best thing here is to open separate issues for each category of suboptimal code gen you encounter.

@huoyaoyuan
Copy link
Member

The GCC output uses BT instruction of x86, which should be covered by #27382.

@EgorBo EgorBo changed the title Compiler doesn't fully understand or optimize code as much as C++ compilers do. JIT: Recognize 'bt' bit test idiom Jul 28, 2022
@EgorBo EgorBo added good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors labels Jul 28, 2022
@EgorBo EgorBo added this to the Future milestone Jul 28, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 28, 2022
@TheBlackPlague
Copy link
Author

Probably the best thing here is to open separate issues for each category of suboptimal code gen you encounter.

Thanks for the suggestion. I agree this may be the best way forward, and I shall do that.

@dubiousconst282
Copy link
Contributor

Note that the following pattern is properly recognized:

static bool M(int x, int y) => (x & (1 << y)) != 0;
C.M(Int32, Int32)
    L0000: bt ecx, edx
    L0003: setb al
    L0006: movzx eax, al
    L0009: ret

@TheBlackPlague
Copy link
Author

TheBlackPlague commented Jul 29, 2022

Note that the following pattern is properly recognized:

static bool M(int x, int y) => (x & (1 << y)) != 0;
C.M(Int32, Int32)
    L0000: bt ecx, edx
    L0003: setb al
    L0006: movzx eax, al
    L0009: ret

It seems this is only possible with integers. When translating the code to the same specifications as the issue documentation, it fails:

readonly ulong Internal = 0x003;

bool M(int x) => (Internal & (ulong)(1 << x)) != 0UL;
       mov      eax, 1
       mov      ecx, esi
       shl      eax, cl
       movsxd   rax, eax
       test     qword ptr [rdi+8], rax
       setne    al
       movzx    rax, al
       ret      

@dubiousconst282
Copy link
Contributor

dubiousconst282 commented Jul 29, 2022

Looks like that's the x86 disassembly, it should work if you change to x64: Sharplab

C.M3(UInt64, Int32)
    L0000: bt rcx, rdx
    L0004: setb al
    L0007: movzx eax, al
    L000a: ret

Edit: it won't work if you cast the shift apparently ((ulong)(1 << x)), but that would be incorrect anyway if the shift is >= 32. Try 1ul << x.

@TheBlackPlague
Copy link
Author

TheBlackPlague commented Jul 29, 2022

Looks like that's the x86 disassembly, it should work if you change to x64: Sharplab

C.M3(UInt64, Int32)
    L0000: bt rcx, rdx
    L0004: setb al
    L0007: movzx eax, al
    L000a: ret

Edit: it won't work if you cast the shift apparently ((ulong)(1 << x)), but that would be incorrect anyway if the shift is >= 32. Try 1ul << x.

Indeed that works. However, I still question the necessity of the movzx instruction. GCC removes that, so I believe it shouldn't be required.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 31, 2022
@AndyAyersMS
Copy link
Member

Indeed that works. However, I still question the necessity of the movzx instruction. GCC removes that, so I believe it shouldn't be required.

.NET semantics are different. The return value is always "widened" to a stack type. So, the jit will always ensure that upper bytes of small return values are properly cleared/set.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants