JIT: Recognize 'bt' bit test idiom #72986

TheBlackPlague · 2022-07-28T05:49:57Z

Description

I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the -O3 flag to ensure proper optimization, I imagine the equivalent to dotnet's Release configuration) and turns out I was right.

Consider the following code:

readonly ulong Internal = 0x003;

bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;

RyuJIT generates the following assembly for the method GetSetBit in release configuration:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       test     al, 1
       setne    al
       movzx    rax, al
       ret

The similar code in C++ looks like this:

unsigned long long internal = 0x003;

bool get_set_bit(int i)
{
    return (internal >> i & 1ULL) == 1ULL;
}

GCC 12.1 x86-64 generates the following assembly for the method get_set_bit with the -O3 argument:

        mov     rax, QWORD PTR internal[rip]
        bt      rax, rdi
        setc    al
        ret

As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:

bool GetSetBit(int i) 
{
    byte value = (byte)(Internal >> i & 1UL);
    return Unsafe.As<byte, bool>(ref value);
}

typedef int boolean;
#define true 1
#define false 0

boolean get_set_bit(int i)
{
    return internal >> i & 1ULL;
}

The generated assembly for this by RyuJIT is:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       and      eax, 1
       ret

...and by GCC:

        mov     rax, QWORD PTR internal[rip]
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1
        ret

This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.

I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.

The text was updated successfully, but these errors were encountered:

ghost · 2022-07-28T05:50:09Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the -O3 flag to ensure proper optimization, I imagine the equivalent to dotnet's Release configuration) and turns out I was right.

Consider the following code:

readonly ulong Internal = 0x003;

bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;

RyuJIT generates the following assembly for the method GetSetBit in release configuration:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       test     al, 1
       setne    al
       movzx    rax, al
       ret

The similar code in C++ looks like this:

unsigned long long internal = 0x003;

bool get_set_bit(int i)
{
    return (internal >> i & 1ULL) == 1ULL;
}

GCC 12.1 x86-64 generates the following assembly for the method get_set_bit with the -O3 argument:

        mov     rax, QWORD PTR internal[rip]
        bt      rax, rdi
        setc    al
        ret

As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:

bool GetSetBit(int i) 
{
    byte value = (byte)(Internal >> i & 1UL);
    return Unsafe.As<byte, bool>(ref value);
}

typedef int boolean;
#define true 1
#define false 0

boolean get_set_bit(int i)
{
    return internal >> i & 1ULL;
}

The generated assembly for this by RyuJIT is:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       and      eax, 1
       ret

...and by GCC:

        mov     rax, QWORD PTR internal[rip]
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1
        ret

This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.

I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.

Author:	TheBlackPlague
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	-

danmoseley · 2022-07-28T07:16:28Z

Probably the best thing here is to open separate issues for each category of suboptimal code gen you encounter.

huoyaoyuan · 2022-07-28T07:21:05Z

The GCC output uses BT instruction of x86, which should be covered by #27382.

TheBlackPlague · 2022-07-28T09:15:58Z

Probably the best thing here is to open separate issues for each category of suboptimal code gen you encounter.

Thanks for the suggestion. I agree this may be the best way forward, and I shall do that.

dubiousconst282 · 2022-07-29T03:44:33Z

Note that the following pattern is properly recognized:

static bool M(int x, int y) => (x & (1 << y)) != 0;

C.M(Int32, Int32)
    L0000: bt ecx, edx
    L0003: setb al
    L0006: movzx eax, al
    L0009: ret

TheBlackPlague · 2022-07-29T05:36:32Z

Note that the following pattern is properly recognized:

static bool M(int x, int y) => (x & (1 << y)) != 0;

C.M(Int32, Int32)
    L0000: bt ecx, edx
    L0003: setb al
    L0006: movzx eax, al
    L0009: ret

It seems this is only possible with integers. When translating the code to the same specifications as the issue documentation, it fails:

readonly ulong Internal = 0x003;

bool M(int x) => (Internal & (ulong)(1 << x)) != 0UL;

       mov      eax, 1
       mov      ecx, esi
       shl      eax, cl
       movsxd   rax, eax
       test     qword ptr [rdi+8], rax
       setne    al
       movzx    rax, al
       ret

dubiousconst282 · 2022-07-29T06:26:39Z

Looks like that's the x86 disassembly, it should work if you change to x64: Sharplab

C.M3(UInt64, Int32)
    L0000: bt rcx, rdx
    L0004: setb al
    L0007: movzx eax, al
    L000a: ret

Edit: it won't work if you cast the shift apparently ((ulong)(1 << x)), but that would be incorrect anyway if the shift is >= 32. Try 1ul << x.

TheBlackPlague · 2022-07-29T11:58:45Z

Looks like that's the x86 disassembly, it should work if you change to x64: Sharplab
C.M3(UInt64, Int32)
    L0000: bt rcx, rdx
    L0004: setb al
    L0007: movzx eax, al
    L000a: ret
Edit: it won't work if you cast the shift apparently ((ulong)(1 << x)), but that would be incorrect anyway if the shift is >= 32. Try 1ul << x.

Indeed that works. However, I still question the necessity of the movzx instruction. GCC removes that, so I believe it shouldn't be required.

AndyAyersMS · 2022-08-02T17:10:55Z

Indeed that works. However, I still question the necessity of the movzx instruction. GCC removes that, so I believe it shouldn't be required.

.NET semantics are different. The return value is always "widened" to a stack type. So, the jit will always ensure that upper bytes of small return values are properly cleared/set.

TheBlackPlague added the tenet-performance Performance related issue label Jul 28, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 28, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jul 28, 2022

TheBlackPlague mentioned this issue Jul 28, 2022

Speed up BitBoard Indexing. TheBlackPlague/StockNemo#46

Merged

EgorBo changed the title ~~Compiler doesn't fully understand or optimize code as much as C++ compilers do.~~ JIT: Recognize 'bt' bit test idiom Jul 28, 2022

EgorBo added good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors labels Jul 28, 2022

EgorBo added this to the Future milestone Jul 28, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Jul 28, 2022

dubiousconst282 mentioned this issue Jul 31, 2022

Improve morphing of bit comparisons to constant 0/1 #73120

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 31, 2022

AndyAyersMS assigned dubiousconst282 Aug 2, 2022

dubiousconst282 mentioned this issue Aug 30, 2022

JIT: Optimize "X & 1 == 0" to "X & 1" #61412

Closed

AndyAyersMS closed this as completed in #73120 Sep 6, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 6, 2022

ghost locked as resolved and limited conversation to collaborators Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Recognize 'bt' bit test idiom #72986

JIT: Recognize 'bt' bit test idiom #72986

TheBlackPlague commented Jul 28, 2022

ghost commented Jul 28, 2022

Description

danmoseley commented Jul 28, 2022

huoyaoyuan commented Jul 28, 2022

TheBlackPlague commented Jul 28, 2022

dubiousconst282 commented Jul 29, 2022

TheBlackPlague commented Jul 29, 2022 •

edited

Loading

dubiousconst282 commented Jul 29, 2022 •

edited

Loading

TheBlackPlague commented Jul 29, 2022 •

edited

Loading

AndyAyersMS commented Aug 2, 2022

JIT: Recognize 'bt' bit test idiom #72986

JIT: Recognize 'bt' bit test idiom #72986

Comments

TheBlackPlague commented Jul 28, 2022

Description

ghost commented Jul 28, 2022

Description

danmoseley commented Jul 28, 2022

huoyaoyuan commented Jul 28, 2022

TheBlackPlague commented Jul 28, 2022

dubiousconst282 commented Jul 29, 2022

TheBlackPlague commented Jul 29, 2022 • edited Loading

dubiousconst282 commented Jul 29, 2022 • edited Loading

TheBlackPlague commented Jul 29, 2022 • edited Loading

AndyAyersMS commented Aug 2, 2022

TheBlackPlague commented Jul 29, 2022 •

edited

Loading

dubiousconst282 commented Jul 29, 2022 •

edited

Loading

TheBlackPlague commented Jul 29, 2022 •

edited

Loading