Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nibblemapmacros do a linear scan #93550

Open
fabled opened this issue Oct 16, 2023 · 9 comments
Open

nibblemapmacros do a linear scan #93550

fabled opened this issue Oct 16, 2023 · 9 comments
Assignees
Milestone

Comments

@fabled
Copy link

fabled commented Oct 16, 2023

I am currently researching the dotnet stack unwinding, and came across the nibblemap at https://github.com/dotnet/runtime/blob/main/src/coreclr/inc/nibblemapmacros.h.

It seems that the code is optimized to assume that JIT code generated is very small code blobs which are potentially unaligned. And on the contrast it performs poorly when mapping RIP from the end of a large code blobs. This is because a linear scan of the nibblemap is required to find the function beginning. Such linear scanning is done at e.g. https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L4051-L4056 (but also at other places).

I am also currently looking into reimplementing all of this in ebpf code (in Linux). This puts some constraints on how many iterations of the nibblemap scanning can be done. While some of these can be worked around, I think if some other data structure could serve here better.

I believe optimizing this map would also speed up unwinding in dotnet core runtime itself too. Do you have plans to improve this map, or would you welcome contributed work in this?

The existing implementation could be just improved by adding a special encoding to indicate a jump.

I would like to discuss potential algorithmic / data structures improvements, and potentially contribute work towards making it happen if a mutually acceptable approach can be reached.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 16, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 16, 2023
@vcsjones vcsjones added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 16, 2023
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 16, 2023
@mangod9 mangod9 added this to the Future milestone Oct 16, 2023
@noahfalk
Copy link
Member

I edited the post above to remove a reference to JVM source code. On the .NET project we have to be careful that we don't use GPL source (or have any appearance or question that it might have been used). Sorry, I know it can be awkward limitation, but we need to ensure potential contributors don't read related GPL source so there is no question about the origin of contributions.

@AaronRobinsonMSFT
Copy link
Member

@fabled I am doing some investigation here, nothing too deep though. I know that @davidwrighton was also noodling on this data structure.

@davidwrighton
Copy link
Member

@AaronRobinsonMSFT Yep, I've noodled on this a bit, and come up with a model that implements the find algorithm as a O(1) operation... Its also somewhat complicated by a desire to support the lock free nature of reads of this structure, and iteration, and I'm not convinced it entirely meets all of those needs, but it might be a good choice if we choose to do any work in this space this release.

NOTE: while I've looked at this as a theoretical improvement, I have not looked into whether or not this would provide a performance improvement to details such as stack walking. In my previous explorations of the performance of that system, I identified that the linked list we had for finding the CodeHeap itself was a significant performance problem, but the nibble table walks never showed up in any of my performance tests. (See #79021 where I introduced the RangeSectionMap. The RangeSectionMap also includes a linked list in its algorithm at one point, but the length of that linked list is in practice bounded to a fairly small size which I believe is something like 6.)

I would expect that this algorithm would be unlikely to significantly affect the performance of much code. However, I do understand that EBPF filters have strong guarantees of forward progress implemented by the lack of real support for iteration, which the previous theoretically unbounded iteration algorithm breaks. I think that support for being possible to write an EBPF based stack walker could be quite useful for diagnostics on Linux platforms, and efforts to make that possible seem like a good idea to me. Note the approach below here effectively implements jumps as you suggested, and I believe could be implemented as a single linear set of if statements and operations.

Proposed new nibble map layout lookup algorithm and structure

The intention here is to build a data structure which permits O(1) access once we reach a lookup at the nibble map layer. The dependencies it takes, are that:

  1. No 2 methods may be have start addresses within a single 32byte chunk.
  2. Each start address must be at least 4 byte aligned
  3. No code heap may be larger than 4GB
  4. Each chunk of memory allocated in the codeheap which can be found via the nibble map shall be proceeded by a CodeHeader. These chunks will colloquially be known as jitted methods in this document, although the more precise term is executable memory chunk allocated on the CodeHeap. (The distinction exists as there is a concept known as a CodeFragmentHeap which allocates small chunks of memory within a single allocated "method". That allocation scheme is used by various forms of stubs.)

The data structure must support the following algorithms

  1. Find start of code given pointer
  2. Iterate starts of jitted methods within a heap
  3. Add a new region of jitted memory within heap

There shall be an array of 32bit unsigned integers (pHdrMap), where logically each nibble in the array corresponds to 32 bytes of memory within the jit code region.
There shall be a pointer to the start of the code heap associated with this nibble map (pCodeStart)
There shall be a count of bytes that the code heap represents. (cbCodeHeapSize)

Each 32bit unsigned value may be an array of 8 4-bit nibbles, or an encoded 32bit offset from the start of the CodeHeap memory section to the start of the associated method.

  • Each nibble is indexed from the least significant to the most significant bits in the 32bit value. Each nibble is physically stored as either a 0, to indicate that it is uninitialized, and otherwise represents the numbers 0-14 (This is done by reading the raw value of the nibble and subtracting 1 if it is valid data.
  • Nibble 0 is special in that if it is encoded as a number greater than 8, it indicates that the 32bit value is actually an encoded 32bit offset. The 32bit offset is computed as the 28 most significant bits of the chunk + 2 bits computed from nibble 0 by subtracting 8 from the nibble value (to be used as the next 2 most significant bits), and the lowest bits are 0, as all code start offsets are 4 byte aligned.
  • A nibble may declare that the start region for a method is within itself by having a value between 0 and 7. These represent the offsets within a given 32byte region where the code may start
  • A nibble may also declare that the start region for a method is in either another nibble, or described by the chunk directly preceding the current chunk. This is done by having a value between 8 and 14 which indicates to look in the nibble between 1 and 5 before this nibble. If this reaches into an earlier chunk, restart this scan from the last nibble of the previous chunk. (NOTE: Normally this index only needs to be done once, but for the last nibble, as it can ony represent up to 6 nibbles earlier, if it specifies movement to another chunk, there will be an extra handling of the 0th nibble, which will indicate a need to move to the earlier chunk).

The algorithm to parse this data is as follows. Be aware that this algorithm has not actually be tested/implemented, so this is likely nearly correct, but unlikely to be completely correct.

// This algorithm assumes that pCodePointer has previously been bounds checked to point a some portion of the code heap.
 
uint8_t* FindMethodCode(uint8_t* pCodePointer)
{
    if (pCodePointer < pCodeStart) return NULL;
    if (pCodePointer >= (pCodeStart + cbCodeHeapSize)) return NULL;
 
    bool dataUninitialized;
    uint32_t offsetFromStartOfCodeHeap = (uint32_t)((uint8_t*)pCodePointer - (uint8_t*)pCodeStart);
RestartOffByOne:
    int chunkSearch = offsetFromStartOfCodeHeap / 256;
    uint32_t chunkLogicalStartOffset = chunkSearch * 256;
    uint32_t chunk1 = pHdrMap[chunkSearch]; // This is the only memory read in the entire algorithm. Due to the presence of gotos, this may occur up to 3 times.
    if (IsEncoded32BitOffset(chunk1))
        return Decode32BitOffset(chunk1);
    int32_t nibbleIndex = (int32_t)((offsetFromStartOfCodeHeap / 32) % 8)
    int nibbleRead1 = ReadNibble(chunk1, nibbleIndex, &dataUnitialized);
    if (dataUninitialized) return NULL;
    if (nibbleRead1 < 8)
    {
        // Some code block starts within this nibble
        uint32_t codeOffset = chunkLogicalStartOffset + nibbleIndex * 32 + nibbleRead1 * 4;
        if (codeOffset <= offsetFromStartOfCodeHeap)
        {
            // We've found the start of the code
            return ((uint8_t*)pCodePointer + codeOffset);
        }
        else
        {
            // We've found the start of a chunk of code AFTER pCodePointer. Restart search with a pointer that corresponds to one nibble earlier in the search
            // This restart can only happen if pCodePointer refers to an actual method contained within the heap.
            if ((offsetFromStartOfCodeHeap & ~0x1F) == 0) return NULL;
            offsetFromStartOfCodeHeap = (offsetFromStartOfCodeHeap & ~0x1F) - 1;
            goto RestartOffByOne; // OffByOneNibble
        }
    }
    else
    {
        // This nibble does not have a code start within in, but lookup should check another nibble or chunk.
        int nibbleEarlierToLookAt = nibbleRead - 7;
        int nibbleOfInterest = nibbleIndex - nibbleEarlierToLookAt;
        // Valid results of nibbleOfInterest are the numbers -1 to 5. (It would be nice if 
        if (nibbleOfInterest == -1)
        {
            // We've found that the start of this jitted method can only be described by an earlier chunk entirely.
            // This can only happen once during the lookup, as either the earlier chunk is a 32bit offset chunk, in which case the lookup will only need to look at one chunk
            // Or it is another set of nibbles, in which case the start of the method must be within that set of nibbles.
            offsetFromStartOfCodeHeap = (offsetFromStartOfCodeHeap & ~0xFF) - 1;
            goto RestartOffByOne; // OffByOneChunk
        }
        else
        {
            int nibbleRead2 = ReadNibble(chunk1, nibbleOfInterest, &dataUnitialized);
            if (dataUninitialized) return NULL;
            if (nibbleRead2 < 8)
            {
                uint32_t codeOffset = chunkLogicalStartOffset + nibbleIndex * 32 + nibbleRead1 * 4;
                return ((uint8_t*)pCodePointer + codeOffset);
            }
            else
            {
                assert(nibbleIndex == 7);
                assert(nibbleRead2 == 8);
                // We've found that the start of this jitted method can only be described by an earlier chunk entirely.
                // This can only happen once during the lookup, as either the earlier chunk is a 32bit offset chunk, in which case the lookup will only need to look at one chunk
                // Or it is another set of nibbles, in which case the start of the method must be within that set of nibbles.
                offsetFromStartOfCodeHeap = (offsetFromStartOfCodeHeap & ~0xFF) - 1;
                goto RestartOffByOne; // OffByOneChunk
            }
        }
    }
}

HELPER Methods

// Valid input requires nibbleIndex to be 0-7 inclusive
uint8_t ReadNibble(uint32_t chunk, int nibbleIndex, bool *dataUnitialized)
{
    uint8_t rawNibble = (chunk >> 4 * nibbleIndex) & 0xF;
    *dataUnitialized = rawNibble == 0;
    return rawNibble - 1;
}
 
bool IsEncoded32BitOffset(uint32_t chunk)
{
    // This works as the data structure must not allow "nibbleOfInterest" to be less than -1, and that means no valid encoding for nibble index 0 will be greater than 8.
    uint8_t nibbleZero = ReadNibble(chunk, 0);
    return nibbleZero > 8;
}
 
uint32_t Decode32BitOffset(uint32_t chunk)
{
    uint32_t nibble = ReadNibble(chunk, 0);
    return offset = (chunk & ~0xF) + ((nibble - 8) << 2);
}

@MichalPetryka
Copy link
Contributor

Do I understand correctly that this is for looking up methods from their addresses? If so, would improvements in this area solve the issue I had in #85197 where invalid pointers would crash the VM when looking them up?

@jkotas
Copy link
Member

jkotas commented Oct 26, 2023

If so, would improvements in this area solve the issue I had in #85197 where invalid pointers would crash the VM when looking them up?

This specific improvement would not help with your issue. The data structure proposed here has same function properties as the nibble map, it just has different perf characteristics.

@noahfalk
Copy link
Member

noahfalk commented Oct 26, 2023

@davidwrighton - I spent a little while dissecting what you have and it seemed good to me. This was my simplified mental model of it which others might find useful. I know it doesn't include all the details, but let me know if anything it does include is inaccurate.

  • Every 256 byte region of the code heap is represented by 32 bits of data in the map
  • If there is no method start in a 256 byte region, the 32 bits will be used to encode the offset from the start of the code heap to nearest preceding method start (its not the normal encoding of a 32 bit integer though)
  • If there is one or more method start in a 256 byte region, the 32 bits will be used to encode 8 nibbles. The n'th nibble corresponds to the n'th 32-byte sub-region within the 256 byte region (8*32=256). There is a particular encoding of the 0th nibble which indicates whether to interpret the 32 bits as 8 nibbles or as a single offset.
  • Within a 32 byte sub-region code can start in at most one of the 8 4-byte aligned addresses. The nibble encodes either one of those 8 starting addresses, or 0 indicating it is unintialized, or the remaining 7 values are used when there is no method start in these 32 bytes. We use 7 values rather than just one to provide a hint on which preceding nibble to search next. Linearly searching backwards through the nibbles would work, but skipping some nibbles speeds up the search.
  • The total number of map entries searched is always bounded. If a search in one 32 bit data map entry doesn't find a start address then it must find the start in the preceding entry. If the preceding entry is represented by nibbles that means there was at least one start address in it. Or if the preceding entry is a 32 bit offset that also identifies a method start and terminates the search.

@AaronRobinsonMSFT

This comment was marked as off-topic.

@davidwrighton
Copy link
Member

@AaronRobinsonMSFT No, there is only 1 method start in a 32byte region in the algorithm I presented, at the 256 byte region up to 8 method starts could exist. Now that I've spent a few more moments thinking on this, I suspect we should do some analysis to determine the number of methods which are < 48 and greater than 24 bytes bytes on our 64bit platforms, as you could reduce the amount of memory needed for this sort of map so that each 32bit value represents 64bytes of executable space. If it turns out that we have very few of those, it would probably be preferable to have the system do all of its alignment in terms of 8 bytes instead of 4 bytes, working with 64 byte and 512 byte regions.

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Oct 27, 2023

@AaronRobinsonMSFT No, there is only 1 method start in a 32byte region in the algorithm I presented,

Apologies, you're right. I misinterpreted the 256 and confused that with DWORDs and other units that are defined in the current implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants