Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Corruption With Ref Locals, Punning, and Pinned Object Heap #76929

Closed
kevin-montrose opened this issue Oct 12, 2022 · 15 comments · Fixed by #86802
Closed

Data Corruption With Ref Locals, Punning, and Pinned Object Heap #76929

kevin-montrose opened this issue Oct 12, 2022 · 15 comments · Fixed by #86802
Assignees
Milestone

Comments

@kevin-montrose
Copy link

kevin-montrose commented Oct 12, 2022

Description

This was a bear to diagnose, and I'm still not 100% on what exactly is happening but the scenario is:

  • Have a bunch of byte[] s allocated on the POH
  • Those byte[]s are referenced by a ConcurrentDictionary
  • Have a bunch of threads getting those byte[]s, and punning it via MemoryMarshal.Cast
    • An earlier version used a ref byte and some unsafe code, but I've removed the unsafe code to eliminate it as a possible cause
  • Have some other threads removing byte[]s from the ConcurrentDictionary
  • After some time, data corruption occurs

I first discovered this as random looking pointers getting written into those byte[] arrays, but in the process of winnowing down to a smaller reproduction null reference exceptions, seg faults, and other "you've corrupted the process"-style errors became more likely. I interpret this as the same corruption happening, but because my punned arrays are smaller the corruption is more likely to hit something else.

I first noticed this in .NET 7 RC (7.0.0-rc.1.22427.1 specifically) but it has also been reproduced in .NET 6.

Reproduction Steps

I have a gist I used to winnow down the repro some.

Latest is copied here:

// drop this into a test project

[StructLayout(LayoutKind.Explicit, Size = Size)]
private struct Punned
{
    internal const int Size = 8;

    [FieldOffset(0)]
    public ulong A;
}

/// <summary>
/// This spawns a bunch of threads, half of which do integrity checks on a punned byte[]
/// and half of which randomly replace a referenced byte[].
/// 
/// Sometimes things just break: either field corruption, null ref, or access violation.
/// 
/// Reproduces in DEBUG builds and RELEASE builds.
/// 
/// Tends to take < 10 iterations, but not more than 100.  You know, on my machine.
/// 
/// Only reproduces if you use the POH, SOH and LOH are fine.
/// </summary>
[Fact]
public void Repro()
{
    // DOES repro with ALLOC_SIZE >= Punned.Size
    //        and with USE_POH == true
    //
    // does not repro if USE_POH == false

    // tweak these to mess with alignment and heap
    const int ALLOC_SIZE = Punned.Size;
    const bool USE_POH = true;

    Assert.True(Punned.Size == Unsafe.SizeOf<Punned>(), "Hey, this isn't right");
    Assert.True(ALLOC_SIZE >= Punned.Size, "Hey, this isn't right");

    const int MAX_KEY = 1_000_000;

    var iter = 0;
    while (true)
    {
        Debug.WriteLine($"Iteration: {iter}");
        iter++;

        var dict = new ConcurrentDictionary<int, byte[]>();

        // allocate
        for (var i = 0; i < MAX_KEY; i++)
        {
            dict[i] = GC.AllocateArray<byte>(Punned.Size, pinned: USE_POH);
        }

        // start all the threads
        using var startThreads = new SemaphoreSlim(0, Environment.ProcessorCount);

        var modifyThreads = new Thread[Environment.ProcessorCount / 2];
        for (var i = 0; i < modifyThreads.Length; i++)
        {
            modifyThreads[i] = ModifyingThread(i, startThreads, dict);
        }

        var checkThreads = new Thread[Environment.ProcessorCount - modifyThreads.Length];
        using var stopCheckThreads = new SemaphoreSlim(0, checkThreads.Length);
        for (var i = 0; i < checkThreads.Length; i++)
        {
            checkThreads[i] = IntegrityThread(i, MAX_KEY / checkThreads.Length, startThreads, stopCheckThreads, dict);
        }

        // let 'em go
        startThreads.Release(modifyThreads.Length + checkThreads.Length);

        // wait for modifying threads to finish...
        for (var i = 0; i < modifyThreads.Length; i++)
        {
            modifyThreads[i].Join();
        }

        // stop check threads..
        stopCheckThreads.Release(checkThreads.Length);
        for (var i = 0; i < checkThreads.Length; i++)
        {
            checkThreads[i].Join();
        }
    }

    static Thread IntegrityThread(
        int threadIx,
        int step,
        SemaphoreSlim startThreads,
        SemaphoreSlim stopThreads,
        ConcurrentDictionary<int, byte[]> dict
    )
    {
        using var threadStarted = new SemaphoreSlim(0, 1);

        var t =
            new Thread(
                () =>
                {
                    threadStarted.Release();

                    startThreads.Wait();

                    while (!stopThreads.Wait(0))
                    {
                        for (var i = 0; i < MAX_KEY; i++)
                        {
                            var keyIx = (threadIx * step + i) % MAX_KEY;

                            ref Punned punned = ref Pun(dict[keyIx]);

                            Check(ref punned);
                        }
                    }
                }
             );
        t.Name = $"{nameof(Repro)} Integrity #{threadIx}";
        t.Start();

        threadStarted.Wait();

        return t;
    }

    static Thread ModifyingThread(int threadIx, SemaphoreSlim startThreads, ConcurrentDictionary<int, byte[]> dict)
    {
        using var threadStarted = new SemaphoreSlim(0, 1);

        var t = new
            Thread(
                () =>
                {
                    threadStarted.Release();

                    var rand = new Random(threadIx);

                    startThreads.Wait();

                    for (var i = 0; i < 1_000_000; i++)
                    {
                        var keyIx = rand.Next(MAX_KEY);

                        var newArr = GC.AllocateArray<byte>(Punned.Size, pinned: USE_POH);
                        Assert.True(newArr.All(x => x == 0));

                        // make sure it comes up reasonable
                        ref Punned punned = ref Pun(newArr);
                        Assert.Equal(0UL, punned.A);

                        // this swaps out the only reference to a byte[]
                        // EXCEPT for any of the checking threads, which only
                        // grab it through a ref
                        dict.AddOrUpdate(keyIx, static (_, passed) => passed, static (_, _, passed) => passed, newArr);
                    }
                }
            );
        t.Name = $"{nameof(Repro)} Modify #{threadIx}";
        t.Start();

        threadStarted.Wait();

        return t;
    }

    static ref Punned Pun(byte[] data)
    {
        var span = data.AsSpan();

        var punned = MemoryMarshal.Cast<byte, Punned>(span);

        return ref punned[0];
    }

    static void Check(ref Punned val)
    {
        // all possible bit patterns are well known
        var a = val.A;
        Assert.True(a == 0);
    }
}

This will fail either in Check, with a null ref in an impossible place (usually AddOrUpdate), or with some variant of "runtime has become corrupt". The NRE is most common with the above, but earlier revisions usually failed in Check.

In my testing this only happens if the POH is used (toggle USE_POH to verify), and at all (legal) sizes for the byte[]s (change ALLOC_SIZE to verify).

Expected behavior

I would expect the attached code to run fine forever.

Actual behavior

Crashes with some sort of data corruption.

Regression?

No, this reproduces (at least in part) on .NET 6.

Known Workarounds

Don't use the POH I guess?

Configuration

This was first noticed on:

  • Microsoft Windows 11 Home Insider Preview: 10.0.25151 N/A Build 25151
  • AMD64 Family 23 Model 96 Stepping 1 AuthenticAMD ~2000 Mhz: AMD Ryzen™ 7 4980U
  • .NET 7: Microsoft.WindowsDesktop.App 7.0.0-rc.1.22427.1

It is also reproducing, at least in part, on .NET 6.

It has been reproduced on a colleagues machine as well, but I don't have the specifics beyond also x64, Windows, and .NET 7 & 6.

Other information

When I've found a corrupted byte[] (instead of a NRE or other crash), it looks very pointer-y but seems to point to memory outside of any heap.

This makes me think some sort of GC bug, perhaps as part of growing or shrinking the POH, but that is ~98% guesswork.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 12, 2022
@ghost
Copy link

ghost commented Oct 12, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

This was a bear to diagnose, and I'm still not 100% on what exactly is happening but the scenario is:

  • Have a bunch of byte[] s allocated on the POH
  • Those byte[]s are referenced by a ConcurrentDictionary
  • Have a bunch of threads getting those byte[]s, grabbing a ref someByte[0] local, and punning it via MemoryMarshal.Cast
  • Have some other threads removing byte[]s from the ConcurrentDictionary
  • After some time, data corruption occurs

I first discovered this as random looking pointers getting written into those byte[] arrays, but in the process of winnowing down to a smaller reproduction null reference exceptions, seg faults, and other "you've corrupted the process"-style errors became more likely. I interpret this as the same corruption happening, but because my punned arrays are smaller the corruption is more likely to hit something else.

I first noticed this in .NET 7 RC (7.0.0-rc.1.22427.1 specifically) but it has also been reproduced in .NET 6.

Reproduction Steps

I have a gist I used to winnow down the repro some.

Latest is copied here:

// drop this into a test project

[StructLayout(LayoutKind.Explicit, Size = Size)]
private struct Punned
{
    internal const int Size = 8;

    [FieldOffset(0)]
    public ulong A;
}

/// <summary>
/// This spawns a bunch of threads, half of which do integrity checks on a punned byte[]
/// and half of which randomly replace a referenced byte[].
/// 
/// Sometimes things just break: either field corruption, null ref, or access violation.
/// 
/// Reproduces in DEBUG builds and RELEASE builds.
/// 
/// Tends to take < 10 iterations, but not more than 100.  You know, on my machine.
/// 
/// Only reproduces if you use the POH, SOH and LOH are fine.
/// </summary>
[Fact]
public void Repro()
{
    // DOES repro with ALLOC_SIZE >= Punned.Size
    //        and with USE_POH == true
    //
    // does not repro if USE_POH == false

    // tweak these to mess with alignment and heap
    const int ALLOC_SIZE = Punned.Size;
    const bool USE_POH = true;

    Assert.True(Punned.Size == Unsafe.SizeOf<Punned>(), "Hey, this isn't right");
    Assert.True(ALLOC_SIZE >= Punned.Size, "Hey, this isn't right");

    const int MAX_KEY = 1_000_000;

    var iter = 0;
    while (true)
    {
        Debug.WriteLine($"Iteration: {iter}");
        iter++;

        var dict = new ConcurrentDictionary<int, byte[]>();

        // allocate
        for (var i = 0; i < MAX_KEY; i++)
        {
            dict[i] = GC.AllocateArray<byte>(Punned.Size, pinned: USE_POH);
        }

        // start all the threads
        using var startThreads = new SemaphoreSlim(0, Environment.ProcessorCount);

        var modifyThreads = new Thread[Environment.ProcessorCount / 2];
        for (var i = 0; i < modifyThreads.Length; i++)
        {
            modifyThreads[i] = ModifyingThread(i, startThreads, dict);
        }

        var checkThreads = new Thread[Environment.ProcessorCount - modifyThreads.Length];
        using var stopCheckThreads = new SemaphoreSlim(0, checkThreads.Length);
        for (var i = 0; i < checkThreads.Length; i++)
        {
            checkThreads[i] = IntegrityThread(i, MAX_KEY / checkThreads.Length, startThreads, stopCheckThreads, dict);
        }

        // let 'em go
        startThreads.Release(modifyThreads.Length + checkThreads.Length);

        // wait for modifying threads to finish...
        for (var i = 0; i < modifyThreads.Length; i++)
        {
            modifyThreads[i].Join();
        }

        // stop check threads..
        stopCheckThreads.Release(checkThreads.Length);
        for (var i = 0; i < checkThreads.Length; i++)
        {
            checkThreads[i].Join();
        }
    }

    static Thread IntegrityThread(
        int threadIx,
        int step,
        SemaphoreSlim startThreads,
        SemaphoreSlim stopThreads,
        ConcurrentDictionary<int, byte[]> dict
    )
    {
        using var threadStarted = new SemaphoreSlim(0, 1);

        var t =
            new Thread(
                () =>
                {
                    threadStarted.Release();

                    startThreads.Wait();

                    while (!stopThreads.Wait(0))
                    {
                        for (var i = 0; i < MAX_KEY; i++)
                        {
                            var keyIx = (threadIx * step + i) % MAX_KEY;

                            ref Punned punned = ref Pun(dict[keyIx]);

                            Check(ref punned);
                        }
                    }
                }
             );
        t.Name = $"{nameof(Repro)} Integrity #{threadIx}";
        t.Start();

        threadStarted.Wait();

        return t;
    }

    static Thread ModifyingThread(int threadIx, SemaphoreSlim startThreads, ConcurrentDictionary<int, byte[]> dict)
    {
        using var threadStarted = new SemaphoreSlim(0, 1);

        var t = new
            Thread(
                () =>
                {
                    threadStarted.Release();

                    var rand = new Random(threadIx);

                    startThreads.Wait();

                    for (var i = 0; i < 1_000_000; i++)
                    {
                        var keyIx = rand.Next(MAX_KEY);

                        var newArr = GC.AllocateArray<byte>(Punned.Size, pinned: USE_POH);
                        Assert.True(newArr.All(x => x == 0));

                        // make sure it comes up reasonable
                        ref Punned punned = ref Pun(newArr);
                        Assert.Equal(0UL, punned.A);

                        // this swaps out the only reference to a byte[]
                        // EXCEPT for any of the checking threads, which only
                        // grab it through a ref
                        dict.AddOrUpdate(keyIx, static (_, passed) => passed, static (_, _, passed) => passed, newArr);
                    }
                }
            );
        t.Name = $"{nameof(Repro)} Modify #{threadIx}";
        t.Start();

        threadStarted.Wait();

        return t;
    }

    static ref Punned Pun(byte[] data)
    {
        var span = data.AsSpan();

        var punned = MemoryMarshal.Cast<byte, Punned>(span);

        return ref punned[0];
    }

    static void Check(ref Punned val)
    {
        // all possible bit patterns are well known
        var a = val.A;
        Assert.True(a == 0);
    }
}

This will fail either in Check, with a null ref in an impossible place (usually AddOrUpdate), or with some variant of "runtime has become corrupt". The NRE is most common with the above, but earlier revisions usually failed in Check.

In my testing this only happens if the POH is used (toggle USE_POH to verify), and at all (legal) sizes for the byte[]s (change ALLOC_SIZE to verify).

A old colleague had a variant using a byte[][] instead, which might be easier to diagnose. They're also the ones who confirmed the .NET 6 repro. I'll see if they can't add their findings it to this issue later today.

Expected behavior

I would expect the attached code to run fine forever.

Actual behavior

Crashes with some sort of data corruption.

Regression?

No, this reproduces (at least in part) on .NET 6.

Known Workarounds

Don't use the POH I guess?

Configuration

This was first noticed on:

  • Microsoft Windows 11 Home Insider Preview: 10.0.25151 N/A Build 25151
  • AMD64 Family 23 Model 96 Stepping 1 AuthenticAMD ~2000 Mhz: AMD Ryzen™ 7 4980U
  • .NET 7: Microsoft.WindowsDesktop.App 7.0.0-rc.1.22427.1

It is also reproducing, at least in part, on .NET 6.

It has been reproduced on a colleagues machine as well, but I don't have the specifics beyond also x64, Windows, and .NET 7 & 6.

Other information

When I've found a corrupted byte[] (instead of a NRE or other crash), it looks very pointer-y but seems to point to memory outside of any heap.

This makes me think some sort of GC bug, perhaps as part of growing or shrinking the POH, but that is ~98% guesswork.

Author: kevin-montrose
Assignees: -
Labels:

area-GC-coreclr, untriaged

Milestone: -

@vcsjones
Copy link
Member

It has been reproduced on a colleagues machine as well, but I don't have the specifics beyond also x64, Windows, and .NET 7 & 6.

For what it's worth, it reliably crashes in a few seconds with a SIGSEGV on a macOS M1, but only in release mode. This was on .NET 8 from a nightly a few days ago.

More notably, it crashes for me even with USE_POH = false. So this may indicate the problem is not exclusive to the POH.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 12, 2022
@mangod9 mangod9 added this to the 8.0.0 milestone Oct 12, 2022
@mgravell
Copy link
Member

mgravell commented Oct 12, 2022

I was able to repro on net6 and net7-rc2 on Windows, Ryzen 9 5900HS in POH mode (only); I was ultimately unable to make any significant inroads into what was actually happening, other than "bad things"

I tried a few scenarios:

  • as-written: fails after a while with NRE in ConcurrentDictionary<,>.TryUpdateInternal
  • using the indexer-setter on the dictionary dict[keyIx] = newArr; - fails (after more iterations, but often less time - indexer is much faster) with an assertion error on the value - typically seeing values like 2565199768464 where the zero was expected
  • using the alternative overload dict.AddOrUpdate(keyIx, newArr, static (_, passed) => passed); - again much faster, but fails eventually

It certainly feels like a GC tracking error that is specific to POH reachability checks for values only reachable by interior managed pointers; however - multiple simpler scenarios that targeted exactly that: did nothing interesting and worked as expected.

Perhaps relevant:

  • using Unsafe.As instead of the MemoryMarshal.Cast for the punning: made no difference (but a little faster - sacrificing a length test)
  • adding GC.KeepAlive(newArr); after the assignment changed nothing, so presumably the issue isn't related to tracking the inbound value... but this is speculation
  • when I was adding more context to the assertions (to get the 2565199768464 value) it started taking significantly more iterations to fault (but was still unstable); that may suggest some interplay with JIT, or could just be related to inlining changes making it perform differently
  • using a vector as the backend store (and just the array indexer) didn't seem to fault at all; maybe it needs the interior pointer (now the only thing rooting the object) to be somewhere non-trivial in the stack, via the deeper dictionary update path? or maybe it is just that the vector indexer speeds things up so dramatically that we simply aren't hitting the race scenario in the time available

@vcsjones
Copy link
Member

Anecdotal, but this might be JIT related. If I set COMPlus_JITMinOpts=1, it does not reproduce anymore. It's been running several minutes without crashing.

@kevin-montrose
Copy link
Author

kevin-montrose commented Oct 12, 2022

@vcsjones still reproduces for me (.NET 7 RC, using POH, x64) with COMPlus_JITMinOpts=1 😞 - just a couple iterations, though they are slower iterations (as expected).

@lambdageek
Copy link
Member

lambdageek commented Oct 12, 2022

(adding a note to myself to check with Mono)


Update: didn't repro on desktop mono (JIT and interp). I tried some variants with adding enough recursion to confound conservative stack scanning, too. So it seems unlikely it's some logic error in at least the shared bits of CoreLib.

@mgravell
Copy link
Member

Random thought (I'm not at PC to test) - maybe a JIT size miscalculation due to the explicit struct layout? (not as random as it sounds - the last time I found a JIT bug was the "fixed buffers" size miscalculation)

@mangod9
Copy link
Member

mangod9 commented Oct 12, 2022

Usually a heap corruption of this sort could be related to JIT optimizations. @AndyAyersMS in case you have seen cases like this

@kevin-montrose
Copy link
Author

Random thought (I'm not at PC to test) - maybe a JIT size miscalculation due to the explicit struct layout? (not as random as it sounds - the last time I found a JIT bug was the "fixed buffers" size miscalculation)

In my experimentation removing explicit layout, or adding kinda random padding, made no difference. I default to explicit layouts when doing tricky things with structs since I can never remember what's actually guaranteed by the compiler - given how shrunk down this repro is, it could probably be removed safely...

@AndyAyersMS
Copy link
Member

Usually a heap corruption of this sort could be related to JIT optimizations. @AndyAyersMS in case you have seen cases like this

If this repros with minopts and seems to be related to POH it's a bit less likely to be a JIT issue. But still possible.

cc @dotnet/jit-contrib

@EgorBo
Copy link
Member

EgorBo commented Oct 13, 2022

Judging by the stack trace it crashes in ObjectEqualityComparer.Equals

@EgorBo
Copy link
Member

EgorBo commented Oct 13, 2022

Hits an assert with DOTNET_HeapVerify=1 when validating a ConcurrentDictionary's Node object, namely - its members in ValidateObjectMember

@mrsharm mrsharm self-assigned this Oct 19, 2022
@cshung cshung removed their assignment Oct 20, 2022
@cshung
Copy link
Member

cshung commented Dec 18, 2022

The bug is understood, see here for the investigation notes.

@cmadh
Copy link

cmadh commented Jun 8, 2023

I assume this issue is has not been fixed in the last .net 8 preview ?

@ghost ghost locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.