Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MemoryCache test failures due to race #72821

Conversation

danmoseley
Copy link
Member

Fix #45868
Fix #33993
Related #50270

MemoryCache entries may have arbitrary size (weight); if the cache has an overall size limit set on it, the cache maintains a sum of the sizes of its entries ("Size") in order to trigger potential compaction and to reject oversize additions. As an implementation decision, presumably for performance, it does not update the Size atomically with updates to the entries in the collection; the Size is made eventually consistent, potentially on another thread. Size is exposed for unit tests only. These tests were assuming Size was immediately consistent, so they were occasionally failing.

  • Verified failures reproduce by adding a sleep before updating the size. (I cannot repro locally without sleeps, but we know that lab machines can be much slower, busier, unluckier, etc)
  • Add retries to all reads of Size, verified failures no longer reproduce.
  • Remove sleeps again.
  • Un-disabled all tests disabled against Test failures: Microsoft.Extensions.Caching.Memory flakiness #33993 and verified all tests pass.
  • Add some comments where I was confused by the implementation.

Note that at least one test disabled against #33993 did not actually read the size. It would have failed due to some unrelated reason - back in 2020. I'm going to assume it was fixed by another change such as possibly #42355.

Thanks @vonzshik for help.

@ghost
Copy link

ghost commented Jul 26, 2022

Tagging subscribers to this area: @dotnet/area-extensions-caching
See info in area-owners.md if you want to be subscribed.

Issue Details

Fix #45868
Fix #33993
Related #50270

MemoryCache entries may have arbitrary size (weight); if the cache has an overall size limit set on it, the cache maintains a sum of the sizes of its entries ("Size") in order to trigger potential compaction and to reject oversize additions. As an implementation decision, presumably for performance, it does not update the Size atomically with updates to the entries in the collection; the Size is made eventually consistent, potentially on another thread. Size is exposed for unit tests only. These tests were assuming Size was immediately consistent, so they were occasionally failing.

  • Verified failures reproduce by adding a sleep before updating the size. (I cannot repro locally without sleeps, but we know that lab machines can be much slower, busier, unluckier, etc)
  • Add retries to all reads of Size, verified failures no longer reproduce.
  • Remove sleeps again.
  • Un-disabled all tests disabled against Test failures: Microsoft.Extensions.Caching.Memory flakiness #33993 and verified all tests pass.
  • Add some comments where I was confused by the implementation.

Note that at least one test disabled against #33993 did not actually read the size. It would have failed due to some unrelated reason - back in 2020. I'm going to assume it was fixed by another change such as possibly #42355.

Thanks @vonzshik for help.

Author: danmoseley
Assignees: -
Labels:

area-Extensions-Caching

Milestone: -

@ghost ghost assigned danmoseley Jul 26, 2022
@danmoseley danmoseley merged commit a660c0f into dotnet:main Jul 26, 2022
@@ -543,7 +543,6 @@ public void SetGetAndRemoveWorksWithObjectKeysWhenDifferentReferences()
}

[Fact]
[ActiveIssue("https://github.com/dotnet/runtime/issues/33993")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed in this test to allow it to be re-enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was the test I was referring to above

Note that at least one test disabled against #33993 did not actually read the size. It would have failed due to some unrelated reason - back in 2020. I'm going to assume it was fixed by another change such as possibly #42355.

it passes for me, so not sure what else I can do.

Copy link
Member

@eerhardt eerhardt Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen that #33993 has been reactivated?

Assert.False() Failure
Expected: False
Actual:   True

at Microsoft.Extensions.Caching.Memory.MemoryCacheSetAndRemoveTests.GetAndSet_AreThreadSafe_AndUpdatesNeverLeavesNullValues() in /_/src/libraries/Microsoft.Extensions.Caching.Memory/tests/MemoryCacheSetAndRemoveTests.cs:line 588

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I'll re-disable but against a new issue, for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, that's a pure speculation, but from my understanding, what's what might be happening:

  1. Thread 1 sets a new entry. It happened to find a previous entry under the same key, so it attempts to expire it.

if (coherentState._entries.TryGetValue(entry.Key, out CacheEntry? priorEntry))
{
priorEntry.SetExpired(EvictionReason.Replaced);
}

  1. While expiring the previous entry, jit is (I think) allowed to rewrite it from

if (EvictionReason == EvictionReason.None)
{
EvictionReason = reason;
}
_isExpired = true;

to

var currentEvictionReason = EvictionReason;
_isExpired = true; 
if (currentEvictionReason == EvictionReason.None)
{
    EvictionReason = reason;
}
  1. Thread 2 attempts to get an entry from the cache.

if (coherentState._entries.TryGetValue(key, out CacheEntry? tmp))
{
CacheEntry entry = tmp;
// Check if expired due to expiration tokens, timers, etc. and if so, remove it.
// Allow a stale Replaced value to be returned due to concurrent calls to SetExpired during SetEntry.
if (!entry.CheckExpired(utcNow) || entry.EvictionReason == EvictionReason.Replaced)

While doing so, it first check whether the entry is expired, and if so, the reason for the expiration.

  1. Now we have all the parts. If thread 1 did manage to set _isExpired flag to true, but not yet EvictionReason, then thread 2 might be able to get true from CheckExpired, and read EvictionReason.None from EvictionReason. And this will lead result being null.

If so, we can test this theory by adding a MemoryBarrier between EvictionReason and _isExpired, like this:

if (EvictionReason == EvictionReason.None)
{
	EvictionReason = reason;
}
Interlocked.MemoryBarrier();
_isExpired = true;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is plausible @vonzshik thank you. I'll link from the issue..

@danmoseley danmoseley deleted the AddingReplacementExceedsCapacityRemovesOldEntry branch July 26, 2022 18:12
@AntonLapounov
Copy link
Member

DoNotAddIfSizeOverflows failed in my PR. https://helix.dot.net/api/2019-06-17/jobs/15cb7b68-1aef-4870-ba9d-0b5b476a0e17/workitems/Microsoft.Extensions.Caching.Memory.Tests/console

Microsoft.Extensions.Caching.Memory.CapacityTests.DoNotAddIfSizeOverflows [FAIL]
      Assert.True() Failure
      Expected: True
      Actual:   False
      Stack Trace:
        /_/src/libraries/Microsoft.Extensions.Caching.Memory/tests/CapacityTests.cs(137,0): at Microsoft.Extensions.Caching.Memory.CapacityTests.DoNotAddIfSizeOverflows()

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants