Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP. More Efficient MemoryAllocator #1660

Closed
wants to merge 30 commits into from

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Jun 13, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

A breaking (due to the removal of unused configuration options) rewrite of ArrayPoolMemoryAllocator to modelled on the design described at #1596 (comment) by @antonfirsov

Things to consider

  • For Allocate<T> only we should consider using unmanaged memory on supported platforms for > 2MB allocations using the approach described by @saucecontrol More efficient MemoryAllocator #1596 (comment) Done!
  • Remove all factory methods and potentially all configuration from the pool.
  • We need tests for the GC trimming. Will check the runtime tests to see what they have.

Issues

Several Tiff encoding tests fail with the lower buffer chunk threshold set. This is due to TiffBaseColorWriter<TPixel> calling Buffer2D<T>.GetSingleSpan(). @brianpopow I'll need you or @IldarKhayrutdinov to help me there as I don't know enough about the format to do a fix. Fixed, thanks @brianpopow !!

protected static Span<T> GetStripPixels<T>(Buffer2D<T> buffer2D, int y, int height)
where T : struct
=> buffer2D.GetSingleSpan().Slice(y * buffer2D.Width, height * buffer2D.Width);

We have a failing test for ResizeKernelMap due to the use of Buffer2D<T>.GetSingleMemory(). @antonfirsov I'll need help there. . Fixed!

this.data = memoryAllocator.Allocate2D<float>(this.MaxDiameter, bufferHeight, AllocationOptions.Clean);
this.pinHandle = this.data.GetSingleMemory().Pin();

@JimBobSquarePants JimBobSquarePants added area:performance breaking Signifies a binary breaking change. labels Jun 13, 2021
@JimBobSquarePants JimBobSquarePants added this to the 1.1.0 milestone Jun 13, 2021
@JimBobSquarePants JimBobSquarePants linked an issue Jun 13, 2021 that may be closed by this pull request
@JimBobSquarePants JimBobSquarePants requested a review from a team June 14, 2021 01:12
: base(IntPtr.Zero, true)
{
this.SetHandle(Marshal.AllocHGlobal(size));
GC.AddMemoryPressure(this.byteCount = size);
Copy link
Contributor

@br3aker br3aker Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done? This would lead to a faster gen2 collection and object itself would behave like it's a managed allocation which I assume is not what we want to be happening for large chunks of data.

Microsoft docs states that "The AddMemoryPressure and RemoveMemoryPressure methods improve performance only for types that exclusively depend on finalizers to release the unmanaged resources". Which is not the case here as safe handle is used to prevent memory leaking only when user forgot to call dispose which is very unlikely to be caused by library code (this should be tested somehow to be honest).

Otherwise this would lead to a memory throttle as it is with current implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That advice is awkwardly worded. In the case where a user fails to Dispose an object backed by an unmanaged buffer, it does rely on the SafeHandle's finalizer to release the memory, so it is advantageous to advise the GC that it might be able to reclaim the memory by doing its GC thing. This should only result in more frequent gen2 GCs if the memory limits are being reached, which should only happen if the unmanaged buffer is long-lived or if the system is actually low on memory. For properly Disposed ephemeral buffers, the GC will be aware of the added memory pressure but will also see it freed quickly, so no harm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that these buffers can be named 'ephemeral', 2mb is a lot (at least it looks like a lot for something temporal) which should lead to a lot of execution time spent on this memory. Primary idea I had while writing this is that even if user forgets to dispose memory backed by a handle it'll be freed by either gen0/gen1 collection or full gen2 collection when OutOfMemoryException kicks in.

While I'm not so sure about this now, I'm still concerned that this might be an overkill due to freachable queue. At least I'd want to benchmark this on something big like your parallel beeheads demo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unmanaged buffers will only be used for allocations that are over the pooled size limit. They should be used once and then released, hence 'ephemeral'. The question of whether GC.AddMemoryPressure is appropriate ultimately comes down to "is there any way a GC could free this memory?". The answer here is yes, but only if the memory has been leaked because someone didn't dispose. Unfortunately there's no way to know whether someone is going to leak; you can only tell after they've done it, when your finalizer runs.

What the MS docs should have said is something along the lines of "don't tell the GC about memory it couldn't possibly reclaim". If the allocation and free were always 100% deterministic, there would be no point in telling the GC about them. But since we can't know whether the GC could reclaim the memory, it's better to tell it about the allocation than to not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, thanks for a deep dive! I've never looked at GC from "memory it can or cannot not claim" point of view.

/// Does not directly use any hardware intrinsics, nor does it incur branching.
/// </summary>
/// <param name="value">The value.</param>
private static int Log2SoftwareFallback(uint value)
Copy link
Contributor

@br3aker br3aker Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jpeg encoder uses similar logic for internal operations, this could potentially be placed somewhere in Numerics.cs class for reusage.

P.S.
Jpeg needs to calculate minimum bitsize of a given number which can be calculated via Log2DeBruijn fallback logic on non-intrinsic hardware. Not really a game changer but would be nice to have these under single class & single test fixture.

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That uses BitOperations.LeadingZeroCount does it not? Numerics.MinimumBitsToStore16 is awkwardly named actually. 16 what?

Copy link
Contributor

@br3aker br3aker Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Lzcnt intrinsic is supported - yes. But for fallback code it now uses LuT of 255 values with possible branching if value exceeds 255 only once - thus 16 bits is the maximum value bitsize this method would safely calculate. DeBruijn sequence should be a bit faster and would eliminate the 16 bit maximum value constraint. Code would be a bit different, that table is what can be shared between jpeg & pool code. I'll open a PR with naming fix & algorithms after finishing my current open pr for testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your maths is better than mine! 😀

Cool I’ll leave that to you then. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! I'll try to do it today so you could use that in this pr before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, thanks!

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Jun 14, 2021

@saucecontrol I had a go at profiling.
This PR

image

Master

image

Here's your original sample.

image

/// </summary>
private ArrayPool<byte> largeArrayPool;
private const int DefaultMaxArraysPerBucket = 16;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we utilize only one bucket now, an intensive parallel load will defer to the unmanaged stuff almost all the time. I would try benchmarking with different bucket counts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna need help writing those tests and improving others to ensure they work in CI.

ThrowInvalidAllocationException<T>(length);
// For anything greater than our pool limit defer to unmanaged memory
// to prevent LOH fragmentation.
memory = new UnmanagedBuffer<T>(length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to see some metrics, how many times do we go here VS to the pool.
I can help with the EventSource stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the BMP codec hits this a few times. I saw that when I made a mistake in the UnmanagedBuffer<T> type. Hopefully my EventSource stuff works ok.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to have a fair comparison, you need to benchmark the old and the new allocator on the same PC. You can make a copy of the old allocator class in a different namespace for simplicity.

I would try to tweak different parameters and see how the results change.

/// </summary>
private ArrayPool<byte> normalArrayPool;
internal const int DefaultMaxArrayLengthInBytes = 2 * SharedPoolThresholdInBytes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a parameter we can tweak!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Time is against me now though, it's nearly 3am.

@JimBobSquarePants JimBobSquarePants changed the title WIP: More Efficient MemoryAllocator More Efficient MemoryAllocator Jun 16, 2021
@JimBobSquarePants JimBobSquarePants requested review from antonfirsov and a team June 16, 2021 10:33
@antonfirsov
Copy link
Member

antonfirsov commented Jun 16, 2021

So I think what's happening here is that with GC allocating the 2MB buffers, you end up re-using the LOH segment holes

The code is still pooling only 16*2=32MB, that's a tiny part of the 1.7GB peak. What we see here must be the result of building disconitguos buffers out of 2MB unmanaged memory blocks.

@JimBobSquarePants can we run an experiment trying to pool significantly more? (128MB, 512MB, 1024MB => maxArraysPerPoolBucket=64, 256, 512)

If we get better results with unmanaged buffers, we may consider dropping the large pool entirely, though I still have concerns on the reliability SafeHandle finalizers. AFAIK a single unhandled exception in any user finalizer will prevent the finalizer queue to finish, leading to actual memory leaks. @saucecontrol thoughts?

@JimBobSquarePants
Copy link
Member Author

Marking this as ready to review. There's a lot of testing and configuration to be done but I think the functionality is pretty much where we need it to be.

@antonfirsov
Copy link
Member

the functionality is pretty much where we need it to be.

We still need to figure out how much value is there from pooling, if it has no real effect, that's a game changer on the current implementation and API shape.

We also need metrics on throughput, the new memory access patterns may have impact on cache utilization.

@JimBobSquarePants
Copy link
Member Author

I’ll leave the memory profiling to the more proficient. I need to take a break, struggling with jet lag

@antonfirsov
Copy link
Member

antonfirsov commented Jun 16, 2021

I need to take a break, struggling with jet lag

TBH it's also a very difficult time for me now. There's plenty of work left IMO, I recommend to slow down, and to declare it as a marathon rather than a sprint, there's no good in doing it in rush. I'll see what I can do on Saturday.

@saucecontrol
Copy link
Contributor

saucecontrol commented Jun 16, 2021

The code is still pooling only 16*2=32MB, that's a tiny part of the 1.7GB peak. What we see here must be the result of building disconitguos buffers out of 2MB unmanaged memory blocks.

@antonfirsov Correct. I was attempting to explain why the profile of the initial PR showed much lower total VirtualAlloc numbers despite much higher peak and baseline memory. The switch to unmanaged makes every buffer allocation an actual allocation, giving a higher total VirtualAlloc. But those allocations are released immediately, keeping the instantaneous committed memory lower.

When falling back to unmanaged allocations, keeping the 2MB discontiguous buffer strategy will be a negative perf-wise in that AllocHGlobal is comparatively expensive. It would be better to keep those allocations contiguous, but that looks like that would take a pretty big refactor. More tuning is needed in either case.

I still have concerns on the reliability SafeHandle finalizers. AFAIK a single unhandled exception in any user finalizer will prevent the finalizer queue to finish

That's true, but only because an unhandled exception in a finalizer will crash the process. 😆

Your finalizers in particular will only be calling LocalFree (by way of Marshal.FreeHGlobal), which will always succeed provided the handle you give it was created by LocalAlloc, and GC.RemoveMemoryPressure, which only throws for invalid args.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Jun 17, 2021

@JimBobSquarePants can we run an experiment trying to pool significantly more? (128MB, 512MB, 1024MB => maxArraysPerPoolBucket=64, 256, 512)

@antonfirsov I need to better understand what we want here. As I see it the larger the pool the more aggressively we will have to trim the pool as it will end up holding onto memory again. With our new logic we'd end up adding 1024MB arrays to that pool and almost nothing would go to unmanaged memory.

When falling back to unmanaged allocations, keeping the 2MB discontiguous buffer strategy will be a negative perf-wise in that AllocHGlobal is comparatively expensive. It would be better to keep those allocations contiguous, but that looks like that would take a pretty big refactor. More tuning is needed in either case.

@saucecontrol Actually, it's pretty simple as I recall. I think if we simply adjusted MaxContiguousArrayLengthInBytes to a larger number then MemoryGroup<T>Allocate will request larger buffers.

@saucecontrol
Copy link
Contributor

saucecontrol commented Jun 17, 2021

I think if we simply adjusted MaxContiguousArrayLengthInBytes to a larger number then MemoryGroup<T>Allocate will request larger buffers.

Yeah, but that would mean larger managed buffer requests too, which would then always fail, no? Would be a quick way to test whether going all unmanaged would be ok, though.

One more note on perf around that...

In addition to the fact AllocHGlobal is expensive, allocating any finalizable object is also expensive. The GC's finalization queue is global and, therefore, takes a lock to update it when you new a finalizable object and when you SuppressFinalize, because these don't happen during a GC pause. That's in addition to it always taking the GC's slow allocation path rather than the fast bump allocator usually used for small objects. It's worth looking into whether those things are in part responsible for the more rough CPU usage graph in the latest trace. Of course bigger chunks would mitigate that as well, so that might show up after a quick change to MaxContiguousArrayLengthInBytes

Correction: SuppressFinalize only updates the marker bit in the object header. It's removed from the finalization queue during GC pause.

@JimBobSquarePants
Copy link
Member Author

Yeah, but that would mean larger managed buffer requests too, which would then always fail, no? Would be a quick way to test whether going all unmanaged would be ok, though.

Ah no. If the requested amount is larger than the pool maximum of 2MB or if the pool is exhausted then we defer to unmanaged buffers.

T[] buffer = null;
int index = SelectBucketIndex(minimumLength);
if (index < this.buckets.Length)
{
// Search for an array starting at the 'index' bucket. If the bucket is empty, bump up to the
// next higher bucket and try that one, but only try at most a few buckets.
const int maxBucketsToTry = 2;
int i = index;
do
{
// Attempt to rent from the bucket. If we get a buffer from it, return it.
buffer = this.buckets[i].Rent();
if (buffer != null)
{
if (log.IsEnabled())
{
log.BufferRented(buffer.GetHashCode(), buffer.Length, this.Id, this.buckets[i].Id);
}
return buffer;
}
}
while (++i < this.buckets.Length && i != index + maxBucketsToTry);
}
// We were unable to return a buffer.
// This can happen for two reasons:
// 1: The pool was exhausted for this buffer size.
// 2: The request was for a size too large for the pool.
// We should now log this. We use the conventional allocation logging since we will
// be advising the GC of the subsequent unmanaged allocation.
if (log.IsEnabled())
{
const int bufferId = -1;
log.BufferRented(bufferId, buffer.Length, this.Id, ArrayPoolEventSource.NoBucketId);
ArrayPoolEventSource.BufferAllocatedReason reason = index >= this.buckets.Length
? ArrayPoolEventSource.BufferAllocatedReason.OverMaximumSize
: ArrayPoolEventSource.BufferAllocatedReason.PoolExhausted;
log.BufferAllocated(
bufferId,
buffer.Length,
this.Id,
ArrayPoolEventSource.NoBucketId,
reason);
}
// Return the null buffer.
// Our calling allocator will check for this and use unmanaged memory instead.
return buffer;

byte[] array = pool.Rent(bufferSizeInBytes);
// Our custom GC aware pool differs from normal will return null
// if the pool is exhausted or the buffer is too large.
if (array != null)
{
memory = new Buffer<T>(array, length, pool);
}
else
{
// Use unmanaged buffer to prevent LOH fragmentation.
memory = new UnmanagedBuffer<T>(length);
}
if (options == AllocationOptions.Clean)
{
memory.GetSpan().Clear();
}
return memory;

@saucecontrol
Copy link
Contributor

saucecontrol commented Jun 17, 2021

Yeah, that's what I meant. MemoryGroup<T>Allocate would request chunks > 2MB, which can never be served by a managed bucket so would always fall through to unmanaged. As it is now, it will consume as many 2MB managed chunks as it can before falling through to unmanaged (this may or may not actually be a good thing -- more profiling is needed)

What I was picturing was something that requested 2MB (or whatever you pick for your max managed chunk size) at a time from the managed pool, and then when that returns null, request all the rest in one unmanaged allocation. That would require moving that abstraction out of the allocator, though.

@JimBobSquarePants
Copy link
Member Author

What I was picturing was something that requested 2MB (or whatever you pick for your max managed chunk size) at a time from the managed pool, and then when that returns null, request all the rest in one unmanaged allocation. That would require moving that abstraction out of the allocator, though.

Yeah, that get's a bit iffy. I'd really like to keep everything there.

Here's what happens if I change the contiguous length to 24MB

image

Vs the current PR state
image

Looks like CPU takes a hit there.

@saucecontrol
Copy link
Contributor

saucecontrol commented Jun 17, 2021

Ouch. Yeah, 24MB is too chunky. You'd allocate 48MB when 25MB is requested, which shows in your total VirtualAlloc number jumping way up again. That's why it would be better for the MemoryGroup to know it's making an unmanaged 'rental' so it can size it exactly to what it needs.

There's a balance there somewhere. Unfortunately it'll be a lot of testing.

@JimBobSquarePants
Copy link
Member Author

That's why it would be better for the MemoryGroup to know it's making an unmanaged 'rental' so it can size it exactly to what it needs.

I think we can expose the required properties via the allocator interface easily enough. We're breaking it anyway.

Comment on lines +330 to +341
bool lockTaken = false, allocateBuffer = false;
try
{
this.spinLock.Enter(ref lockTaken);

if (this.index < buffers.Length)
{
buffer = buffers[this.index];
buffers[this.index++] = null;
allocateBuffer = buffer == null;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating if buffer was null can be done outside of lock scope: if(buffer == null) {}

Suggested change
bool lockTaken = false, allocateBuffer = false;
try
{
this.spinLock.Enter(ref lockTaken);
if (this.index < buffers.Length)
{
buffer = buffers[this.index];
buffers[this.index++] = null;
allocateBuffer = buffer == null;
}
}
bool lockTaken = false;
try
{
this.spinLock.Enter(ref lockTaken);
if (this.index < buffers.Length)
{
buffer = buffers[this.index];
buffers[this.index++] = null;
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@br3aker br3aker Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I don't know if you can suggest changes from different places so do not commit this, line 353 must be changed to if(buffer == null) for this to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JimBobSquarePants hm, I may be wrong of course but we can easily do that outside of the lock scope, no? That comparison is nothing but with spinlocks better be fast than asleep :P

// for that slot, in which case we should do so now.
if (allocateBuffer)
{
if (this.index == 0)
Copy link
Contributor

@br3aker br3aker Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread safe, this.index variable is a shared state that must be altered only in locked scope.

this.index would never be 0 in this piece of code. If bucket is full (i.e. all buffers are not rented), this.index would be equal to zero. After exactly one rent in locked scope this.index would be incremented thus at if(this.index == 0) line index would always be >= 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed like this:

bool lockTaken = false;
int takenIndex = -1;
try
{
    this.spinLock.Enter(ref lockTaken);

    if (this.index < buffers.Length)
    {
        buffer = buffers[this.index];
        buffers[this.index] = null;
        takenIndex = this.index++;
    }
}
finally
{
    if (lockTaken)
    {
        this.spinLock.Exit(false);
    }
}

if (buffer == null)
{
    if (takenIndex == 0)
    {
        // Stash the time the first item was added.
        this.firstItemMS = (uint)Environment.TickCount;
    }
// ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It's also in the wrong method. Will update.

@br3aker
Copy link
Contributor

br3aker commented Jun 17, 2021

2D buffer renting logic can be optimized.

From Buffer2D<T>.Allocate:

// ...
var buffers = new IMemoryOwner<T>[bufferCount];
for (int i = 0; i < buffers.Length - 1; i++)
{
    buffers[i] = allocator.Allocate<T>(bufferLength, options);
}

if (bufferCount > 0)
{
    buffers[buffers.Length - 1] = allocator.Allocate<T>(sizeOfLastBuffer, options);
}
// ....

If we try to allocate a 2D buffer for some images from separate threads at the same time - those would be paused on spinlock with extremely fast response times which can lead to LOH fragmentation as we don't know which image/resource would be freed in what order and in what times:

Bucket:
[0]: free
[1]: free
[2]: free
[3]: free
[4]: free
...

Thread1 requests 2 buffers
Thread2 requests 3 buffers, let's say thread2 wins:

time quant 1
Bucket:
[0]: thread2 taken
[1]: free
[2]: free
[3]: free
[4]: free
...
// spin locks can be very fast considering nothing big happens inside locked code
// while thread2 doing some bureaucracy returning from allocator method & and calling it again with some checks again
// thread1 could easily have taken the lock which can lead to allocator fragmentation
time quant 2
Bucket:
[0]: thread2 taken
[1]: thread1 taken
[2]: free
[3]: free
[4]: free
...
// possible example
time quant 3-5
Bucket:
[0]: thread2 taken
[1]: thread1 taken
[2]: thread2 taken
[3]: thread1 taken
[4]: thread2 taken
...

And that's only 2 threads, add 8 or even 16 cores, async/await - boom.

Solution proposal: ability to rent a number of buffers at a single call. This can be easily implemented in the custom GCAwareConfigurableArrayPool but .net TLS pool isn't designed to support this so it will operate the same way as it is now, unfortunately. This would still be very fast & spin locks won't be hurt. Renting logic can be simplified to bucket buffer index increase, with multiple buffers rent this can be can as simple as bucket.nextBufferIndex += numberOfBuffersToRent with some checks. If we requested 10 buffers but got only 5 as buffer is exhausted during the process - we can allocate remaining slots on the fly and these buffers will participate at the returning stage. This might not be a good thing - our buffer is jagged with 2 halves from pool and some random LOH memory, returning them as a full chunk could create a fragmented space in the bucket.

This can also possibly decrease fragmention at the bucket first allocation. Current implementation allocates new buffers if they were rented as null, batch renting would lead to subsequent allocations which have a higher chance for being a continuous memory which also might help with fragmentation.


This can be proofed via some random 4k image:
Custom log from MemoryGroup.Allocate (lesser than 2mb allocations are omitted):

Allocating 7478016 bytes via 4 buffers
Allocating 14755840 bytes via 8 buffers

Even 2 threads with same images would mess up the ordering by a lot:
image
Note: this is an array of which buffer was taken by which thread via Thread.ManagedThreadId.
Proposed solution should chnage this behaviour to something like: 5 5 5 5 6 6 6 6 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6. But the game changer is that they are returned in batches.

P.S.
Sorry for a giant wall of text, couldn't come up with something more compact :(

@JimBobSquarePants JimBobSquarePants changed the title More Efficient MemoryAllocator WIP. More Efficient MemoryAllocator Jun 18, 2021
@JimBobSquarePants
Copy link
Member Author

I definitely broke something in NET FX when I removed IManagedByteBuffer. The tests freeze up and never report completion. Will revert to the commit before that and try again.

@JimBobSquarePants JimBobSquarePants marked this pull request as draft June 18, 2021 00:13
Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get me wrong, this is amazing work, it's great to see a proof that unmanaged allocations can reduce memory peaks by more than 50%. However, this is a massive change in the core engine, and the current code and the status of validation is still very far from a state I would consider mergeable.

Here is the list of major problems we need to solve:

  1. As a start, we need to setup systematic benchmarking, which can help us getting the following metrics for any configuration in a quantitative/comparable manner enabling data-driven decision making to shape the design and fine-tune the final perf parameters:
  • Peak memory usage
  • Average memory usage
  • CPU utilization
  • Throughput (~time taken to process all the bee heads)

We should be able to put these values into a table so we can easily compare benchmark results for different parameter sets.

  1. With the current tuning, there is almost no pooling happening when it comes to large buffers, meaning that the current PR behavior does not justify the presence of the complex array pooling & trimming logic and the ArrayPoolMemoryAllocator name / API shape. Intuition tells that we need some pooling, but ATM we have no idea how much. We need to see how does the allocator work with pooling disabled + and a list of more aggressive pooling setups.

With our new logic we'd end up adding 1024MB arrays to that pool and almost nothing would go to unmanaged memory.

I did not propose 1024 as a final value. It is an extreme config in general, but not for the bee heads benchmark, where the memory usage seems to fluctuate around 1.2GB. I'm also more curious about less aggressive, more reasonable pooling settings, but I see no reason for not getting all the data points that help us understanding the whole picture and make better decisions.

  1. As pointed out in WIP. More Efficient MemoryAllocator #1660 (comment), we need different sizes for the contiguous blocks coming from pooled arrays VS unmanaged buffers. There are several ways to implement this, probably the most straightforward thing is to move the code in MemoryGroup.Allocate to a virtual method on MemoryAllocator, that could be overridden, and use ideas from WIP. More Efficient MemoryAllocator #1660 (comment) in the override.

  2. Using the ArrayPool<T> abstraction and the entire GCAwareConfigurableArrayPool<T> class to pool 2MB buffers doesn't bring us value, since we are utilizing only one bucket of that pool. We should implement a custom pool class dedicated to the concern of pooling uniform arrays. It should be relatively easy to refactor it from GCAwareConfigurableArrayPool<T>.Bucket.

  3. We should periodically trim the pool by by a certain factor even when there is low memory pressure. This would address the concern of retaining memory unnecessarily (pointed out in General discussion about memory management #1590 and other user complaints), and also enable us to pool much more when there is high load.

  4. We need extensive test coverage to validate our assumptions regarding the utilization of the pools and unmanaged buffers. We also need to test trimming. Unfortunately, it would be too expensive to do it with regular Xunit runs, but we can define a local-only tests, that deliver & validate the logs/metrics proving the trimming is happening in the way we expect.

I understand this is enormous amount of work, but it was always the case for all the previous PR-s refactoring the memory management engine. I don't see a reason to lower our quality criterias in this case, and omit any of the points above, especially that we are about to introduce a breaking change, and we want to prevent future breaking changes. Personally, I want to start working on point 1., but even that alone may take several evenings, that makes it very hard to give an ETA. If we feel like the allocator work may block V1.1 for too long, we should focus on #1597 first, since it will bring even bigger improvement with a lower development cost.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov I actually agree with all the above. I'm running before we can walk without the relevant experience and suffering as a result. #1597 should immediately benefit us for V1.1.

What I'm actually going to do is close this and instead introduce a few smaller PRs to do some sanitation work which will allow the allocator changes to be made more easily.

  • Rename all GetSpan, GetMemory that require a single span to add a Dangerous prefix and remove all usage that is not required.
  • Multiple PRs to Migrate any calls that use AllocateManageByteBuffer to use Allocate<byte> (Turns out it isn't that that is breaking thing, the cause is still undetermined)
  • Optimize the Gif encoder to allow us to clear and reuse a cache to save memory churn in multiframe images.

@antonfirsov
Copy link
Member

I spent some time today trying to figure out how to get the desired metrics out of .ETL files, but I realized there is no easy way that is worth the efforts. This should not block systematic comparison, but will make it even more of a chore work 😞

@JimBobSquarePants JimBobSquarePants deleted the js/memory-experiments branch May 21, 2022 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:performance breaking Signifies a binary breaking change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More efficient MemoryAllocator
4 participants