-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block & array hybrid storage #2738
Conversation
I'm not sure that we should mention this in release notes or documentation for this release. By default the cache is turned off and this change should be totally transparent for end users. I'd like to have it as experimental feature and add public resource manager in next release (4.4). |
# Conflicts: # libImaging/Imaging.h
allocate block for wider lines
Finally all green and almost done |
libImaging/Storage.c
Outdated
block = arena->blocks[arena->blocks_cached]; | ||
// Reallocate if needed | ||
if (block.size != requested_size){ | ||
block.ptr = realloc(block.ptr, requested_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this likely to be realloc'ing on every block, unless you're processing a bunch of images of identical bit width?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are three possible results of reallocation:
on the reduction: most implementations will just decrease several numbers in metadata without actually reallocation
on the expansion when space is enough: as above
on the expansion when here is no space: actual reallocation on the new address
In theory, the more time program is running the less actual relocations should occur, because all blocks will have enough space ahead after several reallocations.
According to my tests, when blocks_max is enough, only small amount of the reusings lead to actual reallocations (when the new ptr != old ptr).
The situation should improved with more intelligent block selection, not just the topmost one.
I learned a lot about linux memory model this days. For example, there are two mechanisms in Linux for allocation: mmap and sbrk. And finally I found something really interesting in libc sources.
So, as discovered above, in some conditions libc doesn't use mmap for blocks up to 32 megabytes and can reuse them without returning to the system. Still investigating. |
Ok, looks like I finally understood everything. From the list of libc malloc variables there are only two significant for our case: Libc wouldn't release freed memory and would reuse it if two conditions are met:
By default, both variables are set to 128Kb, but as said in documentation, they are dynamically adjustable. I found only one place in the sources where this happens: in So, So, let's check this: A. master, -s2560x1639 (array storage). As you can see, result doesn't match with original comment. This is because there are no scale tests. Indeed, the scale tests bumped _tmp = b'0123456789' * int(30*1024*1024*0.1)
_tmp = 0 D. master, -s2560x1639 (array storage). How about to set local environment variables manual?
All of three methods work for master and array storage. to be continued... |
What about block & array hybrid storage. I choose the block size 1Mb because I think it is large enough to avoid too frequent malloc/free calls and small enough to be allocated even with high memory fragmentation. C and D cases also bump Possible solutions I see:
We shouldn't worry about memory fragmentation, because I'm going to implement retry with smallest possible block size (one page, 4Kb) if large block allocation is failed (very like it was before, when array storage was used when block storage was failed). |
# Conflicts: # docs/releasenotes/4.3.0.rst
increase default block size to previous value
docs/releasenotes/4.3.0.rst
Outdated
have been removed. | ||
|
||
The ``PIL.Image.core.getcount`` methods have been removed, use | ||
``PIL.Image.core.get_stats()['new_count']`` property instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've accidentally removed this chunk of the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it was moved to Core Image API Changes section:
https://github.com/python-pillow/Pillow/pull/2738/files#diff-28d4b8f87f5b7e23cf7f62b2e65d2d4d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks, my mistake
take alignment into account when calculating lines_per_block
I finally happy with this. In its current form the PR doesn't introduce any known performance regressions and also provide reach API on all levels for memory management. It is even a bit faster for large images (like 5120x3200). |
libImaging/Storage.c
Outdated
break; | ||
if (line_in_block == 0) { | ||
int required; | ||
int lines_remained = lines_per_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is better named lines_remaining
, or lines_requested
or current_block_lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
So the invariants here are:
|
If you mean
I believe for long-living applications where a big enough image (4Mpx) was created at least once, memory consumption will be exactly the same including held on memory. |
For some reason one of the docker's builds sometimes fails. |
We're tracking the Docker problem in #2758. |
libImaging/Imaging.h
Outdated
int alignment; /* Alignment in memory of each line of an image */ | ||
int block_size; /* Preferred block size */ | ||
int blocks_max; /* Maximum number of cached blocks */ | ||
int blocks_cached; /* Current number of block not accociated with images */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocks not associated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed
libImaging/Imaging.h
Outdated
int stats_new_count; /* Number of new allocated images */ | ||
int stats_allocated_blocks; /* Number of allocated blocks */ | ||
int stats_reused_blocks; /* Number of blocks which was retrieved from pool */ | ||
int stats_reallocated_blocks; /* Number of blocks which was actually reallocated after retrieving */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were
fix comments
I think that this should probably be called out in the release notes, if only to say that large images are now allocated in a block manner. We don't think that there's a performance regression, but we don't really test everywhere that Pillow is used, especially on small memory devices. I'd also like to include some documentation, at least as it relates to the environment variables. And, I think that we should include a bunch of this thread in an area of the docs for design decisions. Promoting them will help them be useful to future hackers and prevent them from being buried in github threads. |
@wiredfool That is fair enough. Do you need any help from me? |
Don't think so. The block size is still the same as the old threshold to go to line by line, right? |
Yes. The previous THRESHOLD was |
First of all, this is just a prototype.No longer.Memory allocation from system is slow. It can take up to 3.6x more time to access just allocated memory than access memory which is already belong to an application and was accessed earlier. By allocations from system I mean system calls to
mmap
or analogues for other operating systems.C example and results
On Ubuntu 16 with high memory:
On Ubuntu 16 with low memory (512M):
On MacOS 10.11:
Memory allocators (like libc) tries sometimes to predict memory usage patterns to avoid extra allocations from system. They could not actually return the memory to the system on
free
call and reuse the same memory on nextmalloc
. For example, if reduceSIZE
in the sample above to 10 Megabytes, the access to the new memory will cost the same as access to the used memory on Ubuntu. But this optimization only works in few cases:MALLOC_TRIM_THRESHOLD_=-1
, allocator drops optimization.Pillow allocates and frees memory a lot. Almost every operation requires to recreate the bitmap. In general this is not good. If I'd write Pillow from the scratch, I'd try to do as much as possible operations (like conversion, composition, filtering) in-place. But this is huge task which requires redesign all internal and external APIs.
So how to avoid slowdown on reallocations? Once someone told me: one of the following is true: you're allocating huge amount of memory sporadically and relatively rare and you shouldn't worry about the time spend to allocations. Or you constantly need huge amount of memory. Then just don't release it. And so here I am.
As you probably know, there are two image allocators in Pillow:
ImagingAllocateBlock
andImagingAllocateArray
. The first works for images < 16 Megabytes of data and allocates one large chank of memoryim->linesize * im->ysize
bytes. The second works for large images and allocates small pieces of memory for each line. In both cases the allocated memory size is linked with image size. This is not good for reusing memory. Additionally, this is very sharp transaction between storage types which can lead to unpredictable performance penalties. So I've reimplementedImagingAllocateArray
so it allocates chain of relatively large blocks. Every block could be used and reused to store any size of images.As I said, this is just a prototype, which should only show opportunities of this approach. To show how it works, I'll test Pillow-SIMD with
uploadcare:simd/4.3-demo
branch. Tests performed usingpillow-perf/testsute/
with following command line:$ time ./run.py scale filter convert composition rotate_right -n21
Options:
So, what is going on
master, -s2560x1638
One large block, most of allocations are cached by libc. The only exception probably is the scaling to 5478x3505.
master, -s2560x1639
No significant changes, 0-10% slower due to different storage model. The only exception is Composition, which most fast in this configuration.
this, -s2560x1638, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=0
Libc doesn't cache this pattern for some reasons. System time is bigger than ever, because almost all memory is come from system.
this, -s2560x1638, MEMORY_BLOCK_SIZE=4MB, MEMORY_MAX_BLOCKS=0
It's magic, libc caches memory again. System time is even less than for master because scaling to 5478x3505 is also partially cached (which is notable from this operation time). Some operations are still slower than in master. If we want to keep libc cache, this requires further investigation.
this, -s2560x1638, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=256
All memory is cached by application. System time is near zero, almost all operation have same speed as master or even faster. Scaling to 5478x3505 is faster up to 90%! Rotate->flip is the only which noticeable slower.
Ok, than was images which originally fit in libc cache. What about large images?
Options:
Sys time is about 1%. The most important thing: performance of almost all operations for large images is almost equal to performance for medium images.
So I believe this is a huge win. There are lot of thinks left to do though. There are some of them: