Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block & array hybrid storage #2738

Merged
merged 37 commits into from
Oct 1, 2017
Merged

Conversation

homm
Copy link
Member

@homm homm commented Sep 16, 2017

First of all, this is just a prototype. No longer.

Memory allocation from system is slow. It can take up to 3.6x more time to access just allocated memory than access memory which is already belong to an application and was accessed earlier. By allocations from system I mean system calls to mmap or analogues for other operating systems.

C example and results
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define SIZE 50 * 1024 * 1024


int main(int argc, char *argv[]) {
  int i;
  int *p;
  int sum = 0;
  clock_t start;
  float clock_alloc, clock_new, clock_used;

  for (i = 0; i < 100; i ++) {
    start = clock();
    p = (int *) calloc(1, SIZE);
    clock_alloc += (float)(clock() - start) / CLOCKS_PER_SEC;
    
    start = clock();
    memset(p, 0, SIZE);
    clock_new += (float)(clock() - start) / CLOCKS_PER_SEC;
    sum += p[512];
    
    start = clock();
    memset(p, 0, SIZE);
    clock_used += (float)(clock() - start) / CLOCKS_PER_SEC;
    sum += p[512];

    free(p);
  }
  printf("sum %d\n", sum);
  printf("alloc: %f\nnew %f\nused %f\n", clock_alloc, clock_new, clock_used);
  return 0;
}

On Ubuntu 16 with high memory:

alloc: 0.006105
new 0.513503
used 0.224562

On Ubuntu 16 with low memory (512M):

alloc: 0.021838
new 1.585902
used 0.595298

On MacOS 10.11:

alloc: 0.334999
new 0.620930
used 0.264800

Memory allocators (like libc) tries sometimes to predict memory usage patterns to avoid extra allocations from system. They could not actually return the memory to the system on free call and reuse the same memory on next malloc. For example, if reduce SIZE in the sample above to 10 Megabytes, the access to the new memory will cost the same as access to the used memory on Ubuntu. But this optimization only works in few cases:

  • Only with some amount of memory. For my system this optimization only works when allocated block is less than 32 Megabytes
  • Only with default Ubuntu allocator. This doesn't work with jemalloc or tcmalloc. This also doesn't work on Mac OS.
  • When no manual options for allocator are set. For example if I MALLOC_TRIM_THRESHOLD_=-1, allocator drops optimization.

Pillow allocates and frees memory a lot. Almost every operation requires to recreate the bitmap. In general this is not good. If I'd write Pillow from the scratch, I'd try to do as much as possible operations (like conversion, composition, filtering) in-place. But this is huge task which requires redesign all internal and external APIs.

So how to avoid slowdown on reallocations? Once someone told me: one of the following is true: you're allocating huge amount of memory sporadically and relatively rare and you shouldn't worry about the time spend to allocations. Or you constantly need huge amount of memory. Then just don't release it. And so here I am.

As you probably know, there are two image allocators in Pillow: ImagingAllocateBlock and ImagingAllocateArray. The first works for images < 16 Megabytes of data and allocates one large chank of memory im->linesize * im->ysize bytes. The second works for large images and allocates small pieces of memory for each line. In both cases the allocated memory size is linked with image size. This is not good for reusing memory. Additionally, this is very sharp transaction between storage types which can lead to unpredictable performance penalties. So I've reimplemented ImagingAllocateArray so it allocates chain of relatively large blocks. Every block could be used and reused to store any size of images.

As I said, this is just a prototype, which should only show opportunities of this approach. To show how it works, I'll test Pillow-SIMD with uploadcare:simd/4.3-demo branch. Tests performed using pillow-perf/testsute/ with following command line:

$ time ./run.py scale filter convert composition rotate_right -n21

Options:

  • master, -s2560x1638 (maximum possible ImagingAllocateBlock image)
  • master, -s2560x1639 (minimal possible ImagingAllocateArray image)
  • this, -s2560x1638, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=0
  • this, -s2560x1638, MEMORY_BLOCK_SIZE=4MB, MEMORY_MAX_BLOCKS=0
  • this, -s2560x1638, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=256
Scale
                    2560x1638        2560x1639        1M, 0 cache      4M, 0 cache      1M, 256M cache
  26x16 bil       2422.24 Mpx/s    2265.53 Mpx/s    2285.33 Mpx/s    2367.78 Mpx/s    2273.81 Mpx/s
  26x16 bic       1363.72 Mpx/s    1301.00 Mpx/s    1317.74 Mpx/s    1354.06 Mpx/s    1323.69 Mpx/s
  26x16 lzs        757.32 Mpx/s     733.67 Mpx/s     746.93 Mpx/s     745.34 Mpx/s     745.22 Mpx/s
  320x205 bil     1734.85 Mpx/s    1692.50 Mpx/s    1624.60 Mpx/s    1655.49 Mpx/s    1622.20 Mpx/s
  320x205 bic     1100.62 Mpx/s    1070.87 Mpx/s    1051.22 Mpx/s    1065.93 Mpx/s    1039.78 Mpx/s
  320x205 lzs      655.80 Mpx/s     626.91 Mpx/s     635.24 Mpx/s     630.84 Mpx/s     629.06 Mpx/s
  2048x1310 bil    427.71 Mpx/s     423.18 Mpx/s     285.12 Mpx/s     352.44 Mpx/s     428.02 Mpx/s
  2048x1310 bic    324.81 Mpx/s     323.97 Mpx/s     240.43 Mpx/s     283.17 Mpx/s     324.73 Mpx/s
  2048x1310 lzs    244.10 Mpx/s     242.86 Mpx/s     193.53 Mpx/s     219.76 Mpx/s     246.10 Mpx/s
  5478x3505 bil     71.60 Mpx/s      71.22 Mpx/s      71.74 Mpx/s      89.41 Mpx/s     136.74 Mpx/s
  5478x3505 bic     60.72 Mpx/s      59.95 Mpx/s      58.38 Mpx/s      68.88 Mpx/s      96.61 Mpx/s
  5478x3505 lzs     48.61 Mpx/s      48.52 Mpx/s      46.13 Mpx/s      52.46 Mpx/s      67.44 Mpx/s

Filter
                    2560x1638        2560x1639        1M, 0 cache      4M, 0 cache      1M, 256M cache
  Smooth           554.82 Mpx/s     548.69 Mpx/s     410.26 Mpx/s     477.15 Mpx/s     565.05 Mpx/s
  Sharpen          555.19 Mpx/s     548.69 Mpx/s     409.62 Mpx/s     476.83 Mpx/s     565.15 Mpx/s
  Smooth More      233.19 Mpx/s     248.97 Mpx/s     197.17 Mpx/s     234.82 Mpx/s     250.90 Mpx/s

Convert
                    2560x1638        2560x1639        1M, 0 cache      4M, 0 cache      1M, 256M cache
  RGB to L        1915.47 Mpx/s    1904.61 Mpx/s    1919.87 Mpx/s    1909.44 Mpx/s    1922.59 Mpx/s
  RGBA to LA      1423.89 Mpx/s    1340.54 Mpx/s    1454.51 Mpx/s     965.09 Mpx/s    1513.20 Mpx/s
  RGBa to RGBA    1321.21 Mpx/s    1366.67 Mpx/s     799.81 Mpx/s     976.35 Mpx/s    1478.10 Mpx/s
  RGBA to RGBa    1424.35 Mpx/s    1331.21 Mpx/s    1440.10 Mpx/s    1070.28 Mpx/s    1357.51 Mpx/s

Composition
                    2560x1638        2560x1639        1M, 0 cache      4M, 0 cache      1M, 256M cache
  Composition      688.91 Mpx/s     804.44 Mpx/s     441.18 Mpx/s     528.05 Mpx/s     737.72 Mpx/s

Rotate 
                    2560x1638        2560x1639        1M, 0 cache      4M, 0 cache      1M, 256M cache
  Flop            1351.77 Mpx/s    1351.35 Mpx/s     862.28 Mpx/s    1233.29 Mpx/s    1466.64 Mpx/s
  Flip            2024.62 Mpx/s    1709.10 Mpx/s     849.53 Mpx/s    1453.42 Mpx/s    1788.30 Mpx/s
  Rotate 90        528.13 Mpx/s     510.02 Mpx/s     405.26 Mpx/s     534.44 Mpx/s     539.80 Mpx/s
  Rotate 180      1355.31 Mpx/s    1305.54 Mpx/s     744.15 Mpx/s    1157.10 Mpx/s    1428.28 Mpx/s
  Rotate 270       519.48 Mpx/s     497.97 Mpx/s     396.42 Mpx/s     525.94 Mpx/s     530.52 Mpx/s
  Transpose        549.66 Mpx/s     517.56 Mpx/s     403.59 Mpx/s     532.69 Mpx/s     535.89 Mpx/s
  Transverse       553.93 Mpx/s     523.24 Mpx/s     402.98 Mpx/s     527.39 Mpx/s     533.90 Mpx/s

                    2560x1638        2560x1639        1M, 0 cache      4M, 0 cache      1M, 256M cache
real                8.277 s          8.449 s          9.437 s          8.106 s          6.566 s
user                6.596 s          6.700 s          6.796 s          6.748 s          6.516 s
sys                 1.676 s          1.744 s          2.636 s          1.352 s          0.044 s

So, what is going on

master, -s2560x1638
One large block, most of allocations are cached by libc. The only exception probably is the scaling to 5478x3505.

master, -s2560x1639
No significant changes, 0-10% slower due to different storage model. The only exception is Composition, which most fast in this configuration.

this, -s2560x1638, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=0
Libc doesn't cache this pattern for some reasons. System time is bigger than ever, because almost all memory is come from system.

this, -s2560x1638, MEMORY_BLOCK_SIZE=4MB, MEMORY_MAX_BLOCKS=0
It's magic, libc caches memory again. System time is even less than for master because scaling to 5478x3505 is also partially cached (which is notable from this operation time). Some operations are still slower than in master. If we want to keep libc cache, this requires further investigation.

this, -s2560x1638, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=256
All memory is cached by application. System time is near zero, almost all operation have same speed as master or even faster. Scaling to 5478x3505 is faster up to 90%! Rotate->flip is the only which noticeable slower.

Ok, than was images which originally fit in libc cache. What about large images?

Options:

  • master, -s5120x3200
  • this, -s5120x3200, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=0
  • this, -s5120x3200, MEMORY_BLOCK_SIZE=4MB, MEMORY_MAX_BLOCKS=0
  • this, -s5120x3200, MEMORY_BLOCK_SIZE=1MB, MEMORY_MAX_BLOCKS=512
Scale
                    5120x3200        1M, 0 cache      4M, 0 cache      1M, 512M cache
  51x32 bil       2340.90 Mpx/s    2319.64 Mpx/s    2357.77 Mpx/s    2284.78 Mpx/s
  51x32 bic       1337.79 Mpx/s    1352.16 Mpx/s    1328.48 Mpx/s    1330.10 Mpx/s
  51x32 lzs        838.48 Mpx/s     843.28 Mpx/s     833.11 Mpx/s     840.59 Mpx/s
  640x400 bil     1642.00 Mpx/s    1624.92 Mpx/s    1649.45 Mpx/s    1627.19 Mpx/s
  640x400 bic     1041.84 Mpx/s    1048.03 Mpx/s    1036.70 Mpx/s    1023.68 Mpx/s
  640x400 lzs      707.06 Mpx/s     702.70 Mpx/s     705.72 Mpx/s     695.98 Mpx/s
  4096x2560 bil    275.11 Mpx/s     275.63 Mpx/s     329.98 Mpx/s     427.13 Mpx/s
  4096x2560 bic    231.73 Mpx/s     236.04 Mpx/s     269.45 Mpx/s     331.88 Mpx/s
  4096x2560 lzs    195.94 Mpx/s     195.53 Mpx/s     220.44 Mpx/s     256.39 Mpx/s
  10957x6848 bil    73.54 Mpx/s      72.89 Mpx/s      88.44 Mpx/s     136.26 Mpx/s
  10957x6848 bic    63.27 Mpx/s      61.49 Mpx/s      69.56 Mpx/s      95.17 Mpx/s
  10957x6848 lzs    51.19 Mpx/s      49.63 Mpx/s      54.10 Mpx/s      68.73 Mpx/s

Filter
                    5120x3200        1M, 0 cache      4M, 0 cache      1M, 512M cache
  Smooth           378.58 Mpx/s     381.89 Mpx/s     439.52 Mpx/s     566.00 Mpx/s
  Sharpen          378.15 Mpx/s     381.88 Mpx/s     439.36 Mpx/s     566.16 Mpx/s
  Smooth More      203.46 Mpx/s     205.84 Mpx/s     220.60 Mpx/s     247.51 Mpx/s

Convert
                    5120x3200        1M, 0 cache      4M, 0 cache      1M, 512M cache
  RGB to L        1938.22 Mpx/s    1687.69 Mpx/s    1924.16 Mpx/s    1929.78 Mpx/s
  RGBA to LA      1394.24 Mpx/s    1450.94 Mpx/s    1329.35 Mpx/s    1490.40 Mpx/s
  RGBa to RGBA     706.00 Mpx/s     711.94 Mpx/s     929.12 Mpx/s    1456.63 Mpx/s
  RGBA to RGBa    1388.58 Mpx/s    1443.29 Mpx/s    1318.54 Mpx/s    1465.73 Mpx/s

Composition
                    5120x3200        1M, 0 cache      4M, 0 cache      1M, 512M cache
  Composition      447.56 Mpx/s     452.92 Mpx/s     548.22 Mpx/s     751.94 Mpx/s

Rotate 
                    5120x3200        1M, 0 cache      4M, 0 cache      1M, 512M cache
  Flop             760.20 Mpx/s     795.92 Mpx/s    1042.45 Mpx/s    1472.58 Mpx/s
  Flip             728.54 Mpx/s     737.72 Mpx/s    1037.82 Mpx/s    1776.62 Mpx/s
  Rotate 90        362.92 Mpx/s     335.44 Mpx/s     435.71 Mpx/s     476.47 Mpx/s
  Rotate 180       656.80 Mpx/s     662.20 Mpx/s     883.85 Mpx/s    1427.28 Mpx/s
  Rotate 270       355.48 Mpx/s     351.04 Mpx/s     441.78 Mpx/s     475.15 Mpx/s
  Transpose        357.53 Mpx/s     336.80 Mpx/s     431.85 Mpx/s     482.73 Mpx/s
  Transverse       355.72 Mpx/s     325.59 Mpx/s     422.69 Mpx/s     475.31 Mpx/s

                    5120x3200        1M, 0 cache      4M, 0 cache      1M, 512M cache
real               19.249 s         19.640 s         17.125 s         13.571 s
user               13.352 s         13.604 s         13.468 s         13.416 s
sys                 5.888 s          6.032 s          3.652 s          0.152 s

Sys time is about 1%. The most important thing: performance of almost all operations for large images is almost equal to performance for medium images.

So I believe this is a huge win. There are lot of thinks left to do though. There are some of them:

  • Support for images wider than block size
  • Reduce memory consumption for small images
  • Lines alignment
  • Fix errors and tests
  • Count the statistic of allocations
  • Core api for all options (block size, max blocks, lines alignment) and statistic
  • Tests, of course
  • Configure through environment variables
  • Free cached blocks on module deinitialization
  • Method for purging caches
  • Clear cache to desired size
  • Fallback with smallest chunk size
  • Investigate libc cache behavior to use it by default

@homm
Copy link
Member Author

homm commented Sep 16, 2017

I'm not sure that we should mention this in release notes or documentation for this release. By default the cache is turned off and this change should be totally transparent for end users. I'd like to have it as experimental feature and add public resource manager in next release (4.4).

@homm homm changed the title [WIP] Block storage [WIP] Block & array hybrid storage Sep 16, 2017
@homm homm added this to the 4.3.0 milestone Sep 16, 2017
@homm
Copy link
Member Author

homm commented Sep 18, 2017

Finally all green and almost done

@homm homm changed the title [WIP] Block & array hybrid storage Block & array hybrid storage Sep 18, 2017
@homm homm removed the Do Not Merge label Sep 18, 2017
block = arena->blocks[arena->blocks_cached];
// Reallocate if needed
if (block.size != requested_size){
block.ptr = realloc(block.ptr, requested_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this likely to be realloc'ing on every block, unless you're processing a bunch of images of identical bit width?

Copy link
Member Author

@homm homm Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are three possible results of reallocation:

on the reduction: most implementations will just decrease several numbers in metadata without actually reallocation
on the expansion when space is enough: as above
on the expansion when here is no space: actual reallocation on the new address

In theory, the more time program is running the less actual relocations should occur, because all blocks will have enough space ahead after several reallocations.

According to my tests, when blocks_max is enough, only small amount of the reusings lead to actual reallocations (when the new ptr != old ptr).

The situation should improved with more intelligent block selection, not just the topmost one.

@homm
Copy link
Member Author

homm commented Sep 22, 2017

I learned a lot about linux memory model this days. For example, there are two mechanisms in Linux for allocation: mmap and sbrk.

And finally I found something really interesting in libc sources.

M_MMAP_THRESHOLD
Using mmap segregates relatively large chunks of memory so that
they can be individually obtained and released from the host
system. A request serviced through mmap is never reused by any
other request (at least not directly; the system may just so
happen to remap successive requests to the same locations).

So, as discovered above, in some conditions libc doesn't use mmap for blocks up to 32 megabytes and can reuse them without returning to the system.

Still investigating.

@homm
Copy link
Member Author

homm commented Sep 22, 2017

Ok, looks like I finally understood everything.

From the list of libc malloc variables there are only two significant for our case: mmap_threshold and trim_threshold.

Libc wouldn't release freed memory and would reuse it if two conditions are met:

  1. memory was allocated with sbrk, not mmap
  2. memory was not released (or trimmed)

mmap_threshold is a minimal size of the block which will be allocated using mmap (or, in other words, a maximum size of the block which will be allocated using sbrk)
trim_threshold is a minimum size of free memory allocated with sbrk, after which libc will actually release the memory.

By default, both variables are set to 128Kb, but as said in documentation, they are dynamically adjustable. I found only one place in the sources where this happens: in __libc_free.

So, mmap_threshold always has a size of the largest freed mmap block. And trim_threshold has double size. And this is the most important thing: once an application allocated and freed large enough block of memory (but not larger than DEFAULT_MMAP_THRESHOLD_MAX, which is 32Mb on 64-bit systems), it will not release up to two sizes of the large block and will reuse it when possible.

So, let's check this:

A. master, -s2560x1639 (array storage). As you can see, result doesn't match with original comment. This is because there are no scale tests. Indeed, the scale tests bumped trim_threshold in 2560x1639→2048x1310 scaling, because 2048x1310 had block storage.
B. master, -s2560x1639 (array storage). Ok, let's try bump trim_threshold with something else, for example 'load' operations. It always loads 2560x1600 JPEG image, so block storage is used.
C. master, -s2560x1639 (array storage). And to be sure, let's try to bump the threshold with simple python code before the tests (in pillow-perf/testsuite/cases/pillow.py):

_tmp = b'0123456789' * int(30*1024*1024*0.1)
_tmp = 0

D. master, -s2560x1639 (array storage). How about to set local environment variables manual? MALLOC_MMAP_THRESHOLD_=33554432 MALLOC_TRIM_THRESHOLD_=67108864

Convert
                      A                B                C                D
  RGB to L        1938.82 Mpx/s    1930.94 Mpx/s    1862.29 Mpx/s    1772.63 Mpx/s
  RGBA to LA      1341.36 Mpx/s    1382.56 Mpx/s    1340.54 Mpx/s    1352.18 Mpx/s
  RGBa to RGBA     749.55 Mpx/s    1360.86 Mpx/s    1364.98 Mpx/s    1359.18 Mpx/s
  RGBA to RGBa    1331.11 Mpx/s    1332.83 Mpx/s    1333.33 Mpx/s    1340.95 Mpx/s

Composition
                      A                B                C                D
  Composition      438.39 Mpx/s     805.80 Mpx/s     804.25 Mpx/s     805.95 Mpx/s

Rotate 
                      A                B                C                D
  Flop             789.88 Mpx/s    1353.53 Mpx/s    1343.92 Mpx/s    1357.92 Mpx/s
  Flip             807.68 Mpx/s    1714.76 Mpx/s    1746.24 Mpx/s    1701.50 Mpx/s
  Rotate 90        391.84 Mpx/s     508.53 Mpx/s     510.31 Mpx/s     512.75 Mpx/s
  Rotate 180       713.68 Mpx/s    1307.09 Mpx/s    1314.51 Mpx/s    1307.48 Mpx/s
  Rotate 270       379.61 Mpx/s     500.03 Mpx/s     495.14 Mpx/s     498.19 Mpx/s
  Transpose        389.07 Mpx/s     520.25 Mpx/s     519.09 Mpx/s     518.12 Mpx/s
  Transverse       397.34 Mpx/s     523.96 Mpx/s     512.26 Mpx/s     526.59 Mpx/s

                      A                B                C                D
real                2.073 s          2.569 s          1.538 s          1.495 s
user                1.516 s          2.524 s          1.508 s          1.468 s
sys                 0.556 s          0.048 s          0.032 s          0.024 s

All of three methods work for master and array storage.

to be continued...

@homm
Copy link
Member Author

homm commented Sep 22, 2017

What about block & array hybrid storage. I choose the block size 1Mb because I think it is large enough to avoid too frequent malloc/free calls and small enough to be allocated even with high memory fragmentation.

C and D cases also bump trim_threshold and fix performance for hybrid storage. But no other operations bumps threshold anymore, because largest freed block for now is largest pillow's block and it is 1Mb.

Possible solutions I see:

  1. Leave it as is. Indeed, the problem only related to single memory manger and its defaults. We are adding something like our own configurable memory manager with more explicit tools to improve performance.
  2. Increase default block size to old value: 16Mb. This will bump trim_threshold for large images automatically, as before. This will improve compatibility, but user will have to change PILLOW_BLOCK_SIZE to smallest values in the most cases if he decides to manage memory manually.

We shouldn't worry about memory fragmentation, because I'm going to implement retry with smallest possible block size (one page, 4Kb) if large block allocation is failed (very like it was before, when array storage was used when block storage was failed).

@homm homm removed the Needs Rebase label Sep 23, 2017
have been removed.

The ``PIL.Image.core.getcount`` methods have been removed, use
``PIL.Image.core.get_stats()['new_count']`` property instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've accidentally removed this chunk of the release notes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks, my mistake

take alignment into account when calculating lines_per_block
@homm
Copy link
Member Author

homm commented Sep 23, 2017

I finally happy with this. In its current form the PR doesn't introduce any known performance regressions and also provide reach API on all levels for memory management. It is even a bit faster for large images (like 5120x3200).

@homm homm requested a review from wiredfool September 23, 2017 03:12
break;
if (line_in_block == 0) {
int required;
int lines_remained = lines_per_block;
Copy link
Member

@wiredfool wiredfool Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better named lines_remaining, or lines_requested or current_block_lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wiredfool
Copy link
Member

So the invariants here are:

  1. Images don't require any more storage then they previously did, except for potentially a handful of bytes due to alignment issues.
  2. Large images are allocating in blocks that are similarly (order of magnitude) sized to the previous chunk size, where pre-patch they would be made of one malloc per line.
  3. Small images will still go in one chunk.
  4. The only case where this will use more memory is that it will hold on to a bunch of memory from recently freed images. The high water mark should be the same.

@homm
Copy link
Member Author

homm commented Sep 30, 2017

except for potentially a handful of bytes due to alignment issues.

If you mean arena->alignment, it is 1 by default and doesn't affect any memory pointers until user explicitly set it.

it will hold on to a bunch of memory from recently freed images

I believe for long-living applications where a big enough image (4Mpx) was created at least once, memory consumption will be exactly the same including held on memory.

@homm
Copy link
Member Author

homm commented Sep 30, 2017

For some reason one of the docker's builds sometimes fails.

@hugovk
Copy link
Member

hugovk commented Sep 30, 2017

We're tracking the Docker problem in #2758.

int alignment; /* Alignment in memory of each line of an image */
int block_size; /* Preferred block size */
int blocks_max; /* Maximum number of cached blocks */
int blocks_cached; /* Current number of block not accociated with images */
Copy link
Member

@radarhere radarhere Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocks not associated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed

int stats_new_count; /* Number of new allocated images */
int stats_allocated_blocks; /* Number of allocated blocks */
int stats_reused_blocks; /* Number of blocks which was retrieved from pool */
int stats_reallocated_blocks; /* Number of blocks which was actually reallocated after retrieving */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were

@wiredfool
Copy link
Member

I think that this should probably be called out in the release notes, if only to say that large images are now allocated in a block manner. We don't think that there's a performance regression, but we don't really test everywhere that Pillow is used, especially on small memory devices.

I'd also like to include some documentation, at least as it relates to the environment variables. And, I think that we should include a bunch of this thread in an area of the docs for design decisions. Promoting them will help them be useful to future hackers and prevent them from being buried in github threads.

@wiredfool wiredfool merged commit c82f9fe into python-pillow:master Oct 1, 2017
@homm homm deleted the block-storage branch October 1, 2017 19:41
@homm
Copy link
Member Author

homm commented Oct 1, 2017

@wiredfool That is fair enough. Do you need any help from me?

@wiredfool
Copy link
Member

Don't think so. The block size is still the same as the old threshold to go to line by line, right?

@homm
Copy link
Member Author

homm commented Oct 1, 2017

Yes. The previous THRESHOLD was 2048*2048*4L which is 16*1024*1024, same as the new default block_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants