Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inflate fast NEON optimization #345

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Adenilson
Copy link

Using SIMD to perform wide loads/stores in inflate_fast, this should improve performance on ARM between
18% to 30% depending on the data.

Plus it has the fix for the InflateBack() corner case (details in: https://bugs.chromium.org/p/chromium/issues/detail?id=769880).

This optimization is shipping in Chromium since M62 (landed in the repository around September/October 2017).

Adenilson Cavalcanti added 2 commits April 5, 2018 11:05
In inflate_fast() the output pointer always has plenty of room to write. This
means that so long as the target is capable, wide un-aligned loads and stores
can be used to transfer several bytes at once. When the reference distance is
too short simply unroll the data a little to increase the distance.

For reference, please see:
https://chromium.googlesource.com/chromium/src/+/78104f4d73e3bbb4155fa804d00ed66682180556
ps: this is still missing the fix for inflate_back corner case.

Change-Id: I5216424ab584e069b77ddf04000a313d5ca99839
This handles the case where a zlib user could rely on InflateBack
API to decompress content.

The NEON optimization assumes that it can perform wide stores, sometimes
overwriting data on the output pointer (but never overflowing the buffer
end as it has enough room for the write).

For infback there is no such guarantees (i.e. no extra wiggle room),
which can result in illegal operations. This patch fixes the potential
issue by falling back to the non-optimized code for such cases.

Also it adds some comments about the entry assumptions in inflate and
writes out a defined value at the write buffer to identify where
the real data has ended (helpful while debugging).

For reference, please see:
https://chromium.googlesource.com/chromium/src/+/0bb11040792edc5b28fcb710fc4c01fedd98c97c

Change-Id: Iffbda9eb5e08a661aa15c6e3d1c59b678cc23b2c
@Adenilson
Copy link
Author

Ideally this should be applied first followed by updated (WIP) versions of the checksums patches (i.e. optimized crc32 and adler32).

@Adenilson
Copy link
Author

@madler any suggestions?

@Adenilson
Copy link
Author

For further details concerning the optimization, please see:
https://bugs.chromium.org/p/chromium/issues/detail?id=697280

@@ -0,0 +1,311 @@
/* inffast.c -- fast decoding
* Copyright (C) 1995-2017 Mark Adler
* For conditions of distribution and use, see copyright notice in zlib.h
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should point clearly that this is a modded inffast.c (i.e. by adding the respective Copyright).

@@ -0,0 +1,1582 @@
/* inflate.c -- zlib decompression
* Copyright (C) 1995-2016 Mark Adler
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should point clearly that this is a modded inflate.c (i.e. by adding the respective Copyright).

@Adenilson
Copy link
Author

Some benchmarking data running in an ARM CPU (big core A72, snappy data set), shows an average of 31% performance improvement:

a) Vanilla
(xenial)adenilson@localhost:~/canonical-fork/build$ time taskset -c 3 ./zlib_bench gzip ~/corpora/snappy/testdata/*
/home/adenilson/corpora/snappy/testdata/alice29.txt :
GZIP: [b 1M] bytes 152089 -> 54426 35.8% comp 7.1 ( 7.2) MB/s uncomp 127.7 (127.9) MB/s
/home/adenilson/corpora/snappy/testdata/asyoulik.txt :
GZIP: [b 1M] bytes 125179 -> 48949 39.1% comp 6.5 ( 6.5) MB/s uncomp 120.5 (120.6) MB/s
/home/adenilson/corpora/snappy/testdata/baddata1.snappy :
GZIP: [b 1M] bytes 27512 -> 22920 83.3% comp 18.6 ( 18.7) MB/s uncomp 88.2 ( 88.3) MB/s
/home/adenilson/corpora/snappy/testdata/baddata2.snappy :
GZIP: [b 1M] bytes 27483 -> 23000 83.7% comp 18.6 ( 18.6) MB/s uncomp 88.4 ( 88.4) MB/s
/home/adenilson/corpora/snappy/testdata/baddata3.snappy :
GZIP: [b 1M] bytes 28384 -> 23705 83.5% comp 18.5 ( 18.5) MB/s uncomp 87.9 ( 87.9) MB/s
/home/adenilson/corpora/snappy/testdata/fireworks.jpeg :
GZIP: [b 1M] bytes 123093 -> 122927 99.9% comp 21.8 ( 21.8) MB/s uncomp 314.5 (314.8) MB/s
/home/adenilson/corpora/snappy/testdata/geo.protodata :
GZIP: [b 1M] bytes 118588 -> 15143 12.8% comp 34.4 ( 34.7) MB/s uncomp 237.2 (237.3) MB/s
/home/adenilson/corpora/snappy/testdata/html :
GZIP: [b 1M] bytes 102400 -> 13711 13.4% comp 27.3 ( 27.5) MB/s uncomp 220.2 (220.4) MB/s
/home/adenilson/corpora/snappy/testdata/html_x_4 :
GZIP: [b 1M] bytes 409600 -> 53299 13.0% comp 24.3 ( 24.5) MB/s uncomp 220.7 (221.1) MB/s
/home/adenilson/corpora/snappy/testdata/kppkn.gtb :
GZIP: [b 1M] bytes 184320 -> 38789 21.0% comp 5.2 ( 5.3) MB/s uncomp 162.3 (162.5) MB/s
/home/adenilson/corpora/snappy/testdata/lcet10.txt :
GZIP: [b 1M] bytes 426754 -> 144904 34.0% comp 7.2 ( 7.2) MB/s uncomp 129.4 (129.6) MB/s
/home/adenilson/corpora/snappy/testdata/paper-100k.pdf :
GZIP: [b 1M] bytes 102400 -> 81276 79.4% comp 22.1 ( 22.1) MB/s uncomp 146.2 (146.4) MB/s
/home/adenilson/corpora/snappy/testdata/plrabn12.txt :
GZIP: [b 1M] bytes 481861 -> 195220 40.5% comp 5.3 ( 5.3) MB/s uncomp 117.1 (117.4) MB/s
/home/adenilson/corpora/snappy/testdata/urls.10K :
GZIP: [b 1M] bytes 702087 -> 222381 31.7% comp 14.0 ( 14.0) MB/s uncomp 141.4 (141.5) MB/s

b) inflate_fast
(xenial)adenilson@localhost:~/canonical-fork/build$ time taskset -c 3 ./zlib_bench gzip ~/corpora/snappy/testdata/*
/home/adenilson/corpora/snappy/testdata/alice29.txt :
GZIP: [b 1M] bytes 152089 -> 54426 35.8% comp 7.2 ( 7.2) MB/s uncomp 177.1 (177.2) MB/s
/home/adenilson/corpora/snappy/testdata/asyoulik.txt :
GZIP: [b 1M] bytes 125179 -> 48949 39.1% comp 6.5 ( 6.5) MB/s uncomp 164.5 (164.6) MB/s
/home/adenilson/corpora/snappy/testdata/baddata1.snappy :
GZIP: [b 1M] bytes 27512 -> 22920 83.3% comp 18.8 ( 18.8) MB/s uncomp 90.8 ( 91.0) MB/s
/home/adenilson/corpora/snappy/testdata/baddata2.snappy :
GZIP: [b 1M] bytes 27483 -> 23000 83.7% comp 18.8 ( 18.8) MB/s uncomp 90.7 ( 90.7) MB/s
/home/adenilson/corpora/snappy/testdata/baddata3.snappy :
GZIP: [b 1M] bytes 28384 -> 23705 83.5% comp 18.7 ( 18.7) MB/s uncomp 90.4 ( 90.5) MB/s
/home/adenilson/corpora/snappy/testdata/fireworks.jpeg :
GZIP: [b 1M] bytes 123093 -> 122927 99.9% comp 21.8 ( 21.9) MB/s uncomp 311.1 (311.3) MB/s
/home/adenilson/corpora/snappy/testdata/geo.protodata :
GZIP: [b 1M] bytes 118588 -> 15143 12.8% comp 34.9 ( 35.1) MB/s uncomp 299.1 (299.1) MB/s
/home/adenilson/corpora/snappy/testdata/html :
GZIP: [b 1M] bytes 102400 -> 13711 13.4% comp 27.7 ( 27.7) MB/s uncomp 284.6 (284.9) MB/s
/home/adenilson/corpora/snappy/testdata/html_x_4 :
GZIP: [b 1M] bytes 409600 -> 53299 13.0% comp 24.7 ( 24.8) MB/s uncomp 284.9 (285.5) MB/s
/home/adenilson/corpora/snappy/testdata/kppkn.gtb :
GZIP: [b 1M] bytes 184320 -> 38789 21.0% comp 5.3 ( 5.3) MB/s uncomp 222.0 (222.1) MB/s
/home/adenilson/corpora/snappy/testdata/lcet10.txt :
GZIP: [b 1M] bytes 426754 -> 144904 34.0% comp 7.2 ( 7.3) MB/s uncomp 180.0 (180.1) MB/s
/home/adenilson/corpora/snappy/testdata/paper-100k.pdf :
GZIP: [b 1M] bytes 102400 -> 81276 79.4% comp 20.2 ( 21.8) MB/s uncomp 147.9 (149.5) MB/s
/home/adenilson/corpora/snappy/testdata/plrabn12.txt :
GZIP: [b 1M] bytes 481861 -> 195220 40.5% comp 5.3 ( 5.3) MB/s uncomp 163.4 (163.7) MB/s
/home/adenilson/corpora/snappy/testdata/urls.10K :
GZIP: [b 1M] bytes 702087 -> 222381 31.7% comp 14.0 ( 14.0) MB/s uncomp 175.1 (175.2) MB/s

@Adenilson
Copy link
Author

@madler any comment?

@Adenilson
Copy link
Author

@madler ping?

antonlacon added a commit to antonlacon/LibreELEC.tv that referenced this pull request Aug 20, 2018
This introduces arm/neon optimizations to zlib.

The first two patches are a neon optimization relating to zlib's
inflate function. They increase decompression speed. It has been
shipping in Chromimum since release 62 (Oct. 2017). The patches have
been pulled from a PR to zlib upstream: madler/zlib#345.

Patches 003 and 004 have been pulled from Fedora Core's aarch64 zlib
package. They improve zlib compression speed and have been there for
4 months.

Patch 005 is pulled from a PR to zlib upstream. madler/zlib#251. It's
been shipping in Chromium since release 63, and increases
decompression speed.

Patch 006 is my own to allow 005 to merge without conflict with the
previous patches.

Signed-off-by: Ian Leonard <antonlacon@gmail.com>
GerHobbelt pushed a commit to GerHobbelt/zlib that referenced this pull request Nov 20, 2021
* Remove old zlib readme.
* Remove old zlib change history from inflate.c.
* Remove old treebuild.xml and zlib pdf.
@PolynomialDivision
Copy link

Can you rebase on the latest master? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants