Add support for IBM Z hardware-accelerated deflate #410

iii-i · 2019-03-15T10:11:28Z

Note: this PR is based on #750 and https://github.com/iii-i/zlib/releases/tag/crc32vx-v6 in order to simplify integration into distributions, which normally want all three changes.

IBM Z mainframes starting from version z15 provide DFLTCC instruction,
which implements deflate algorithm in hardware with estimated
compression and decompression performance orders of magnitude faster
than the current zlib and ratio comparable with that of level 1.

This patch adds DFLTCC support to zlib. It can be enabled using the
following build commands:

$ ./configure --dfltcc
$ make

When built like this, zlib would compress in hardware on level 1, and
in software on all other levels. Decompression will always happen in
hardware. In order to enable DFLTCC compression for levels 1-6 (i.e.,
to make it used by default) one could either configure with
--dfltcc-level-mask=0x7e or export DFLTCC_LEVEL_MASK=0x7e at run
time.

Two DFLTCC compression calls produce the same results only when they
both are made on machines of the same generation, and when the
respective buffers have the same offset relative to the start of the
page. Therefore care should be taken when using hardware compression
when reproducible results are desired. One such use case - reproducible
software builds - is handled explicitly: when the SOURCE_DATE_EPOCH
environment variable is set, the hardware compression is disabled.

DFLTCC does not support every single zlib feature, in particular:

* `inflate(Z_BLOCK)` and `inflate(Z_TREES)`
* `inflateMark()`
* `inflatePrime()`
* `inflateSyncPoint()`

When used, these functions will either switch to software, or, in case
this is not possible, gracefully fail.

This patch tries to add DFLTCC support in the least intrusive way.
All SystemZ-specific code is placed into a separate file, but
unfortunately there is still a noticeable amount of changes in the
main zlib code. Below is the summary of these changes.

DFLTCC takes as arguments a parameter block, an input buffer, an output
buffer and a window. Since DFLTCC requires parameter block to be
doubleword-aligned, and it's reasonable to allocate it alongside
deflate and inflate states, The ZALLOC_STATE(), ZFREE_STATE() and
ZCOPY_STATE() macros are introduced in order to encapsulate the
allocation details. The same is true for window, for which
the ZALLOC_WINDOW() and TRY_FREE_WINDOW() macros are introduced.

Software and hardware window formats do not match, therefore,
deflateSetDictionary(), deflateGetDictionary(),
inflateSetDictionary() and inflateGetDictionary() need special
handling, which is triggered using the new
DEFLATE_SET_DICTIONARY_HOOK(), DEFLATE_GET_DICTIONARY_HOOK(),
INFLATE_SET_DICTIONARY_HOOK() and INFLATE_GET_DICTIONARY_HOOK()
macros.

deflateResetKeep() and inflateResetKeep() now update the DFLTCC
parameter block, which is allocated alongside zlib state, using
the new DEFLATE_RESET_KEEP_HOOK() and INFLATE_RESET_KEEP_HOOK()
macros.

The new DEFLATE_PARAMS_HOOK() macro switches between the hardware
and the software deflate implementations when the deflateParams()
arguments demand this.

The new INFLATE_PRIME_HOOK(), INFLATE_MARK_HOOK() and
INFLATE_SYNC_POINT_HOOK() macros make the respective unsupported
calls gracefully fail.

The algorithm implemented in the hardware has different compression
ratio than the one implemented in software. In order for
deflateBound() to return the correct results for the hardware
implementation, the new DEFLATE_BOUND_ADJUST_COMPLEN() and
DEFLATE_NEED_CONSERVATIVE_BOUND() macros are introduced.

Actual compression and decompression are handled by the new
DEFLATE_HOOK() and INFLATE_TYPEDO_HOOK() macros. Since inflation
with DFLTCC manages the window on its own, calling updatewindow() is
suppressed using the new INFLATE_NEED_UPDATEWINDOW() macro.

In addition to the compression, DFLTCC computes the CRC-32 and Adler-32
checksums, therefore, whenever it's used, the software checksumming is
suppressed using the new DEFLATE_NEED_CHECKSUM() and
INFLATE_NEED_CHECKSUM() macros.

DFLTCC will refuse to write an End-of-block Symbol if there is no input
data, thus in some cases it is necessary to do this manually. In order
to achieve this, send_bits(), bi_reverse(), bi_windup() and
flush_pending() are promoted from local to ZLIB_INTERNAL.
Furthermore, since the block and the stream termination must be handled
in software as well, enum block_state is moved to deflate.h.

Since the first call to dfltcc_inflate() already needs the window,
and it might be not allocated yet, inflate_ensure_window() is
factored out of updatewindow() and made ZLIB_INTERNAL.

iii-i · 2019-04-05T09:02:28Z

Gentle ping.

iii-i · 2019-05-06T09:10:11Z

Fixed a bug when Z_SYNC_FLUSH usage led to incomplete EOBS write.
Fixed a "goto fail" bug in dfltcc_deflate_get_dictionary().
Replaced getenv() with secure_getenv().
Added sys/sdt.h feature test.

iii-i · 2019-06-04T10:35:42Z

Fixed incorrect usage of STFLE instruction (reported against gzip here, fixed in gzip here, corresponding zlib-ng pull request here).
Simplified hooks by removing #undef statements.

iii-i · 2019-07-08T13:08:49Z

Fixed 31-bit build:
- Added machine mode hint for STFLE.
- Adjusted offset calculations.
Fixed sys/sdt.h feature test.
Added an entry to contrib/README.contrib.

When nginx is used with zlib patched with [1], which provides integration with the future IBM Z hardware deflate acceleration, it ends up computing CRC32 twice: one time in hardware, which always does this, and one time in software by explicitly calling crc32(). crc32() calls were added in changesets 133:b27548f540ad ("nginx-0.0.1- 2003-09-24-23:51:12 import") and 134:d57c6835225c ("nginx-0.0.1- 2003-09-26-09:45:21 import") as part of gzip wrapping feature - back then zlib did not support it. However, since then gzip wrapping was implemented in zlib v1.2.0.4, and it's already being used by nginx for log compression. This patch replaces hand-written gzip wrapping with the one provided by zlib. It simplifies the code, and makes it avoid computing CRC32 twice when using hardware acceleration. [1] madler/zlib#410

kloczek · 2019-08-23T05:25:32Z

Only issue with this PR is that looks like it is is controlled by CFLSGS injection instead proper autoconf ---with{,oout}-foo.

When nginx is used with zlib patched with [1], which provides integration with the future IBM Z hardware deflate acceleration, it ends up computing CRC32 twice: one time in hardware, which always does this, and one time in software by explicitly calling crc32(). crc32() calls were added in changesets 133:b27548f540ad ("nginx-0.0.1- 2003-09-24-23:51:12 import") and 134:d57c6835225c ("nginx-0.0.1- 2003-09-26-09:45:21 import") as part of gzip wrapping feature - back then zlib did not support it. However, since then gzip wrapping was implemented in zlib v1.2.0.4, and it's already being used by nginx for log compression. This patch replaces hand-written gzip wrapping with the one provided by zlib. It simplifies the code, and makes it avoid computing CRC32 twice when using hardware acceleration. [1] madler/zlib#410

Patch series "S390 hardware compression support for kernel zlib", v2. With IBM z15 mainframe the new DFLTCC instruction is available. It implements deflate algorithm in hardware (Nest Acceleration Unit - NXU) with estimated compression and decompression performance orders of magnitude faster than the current zlib. This patchset adds s390 hardware compression support to kernel zlib. The code is based on the userspace zlib implementation: madler/zlib#410 The coding style is also preserved for future maintainability. There is only limited set of userspace zlib functions represented in kernel. Apart from that, all the memory allocation should be performed in advance. Thus, the workarea structures are extended with the parameter lists required for the DEFLATE CONVENTION CALL instruction. Since kernel zlib itself does not support gzip headers, only Adler-32 checksum is processed (also can be produced by DFLTCC facility). Like it was implemented for userspace, kernel zlib will compress in hardware on level 1, and in software on all other levels. Decompression will always happen in hardware (when enabled). Two DFLTCC compression calls produce the same results only when they both are made on machines of the same generation, and when the respective buffers have the same offset relative to the start of the page. Therefore care should be taken when using hardware compression when reproducible results are desired. However it does always produce the standard conform output which can be inflated anyway. The new kernel command line parameter 'dfltcc' is introduced to configure s390 zlib hardware support: Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) The main purpose of the integration of the NXU support into the kernel zlib is the use of hardware deflate in btrfs filesystem with on-the-fly compression enabled. Apart from that, hardware support can also be used during boot for decompressing the kernel or the ramdisk image With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch 6) the following performance results have been achieved using the ramdisk with btrfs. These are relative numbers based on throughput rate and compression ratio for zlib level 1: Input data Deflate rate Inflate rate Compression ratio NXU/Software NXU/Software NXU/Software stream of zeroes 1.46 1.02 1.00 random ASCII data 10.44 3.00 0.96 ASCII text (dickens) 6,21 3.33 0.94 binary data (vmlinux) 8,37 3.90 1.02 This means that s390 hardware deflate can provide up to 10 times faster compression (on level 1) and up to 4 times faster decompression (refers to all compression levels) for btrfs zlib. Disclaimer: Performance results are based on IBM internal tests using DD command-line utility on btrfs on a Fedora 30 based internal driver in native LPAR on a z15 system. Results may vary based on individual workload, configuration and software levels. This patch (of 6): Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL implementation and related compression functions. Update zlib_deflate functions with the hooks for s390 hardware support and adjust workspace structures with extra parameter lists required for hardware deflate. Link: http://lkml.kernel.org/r/20191209152948.37080-2-zaslonko@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

With IBM z15 mainframe the new DFLTCC instruction is available. It implements deflate algorithm in hardware (Nest Acceleration Unit - NXU) with estimated compression and decompression performance orders of magnitude faster than the current zlib. This patchset adds s390 hardware compression support to kernel zlib. The code is based on the userspace zlib implementation: madler/zlib#410 The coding style is also preserved for future maintainability. There is only limited set of userspace zlib functions represented in kernel. Apart from that, all the memory allocation should be performed in advance. Thus, the workarea structures are extended with the parameter lists required for the DEFLATE CONVENTION CALL instruction. Since kernel zlib itself does not support gzip headers, only Adler-32 checksum is processed (also can be produced by DFLTCC facility). Like it was implemented for userspace, kernel zlib will compress in hardware on level 1, and in software on all other levels. Decompression will always happen in hardware (when enabled). Two DFLTCC compression calls produce the same results only when they both are made on machines of the same generation, and when the respective buffers have the same offset relative to the start of the page. Therefore care should be taken when using hardware compression when reproducible results are desired. However it does always produce the standard conform output which can be inflated anyway. The new kernel command line parameter 'dfltcc' is introduced to configure s390 zlib hardware support: Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) The main purpose of the integration of the NXU support into the kernel zlib is the use of hardware deflate in btrfs filesystem with on-the-fly compression enabled. Apart from that, hardware support can also be used during boot for decompressing the kernel or the ramdisk image With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch 6) the following performance results have been achieved using the ramdisk with btrfs. These are relative numbers based on throughput rate and compression ratio for zlib level 1: Input data Deflate rate Inflate rate Compression ratio NXU/Software NXU/Software NXU/Software stream of zeroes 1.46 1.02 1.00 random ASCII data 10.44 3.00 0.96 ASCII text (dickens) 6,21 3.33 0.94 binary data (vmlinux) 8,37 3.90 1.02 This means that s390 hardware deflate can provide up to 10 times faster compression (on level 1) and up to 4 times faster decompression (refers to all compression levels) for btrfs zlib. Disclaimer: Performance results are based on IBM internal tests using DD command-line utility on btrfs on a Fedora 30 based internal driver in native LPAR on a z15 system. Results may vary based on individual workload, configuration and software levels. This patch (of 9): Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL implementation and related compression functions. Update zlib_deflate functions with the hooks for s390 hardware support and adjust workspace structures with extra parameter lists required for hardware deflate. Link: http://lkml.kernel.org/r/20200103223334.20669-2-zaslonko@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Sterba <dsterba@suse.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Patch series "S390 hardware support for kernel zlib", v3. With IBM z15 mainframe the new DFLTCC instruction is available. It implements deflate algorithm in hardware (Nest Acceleration Unit - NXU) with estimated compression and decompression performance orders of magnitude faster than the current zlib. This patchset adds s390 hardware compression support to kernel zlib. The code is based on the userspace zlib implementation: madler/zlib#410 The coding style is also preserved for future maintainability. There is only limited set of userspace zlib functions represented in kernel. Apart from that, all the memory allocation should be performed in advance. Thus, the workarea structures are extended with the parameter lists required for the DEFLATE CONVENTION CALL instruction. Since kernel zlib itself does not support gzip headers, only Adler-32 checksum is processed (also can be produced by DFLTCC facility). Like it was implemented for userspace, kernel zlib will compress in hardware on level 1, and in software on all other levels. Decompression will always happen in hardware (when enabled). Two DFLTCC compression calls produce the same results only when they both are made on machines of the same generation, and when the respective buffers have the same offset relative to the start of the page. Therefore care should be taken when using hardware compression when reproducible results are desired. However it does always produce the standard conform output which can be inflated anyway. The new kernel command line parameter 'dfltcc' is introduced to configure s390 zlib hardware support: Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) The main purpose of the integration of the NXU support into the kernel zlib is the use of hardware deflate in btrfs filesystem with on-the-fly compression enabled. Apart from that, hardware support can also be used during boot for decompressing the kernel or the ramdisk image With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch 6) the following performance results have been achieved using the ramdisk with btrfs. These are relative numbers based on throughput rate and compression ratio for zlib level 1: Input data Deflate rate Inflate rate Compression ratio NXU/Software NXU/Software NXU/Software stream of zeroes 1.46 1.02 1.00 random ASCII data 10.44 3.00 0.96 ASCII text (dickens) 6,21 3.33 0.94 binary data (vmlinux) 8,37 3.90 1.02 This means that s390 hardware deflate can provide up to 10 times faster compression (on level 1) and up to 4 times faster decompression (refers to all compression levels) for btrfs zlib. Disclaimer: Performance results are based on IBM internal tests using DD command-line utility on btrfs on a Fedora 30 based internal driver in native LPAR on a z15 system. Results may vary based on individual workload, configuration and software levels. This patch (of 9): Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL implementation and related compression functions. Update zlib_deflate functions with the hooks for s390 hardware support and adjust workspace structures with extra parameter lists required for hardware deflate. Link: http://lkml.kernel.org/r/20200103223334.20669-2-zaslonko@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Sterba <dsterba@suse.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Patch series "S390 hardware support for kernel zlib", v3. With IBM z15 mainframe the new DFLTCC instruction is available. It implements deflate algorithm in hardware (Nest Acceleration Unit - NXU) with estimated compression and decompression performance orders of magnitude faster than the current zlib. This patchset adds s390 hardware compression support to kernel zlib. The code is based on the userspace zlib implementation: madler/zlib#410 The coding style is also preserved for future maintainability. There is only limited set of userspace zlib functions represented in kernel. Apart from that, all the memory allocation should be performed in advance. Thus, the workarea structures are extended with the parameter lists required for the DEFLATE CONVENTION CALL instruction. Since kernel zlib itself does not support gzip headers, only Adler-32 checksum is processed (also can be produced by DFLTCC facility). Like it was implemented for userspace, kernel zlib will compress in hardware on level 1, and in software on all other levels. Decompression will always happen in hardware (when enabled). Two DFLTCC compression calls produce the same results only when they both are made on machines of the same generation, and when the respective buffers have the same offset relative to the start of the page. Therefore care should be taken when using hardware compression when reproducible results are desired. However it does always produce the standard conform output which can be inflated anyway. The new kernel command line parameter 'dfltcc' is introduced to configure s390 zlib hardware support: Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) The main purpose of the integration of the NXU support into the kernel zlib is the use of hardware deflate in btrfs filesystem with on-the-fly compression enabled. Apart from that, hardware support can also be used during boot for decompressing the kernel or the ramdisk image With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch 6) the following performance results have been achieved using the ramdisk with btrfs. These are relative numbers based on throughput rate and compression ratio for zlib level 1: Input data Deflate rate Inflate rate Compression ratio NXU/Software NXU/Software NXU/Software stream of zeroes 1.46 1.02 1.00 random ASCII data 10.44 3.00 0.96 ASCII text (dickens) 6,21 3.33 0.94 binary data (vmlinux) 8,37 3.90 1.02 This means that s390 hardware deflate can provide up to 10 times faster compression (on level 1) and up to 4 times faster decompression (refers to all compression levels) for btrfs zlib. Disclaimer: Performance results are based on IBM internal tests using DD command-line utility on btrfs on a Fedora 30 based internal driver in native LPAR on a z15 system. Results may vary based on individual workload, configuration and software levels. This patch (of 9): Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL implementation and related compression functions. Update zlib_deflate functions with the hooks for s390 hardware support and adjust workspace structures with extra parameter lists required for hardware deflate. Link: http://lkml.kernel.org/r/20200103223334.20669-2-zaslonko@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Sterba <dsterba@suse.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

iii-i · 2020-05-11T10:38:09Z

Added support for switching between software and hardware compression.
Added --dfltcc configure flag (the old way of building it still works).

iii-i · 2020-08-05T16:03:34Z

Fix missing EOBS in raw streams.
Parse environment variables and facility bits only once.

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate and inflate performance on this platform by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead on s390x. The reason is that Python needs to write specific values into gzip header; and when this support was introduced in year 1997, there was indeed no better way to do this. Since v1.2.2.1 (2011) zlib provides inflateGetHeader() and deflateSetHeader() functions for that, so Python does not have to deal with the exact header and trailer formats anymore. Add the new interfaces to zlibmodule.c that make use of these functions: * Add mtime argument to zlib.compress(). * Add mtime and fname arguments to zlib.compressobj(). * Add gz_header_mtime and gz_header_done propeties to ZlibDecompressor. In Python modules, replace raw streams with gzip streams, make use of the new interfaces, and remove all mentions of crc32. In addition to the new interfaces above, there is an additional change in behavior that the users can see: for malformed gzip headers and trailers, decompression now raises zlib.error instead of BadGzipFile. However, this is allowed by today's spec. 📜🤖 NEWS entry added by blurb_it. [1] madler/zlib#410

ljavorsk · 2023-08-22T10:25:50Z

Hi, could you please rebase your patches on top of zlib-1.3 version and on top of #750 once it's rebased as well??

Some s390x environments include madler/zlib#410 and a more pessimistic compressBound: (sourceLen * 16 + 2308) / 8 + 6. Let us adjust the recently enabled tests accordingly.

Optimized functions for Power will make use of GNU indirect functions, an extension to support different implementations of the same function, which can be selected during runtime. This will be used to provide optimized functions for different processor versions. Since this is a GNU extension, we placed the definition of the Z_IFUNC macro under `contrib/gcc`. This can be reused by other archs as well. Author: Matheus Castanho <msc@linux.ibm.com> Author: Rogerio Alves <rcardoso@linux.ibm.com> Signed-off-by: Manjunath Matti <mmatti@linux.ibm.com>

This commit adds an optimized version for the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Author: Rogerio Alves <rcardoso@linux.ibm.com> Signed-off-by: Manjunath Matti <mmatti@linux.ibm.com>

Clang 7 changed the behavior of vec_xxpermdi in order to match GCC's behavior. After this change, code that used to work on Clang 6 stopped to work on Clang >= 7. Tested on Clang 6, 7, 8 and 9. Reference: https://bugs.llvm.org/show_bug.cgi?id=38192 Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com> Signed-off-by: Manjunath Matti <mmatti@linux.ibm.com>

Use vector extensions when compiling for s390x and binutils knows about them. At runtime, check whether kernel supports vector extensions (it has to be not just the CPU, but also the kernel) and choose between the regular and the vectorized implementations.

IBM Z mainframes starting from version z15 provide DFLTCC instruction, which implements deflate algorithm in hardware with estimated compression and decompression performance orders of magnitude faster than the current zlib and ratio comparable with that of level 1. This patch adds DFLTCC support to zlib. It can be enabled using the following build commands: $ ./configure --dfltcc $ make When built like this, zlib would compress in hardware on level 1, and in software on all other levels. Decompression will always happen in hardware. In order to enable DFLTCC compression for levels 1-6 (i.e., to make it used by default) one could either configure with `--dfltcc-level-mask=0x7e` or `export DFLTCC_LEVEL_MASK=0x7e` at run time. Two DFLTCC compression calls produce the same results only when they both are made on machines of the same generation, and when the respective buffers have the same offset relative to the start of the page. Therefore care should be taken when using hardware compression when reproducible results are desired. One such use case - reproducible software builds - is handled explicitly: when the `SOURCE_DATE_EPOCH` environment variable is set, the hardware compression is disabled. DFLTCC does not support every single zlib feature, in particular: * `inflate(Z_BLOCK)` and `inflate(Z_TREES)` * `inflateMark()` * `inflatePrime()` * `inflateSyncPoint()` When used, these functions will either switch to software, or, in case this is not possible, gracefully fail. This patch tries to add DFLTCC support in the least intrusive way. All SystemZ-specific code is placed into a separate file, but unfortunately there is still a noticeable amount of changes in the main zlib code. Below is the summary of these changes. DFLTCC takes as arguments a parameter block, an input buffer, an output buffer and a window. Since DFLTCC requires parameter block to be doubleword-aligned, and it's reasonable to allocate it alongside deflate and inflate states, The `ZALLOC_STATE()`, `ZFREE_STATE()` and `ZCOPY_STATE()` macros are introduced in order to encapsulate the allocation details. The same is true for window, for which the `ZALLOC_WINDOW()` and `TRY_FREE_WINDOW()` macros are introduced. Software and hardware window formats do not match, therefore, `deflateSetDictionary()`, `deflateGetDictionary()`, `inflateSetDictionary()` and `inflateGetDictionary()` need special handling, which is triggered using the new `DEFLATE_SET_DICTIONARY_HOOK()`, `DEFLATE_GET_DICTIONARY_HOOK()`, `INFLATE_SET_DICTIONARY_HOOK()` and `INFLATE_GET_DICTIONARY_HOOK()` macros. `deflateResetKeep()` and `inflateResetKeep()` now update the DFLTCC parameter block, which is allocated alongside zlib state, using the new `DEFLATE_RESET_KEEP_HOOK()` and `INFLATE_RESET_KEEP_HOOK()` macros. The new `DEFLATE_PARAMS_HOOK()` macro switches between the hardware and the software deflate implementations when the `deflateParams()` arguments demand this. The new `INFLATE_PRIME_HOOK()`, `INFLATE_MARK_HOOK()` and `INFLATE_SYNC_POINT_HOOK()` macros make the respective unsupported calls gracefully fail. The algorithm implemented in the hardware has different compression ratio than the one implemented in software. In order for `deflateBound()` to return the correct results for the hardware implementation, the new `DEFLATE_BOUND_ADJUST_COMPLEN()` and `DEFLATE_NEED_CONSERVATIVE_BOUND()` macros are introduced. Actual compression and decompression are handled by the new `DEFLATE_HOOK()` and `INFLATE_TYPEDO_HOOK()` macros. Since inflation with DFLTCC manages the window on its own, calling `updatewindow()` is suppressed using the new `INFLATE_NEED_UPDATEWINDOW()` macro. In addition to the compression, DFLTCC computes the CRC-32 and Adler-32 checksums, therefore, whenever it's used, the software checksumming is suppressed using the new `DEFLATE_NEED_CHECKSUM()` and `INFLATE_NEED_CHECKSUM()` macros. DFLTCC will refuse to write an End-of-block Symbol if there is no input data, thus in some cases it is necessary to do this manually. In order to achieve this, `send_bits()`, `bi_reverse()`, `bi_windup()` and `flush_pending()` are promoted from `local` to `ZLIB_INTERNAL`. Furthermore, since the block and the stream termination must be handled in software as well, `enum block_state` is moved to `deflate.h`. Since the first call to `dfltcc_inflate()` already needs the window, and it might be not allocated yet, `inflate_ensure_window()` is factored out of `updatewindow()` and made `ZLIB_INTERNAL`. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

iii-i · 2023-09-25T12:16:55Z

Rebased.

This commit fixes the test failures on the zlib in Ubuntu jammy s390x. According to the <https://packages.ubuntu.com/jammy-updates/zlib1g> - `zlib_1.2.11.dfsg-2ubuntu9.2.debian.tar.xz`, the zlib deb package is applying the patch 410.patch (madler/zlib#410), and configured by `./configure --dfltcc` on Ubuntu jammy s390x. The `--dfltcc` is to enable the deflate algorithm in hardware. It produces a different (but still valid) compressed byte stream, and causes the test failures in ruby/zlib. As a workaround, set the environment variable `DFLTCC=0` disabling the implementation in zlib on s390x to the failing tests. Note we need to test in a child Ruby process with `assert_separately` to test on the `DFLTCC=0` set by the parent Ruby process.

This commit fixes the test failures on the zlib in Ubuntu jammy s390x. According to the <https://packages.ubuntu.com/jammy-updates/zlib1g> - `zlib_1.2.11.dfsg-2ubuntu9.2.debian.tar.xz`, the zlib deb package is applying the patch 410.patch (madler/zlib#410), and configured by `./configure --dfltcc` on Ubuntu jammy s390x. The `--dfltcc` is to enable the deflate algorithm in hardware. It produces a different (but still valid) compressed byte stream, and causes the test failures in ruby/zlib. As a workaround, set the environment variable `DFLTCC=0` disabling the implementation in zlib on s390x to the failing tests. Note we need to test in a child Ruby process with `assert_separately` to test on the `DFLTCC=0` set by the parent Ruby process. ruby/zlib@9f3b9c470c

Upgrade the used Ubuntu version from 20.04 (Focal) to 22.04 (Jammy), alignin with RubyCI "s390x (Ubuntu)" server. https://rubyci.org/ Note Travis CI supports Ubuntu 22.04 (Jammy). https://docs.travis-ci.com/user/reference/jammy/ Set `DFLTCC=0` environment variable as a workaround to avoid the test failures related to zlib in the `make test-all` and `make test-spec`. The failures can happen with the zlib library applying the patch madler/zlib#410 to enable the deflate algorithm producing a different compressed byte stream.

junaruga · 2023-10-02T15:33:02Z

Hello, I am working in Ruby language project. Let me just share our situation. As I observed the test failures in the ruby zlib library by this patch in Ubuntu jammy (20.04) s390x, we are managing the issue on the following ticket. We are running the tests with DFLTCC=0 as a workaround for now.

https://bugs.ruby-lang.org/issues/19909

iii-i · 2023-10-09T09:09:00Z

Thanks for sharing the bugtracker link. All the failures look expected to me, since the tests check for exact compressed byte sequences or lengths. This may also happen if one uses a different version or implementation of zlib, such as zlib-ng. Long-term I think this would be beneficial to rework the tests to try decompressing the data instead.

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate and inflate performance on this platform by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead on s390x. The reason is that Python needs to write specific values into gzip header; and when this support was introduced in year 1997, there was indeed no better way to do this. Since v1.2.2.1 (2011) zlib provides inflateGetHeader() and deflateSetHeader() functions for that, so Python does not have to deal with the exact header and trailer formats anymore. Add the new interfaces to zlibmodule.c that make use of these functions: * Add mtime argument to zlib.compress(). * Add mtime and fname arguments to zlib.compressobj(). * Add gz_header_mtime and gz_header_done propeties to ZlibDecompressor. In Python modules, replace raw streams with gzip streams, make use of the new interfaces, and remove all mentions of crc32. 📜🤖 NEWS entry added by blurb_it. [1] madler/zlib#410

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress() and GzipFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

Many distros ship zlib with the IBM Z deflate hardware acceleration patch [1]. Sometimes it's desirable to disable the acceleration, for example, for reproducibility. This can be done by exporting DFLTCC=0. llvm-lit clears this environment variable, which causes compress-debug-sections-zlib.test fail on z15 and later machines. Add DFLTCC to the list of variables to keep. [1] madler/zlib#410 Reviewed By: abrachet Differential Revision: https://reviews.llvm.org/D130253

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

iii-i mentioned this pull request Mar 26, 2019

Add support for IBM Z hardware-accelerated deflate zlib-ng/zlib-ng#330

Merged

iii-i force-pushed the dfltcc branch from 0fde796 to 5d3c036 Compare May 6, 2019 09:07

iii-i force-pushed the dfltcc branch from 5d3c036 to 1d9542d Compare May 15, 2019 11:10

iii-i force-pushed the dfltcc branch from 1d9542d to 305e427 Compare June 4, 2019 10:30

iii-i force-pushed the dfltcc branch from 305e427 to 230b515 Compare July 8, 2019 13:07

iii-i force-pushed the dfltcc branch from 230b515 to 91ccefa Compare May 11, 2020 10:36

iii-i force-pushed the dfltcc branch from 91ccefa to aff7084 Compare August 5, 2020 16:01

dr-m added a commit to MariaDB/server that referenced this pull request Sep 11, 2023

MDEV-21679 fixup for s390x

ef569c3

Some s390x environments include madler/zlib#410 and a more pessimistic compressBound: (sourceLen * 16 + 2308) / 8 + 6. Let us adjust the recently enabled tests accordingly.

Manjunath S Matti added 3 commits September 15, 2023 11:38

junaruga mentioned this pull request Sep 15, 2023

Ubuntu jammy s390x: Test failures ruby/zlib#60

Closed

iii-i added 2 commits September 18, 2023 11:59

s390x: vectorize crc32

559c8ee

Use vector extensions when compiling for s390x and binutils knows about them. At runtime, check whether kernel supports vector extensions (it has to be not just the CPU, but also the kernel) and choose between the regular and the vectorized implementations.

iii-i force-pushed the dfltcc branch from cce6624 to 481ee63 Compare September 25, 2023 12:15

junaruga mentioned this pull request Sep 25, 2023

Workaround: Fix test failures on Ubuntu jammy s390x. ruby/zlib#63

Merged

iii-i mentioned this pull request Nov 17, 2023

gh-103477: Write gzip trailer with zlib python/cpython#112199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for IBM Z hardware-accelerated deflate #410

Add support for IBM Z hardware-accelerated deflate #410

iii-i commented Mar 15, 2019 •

edited

Loading

iii-i commented Apr 5, 2019

iii-i commented May 6, 2019 •

edited

Loading

iii-i commented Jun 4, 2019

iii-i commented Jul 8, 2019

kloczek commented Aug 23, 2019

iii-i commented May 11, 2020 •

edited

Loading

iii-i commented Aug 5, 2020

ljavorsk commented Aug 22, 2023

iii-i commented Sep 25, 2023

junaruga commented Oct 2, 2023 •

edited

Loading

iii-i commented Oct 9, 2023

Add support for IBM Z hardware-accelerated deflate #410

Are you sure you want to change the base?

Add support for IBM Z hardware-accelerated deflate #410

Conversation

iii-i commented Mar 15, 2019 • edited Loading

iii-i commented Apr 5, 2019

iii-i commented May 6, 2019 • edited Loading

iii-i commented Jun 4, 2019

iii-i commented Jul 8, 2019

kloczek commented Aug 23, 2019

iii-i commented May 11, 2020 • edited Loading

iii-i commented Aug 5, 2020

ljavorsk commented Aug 22, 2023

iii-i commented Sep 25, 2023

junaruga commented Oct 2, 2023 • edited Loading

iii-i commented Oct 9, 2023

iii-i commented Mar 15, 2019 •

edited

Loading

iii-i commented May 6, 2019 •

edited

Loading

iii-i commented May 11, 2020 •

edited

Loading

junaruga commented Oct 2, 2023 •

edited

Loading