Add optimized variant of CMN for HWC to HWC case #4992

klecki · 2023-08-11T17:27:07Z

Category: New feature, Refactoring

Description:

This PR generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading (from gmem to smem) and writing the output (from smem to gmem) as separate function, that are used as common parts between kernels.
As input layout is the same, the same loading (and cropping) can be applied.
The output writing for CHW and HWC are different, but they stay the same between the cropping and no-cropping variant.

The sketch of the kernel is again described in the docstring.

In a followup, even more specialized version for HWC->HWC+pad for fp16 output will be provided.

For HWC->HWC planar storage of the tile in shared memory can be further evaluated.

This version provides up to 2x speedups when running as the only operator within pipeline (for non-slicing cases).

TODO: measure the impact of calculating the outputs for FP16 in FP16 values or FP32 values.
Rough benchmarks show no observable difference.

Additional information:

Affected modules and functionalities:

CMN, SFN, etc

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-11T19:01:34Z

!build

dali-automaton · 2023-08-11T19:05:54Z

CI MESSAGE: [9336336]: BUILD STARTED

dali-automaton · 2023-08-11T20:54:41Z

CI MESSAGE: [9336336]: BUILD PASSED

szalpal · 2023-08-17T10:29:05Z

dali/kernels/slice/slice_hwc2chw_normalize_gpu.h

+ * @brief Specialized version of SliceFlipNormalize that reads a HWC u8 image (with 3 channels)
+ * and outputs a HWC or CHW normalized float image, that can be cropped in Y, X coordinates,
+ * mirrored in X coordinate, and the channels can be padded.
 *
- * Optionally allows for cropping the input in y, x (HW) coordinates, flipping in x (W) coordinate
- * and padding the channels to the multiple of 2.
+ * Cropping the input in y, x (HW) coordinates, flipping in x (W) coordinate
+ * and padding the channels (from 3 to 4 in HWC->HWC variant) are optional, optimized implementation
+ * will be selected when those features are not used across the batch.
 *
- * The input is assumed to be u8.
+ * Overview of the kernel:
+ * The image is processed in flattened coordinates. The Y, X stays the same between the interleaved
+ * input layout and planar output layout. Assuming 3-channel input, we can look at the input as
+ * a sequential stream of values, where we distribute them (sequentially) into 3 output planes.
+ * Use a thread block size, that is divisible both by channel number (for the output loop),
+ * and 4 (for input loop).
+ * The processing steps:
+ * 1. [Input loop] Load the linear chunk of input into shared memory, utilizing 4-byte aligned loads
+ *    and cast it to float.
+ *   a. Unaligned prologue loop - reads the first chunk till we get to address that is aligned with
+ *      32 * 4.
+ *   b. Main loop - do as many aligned 4byte reads as possible
+ *   c. Epilogue loop - read the remaining values that were not possible to read as one 4byte read.
+ * 2. Synchronize
+ * 3. [Output loop] Each thread corresponds to a (Y, X) sequential offset into a plane, computes
+ *    the values for all the channels and writes them.
+ *   a. Optionally, mirroring is performed by inverting the X-coordinate in the output offset.
+ *   b. Padding the output channels is performed by filling additional planes with fill values.
 *
- * @tparam Out output type
+ *
+ * @tparam Out output type - fp16 and fp32 allowed.


👍🏻 For the documentation

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-18T08:59:39Z

!build

dali-automaton · 2023-08-18T09:05:19Z

CI MESSAGE: [9422606]: BUILD STARTED

dali-automaton · 2023-08-18T11:20:32Z

CI MESSAGE: [9422606]: BUILD FAILED

dali-automaton · 2023-08-21T11:55:23Z

CI MESSAGE: [9422606]: BUILD PASSED

This commit generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading (from gmem to smem) and writing the output (from smem to gmem) as separate functions, that are used as common parts between kernels. As input layout is the same, the same loading (and cropping) can be applied. The output writing for CHW and HWC are different, but they stay the same between the cropping and no-cropping variant. The sketch of the kernel is described in the docstring. For HWC->HWC planar storage of the tile in shared memory can be further evaluated. This version provides up to 2x speedups when running as the only operator within pipeline (for non-slicing cases). The computations are done in float as in the original kernel, as the benchmarks shown no difference compared to using fp16. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the hwc2hwc_planar branch from b61b4e7 to e562a46 Compare August 11, 2023 18:27

Add HWC -> HWC optimized variants for CMN

fc3fdf2

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the hwc2hwc_planar branch from e562a46 to fc3fdf2 Compare August 11, 2023 18:36

klecki changed the title ~~Add optimized Hwc2Hwc variants for CMN~~ Add optimized variant of CMN for HWC to HWC case Aug 11, 2023

klecki marked this pull request as ready for review August 11, 2023 18:44

jantonguirao assigned awolant and szalpal Aug 14, 2023

szalpal reviewed Aug 17, 2023

View reviewed changes

Use compute in float, same as original kernel

fdbefb3

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the hwc2hwc_planar branch from d642dd0 to fdbefb3 Compare August 17, 2023 12:43

awolant approved these changes Aug 17, 2023

View reviewed changes

szalpal approved these changes Aug 22, 2023

View reviewed changes

klecki merged commit 301d1a6 into NVIDIA:main Aug 22, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized variant of CMN for HWC to HWC case #4992

Add optimized variant of CMN for HWC to HWC case #4992

klecki commented Aug 11, 2023 •

edited

Loading

klecki commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

szalpal Aug 17, 2023

klecki commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 21, 2023

Add optimized variant of CMN for HWC to HWC case #4992

Add optimized variant of CMN for HWC to HWC case #4992

Conversation

klecki commented Aug 11, 2023 • edited Loading

Category: New feature, Refactoring

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

klecki commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

szalpal Aug 17, 2023

Choose a reason for hiding this comment

klecki commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 21, 2023

klecki commented Aug 11, 2023 •

edited

Loading