Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized variant of CMN for HWC to HWC case #4992

Merged
merged 2 commits into from
Aug 22, 2023

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Aug 11, 2023

Category: New feature, Refactoring

Description:

This PR generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading (from gmem to smem) and writing the output (from smem to gmem) as separate function, that are used as common parts between kernels.
As input layout is the same, the same loading (and cropping) can be applied.
The output writing for CHW and HWC are different, but they stay the same between the cropping and no-cropping variant.

The sketch of the kernel is again described in the docstring.

In a followup, even more specialized version for HWC->HWC+pad for fp16 output will be provided.

For HWC->HWC planar storage of the tile in shared memory can be further evaluated.

This version provides up to 2x speedups when running as the only operator within pipeline (for non-slicing cases).

TODO: measure the impact of calculating the outputs for FP16 in FP16 values or FP32 values.
Rough benchmarks show no observable difference.

Additional information:

Affected modules and functionalities:

CMN, SFN, etc

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
      The test covering possible parameters were extended with layout and it compares against CPU implementation.
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki klecki changed the title Add optimized Hwc2Hwc variants for CMN Add optimized variant of CMN for HWC to HWC case Aug 11, 2023
@klecki klecki marked this pull request as ready for review August 11, 2023 18:44
@klecki
Copy link
Contributor Author

klecki commented Aug 11, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9336336]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9336336]: BUILD PASSED

Comment on lines +36 to +64
* @brief Specialized version of SliceFlipNormalize that reads a HWC u8 image (with 3 channels)
* and outputs a HWC or CHW normalized float image, that can be cropped in Y, X coordinates,
* mirrored in X coordinate, and the channels can be padded.
*
* Optionally allows for cropping the input in y, x (HW) coordinates, flipping in x (W) coordinate
* and padding the channels to the multiple of 2.
* Cropping the input in y, x (HW) coordinates, flipping in x (W) coordinate
* and padding the channels (from 3 to 4 in HWC->HWC variant) are optional, optimized implementation
* will be selected when those features are not used across the batch.
*
* The input is assumed to be u8.
* Overview of the kernel:
* The image is processed in flattened coordinates. The Y, X stays the same between the interleaved
* input layout and planar output layout. Assuming 3-channel input, we can look at the input as
* a sequential stream of values, where we distribute them (sequentially) into 3 output planes.
* Use a thread block size, that is divisible both by channel number (for the output loop),
* and 4 (for input loop).
* The processing steps:
* 1. [Input loop] Load the linear chunk of input into shared memory, utilizing 4-byte aligned loads
* and cast it to float.
* a. Unaligned prologue loop - reads the first chunk till we get to address that is aligned with
* 32 * 4.
* b. Main loop - do as many aligned 4byte reads as possible
* c. Epilogue loop - read the remaining values that were not possible to read as one 4byte read.
* 2. Synchronize
* 3. [Output loop] Each thread corresponds to a (Y, X) sequential offset into a plane, computes
* the values for all the channels and writes them.
* a. Optionally, mirroring is performed by inverting the X-coordinate in the output offset.
* b. Padding the output channels is performed by filling additional planes with fill values.
*
* @tparam Out output type
*
* @tparam Out output type - fp16 and fp32 allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 For the documentation

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Aug 18, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9422606]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9422606]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9422606]: BUILD PASSED

@klecki klecki merged commit 301d1a6 into NVIDIA:main Aug 22, 2023
5 checks passed
JanuszL pushed a commit to JanuszL/DALI that referenced this pull request Oct 13, 2023
This commit generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading
(from gmem to smem) and writing the output (from smem to gmem) as separate functions, 
that are used as common parts between kernels.
As input layout is the same, the same loading (and cropping) can be applied.
The output writing for CHW and HWC are different, but they stay the same between
the cropping and no-cropping variant.

The sketch of the kernel is described in the docstring.

For HWC->HWC planar storage of the tile in shared memory can be further evaluated.

This version provides up to 2x speedups when running as the only operator within pipeline
(for non-slicing cases).

The computations are done in float as in the original kernel, as the benchmarks shown
no difference compared to using fp16.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants