-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized variant of CMN for HWC to HWC case #4992
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki
changed the title
Add optimized Hwc2Hwc variants for CMN
Add optimized variant of CMN for HWC to HWC case
Aug 11, 2023
!build |
CI MESSAGE: [9336336]: BUILD STARTED |
CI MESSAGE: [9336336]: BUILD PASSED |
szalpal
reviewed
Aug 17, 2023
Comment on lines
+36
to
+64
* @brief Specialized version of SliceFlipNormalize that reads a HWC u8 image (with 3 channels) | ||
* and outputs a HWC or CHW normalized float image, that can be cropped in Y, X coordinates, | ||
* mirrored in X coordinate, and the channels can be padded. | ||
* | ||
* Optionally allows for cropping the input in y, x (HW) coordinates, flipping in x (W) coordinate | ||
* and padding the channels to the multiple of 2. | ||
* Cropping the input in y, x (HW) coordinates, flipping in x (W) coordinate | ||
* and padding the channels (from 3 to 4 in HWC->HWC variant) are optional, optimized implementation | ||
* will be selected when those features are not used across the batch. | ||
* | ||
* The input is assumed to be u8. | ||
* Overview of the kernel: | ||
* The image is processed in flattened coordinates. The Y, X stays the same between the interleaved | ||
* input layout and planar output layout. Assuming 3-channel input, we can look at the input as | ||
* a sequential stream of values, where we distribute them (sequentially) into 3 output planes. | ||
* Use a thread block size, that is divisible both by channel number (for the output loop), | ||
* and 4 (for input loop). | ||
* The processing steps: | ||
* 1. [Input loop] Load the linear chunk of input into shared memory, utilizing 4-byte aligned loads | ||
* and cast it to float. | ||
* a. Unaligned prologue loop - reads the first chunk till we get to address that is aligned with | ||
* 32 * 4. | ||
* b. Main loop - do as many aligned 4byte reads as possible | ||
* c. Epilogue loop - read the remaining values that were not possible to read as one 4byte read. | ||
* 2. Synchronize | ||
* 3. [Output loop] Each thread corresponds to a (Y, X) sequential offset into a plane, computes | ||
* the values for all the channels and writes them. | ||
* a. Optionally, mirroring is performed by inverting the X-coordinate in the output offset. | ||
* b. Padding the output channels is performed by filling additional planes with fill values. | ||
* | ||
* @tparam Out output type | ||
* | ||
* @tparam Out output type - fp16 and fp32 allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 For the documentation
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
awolant
approved these changes
Aug 17, 2023
!build |
CI MESSAGE: [9422606]: BUILD STARTED |
CI MESSAGE: [9422606]: BUILD FAILED |
CI MESSAGE: [9422606]: BUILD PASSED |
szalpal
approved these changes
Aug 22, 2023
JanuszL
pushed a commit
to JanuszL/DALI
that referenced
this pull request
Oct 13, 2023
This commit generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading (from gmem to smem) and writing the output (from smem to gmem) as separate functions, that are used as common parts between kernels. As input layout is the same, the same loading (and cropping) can be applied. The output writing for CHW and HWC are different, but they stay the same between the cropping and no-cropping variant. The sketch of the kernel is described in the docstring. For HWC->HWC planar storage of the tile in shared memory can be further evaluated. This version provides up to 2x speedups when running as the only operator within pipeline (for non-slicing cases). The computations are done in float as in the original kernel, as the benchmarks shown no difference compared to using fp16. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Category: New feature, Refactoring
Description:
This PR generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading (from gmem to smem) and writing the output (from smem to gmem) as separate function, that are used as common parts between kernels.
As input layout is the same, the same loading (and cropping) can be applied.
The output writing for CHW and HWC are different, but they stay the same between the cropping and no-cropping variant.
The sketch of the kernel is again described in the docstring.
In a followup, even more specialized version for HWC->HWC+pad for fp16 output will be provided.
For HWC->HWC planar storage of the tile in shared memory can be further evaluated.
This version provides up to 2x speedups when running as the only operator within pipeline (for non-slicing cases).
TODO: measure the impact of calculating the outputs for FP16 in FP16 values or FP32 values.
Rough benchmarks show no observable difference.
Additional information:
Affected modules and functionalities:
CMN, SFN, etc
Key points relevant for the review:
Tests:
The test covering possible parameters were extended with layout and it compares against CPU implementation.
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A