Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize audio resampling for ARM NEON. #3745

Merged
merged 4 commits into from
Mar 21, 2022

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Mar 17, 2022

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Category:

Other Performance optimization (main purpose)
Bug fix (additional, in the same file)

Description:

Implements vectorized single-channel audio resampling for ARM NEON.

Additional information:

Vectorization is done differently than on SSE (pairwise loads are faster on ARM).
Other changes include improved handling of floor function.
Bug fix: SSE implementation used rounding instead of truncation - fixed.

Affected modules and functionalities:

Audio resampling (audio decoder).

Key points relevant for the review:

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2651

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@JanuszL JanuszL self-assigned this Mar 17, 2022
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4171295]: BUILD STARTED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
vgetq_lane_s32(i, 2),
vgetq_lane_s32(i, 3)
};
float32x2_t c0 = vld1_f32(&lookup[idx[0]]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

f4 = vfmaq_f32(f4, vld1q_f32(in_block_ptr + i), w4);
x4 = vaddq_f32(x4, vdupq_n_f32(4));
}
// Reduce elements in f4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add the same comment in L214 - it was unclear for me why is happening there.

float f = 0;
int i = i0;

#ifdef __ARM_NEON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can extract this vectorized parts into separate functions, instead having a lot of variants for different architectures in one body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I'd have to re-evaluate the performance. There's quite a bit of variables that are modified in the loop and they all would need to be passed by reference.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4171320]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4173226]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4173226]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4180682]: BUILD STARTED

int i = i_ref;
float32x4_t x4 = vaddq_f32(vdupq_n_f32(i - in_pos), _0123);

for (; i + 3 <= i1; i += 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you change that as well like in L174?

Copy link
Contributor Author

@mzient mzient Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Fixed.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4180791]: BUILD STARTED

@JanuszL JanuszL self-requested a review March 18, 2022 16:26
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4180791]: BUILD PASSED

#ifdef __ARM_NEON

inline float32x4_t vsetq_f32(float x0, float x1, float x2, float x3) {
float32x4_t x;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very little nitpick, could be float32x4_t x = vdubpq_n_f32(x0);

Copy link
Contributor

@prak-nv prak-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, asked out of curiosity

#ifdef __SSE2__
inline __m128 operator()(__m128 x) const {
__m128 fi = _mm_add_ps(x * _mm_set1_ps(scale), _mm_set1_ps(center));
__m128i i = _mm_cvtps_epi32(fi);
__m128i i = _mm_cvttps_epi32(fi);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was doing conversion without truncate an issue before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was a bug - I detected it now that there's a slight difference between ARM and SSE and dug deeper to find out that it's SSE that's wrong.

@mzient mzient merged commit 0fdc119 into NVIDIA:main Mar 21, 2022
@JanuszL JanuszL mentioned this pull request Mar 30, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
* Vectorize audio resampling for ARM NEON.
* Fix rounding mode in SSE vectorization.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
* Vectorize audio resampling for ARM NEON.
* Fix rounding mode in SSE vectorization.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants