-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize audio resampling for ARM NEON. #3745
Conversation
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [4171295]: BUILD STARTED |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
vgetq_lane_s32(i, 2), | ||
vgetq_lane_s32(i, 3) | ||
}; | ||
float32x2_t c0 = vld1_f32(&lookup[idx[0]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
dali/kernels/signal/resampling.h
Outdated
f4 = vfmaq_f32(f4, vld1q_f32(in_block_ptr + i), w4); | ||
x4 = vaddq_f32(x4, vdupq_n_f32(4)); | ||
} | ||
// Reduce elements in f4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add the same comment in L214 - it was unclear for me why is happening there.
dali/kernels/signal/resampling.h
Outdated
float f = 0; | ||
int i = i0; | ||
|
||
#ifdef __ARM_NEON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can extract this vectorized parts into separate functions, instead having a lot of variants for different architectures in one body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but I'd have to re-evaluate the performance. There's quite a bit of variables that are modified in the loop and they all would need to be passed by reference.
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [4171320]: BUILD STARTED |
CI MESSAGE: [4173226]: BUILD STARTED |
CI MESSAGE: [4173226]: BUILD FAILED |
CI MESSAGE: [4180682]: BUILD STARTED |
dali/kernels/signal/resampling.h
Outdated
int i = i_ref; | ||
float32x4_t x4 = vaddq_f32(vdupq_n_f32(i - in_pos), _0123); | ||
|
||
for (; i + 3 <= i1; i += 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you change that as well like in L174?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Fixed.
e91d112
to
907fd20
Compare
CI MESSAGE: [4180791]: BUILD STARTED |
CI MESSAGE: [4180791]: BUILD PASSED |
#ifdef __ARM_NEON | ||
|
||
inline float32x4_t vsetq_f32(float x0, float x1, float x2, float x3) { | ||
float32x4_t x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very little nitpick, could be float32x4_t x = vdubpq_n_f32(x0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, asked out of curiosity
#ifdef __SSE2__ | ||
inline __m128 operator()(__m128 x) const { | ||
__m128 fi = _mm_add_ps(x * _mm_set1_ps(scale), _mm_set1_ps(center)); | ||
__m128i i = _mm_cvtps_epi32(fi); | ||
__m128i i = _mm_cvttps_epi32(fi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was doing conversion without truncate an issue before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was a bug - I detected it now that there's a slight difference between ARM and SSE and dug deeper to find out that it's SSE that's wrong.
* Vectorize audio resampling for ARM NEON. * Fix rounding mode in SSE vectorization. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Vectorize audio resampling for ARM NEON. * Fix rounding mode in SSE vectorization. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz mzient@gmail.com
Category:
Other Performance optimization (main purpose)
Bug fix (additional, in the same file)
Description:
Implements vectorized single-channel audio resampling for ARM NEON.
Additional information:
Vectorization is done differently than on SSE (pairwise loads are faster on ARM).
Other changes include improved handling of floor function.
Bug fix: SSE implementation used rounding instead of truncation - fixed.
Affected modules and functionalities:
Audio resampling (audio decoder).
Key points relevant for the review:
Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-2651