comba mul/sqr: remove W array #441

minad · 2019-10-30T17:41:14Z

This is just an experiment to gather some feedback and preliminary work needed if we want full width digits. It is based on #434.

Output dp must not alias the input dps.
We avoid the unnecessary copying in the end.
No huge W array is allocated on the stack.
If the digit representation is reworked such that the full bitwidth is
used, comba can be used for more digits.
In that case however allocating the W array on the stack is not
feasible.
One additional advantage is that the code gets simpler
The disadvantage is that the the user must ensure that no aliasing
happens if the fast multiplier is to be used. Do we want that?
I would argue that using aliasing (which impedes optimization in
this case) should be discouraged.
However this is a deviation from how things have been until now.
We should document the change, such that code which suddenly
performs worse can be updated.

ALTERNATIVE 1:
The comba functions could allocate a temporary if
the inputs alias the output.

ALTERNATIVE 2:
We could move the allocation of a temporary outside
of both s_mp_(sqr|mul) and s_mp_(sqr|mul)_comba
into mp_mul/mp_sqr. Then the s_mp functions wouldn't
accept aliases anymore. But we would also speed up the
plain old s_mp_mul/s_mp_sqr functions in the non-aliasing
case (no allocations!).

Now that I think about it, I think ALTERNATIVE 2 sounds like a good
solution!

We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.

Ping @czurnieden

* Output dp must not alias the input dps. * We avoid the unnecessary copying in the end. * No huge W array is allocated on the stack. * If the digit representation is reworked such that the full bitwidth is used, comba can be used for more digits. In that case however allocating the W array on the stack is not feasible. * One additional advantage is that the code gets simpler * The disadvantage is that the the user must ensure that no aliasing happens if the fast multiplier is to be used. Do we want that? I would argue that using aliasing (which impedes optimization in this case) should be discouraged. However this is a deviation from how things have been until now. We should document the change, such that code which suddenly performs worse can be updated. ALTERNATIVE 1: The comba functions could allocate a temporary if the inputs alias the output. ALTERNATIVE 2: We could move the allocation of a temporary outside of both s_mp_(sqr|mul) and s_mp_(sqr|mul)_comba into mp_mul/mp_sqr. Then the s_mp functions wouldn't accept aliases anymore. But we would also speed up the plain old s_mp_mul/s_mp_sqr functions in the non-aliasing case (no allocations!). Now that I think about it, I think ALTERNATIVE 2 sounds like a good solution! We also have to think about mul_high_comba/reduce/montgomery_reduce etc, since the reduce functions mutate the input and therefore alias.

czurnieden

Put it all on the heap instead?
Yes, not a bad idea.

Alternative 2? I don't see much of a difference but alt. 2 seems to be cleaner.

We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.

Do you think about changing that behaviour?

sjaeckel · 2019-11-01T17:15:54Z

FYI RSA in libtomcrypt is ~10% slower with this version of mul&sqr!

czurnieden · 2019-11-01T18:15:45Z

FYI RSA in libtomcrypt is ~10% slower with this version of mul&sqr!

A bit was expected but 10% are a bit much.
Mmh…

minad · 2019-11-02T10:42:09Z

@sjaeckel ok but this is not the final version. In the end we would change the usage pattern of the library a bit, discouraging aliasing. And then your ltc rsa version would have to be changed.

What you are observing is that comba is just not used, that's giving the slowdown. It is all expected.

But this is just an experiment. So nothing to worry. If we make such intrusive changes, we definitely have to benchmark.

minad · 2019-11-02T10:42:49Z

I close this for now, to not confuse you about things I am trying :)

minad requested a review from czurnieden October 30, 2019 17:41

minad added experiment feedback required labels Oct 30, 2019

minad force-pushed the simplifications branch from 718dbc8 to ca8cb27 Compare October 30, 2019 19:01

minad added 3 commits October 30, 2019 20:03

rename mul/sqr functions for consistency, comba instead of fast suffix

3daf7e9

regen files

32f93b1

czurnieden approved these changes Oct 30, 2019

View reviewed changes

try alternative 2 for mp_sqr

44caa78

minad force-pushed the rework-comba branch from 8756f6f to 44caa78 Compare October 30, 2019 19:30

minad closed this Nov 2, 2019

minad mentioned this pull request Nov 6, 2019

remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

Closed

sjaeckel mentioned this pull request Feb 21, 2022

Reduce stack usage for embedded targets #511

Closed

sjaeckel mentioned this pull request Oct 27, 2022

add MP_SMALL_STACK_SIZE option #538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comba mul/sqr: remove W array #441

comba mul/sqr: remove W array #441

minad commented Oct 30, 2019

czurnieden left a comment

sjaeckel commented Nov 1, 2019

czurnieden commented Nov 1, 2019

minad commented Nov 2, 2019

minad commented Nov 2, 2019

comba mul/sqr: remove W array #441

comba mul/sqr: remove W array #441

Conversation

minad commented Oct 30, 2019

czurnieden left a comment

Choose a reason for hiding this comment

sjaeckel commented Nov 1, 2019

czurnieden commented Nov 1, 2019

minad commented Nov 2, 2019

minad commented Nov 2, 2019