Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comba mul/sqr: remove W array #441

Closed
wants to merge 4 commits into from
Closed

comba mul/sqr: remove W array #441

wants to merge 4 commits into from

Conversation

minad
Copy link
Member

@minad minad commented Oct 30, 2019

This is just an experiment to gather some feedback and preliminary work needed if we want full width digits. It is based on #434.

  • Output dp must not alias the input dps.

  • We avoid the unnecessary copying in the end.

  • No huge W array is allocated on the stack.

  • If the digit representation is reworked such that the full bitwidth is
    used, comba can be used for more digits.
    In that case however allocating the W array on the stack is not
    feasible.

  • One additional advantage is that the code gets simpler

  • The disadvantage is that the the user must ensure that no aliasing
    happens if the fast multiplier is to be used. Do we want that?
    I would argue that using aliasing (which impedes optimization in
    this case) should be discouraged.
    However this is a deviation from how things have been until now.
    We should document the change, such that code which suddenly
    performs worse can be updated.

ALTERNATIVE 1:
The comba functions could allocate a temporary if
the inputs alias the output.

ALTERNATIVE 2:
We could move the allocation of a temporary outside
of both s_mp_(sqr|mul) and s_mp_(sqr|mul)_comba
into mp_mul/mp_sqr. Then the s_mp functions wouldn't
accept aliases anymore. But we would also speed up the
plain old s_mp_mul/s_mp_sqr functions in the non-aliasing
case (no allocations!).

Now that I think about it, I think ALTERNATIVE 2 sounds like a good
solution!

We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.

Ping @czurnieden

* Output dp must not alias the input dps.

* We avoid the unnecessary copying in the end.

* No huge W array is allocated on the stack.

* If the digit representation is reworked such that the full bitwidth is
  used, comba can be used for more digits.
  In that case however allocating the W array on the stack is not
  feasible.

* One additional advantage is that the code gets simpler

* The disadvantage is that the the user must ensure that no aliasing
  happens if the fast multiplier is to be used. Do we want that?
  I would argue that using aliasing (which impedes optimization in
  this case) should be discouraged.
  However this is a deviation from how things have been until now.
  We should document the change, such that code which suddenly
  performs worse can be updated.

ALTERNATIVE 1:
The comba functions could allocate a temporary if
the inputs alias the output.

ALTERNATIVE 2:
We could move the allocation of a temporary outside
of both s_mp_(sqr|mul) and s_mp_(sqr|mul)_comba
into mp_mul/mp_sqr. Then the s_mp functions wouldn't
accept aliases anymore. But we would also speed up the
plain old s_mp_mul/s_mp_sqr functions in the non-aliasing
case (no allocations!).

Now that I think about it, I think ALTERNATIVE 2 sounds like a good
solution!

We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.
Copy link
Contributor

@czurnieden czurnieden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it all on the heap instead?
Yes, not a bad idea.

Alternative 2? I don't see much of a difference but alt. 2 seems to be cleaner.

We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.

Do you think about changing that behaviour?

@sjaeckel
Copy link
Member

sjaeckel commented Nov 1, 2019

FYI RSA in libtomcrypt is ~10% slower with this version of mul&sqr!

@czurnieden
Copy link
Contributor

FYI RSA in libtomcrypt is ~10% slower with this version of mul&sqr!

A bit was expected but 10% are a bit much.
Mmh…

@minad
Copy link
Member Author

minad commented Nov 2, 2019

@sjaeckel ok but this is not the final version. In the end we would change the usage pattern of the library a bit, discouraging aliasing. And then your ltc rsa version would have to be changed.

What you are observing is that comba is just not used, that's giving the slowdown. It is all expected.

But this is just an experiment. So nothing to worry. If we make such intrusive changes, we definitely have to benchmark.

@minad
Copy link
Member Author

minad commented Nov 2, 2019

I close this for now, to not confuse you about things I am trying :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants