Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<random>: Optimize uniform_int_distribution with multiply-shift #178

Closed
StephanTLavavej opened this issue Oct 17, 2019 · 11 comments · Fixed by #3012
Closed

<random>: Optimize uniform_int_distribution with multiply-shift #178

StephanTLavavej opened this issue Oct 17, 2019 · 11 comments · Fixed by #3012
Labels
fixed Something works now, yay! performance Must go faster

Comments

@StephanTLavavej
Copy link
Member

I reimplemented uniform_int_distribution long ago:

STL/stl/inc/xutility

Lines 4878 to 4942 in 53cdb9f

template <class _Diff, class _Urng>
class _Rng_from_urng { // wrap a URNG as an RNG
public:
using _Ty0 = make_unsigned_t<_Diff>;
using _Ty1 = typename _Urng::result_type;
using _Udiff = conditional_t<sizeof(_Ty1) < sizeof(_Ty0), _Ty0, _Ty1>;
explicit _Rng_from_urng(_Urng& _Func) : _Ref(_Func), _Bits(CHAR_BIT * sizeof(_Udiff)), _Bmask(_Udiff(-1)) {
for (; (_Urng::max)() - (_Urng::min)() < _Bmask; _Bmask >>= 1) {
--_Bits;
}
}
_Diff operator()(_Diff _Index) { // adapt _Urng closed range to [0, _Index)
for (;;) { // try a sample random value
_Udiff _Ret = 0; // random bits
_Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]
while (_Mask < _Udiff(_Index - 1)) { // need more random bits
_Ret <<= _Bits - 1; // avoid full shift
_Ret <<= 1;
_Ret |= _Get_bits();
_Mask <<= _Bits - 1; // avoid full shift
_Mask <<= 1;
_Mask |= _Bmask;
}
// _Ret is [0, _Mask], _Index - 1 <= _Mask, return if unbiased
if (_Ret / _Index < _Mask / _Index || _Mask % _Index == _Udiff(_Index - 1)) {
return static_cast<_Diff>(_Ret % _Index);
}
}
}
_Udiff _Get_all_bits() {
_Udiff _Ret = 0;
for (size_t _Num = 0; _Num < CHAR_BIT * sizeof(_Udiff); _Num += _Bits) { // don't mask away any bits
_Ret <<= _Bits - 1; // avoid full shift
_Ret <<= 1;
_Ret |= _Get_bits();
}
return _Ret;
}
_Rng_from_urng(const _Rng_from_urng&) = delete;
_Rng_from_urng& operator=(const _Rng_from_urng&) = delete;
private:
_Udiff _Get_bits() { // return a random value within [0, _Bmask]
for (;;) { // repeat until random value is in range
_Udiff _Val = _Ref() - (_Urng::min)();
if (_Val <= _Bmask) {
return _Val;
}
}
}
_Urng& _Ref; // reference to URNG
size_t _Bits; // number of random bits generated by _Get_bits()
_Udiff _Bmask; // 2^_Bits - 1
};

This technique described by Daniel Lemire should be a significant improvement: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

@StephanTLavavej StephanTLavavej added enhancement Something can be improved vNext Breaks binary compatibility performance Must go faster labels Oct 17, 2019
@chris0x44
Copy link
Contributor

I would like to work on this. In the hopes to not waste everyone's time with many needless iterations, I have a few questions.

  • Is there an additional style guide apart from clang-format and the _Ugly rule?
  • Since the calculation described by Daniel Lemire requires a promotion to the next larger integer type, should this fall back to normal modulo operation for types of unsigned long long?
  • Is less code preferred over fast(er) code? In this case the power of 2 check _Mask % _Index == _Udiff(_Index - 1) could be extracted from the loop to introduce a special case, which would also reduce the work done inside the for(;;) loop, as it doesn't have to repeatedly check if index is a power of 2. Similar to the following
    static constexpr bool _Is_power_of2(_Udiff _Value) {
        return _Value && !(_Value & (_Value - 1));
    }

    _Diff operator()(_Diff _Index) {
        if (_Is_power_of2(_Index)) {
            ...
            return static_cast<_Diff>(_Reduce_by_power2(_Ret, _Index));
        }
        for (;;) { ...
            if (_Ret / _Index < _Mask / _Index) {
                return static_cast<_Diff>(_Ret % _Index);
            }
        }
    }

On a site note, am I correct in assuming that _Urng::max and _Urng::min are not guaranteed to be constexpr? Otherwise this might offer a way to work around the special handling to avoid full shifts.

@CaseyCarter
Copy link
Member

CaseyCarter commented Oct 29, 2019

  • Is there an additional style guide apart from clang-format and the _Ugly rule?

There will eventually be, but there is none yet. For now all we can say is "try to imitate the style that's already there, and be prepared for us to communicate much of the style to you in code reviews".

  • Since the calculation described by Daniel Lemire requires a promotion to the next larger integer type, should this fall back to normal modulo operation for types of unsigned long long?

It depends on whether hardware 64-bit division is faster than a bespoke 64 x 64 = 128 bit multiply (See _Big_multiply in <ratio>).

  • Is less code preferred over fast(er) code?

In general, more and faster is preferable to less and slower but there are no absolutes. (Your example certainly looks reasonable.)

On a side note, am I correct in assuming that _Urng::max and _Urng::min are not guaranteed to be constexpr? Otherwise this might offer a way to work around the special handling to avoid full shifts.

G::max() and G::min() must be constant expressions for G to model uniform_random_bit_generator (and consequently to satisfy the uniform random bit generator requirements) per [rand.req.urng].

@chris0x44
Copy link
Contributor

Thanks for the clarification. I'll check which solution for the 64-bit case is faster and create a PR. With regard to testing, I assume that additional testcases for the changed code and where to put them can be discussed in the review and can be extended once the test-suites are also available here?

@StephanTLavavej
Copy link
Member Author

I'll check which solution for the 64-bit case is faster and create a PR.

Thanks for looking into this! Based on my experience with charconv/Ryu, I'm pretty sure that umul128 will be great on x64, and decent on x86 compared to runtime division (which was horrible until very recent architectures, where it is less horrible).

With regard to testing, I assume that additional testcases for the changed code and where to put them can be discussed in the review and can be extended once the test-suites are also available here?

Yeah. We're hopefully about a month away from getting our test suites online.

Note that this enhancement is tagged vNext because I believe that it will break ABI - specifically, changing the observable behavior of uniform_int_distribution would lead to mix-and-match issues if we attempted to ship such a change in the current release series. Our vNext branch won't be ready for probably several months, but it will be available eventually.

(Unless the new technique happens to generate exactly the same values as the current technique, while storing the exact same data members, which seems unlikely.)

@chris0x44
Copy link
Contributor

I did some work on the 64-bit path and think I found a pretty neat solution to work around the 128-bit operations. The "divisor" in the multiply-shift operation is a power of two, which allows us to integrate its into the shift-value. Determining log2 of a known power of 2 and then shifting is pretty fast when compared to 128-bit operations. A rough test showed log2 to be ~60% faster.

By the way, are there specific guidelines for comments? I'm a bit unsure how detailed it should be.

Below is the adaption of the multiply-shift operation. Any suggestions for improvement or better style are welcome. :)

    static constexpr size_t _Log2OfPwr2(uint64_t __power_two_val) noexcept {
        size_t __log_val = (((__power_two_val & 0x00000000FFFFFFFFULL) != 0) - 1) & 32;
        __log_val += (((__power_two_val & 0x0000FFFF0000FFFFULL) != 0) - 1) & 16;
        __log_val += (((__power_two_val & 0x00FF00FF00FF00FFULL) != 0) - 1) & 8;
        __log_val += (((__power_two_val & 0x0F0F0F0F0F0F0F0FULL) != 0) - 1) & 4;
        __log_val += (((__power_two_val & 0x3333333333333333ULL) != 0) - 1) & 2;
        __log_val += (((__power_two_val & 0x5555555555555555ULL) != 0) - 1) & 1;
        return __log_val;
    }

    static constexpr _Udiff _Reduce_by_power2(_Udiff _Value, _Udiff _Mod) noexcept {
        constexpr auto _Udiff_bits = CHAR_BIT * sizeof(_Udiff);
        if constexpr (sizeof(_Udiff) < sizeof(unsigned int)) {
            // avoid unnecessary promotion for smaller types
            return (_Value * _Mod) >> _Udiff_bits;
        } else if constexpr (sizeof(_Udiff) == sizeof(unsigned int)) {
            // ensure promotion to ull to keep all bits
            return static_cast<_Udiff>((static_cast<unsigned long long>(_Value) * _Mod) >> _Udiff_bits);
        }
        
        // modification of multiply-shift to avoid costly expansion to 128-bit
        return _Value >> (_Udiff_bits - _Log2OfPwr2(_Mod));
    }

    static constexpr bool _Is_power_of2(_Udiff _Value) noexcept {
        return _Value && !(_Value & (_Value - 1));
    }

    _Diff operator()(_Diff _Index) { // adapt _Urng closed range to [0, _Index)
        if (_Is_power_of2(_Index)) {
            _Udiff _Ret  = 0; // random bits
            _Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]

            while (_Mask < _Udiff(_Index - 1)) { // need more random bits
                _Ret <<= _Bits - 1; // avoid full shift
                _Ret <<= 1;
                _Ret |= _Get_bits();
                _Mask <<= _Bits - 1; // avoid full shift
                _Mask <<= 1;
                _Mask |= _Bmask;
            }

            // _Ret is [0, _Mask], _Index - 1 <= _Mask
            return static_cast<_Diff>(_Reduce_by_power2(_Ret, _Index));
        }

        for (;;) { // try a sample random value
            _Udiff _Ret  = 0; // random bits
            _Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]

            while (_Mask < _Udiff(_Index - 1)) { // need more random bits
                _Ret <<= _Bits - 1; // avoid full shift
                _Ret <<= 1;
                _Ret |= _Get_bits();
                _Mask <<= _Bits - 1; // avoid full shift
                _Mask <<= 1;
                _Mask |= _Bmask;
            }

            // _Ret is [0, _Mask], _Index - 1 <= _Mask, return if unbiased
            if (_Ret / _Index < _Mask / _Index) {
                return static_cast<_Diff>(_Ret % _Index);
            }
        }
    }

With regard to breaking ABI, there are currently no changes to data members or additional non-static methods in my change.

@CaseyCarter
Copy link
Member

Note that this enhancement is tagged vNext because I believe that it will break ABI - specifically, changing the observable behavior of uniform_int_distribution would lead to mix-and-match issues if we attempted to ship such a change in the current release series.

This doesn't seem more "ABI breaking" than any other behavior change: If we avoid changing the representation of uniform_int_distribution, then programs will continue to link and run whether they get the old version or the new version. They may not run correctly if they specifically depend on the old behavior or the new behavior, but that's not a problem I'd call an "ABI break".

@StephanTLavavej StephanTLavavej removed the vNext Breaks binary compatibility label Nov 1, 2019
@StephanTLavavej
Copy link
Member Author

Ok, I'm convinced. (Note that adding non-static member functions is totally fine for ABI; only virtual member functions are a big deal.)

@CaseyCarter
Copy link
Member

This doesn't seem more "ABI breaking" than any other behavior change

I should clarify this for the lurkers: when we change the behavior of an internal function - something with an ugly name that users don't know about or call directly - we do consider it to be an ABI break. Users can't be expected to know that vector::push_back calls vector::_Frobnicate, or that they may need to recompile an OBJ somewhere with codegen for a vector::_Frobnicate specialization because we want to change what _Frobnicate does. In a case like that, we'll "bump" the name to _Frobnicate2 to avoid new code / old code linking and all is well.

With "pretty" functions that users do call the burden is on them to recompile anything that touches e.g. uniform_int_distribution::operator() if they want to ensure the new behavior.

@chris0x44
Copy link
Contributor

Leaving this here to avoid confusion about my previous suggestion. Seems like I took a wrong turn with my approach. The operation for an index of a power of 2 should just be _Value & (_Index-1), which should be handled as a special case, the Lemire-optimization applies to the non-power of 2 cases.

@horenmar
Copy link
Contributor

@CaseyCarter does this mean

This doesn't seem more "ABI breaking" than any other behavior change: If we avoid changing the representation of uniform_int_distribution, then programs will continue to link and run whether they get the old version or the new version. They may not run correctly if they specifically depend on the old behavior or the new behavior, but that's not a problem I'd call an "ABI break".

that a proposal to standardize outputs of distributions in <random> would not be considered breaking change by MS' STL team?

@CaseyCarter
Copy link
Member

does this mean that a proposal to standardize outputs of distributions in <random> would not be considered breaking change by MS' STL team?

There are many different kinds of breaking change. Assuming that we could standardize the outputs of distributions without storing additional state, that proposal would be a behavioral change - and therefore source-breaking - but not ABI breaking.

@chris0x44 chris0x44 mentioned this issue Dec 3, 2019
4 tasks
@StephanTLavavej StephanTLavavej removed the enhancement Something can be improved label Feb 6, 2020
@CaseyCarter CaseyCarter changed the title <random>: Optimize uniform_int_distribution with multiply-shift <random>: Optimize uniform_int_distribution with multiply-shift Aug 9, 2022
@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
4 participants