`<random>`: Optimize `uniform_int_distribution` with multiply-shift #178

StephanTLavavej · 2019-10-17T00:29:05Z

I reimplemented uniform_int_distribution long ago:

Lines 4878 to 4942 in 53cdb9f

    
           template <class _Diff, class _Urng>
 
           class _Rng_from_urng { // wrap a URNG as an RNG
 
           public:
 
               using _Ty0 = make_unsigned_t<_Diff>;
 
               using _Ty1 = typename _Urng::result_type;
 
               using _Udiff = conditional_t<sizeof(_Ty1) < sizeof(_Ty0), _Ty0, _Ty1>;
 
               explicit _Rng_from_urng(_Urng& _Func) : _Ref(_Func), _Bits(CHAR_BIT * sizeof(_Udiff)), _Bmask(_Udiff(-1)) {
 
                   for (; (_Urng::max)() - (_Urng::min)() < _Bmask; _Bmask >>= 1) {
 
                       --_Bits;
 
                   }
 
               }
 
               _Diff operator()(_Diff _Index) { // adapt _Urng closed range to [0, _Index)
 
                   for (;;) { // try a sample random value
 
                       _Udiff _Ret  = 0; // random bits
 
                       _Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]
 
                       while (_Mask < _Udiff(_Index - 1)) { // need more random bits
 
                           _Ret <<= _Bits - 1; // avoid full shift
 
                           _Ret <<= 1;
 
                           _Ret |= _Get_bits();
 
                           _Mask <<= _Bits - 1; // avoid full shift
 
                           _Mask <<= 1;
 
                           _Mask |= _Bmask;
 
                       }
 
                       // _Ret is [0, _Mask], _Index - 1 <= _Mask, return if unbiased
 
                       if (_Ret / _Index < _Mask / _Index || _Mask % _Index == _Udiff(_Index - 1)) {
 
                           return static_cast<_Diff>(_Ret % _Index);
 
                       }
 
                   }
 
               }
 
               _Udiff _Get_all_bits() {
 
                   _Udiff _Ret = 0;
 
                   for (size_t _Num = 0; _Num < CHAR_BIT * sizeof(_Udiff); _Num += _Bits) { // don't mask away any bits
 
                       _Ret <<= _Bits - 1; // avoid full shift
 
                       _Ret <<= 1;
 
                       _Ret |= _Get_bits();
 
                   }
 
                   return _Ret;
 
               }
 
               _Rng_from_urng(const _Rng_from_urng&) = delete;
 
               _Rng_from_urng& operator=(const _Rng_from_urng&) = delete;
 
           private:
 
               _Udiff _Get_bits() { // return a random value within [0, _Bmask]
 
                   for (;;) { // repeat until random value is in range
 
                       _Udiff _Val = _Ref() - (_Urng::min)();
 
                       if (_Val <= _Bmask) {
 
                           return _Val;
 
                       }
 
                   }
 
               }
 
               _Urng& _Ref; // reference to URNG
 
               size_t _Bits; // number of random bits generated by _Get_bits()
 
               _Udiff _Bmask; // 2^_Bits - 1
 
           };

This technique described by Daniel Lemire should be a significant improvement: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

The text was updated successfully, but these errors were encountered:

chris0x44 · 2019-10-29T18:28:11Z

I would like to work on this. In the hopes to not waste everyone's time with many needless iterations, I have a few questions.

Is there an additional style guide apart from clang-format and the _Ugly rule?
Since the calculation described by Daniel Lemire requires a promotion to the next larger integer type, should this fall back to normal modulo operation for types of unsigned long long?
Is less code preferred over fast(er) code? In this case the power of 2 check _Mask % _Index == _Udiff(_Index - 1) could be extracted from the loop to introduce a special case, which would also reduce the work done inside the for(;;) loop, as it doesn't have to repeatedly check if index is a power of 2. Similar to the following

    static constexpr bool _Is_power_of2(_Udiff _Value) {
        return _Value && !(_Value & (_Value - 1));
    }

    _Diff operator()(_Diff _Index) {
        if (_Is_power_of2(_Index)) {
            ...
            return static_cast<_Diff>(_Reduce_by_power2(_Ret, _Index));
        }
        for (;;) { ...
            if (_Ret / _Index < _Mask / _Index) {
                return static_cast<_Diff>(_Ret % _Index);
            }
        }
    }

On a site note, am I correct in assuming that _Urng::max and _Urng::min are not guaranteed to be constexpr? Otherwise this might offer a way to work around the special handling to avoid full shifts.

CaseyCarter · 2019-10-29T20:31:25Z

Is there an additional style guide apart from clang-format and the _Ugly rule?

There will eventually be, but there is none yet. For now all we can say is "try to imitate the style that's already there, and be prepared for us to communicate much of the style to you in code reviews".

Since the calculation described by Daniel Lemire requires a promotion to the next larger integer type, should this fall back to normal modulo operation for types of unsigned long long?

It depends on whether hardware 64-bit division is faster than a bespoke 64 x 64 = 128 bit multiply (See _Big_multiply in <ratio>).

Is less code preferred over fast(er) code?

In general, more and faster is preferable to less and slower but there are no absolutes. (Your example certainly looks reasonable.)

On a side note, am I correct in assuming that _Urng::max and _Urng::min are not guaranteed to be constexpr? Otherwise this might offer a way to work around the special handling to avoid full shifts.

G::max() and G::min() must be constant expressions for G to model uniform_random_bit_generator (and consequently to satisfy the uniform random bit generator requirements) per [rand.req.urng].

chris0x44 · 2019-10-29T21:48:43Z

Thanks for the clarification. I'll check which solution for the 64-bit case is faster and create a PR. With regard to testing, I assume that additional testcases for the changed code and where to put them can be discussed in the review and can be extended once the test-suites are also available here?

StephanTLavavej · 2019-10-31T02:57:23Z

I'll check which solution for the 64-bit case is faster and create a PR.

Thanks for looking into this! Based on my experience with charconv/Ryu, I'm pretty sure that umul128 will be great on x64, and decent on x86 compared to runtime division (which was horrible until very recent architectures, where it is less horrible).

With regard to testing, I assume that additional testcases for the changed code and where to put them can be discussed in the review and can be extended once the test-suites are also available here?

Yeah. We're hopefully about a month away from getting our test suites online.

Note that this enhancement is tagged vNext because I believe that it will break ABI - specifically, changing the observable behavior of uniform_int_distribution would lead to mix-and-match issues if we attempted to ship such a change in the current release series. Our vNext branch won't be ready for probably several months, but it will be available eventually.

(Unless the new technique happens to generate exactly the same values as the current technique, while storing the exact same data members, which seems unlikely.)

chris0x44 · 2019-10-31T16:42:42Z

I did some work on the 64-bit path and think I found a pretty neat solution to work around the 128-bit operations. The "divisor" in the multiply-shift operation is a power of two, which allows us to integrate its into the shift-value. Determining log2 of a known power of 2 and then shifting is pretty fast when compared to 128-bit operations. A rough test showed log2 to be ~60% faster.

By the way, are there specific guidelines for comments? I'm a bit unsure how detailed it should be.

Below is the adaption of the multiply-shift operation. Any suggestions for improvement or better style are welcome. :)

    static constexpr size_t _Log2OfPwr2(uint64_t __power_two_val) noexcept {
        size_t __log_val = (((__power_two_val & 0x00000000FFFFFFFFULL) != 0) - 1) & 32;
        __log_val += (((__power_two_val & 0x0000FFFF0000FFFFULL) != 0) - 1) & 16;
        __log_val += (((__power_two_val & 0x00FF00FF00FF00FFULL) != 0) - 1) & 8;
        __log_val += (((__power_two_val & 0x0F0F0F0F0F0F0F0FULL) != 0) - 1) & 4;
        __log_val += (((__power_two_val & 0x3333333333333333ULL) != 0) - 1) & 2;
        __log_val += (((__power_two_val & 0x5555555555555555ULL) != 0) - 1) & 1;
        return __log_val;
    }

    static constexpr _Udiff _Reduce_by_power2(_Udiff _Value, _Udiff _Mod) noexcept {
        constexpr auto _Udiff_bits = CHAR_BIT * sizeof(_Udiff);
        if constexpr (sizeof(_Udiff) < sizeof(unsigned int)) {
            // avoid unnecessary promotion for smaller types
            return (_Value * _Mod) >> _Udiff_bits;
        } else if constexpr (sizeof(_Udiff) == sizeof(unsigned int)) {
            // ensure promotion to ull to keep all bits
            return static_cast<_Udiff>((static_cast<unsigned long long>(_Value) * _Mod) >> _Udiff_bits);
        }
        
        // modification of multiply-shift to avoid costly expansion to 128-bit
        return _Value >> (_Udiff_bits - _Log2OfPwr2(_Mod));
    }

    static constexpr bool _Is_power_of2(_Udiff _Value) noexcept {
        return _Value && !(_Value & (_Value - 1));
    }

    _Diff operator()(_Diff _Index) { // adapt _Urng closed range to [0, _Index)
        if (_Is_power_of2(_Index)) {
            _Udiff _Ret  = 0; // random bits
            _Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]

            while (_Mask < _Udiff(_Index - 1)) { // need more random bits
                _Ret <<= _Bits - 1; // avoid full shift
                _Ret <<= 1;
                _Ret |= _Get_bits();
                _Mask <<= _Bits - 1; // avoid full shift
                _Mask <<= 1;
                _Mask |= _Bmask;
            }

            // _Ret is [0, _Mask], _Index - 1 <= _Mask
            return static_cast<_Diff>(_Reduce_by_power2(_Ret, _Index));
        }

        for (;;) { // try a sample random value
            _Udiff _Ret  = 0; // random bits
            _Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]

            while (_Mask < _Udiff(_Index - 1)) { // need more random bits
                _Ret <<= _Bits - 1; // avoid full shift
                _Ret <<= 1;
                _Ret |= _Get_bits();
                _Mask <<= _Bits - 1; // avoid full shift
                _Mask <<= 1;
                _Mask |= _Bmask;
            }

            // _Ret is [0, _Mask], _Index - 1 <= _Mask, return if unbiased
            if (_Ret / _Index < _Mask / _Index) {
                return static_cast<_Diff>(_Ret % _Index);
            }
        }
    }

With regard to breaking ABI, there are currently no changes to data members or additional non-static methods in my change.

CaseyCarter · 2019-10-31T16:50:24Z

Note that this enhancement is tagged vNext because I believe that it will break ABI - specifically, changing the observable behavior of uniform_int_distribution would lead to mix-and-match issues if we attempted to ship such a change in the current release series.

This doesn't seem more "ABI breaking" than any other behavior change: If we avoid changing the representation of uniform_int_distribution, then programs will continue to link and run whether they get the old version or the new version. They may not run correctly if they specifically depend on the old behavior or the new behavior, but that's not a problem I'd call an "ABI break".

StephanTLavavej · 2019-11-01T00:27:52Z

Ok, I'm convinced. (Note that adding non-static member functions is totally fine for ABI; only virtual member functions are a big deal.)

CaseyCarter · 2019-11-01T01:02:52Z

This doesn't seem more "ABI breaking" than any other behavior change

I should clarify this for the lurkers: when we change the behavior of an internal function - something with an ugly name that users don't know about or call directly - we do consider it to be an ABI break. Users can't be expected to know that vector::push_back calls vector::_Frobnicate, or that they may need to recompile an OBJ somewhere with codegen for a vector::_Frobnicate specialization because we want to change what _Frobnicate does. In a case like that, we'll "bump" the name to _Frobnicate2 to avoid new code / old code linking and all is well.

With "pretty" functions that users do call the burden is on them to recompile anything that touches e.g. uniform_int_distribution::operator() if they want to ensure the new behavior.

chris0x44 · 2019-11-01T10:06:48Z

Leaving this here to avoid confusion about my previous suggestion. Seems like I took a wrong turn with my approach. The operation for an index of a power of 2 should just be _Value & (_Index-1), which should be handled as a special case, the Lemire-optimization applies to the non-power of 2 cases.

…osoft#178)

horenmar · 2019-11-12T11:20:53Z

@CaseyCarter does this mean

This doesn't seem more "ABI breaking" than any other behavior change: If we avoid changing the representation of uniform_int_distribution, then programs will continue to link and run whether they get the old version or the new version. They may not run correctly if they specifically depend on the old behavior or the new behavior, but that's not a problem I'd call an "ABI break".

that a proposal to standardize outputs of distributions in <random> would not be considered breaking change by MS' STL team?

CaseyCarter · 2019-11-12T18:59:57Z

does this mean that a proposal to standardize outputs of distributions in <random> would not be considered breaking change by MS' STL team?

There are many different kinds of breaking change. Assuming that we could standardize the outputs of distributions without storing additional state, that proposal would be a behavioral change - and therefore source-breaking - but not ABI breaking.

…osoft#178)

StephanTLavavej added enhancement Something can be improved vNext Breaks binary compatibility performance Must go faster labels Oct 17, 2019

StephanTLavavej removed the vNext Breaks binary compatibility label Nov 1, 2019

chris0x44 pushed a commit to chris0x44/STL that referenced this issue Nov 4, 2019

<random>: Optimize uniform_int_distribution with multiply-shift (micr…

83e9736

…osoft#178)

chris0x44 pushed a commit to chris0x44/STL that referenced this issue Nov 4, 2019

<random>: Optimize uniform_int_distribution with multiply-shift (micr…

5bc125e

…osoft#178)

chris0x44 pushed a commit to chris0x44/STL that referenced this issue Nov 6, 2019

<random>: Optimize uniform_int_distribution with multiply-shift (micr…

19d6e84

…osoft#178)

chris0x44 pushed a commit to chris0x44/STL that referenced this issue Nov 19, 2019

<random>: Optimize uniform_int_distribution with multiply-shift (micr…

7564251

…osoft#178)

chris0x44 mentioned this issue Dec 3, 2019

Optimize random #256

Closed

4 tasks

StephanTLavavej added the work in progress label Dec 3, 2019

StephanTLavavej removed the enhancement Something can be improved label Feb 6, 2020

BillyONeal mentioned this issue Aug 7, 2020

<algorithm>: Shuffle algorithms can save registers #1163

Open

StephanTLavavej removed the work in progress label Aug 18, 2021

MattStephanson mentioned this issue Aug 9, 2022

<random>: Implement Lemire's fast integer generation #3012

Merged

CaseyCarter changed the title ~~<random>: Optimize uniform_int_distribution with multiply-shift~~ <random>: Optimize uniform_int_distribution with multiply-shift Aug 9, 2022

StephanTLavavej closed this as completed in #3012 Sep 22, 2022

StephanTLavavej added the fixed Something works now, yay! label Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<random>`: Optimize `uniform_int_distribution` with multiply-shift #178

`<random>`: Optimize `uniform_int_distribution` with multiply-shift #178

StephanTLavavej commented Oct 17, 2019

chris0x44 commented Oct 29, 2019

CaseyCarter commented Oct 29, 2019 •

edited

Loading

chris0x44 commented Oct 29, 2019

StephanTLavavej commented Oct 31, 2019

chris0x44 commented Oct 31, 2019

CaseyCarter commented Oct 31, 2019

StephanTLavavej commented Nov 1, 2019

CaseyCarter commented Nov 1, 2019

chris0x44 commented Nov 1, 2019

horenmar commented Nov 12, 2019

CaseyCarter commented Nov 12, 2019

<random>: Optimize uniform_int_distribution with multiply-shift #178

<random>: Optimize uniform_int_distribution with multiply-shift #178

Comments

StephanTLavavej commented Oct 17, 2019

chris0x44 commented Oct 29, 2019

CaseyCarter commented Oct 29, 2019 • edited Loading

chris0x44 commented Oct 29, 2019

StephanTLavavej commented Oct 31, 2019

chris0x44 commented Oct 31, 2019

CaseyCarter commented Oct 31, 2019

StephanTLavavej commented Nov 1, 2019

CaseyCarter commented Nov 1, 2019

chris0x44 commented Nov 1, 2019

horenmar commented Nov 12, 2019

CaseyCarter commented Nov 12, 2019

`<random>`: Optimize `uniform_int_distribution` with multiply-shift #178

`<random>`: Optimize `uniform_int_distribution` with multiply-shift #178

CaseyCarter commented Oct 29, 2019 •

edited

Loading