Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added compile time optimization for bytewise slice comparison #39422

Closed
wants to merge 4 commits into from
Closed

Added compile time optimization for bytewise slice comparison #39422

wants to merge 4 commits into from

Conversation

saschagrunert
Copy link

@saschagrunert saschagrunert commented Jan 31, 2017

Inspiration for this change is my crate fastcmp. The idea is to add an additional step before calling directly memcmp, which will optimize the PartialEq trait for slices up to 256 bytes.

I think this will relate some how to #16913.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@aturon
Copy link
Member

aturon commented Jan 31, 2017

cc @rust-lang/libs @bluss: thoughts on using build scripts in core, and this optimization?

@saschagrunert Do you have any benchmark results for this change? Generally we like to see evidence of significant speedup before introducing optimizations like this, since they impose a maintenance burden.

@TimNN
Copy link
Contributor

TimNN commented Jan 31, 2017

I'm not sure how relevant this is, but as far as I can tell this will cause unaligned loads (which should generally be avoided as far as I know).

I also think it would be nice if the build.rs file would document what the code generated by slice_compare looks like.

@bluss
Copy link
Member

bluss commented Jan 31, 2017

Good point. Unaligned loads must be avoided, otherwise the code is not portable.

@alexcrichton
Copy link
Member

From a build perspective using a build script seems fine to me. I'd be slightly worried about code bloat in libcore, but we can evaluate that later.

@brson
Copy link
Contributor

brson commented Feb 1, 2017

I don't have a problem per se with build scripts in libcore, but this is quite a lot of ceremony for one little optimization. I do think that bloat in core is a problem. I'm not clear on what this patch is doing, but if we add a bunch of code for optimization purposes, we will definitely be called upon to remove them again conditionally someday.

@saschagrunert
Copy link
Author

saschagrunert commented Feb 1, 2017

@aturon yes benchmarks are coming from the fastcmp crate (including their source) for a 256 byte slice comparison on my local machine:

test fast_compare_equal    ... bench:          14 ns/iter (+/- 9) = 18285 MB/s
test fast_compare_unequal  ... bench:          14 ns/iter (+/- 0) = 18285 MB/s
test slice_compare_equal   ... bench:          35 ns/iter (+/- 29) = 7314 MB/s
test slice_compare_unequal ... bench:          37 ns/iter (+/- 3) = 6918 MB/s

I definitely see the code bloat point, since the compilation time increases drastically if I higher the supported slice length.

The optimization will improve the usual memcmp by comparing multiple bytes at the same time (u64, u32, ...) instead of comparing every single byte. Usually we could think now that this has the same effect:

fn slice_compare(a: *const i8, b: *const i8, len: usize) -> bool {
    let mut bits = len * 8;
    let mut offset = 0;
    while bits > 0 {
        if bits >= 128 {
            if cmp!(a, b, u128, offset) == false {
                return false;
            }
            bits -= 128;
            offset += 16;
        } else if bits >= 64 {
            if cmp!(a, b, u64, offset) == false {
                return false;
            }
            bits -= 64;
            offset += 8;
        } else if bits >= 32 {
            if cmp!(a, b, u32, offset) == false {
                return false;
            }
            bits -= 32;
            offset += 4;
        } else if bits >= 16 {
            if cmp!(a, b, u16, offset) == false {
                return false;
            }
            bits -= 16;
            offset += 2;
        } else if bits >= 8 {
            if cmp!(a, b, u8, offset) == false {
                return false;
            }
            bits -= 8;
            offset += 1;
        } else {
            unreachable!();
        }
    }
    true
}

But the thing is that with the generated source code a optimization for known slices is possible at compile time:

macro_rules! slice_compare (
    ($a:expr, $b:expr, $len:expr) => {{
        match $len {
            1 => cmp!($a, $b, u8, 0),
            2 => cmp!($a, $b, u16, 0),
            3 => cmp!($a, $b, u16, 0) && cmp!($a, $b, u8, 2),
            4 => cmp!($a, $b, u32, 0),
            ....

u128 support could be added later on as well.

@bluss
Copy link
Member

bluss commented Feb 1, 2017

How do we properly evaluate if the code size expansion is worth it? I'd look at existing benchmarks in larger programs.

@saschagrunert
Copy link
Author

saschagrunert commented Feb 1, 2017

I am not sure how the regex crate will use slice comparisons but here are the results of the benchmarks:

 name                                     nightly ns/iter        memcmp_optimization ns/iter  diff ns/iter   diff %
 misc::anchored_literal_long_match        26 (15000 MB/s)        25 (15600 MB/s)                        -1   -3.85%
 misc::anchored_literal_long_non_match    30 (13000 MB/s)        25 (15600 MB/s)                        -5  -16.67%
 misc::anchored_literal_short_match       26 (1000 MB/s)         25 (1040 MB/s)                         -1   -3.85%
 misc::anchored_literal_short_non_match   31 (838 MB/s)          25 (1040 MB/s)                         -6  -19.35%
 misc::easy0_1K                           20 (52550 MB/s)        19 (55315 MB/s)                        -1   -5.00%
 misc::easy0_1MB                          26 (40330884 MB/s)     24 (43691791 MB/s)                     -2   -7.69%
 misc::easy0_32                           20 (2950 MB/s)         19 (3105 MB/s)                         -1   -5.00%
 misc::easy0_32K                          20 (1639750 MB/s)      19 (1726052 MB/s)                      -1   -5.00%
 misc::easy1_1K                           56 (18642 MB/s)        58 (18000 MB/s)                         2    3.57%
 misc::easy1_1MB                          58 (18079241 MB/s)     58 (18079241 MB/s)                      0    0.00%
 misc::easy1_32                           63 (825 MB/s)          56 (928 MB/s)                          -7  -11.11%
 misc::easy1_32K                          65 (504430 MB/s)       55 (596145 MB/s)                      -10  -15.38%
 misc::hard_1K                            70 (15014 MB/s)        69 (15231 MB/s)                        -1   -1.43%
 misc::hard_1MB                           74 (14170310 MB/s)     73 (14364424 MB/s)                     -1   -1.35%
 misc::hard_32                            70 (842 MB/s)          69 (855 MB/s)                          -1   -1.43%
 misc::hard_32K                           70 (468500 MB/s)       74 (443175 MB/s)                        4    5.71%
 misc::literal                            19 (2684 MB/s)         20 (2550 MB/s)                          1    5.26%
 misc::long_needle1                       2,789 (35855 MB/s)     2,757 (36271 MB/s)                    -32   -1.15%
 misc::long_needle2                       714,451 (139 MB/s)     712,581 (140 MB/s)                 -1,870   -0.26%
 misc::match_class                        80 (1012 MB/s)         88 (920 MB/s)                           8   10.00%
 misc::match_class_in_range               34 (2382 MB/s)         32 (2531 MB/s)                         -2   -5.88%
 misc::match_class_unicode                433 (371 MB/s)         353 (456 MB/s)                        -80  -18.48%
 misc::medium_1K                          21 (50095 MB/s)        21 (50095 MB/s)                         0    0.00%
 misc::medium_1MB                         26 (40330923 MB/s)     26 (40330923 MB/s)                      0    0.00%
 misc::medium_32                          21 (2857 MB/s)         21 (2857 MB/s)                          0    0.00%
 misc::medium_32K                         21 (1561714 MB/s)      22 (1490727 MB/s)                       1    4.76%
 misc::no_exponential                     467 (214 MB/s)         510 (196 MB/s)                         43    9.21%
 misc::not_literal                        136 (375 MB/s)         126 (404 MB/s)                        -10   -7.35%
 misc::one_pass_long_prefix               76 (342 MB/s)          74 (351 MB/s)                          -2   -2.63%
 misc::one_pass_long_prefix_not           74 (351 MB/s)          79 (329 MB/s)                           5    6.76%
 misc::one_pass_short                     53 (320 MB/s)          53 (320 MB/s)                           0    0.00%
 misc::one_pass_short_not                 60 (283 MB/s)          60 (283 MB/s)                           0    0.00%
 misc::reallyhard2_1K                     100 (10400 MB/s)       102 (10196 MB/s)                        2    2.00%
 misc::reallyhard_1K                      2,201 (477 MB/s)       2,485 (422 MB/s)                      284   12.90%
 misc::reallyhard_1MB                     2,200,632 (476 MB/s)   2,180,884 (480 MB/s)              -19,748   -0.90%
 misc::reallyhard_32                      143 (412 MB/s)         142 (415 MB/s)                         -1   -0.70%
 misc::reallyhard_32K                     68,128 (481 MB/s)      68,124 (481 MB/s)                      -4   -0.01%
 misc::replace_all                        176                    176                                     0    0.00%
 misc::reverse_suffix_no_quadratic        7,015 (1140 MB/s)      5,647 (1416 MB/s)                  -1,368  -19.50%
 regexdna::find_new_lines                 18,309,005 (277 MB/s)  18,342,658 (277 MB/s)              33,653    0.18%
 regexdna::subst1                         1,193,106 (4260 MB/s)  1,190,372 (4270 MB/s)              -2,734   -0.23%
 regexdna::subst10                        1,193,920 (4257 MB/s)  1,195,145 (4253 MB/s)               1,225    0.10%
 regexdna::subst11                        1,207,184 (4210 MB/s)  1,184,599 (4291 MB/s)             -22,585   -1.87%
 regexdna::subst2                         1,198,529 (4241 MB/s)  1,208,560 (4206 MB/s)              10,031    0.84%
 regexdna::subst3                         1,242,987 (4089 MB/s)  1,182,336 (4299 MB/s)             -60,651   -4.88%
 regexdna::subst4                         1,205,986 (4215 MB/s)  1,196,506 (4248 MB/s)              -9,480   -0.79%
 regexdna::subst5                         1,190,987 (4268 MB/s)  1,192,494 (4262 MB/s)               1,507    0.13%
 regexdna::subst6                         1,196,724 (4247 MB/s)  1,295,290 (3924 MB/s)              98,566    8.24%
 regexdna::subst7                         1,336,715 (3802 MB/s)  1,184,260 (4292 MB/s)            -152,455  -11.41%
 regexdna::subst8                         1,213,462 (4189 MB/s)  1,228,853 (4136 MB/s)              15,391    1.27%
 regexdna::subst9                         1,195,252 (4253 MB/s)  1,255,660 (4048 MB/s)              60,408    5.05%
 regexdna::variant1                       4,655,147 (1091 MB/s)  3,895,812 (1304 MB/s)            -759,335  -16.31%
 regexdna::variant2                       8,402,572 (604 MB/s)   7,505,751 (677 MB/s)             -896,821  -10.67%
 regexdna::variant3                       9,959,632 (510 MB/s)   8,639,892 (588 MB/s)           -1,319,740  -13.25%
 regexdna::variant4                       10,024,492 (507 MB/s)  8,669,005 (586 MB/s)           -1,355,487  -13.52%
 regexdna::variant5                       8,442,707 (602 MB/s)   6,991,045 (727 MB/s)           -1,451,662  -17.19%
 regexdna::variant6                       8,220,317 (618 MB/s)   6,714,895 (757 MB/s)           -1,505,422  -18.31%
 regexdna::variant7                       8,395,103 (605 MB/s)   6,626,063 (767 MB/s)           -1,769,040  -21.07%
 regexdna::variant8                       8,530,544 (595 MB/s)   6,797,057 (747 MB/s)           -1,733,487  -20.32%
 regexdna::variant9                       8,422,626 (603 MB/s)   6,632,114 (766 MB/s)           -1,790,512  -21.26%
 sherlock::before_after_holmes            1,309,904 (454 MB/s)   1,256,609 (473 MB/s)              -53,295   -4.07%
 sherlock::before_holmes                  100,387 (5926 MB/s)    99,688 (5967 MB/s)                   -699   -0.70%
 sherlock::everything_greedy              3,029,729 (196 MB/s)   3,074,262 (193 MB/s)               44,533    1.47%
 sherlock::everything_greedy_nl           1,119,313 (531 MB/s)   1,126,589 (528 MB/s)                7,276    0.65%
 sherlock::holmes_cochar_watson           197,225 (3016 MB/s)    181,311 (3281 MB/s)               -15,914   -8.07%
 sherlock::holmes_coword_watson           678,915 (876 MB/s)     679,120 (876 MB/s)                    205    0.03%
 sherlock::ing_suffix                     540,007 (1101 MB/s)    531,012 (1120 MB/s)                -8,995   -1.67%
 sherlock::ing_suffix_limited_space       1,472,288 (404 MB/s)   1,472,371 (404 MB/s)                   83    0.01%
 sherlock::letters                        30,811,901 (19 MB/s)   32,437,455 (18 MB/s)            1,625,554    5.28%
 sherlock::letters_lower                  29,862,497 (19 MB/s)   31,551,544 (18 MB/s)            1,689,047    5.66%
 sherlock::letters_upper                  2,449,580 (242 MB/s)   2,483,428 (239 MB/s)               33,848    1.38%
 sherlock::line_boundary_sherlock_holmes  1,239,588 (479 MB/s)   1,244,218 (478 MB/s)                4,630    0.37%
 sherlock::name_alt1                      43,106 (13801 MB/s)    44,331 (13420 MB/s)                 1,225    2.84%
 sherlock::name_alt2                      154,165 (3859 MB/s)    168,084 (3539 MB/s)                13,919    9.03%
 sherlock::name_alt3                      169,525 (3509 MB/s)    169,755 (3504 MB/s)                   230    0.14%
 sherlock::name_alt3_nocase               1,600,117 (371 MB/s)   1,797,026 (331 MB/s)              196,909   12.31%
 sherlock::name_alt4                      205,502 (2895 MB/s)    224,941 (2644 MB/s)                19,439    9.46%
 sherlock::name_alt4_nocase               290,542 (2047 MB/s)    290,104 (2050 MB/s)                  -438   -0.15%
 sherlock::name_alt5                      159,470 (3730 MB/s)    160,301 (3711 MB/s)                   831    0.52%
 sherlock::name_alt5_nocase               810,481 (734 MB/s)     746,462 (797 MB/s)                -64,019   -7.90%
 sherlock::name_holmes                    53,321 (11157 MB/s)    50,460 (11790 MB/s)                -2,861   -5.37%
 sherlock::name_holmes_nocase             238,374 (2495 MB/s)    231,151 (2573 MB/s)                -7,223   -3.03%
 sherlock::name_sherlock                  85,498 (6958 MB/s)     95,866 (6205 MB/s)                 10,368   12.13%
 sherlock::name_sherlock_holmes           40,648 (14636 MB/s)    40,546 (14673 MB/s)                  -102   -0.25%
 sherlock::name_sherlock_holmes_nocase    210,272 (2829 MB/s)    205,370 (2896 MB/s)                -4,902   -2.33%
 sherlock::name_sherlock_nocase           206,834 (2876 MB/s)    194,601 (3057 MB/s)               -12,233   -5.91%
 sherlock::name_whitespace                103,576 (5743 MB/s)    104,336 (5702 MB/s)                   760    0.73%
 sherlock::no_match_common                28,759 (20686 MB/s)    28,794 (20661 MB/s)                    35    0.12%
 sherlock::no_match_really_common         462,169 (1287 MB/s)    466,814 (1274 MB/s)                 4,645    1.01%
 sherlock::no_match_uncommon              28,706 (20725 MB/s)    28,738 (20701 MB/s)                    32    0.11%
 sherlock::quotes                         644,007 (923 MB/s)     640,884 (928 MB/s)                 -3,123   -0.48%
 sherlock::repeated_class_negation        103,471,458 (5 MB/s)   106,326,778 (5 MB/s)            2,855,320    2.76%
 sherlock::the_lower                      791,502 (751 MB/s)     796,159 (747 MB/s)                  4,657    0.59%
 sherlock::the_nocase                     583,382 (1019 MB/s)    583,712 (1019 MB/s)                   330    0.06%
 sherlock::the_upper                      56,158 (10593 MB/s)    56,568 (10517 MB/s)                   410    0.73%
 sherlock::the_whitespace                 1,447,638 (410 MB/s)   1,432,960 (415 MB/s)              -14,678   -1.01%
 sherlock::word_ending_n                  2,306,066 (257 MB/s)   2,326,694 (255 MB/s)               20,628    0.89%
 sherlock::words                          11,681,622 (50 MB/s)   11,993,532 (49 MB/s)              311,910    2.67%

Around 20% is the performance improvement I got with the same technique within C projects.

// The compare macro for the bytewise slice comparison
macro_rules! cmp (
($left:expr, $right: expr, $var:ident, $offset:expr) => {{
unsafe {*($left.offset($offset) as *const $var) == *($right.offset($offset) as *const $var)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a possibly unaligned load (for example *const u64 used that is not necessarily well aligned for u64. We can't merge the code like this, because it is not portable and will crash on certain platforms.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this issue will likely have an impact on the performance. In order to get the best possible performance on platforms which allow unaligned memory access we might need/want to perform this optimisation at a lower level (i.e. replace the implementation of memcmp and/or the strategy used in LLVM to replace memcmp with inline comparisons).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently not sure how to fix this unaligned loads here. Is there something like get_unaligned in the compiler?

@aturon
Copy link
Member

aturon commented Feb 3, 2017

Regarding the bloat, I want to cc #39492, which flags a non-trivial bloat in libcore that affects embedded device development. The libs team will soon be considering whether to set up a general policy of optimizing core for code size. We should consider the impact of this PR in that light as well.

@aturon aturon added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Feb 3, 2017
@aturon
Copy link
Member

aturon commented Feb 3, 2017

Nominating for libs team discussion. I feel like we need some clearer policies around these kinds of optimization tradeoffs.

@arthurprs
Copy link
Contributor

It'd be interesting to graph code size vs OPTLEN and maybe speed. I suspect we can lower it to 64 and still heap most of the benefits while being more comfortable with the extra bloat.

@nagisa
Copy link
Member

nagisa commented Feb 9, 2017

There’s little to be done here other than providing an efficient implementation of memcmp (possibly with alignment argument, diverging from C memcmp somewhat) and using that to do comparisons like these.

Now that rustc supports things like #[target_feature] it is trivial to provide SSE/NEON-optimised implementation along with the fallback as well.

@alexcrichton
Copy link
Member

Discussed during libs triage today our conclusion was that we should probably at least fix the unaligned load business before evaluating performance, and then after that we can decide whether we need this in core or if it can live externally.

@alexcrichton
Copy link
Member

It may also be worth investigating performance across platforms, I'd expect memcmp to be much faster with glibc, for example, than with OSX

@saschagrunert
Copy link
Author

I had the chance to think about the issue with this pull request and really think that the optimisation should fit better into LLVM. So we could add an memcmp intrinsic there...

@nagisa
Copy link
Member

nagisa commented Feb 17, 2017

It may also be worth investigating performance across platforms, I'd expect memcmp to be much faster with glibc, for example, than with OSX

I personally feel like it is a much better idea to just spend time and write our own rmemcmp. Even if it ends up not being as efficient as hand-tuned assembly glibc uses, it will certainly be better than byte-by-byte naive code we’re running now. We also have the pieces we need for quite an efficient implementation as well – #[target_feature] and corresponding cfgs for instance.

I feel this is one of the places @BurntSushi would have interesting things to say, as they spent considerable amount of time implementing efficient string search (i.e. somewhat related) for their regex crate.


Adding memcmp intrinsic to LLVM would be very non-trivial, sadly.

@aturon
Copy link
Member

aturon commented Mar 14, 2017

I'm going to close this PR for the time being, pending the actions outlined here. Thanks!

@aturon aturon closed this Mar 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants