Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed loop optimization for &Vec argument in .zip() #36920

Closed
bluss opened this issue Oct 3, 2016 · 15 comments
Closed

Missed loop optimization for &Vec argument in .zip() #36920

bluss opened this issue Oct 3, 2016 · 15 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@bluss
Copy link
Member

bluss commented Oct 3, 2016

This is a case where two almost identical functions have very different performance, because the slice version is autovectorized and the vector version is not. Note that specialization is involved, but both cases use exactly the same code path, for the slice iterator.

#![crate_type="lib"]

#[no_mangle]
pub fn dot_s(xs: &[u32], ys: &[u32]) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(ys) {
        s += x * y;
    }
    s
}

#[no_mangle]
pub fn dot_v(xs: &Vec<u32>, ys: &Vec<u32>) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(ys) {
        s += x * y;
    }
    s
}

Using rustc 1.14.0-nightly (289f3a4 2016-09-29)

Note: The Vec vs slice difference is sometimes visible in actual inline use of .zip in code, depending on exact context.

@bluss bluss added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Oct 3, 2016
@arielb1
Copy link
Contributor

arielb1 commented Oct 3, 2016

Note: a similar version with xs: &mut [u32] (and therefore no noalias metadata) still vectorizes:

#[no_mangle]
pub fn dot_mut_s(xs: &mut [u32], ys: &mut [u32]) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(ys) {
        s += (*x) * (*y);
    }
    s
}

Maybe the problem is nonnull metadata (vs. call to assume)? The only difference between the 2 is the call to Vec::deref?

@bluss
Copy link
Member Author

bluss commented Oct 3, 2016

It may just be the double indirection in &Vec vs &[u32] ?

I've seen double indirection losses like that in closures too (references captured by reference vs move).

Tries Yes! Double indirection is bad. No autovectorization:

#[no_mangle]
pub fn dot_ref_s(xs: &&[u32], ys: &&[u32]) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(*ys) {
        s += x * y;
    }
    s
}

Playground link

@bluss
Copy link
Member Author

bluss commented Oct 3, 2016

Inside ndarray I carefully use move closures to capture references as they are, and this seems to be the reason.

@arielb1
Copy link
Contributor

arielb1 commented Oct 3, 2016

The problem indeed looks like a dubious null check surviving:

; Function Attrs: readonly uwtable
define i32 @dot_ref_s({ i32*, i64 }* noalias nocapture readonly dereferenceable(16), { i32*, i64 }* noalias nocapture readonly dereferenceable(16)) unnamed_addr #2 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %2 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %0, i64 0, i32 0
  %3 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %0, i64 0, i32 1
  %4 = load i32*, i32** %2, align 8
  %5 = load i64, i64* %3, align 8
  %6 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %1, i64 0, i32 0
  %7 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %1, i64 0, i32 1
  %8 = load i32*, i32** %6, align 8
  %9 = load i64, i64* %7, align 8
  %10 = icmp ule i64 %5, %9
  %_0.0.sroa.speculated.i.i.i = select i1 %10, i64 %5, i64 %9
  %11 = icmp eq i64 %_0.0.sroa.speculated.i.i.i, 0
  br i1 %11, label %bb6, label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader"

"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader": ; preds = %entry-block
  br label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit"

"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit": ; preds = %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader", %bb7
  %s.017 = phi i32 [ %18, %bb7 ], [ 0, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader" ]
  %iter.sroa.37.016 = phi i64 [ %14, %bb7 ], [ 0, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader" ]
  %12 = getelementptr inbounds i32, i32* %4, i64 %iter.sroa.37.016
  %switchtmp = icmp eq i32* %12, null ; <- ********************************** NULL CHECK HERE
  br i1 %switchtmp, label %bb6.loopexit, label %bb7

bb6.loopexit:                                     ; preds = %bb7, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit"
  %s.0.lcssa.ph = phi i32 [ %s.017, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit" ], [ %18, %bb7 ]
  br label %bb6

bb6:                                              ; preds = %bb6.loopexit, %entry-block
  %s.0.lcssa = phi i32 [ 0, %entry-block ], [ %s.0.lcssa.ph, %bb6.loopexit ]
  ret i32 %s.0.lcssa

bb7:                                              ; preds = %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit"
  %13 = getelementptr inbounds i32, i32* %8, i64 %iter.sroa.37.016
  %14 = add nuw i64 %iter.sroa.37.016, 1
  %15 = load i32, i32* %12, align 4
  %16 = load i32, i32* %13, align 4
  %17 = mul i32 %16, %15
  %18 = add i32 %17, %s.017
  %19 = icmp ult i64 %14, %_0.0.sroa.speculated.i.i.i
  br i1 %19, label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit", label %bb6.loopexit
}

@arielb1
Copy link
Contributor

arielb1 commented Oct 3, 2016

Fixing this is blocked on https://llvm.org/bugs/show_bug.cgi?id=30597.

@arielb1
Copy link
Contributor

arielb1 commented Oct 4, 2016

The LLVM PR fixes the Vec case (tested locally). Fixing the &&[u32] case requires us to emit more attributes.

arielb1 added a commit to arielb1/rust that referenced this issue Oct 5, 2016
cc rust-lang#36920 (in addition to LLVM PR30597, should fix the &&[i32] case)
@bluss
Copy link
Member Author

bluss commented Oct 12, 2016

rustc 0b2c356 + that llvm backport revision do fix all the test cases in this issue. Awesome. Now to try what more it fixes

@bluss
Copy link
Member Author

bluss commented Oct 12, 2016

ndarray changes (selection of benchmarks, rest seemed to be unchanged). Benchmarking is a bit noisy, so I had to pick a few and try to verify them.

comparing

  • old: rustc 1.14.0-nightly (a3bc191 2016-10-10)
  • new: rustc 1.14.0-dev 0b2c356 + that llvm backport revision
name                           old-best ns/iter  new-best ns/iter    diff ns/iter   diff %
iter_sum_2d_regular            178               177                           -1   -0.56%
iter_sum_2d_transpose_by_row   3,163             2,603                       -560  -17.70%
iter_sum_2d_transpose_regular  4,783             5,056                        273    5.71%
mean_axis0                     5,474             4,736                       -738  -13.48%
mean_axis1                     2,977             2,918                        -59   -1.98%
scalar_sum_2d_cutout           505               504                           -1   -0.20%
scalar_sum_2d_float            417               416                           -1   -0.24%
scalar_sum_2d_float_cutout     543               544                            1    0.18%
scalar_sum_2d_regular          169               169                            0    0.00%
sum_axis0                      5,057             4,550                       -507  -10.03%
sum_axis1                      2,768             2,720                        -48   -1.73%

I believe the improvements in mean_axis0, sum_axis0 and iter_sum_2d_transpose_by_row are legitimate and that iter_sum_2d_transpose_regular genuinely regresses for some reason. I don't think the rest can be verified to change in either direction.

It's nice overall though.,

@arielb1 arielb1 mentioned this issue Oct 12, 2016
@arielb1
Copy link
Contributor

arielb1 commented Oct 12, 2016

Any code changes in iter_sum_2d_transpose_regular?

arielb1 added a commit to arielb1/rust that referenced this issue Oct 12, 2016
Contains backport for my patch, [InstCombine] Transform !range metadata
to !nonnull when combining loads.

Fixes rust-lang#36920.
@bluss
Copy link
Member Author

bluss commented Oct 12, 2016

I'm not really finding what the difference is. Do you think I should look in the unoptimized or optimized llvm IR? It's not a big concern since it's already the slow case for the iterator.

@arielb1
Copy link
Contributor

arielb1 commented Oct 13, 2016

The optimized IR. Could you post it?

@bluss
Copy link
Member Author

bluss commented Oct 13, 2016

optimized IR here, there are two files "old" and "new", same versions as previous comment #36920 (comment)

@bluss
Copy link
Member Author

bluss commented Oct 13, 2016

I haven't spotted the difference yet. There's a lot of noise for I think the autovectorization of the fast path, which is not taken for the transposed array. Instead it should run the general/slow case here that calls Dimension::next_for() (which is at least visible in the IR).

@arielb1
Copy link
Contributor

arielb1 commented Oct 13, 2016

The IRs look identical up to α-renaming.

bors added a commit that referenced this issue Oct 20, 2016
LLVM: Add triple for Fuchsia

Update subproject commit.

Fixes #36920
@bluss
Copy link
Member Author

bluss commented Oct 21, 2016

This is now in the latest nightly, and all the four testcases produce the same nice code. Thanks a lot for working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

2 participants