Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating with step_by(1) is much slower than without #57517

Open
mvlabat opened this issue Jan 11, 2019 · 8 comments · Fixed by #64121
Open

Iterating with step_by(1) is much slower than without #57517

mvlabat opened this issue Jan 11, 2019 · 8 comments · Fixed by #64121
Labels
E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@mvlabat
Copy link

mvlabat commented Jan 11, 2019

Greetings!

I'd like to report that I get some significant (negative) performance impact when calling step_by(1) on an iterator. I created a repository with a detailed description of the issue and benchmarks to reproduce it: https://github.com/mvlabat/step_by_one

These are the functions I tested:

pub fn iter_default_step(mut arr: Vec<i32>) -> Vec<i32> {
    for e in arr.iter_mut() {
        *e = e.wrapping_add(3);
    }
    arr
}

pub fn iter_step(mut arr: Vec<i32>, iter_step: usize) -> Vec<i32> {
    for e in arr.iter_mut().step_by(iter_step) {
        *e = e.wrapping_add(3);
    }
    arr
}

Calling iter_step(vec![1; LARGE_ENOUGH], 1) computes 1.75x slower than iter_default_step(vec![1; LARGE_ENOUGH])
with const LARGE_ENOUGH: usize = 10_000_000;.

I'm running Macbook Pro 2015 with Intel i5-5257U CPU (2.70GHz).

My rustc version: 1.33.0-nightly (c2d381d39 2019-01-10).

These are the exact benchmark results I got:

running 4 tests
test tests::bench_iter_default_step ... bench:   8,702,963 ns/iter (+/- 1,648,782)
test tests::bench_iter_step_1       ... bench:  15,267,083 ns/iter (+/- 1,236,220)
test tests::bench_iter_step_16      ... bench:   9,053,772 ns/iter (+/- 380,422)
test tests::bench_iter_step_64      ... bench:   8,711,169 ns/iter (+/- 327,562)

In the repository README there are also links to the generated asm code of these two functions.

@nikic
Copy link
Contributor

nikic commented Jan 11, 2019

Godbolt link: https://rust.godbolt.org/z/scU_bY

I've slightly changed the code to actually modify the array (wrapping_add returns the value -- it does not operate in-place).

One bit that stands out especially to me is this:

bb4: ; preds = %bb8, %start
  %iter.sroa.0.0 = phi i64 [ %1, %start ], [ %iter.sroa.0.1, %bb8 ]
  %iter.sroa.13.0 = phi i1 [ false, %start ], [ true, %bb8 ]
  br i1 %iter.sroa.13.0, label %bb2.i, label %bb1.i

I would have expected something (possibly jump threading?) to optimize away the %iter.sroa.13.0 phi.

@nikic nikic added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 11, 2019
@mvlabat
Copy link
Author

mvlabat commented Jan 11, 2019

Yeah, I forgot to use the returned addition value. Changed that in my code already too )

@scottmcm
Copy link
Member

I wonder if this would be fixed with the change discussed in #52065. Also, try using .for_each() instead of a for loop; this might just be one of the many places where internal iteration is just faster.

@nikic
Copy link
Contributor

nikic commented Sep 9, 2019

I don't think #64121 fixes this issue, as it affects internal iteration only, while this one is about external iteration.

@nikic nikic reopened this Sep 9, 2019
@timvermeulen
Copy link
Contributor

@nikic You're right, but now it's just a matter of "for_each is faster than a for loop for step_by" which is not at all surprising if you consider that StepBy::next branches based on the first_take field. It's not something we're able to fix in the general case.

@mlindner
Copy link

mlindner commented May 18, 2020

There's another example of this coming up here in this thread: https://www.reddit.com/r/learnrust/comments/glxa5r/rust_vs_cpp_speed_in_implementation/

Rust vs C++ example where C++ takes 600-700ms and Rust takes 4300ms.

@the8472
Copy link
Member

the8472 commented May 23, 2023

Looks like it's fixed on beta and nightly. It hasn't reached stable yet.

https://rust.godbolt.org/z/6qjvoE719

@nikic nikic added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label May 23, 2023
@nikic
Copy link
Contributor

nikic commented May 23, 2023

I believe step_by was one of the iterators with improved optimization in LLVM 16. I thought I added a codegen test for that, but apparently not.

We should add a test for this, but probably not the one from OP, but something with simpler IR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants