Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust stable version 4x times performance loss in avx2 #674

Closed
zzau13 opened this issue Feb 1, 2019 · 20 comments
Closed

Rust stable version 4x times performance loss in avx2 #674

zzau13 opened this issue Feb 1, 2019 · 20 comments

Comments

@zzau13
Copy link
Contributor

zzau13 commented Feb 1, 2019

Lost 4 times the performance in avx2 with the same sources of previous version. This error is random from the new stable version 1.32.0 (9fda7c223 2019-01-16).

The algorithm in avx2 uses cmpeq_epi8, cmpgt_epi8, or and add_epi8 for comparisons and loadu, load, set1_epi8, movemask_epi8 for move memory.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 4, 2019

Weird. I think it would be better to fill this issue in rust-lang/rust, since it appears to be a compiler issue.

Do you have a small self-contained snippet of code that reproduces the issue?

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 4, 2019

The problem is in the cmpgt_epi8 and the cmpeq_epi8, do not give the expected result, in the version of sse2 and avx2. So the search algorithm by mask many times more than necessary.

It is closely related to rust-lang/rust#50154. But it also happens with a debug build. As seen in this zzau13/v_escape#19, with cmpeq_epi8, test fails in Travis but not in gitlab ci.

There are several cases, I will reduce them to the minimum cases and put it in a repository in the minimum possible time.

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 7, 2019

https://github.com/rust-iendo/bench_stdsimd code that fails the only difference is that the logic of writing by a counter has changed. I can not make it fail, it's cut and paste. https://travis-ci.org/rust-iendo/bench_stdsimd and https://travis-ci.org/rust-iendo/v_htmlescape/builds/485079457 .

I will be working on this repository until I fail.

https://travis-ci.org/rust-iendo/v_htmlescape. Fail avx2 sometimes in release.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 13, 2019

@botika I think it will be best to wait if it is fixed in the next beta, and if so, we can close this. I don't think we will backport any fixes for the stable versions.

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 14, 2019

Yes, now I was going. It is fixed as of rust version 1.34.0. Thanks. I close

@zzau13 zzau13 closed this as completed Feb 14, 2019
@zzau13 zzau13 reopened this Feb 15, 2019
@zzau13
Copy link
Contributor Author

zzau13 commented Feb 15, 2019

I'm wrong these changes do not solve my performance problems. https://travis-ci.org/rust-iendo/v_htmlescape/jobs/493398064#L1355

They have solved the tests in release. The safest thing is to always return a -1.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 15, 2019

@botika it would be great to have a single crate without any dependencies that reproduces this issue. I can take over minimizing from there.

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 17, 2019

I do not really use any external libraries in runtime. I use syn, quote, proc_macro and nom(parser) for generate macro calls and statics. So if you can try to get the objdump of the function _escape.

#[macro_use]
extern crate v_escape;

// generate struct and function _escape
 new_escape!(
            MyEs,
            "#0->zero || #1->one || #2->two || #3->three || #4->four || #5->five || \
             #6->six || #7->seven || #8->eight || #9->nine", print = true
);
// you can use unsafe _escape(&[u8], fmt::Formatter) -> fmt::Result

fn main() {
    print!("{}", escape("foo bar more than 32 bytes....");
}

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 18, 2019

I guess what I am saying is that I want to help you, but I have very little time to invest in this. If the code that reproduces this does have dependencies, is larger than a couple of 100 LOC, does not compile in the playground, etc. then I am going to need more time that's what minimally necessary to figure out what's wrong and fix it, and for that, I'm going to have to find a long time slot for this, and that might take some time.

It would be great if you could, e.g., expand the result of all those macros, and chunk it down to a dependency free code / function that shows the issue. If you provide a test case that feeds it raw input bytes or similar that trigger the issue, that would be great.

With something like that, I might be able to find out what's wrong and fix it in a 15min or 30 min slot, which would mean I'd get to this sooner. Hope that makes sense.

If you don't manage to chunk this down more, I'd try to look into this ASAP anyways, but am currently not sure when that will be.

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 19, 2019

I am new to these issues but I can try to debug. @gnzlbg Any recommendations before putting me?

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 19, 2019

You can try to use https://github.com/dtolnay/cargo-expand to view the result of your snippet in #674 (comment) after macro expansion, then you can manually move all functions from v_escape that you need into the main.rs file until you can remove that dependency.

So in this first step, you'd end up with a pretty big main.rs file, without any dependencies.

Then you can try to remove all the unused stuff, and simplify the code while making sure that the issue is still reproduced.

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 20, 2019

OK thank you very much. I have already taken it apart, with the case of the add + cmpgt which is the one that should give performance problems. https://github.com/rust-iendo/bench_stdsimd/blob/master/src/debug.rs

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2019

@botika

escape (https://github.com/rust-iendo/bench_stdsimd/blob/master/src/debug.rs#L46 ) is missing a #[target_feature(enable = "avx2")]

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 22, 2019

Ok, whatever, I thought it was indifferent as long as it launched with a processor with this flag.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 22, 2019

It is indifferent if you compile with RUSTFLAGS=-C target-feature=+avx2. Were you doing that ? If not, then #[target_feature(enable)] should make a difference. Otherwise, it should not make a difference at all for this example. Does that solve the performance problem for you?

@zzau13
Copy link
Contributor Author

zzau13 commented Feb 22, 2019

It does not solve it, the attribute is set in the library. https://github.com/rust-iendo/v_htmlescape/blob/master/v_escape/src/ranges/mod.rs#L13

@zzau13
Copy link
Contributor Author

zzau13 commented Mar 7, 2019

I put a daily CRON in the travis. The results, according to my criteria, reflect that the error is located in the function range and very surely the overflow in the i8.

I do not know, I attached the link. https://travis-ci.org/rust-iendo/v_htmlescape/jobs/502697796#L1525

Functions

  • range: _mm256_add_epi8 with overflow and _mm256_cmpgt_epi8
  • eq: _mm256_cmpeq_epi8

v_htmlescape makes 2 range and a eq, while v_latexescape makes 3 range and goes a lot slower

Thanks for your time.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

This is still blocked on the test case being reduced even further.

@zzau13
Copy link
Contributor Author

zzau13 commented Mar 7, 2019

Ok, I will try to reduce it to the minimum. Thanks again.

@zzau13
Copy link
Contributor Author

zzau13 commented Apr 4, 2019

In version 1.35 the problem is not observed, https://travis-ci.org/rust-iendo/v_htmlescape/jobs/515338745#L1355 . Step to close.

thanks for your patience.

@zzau13 zzau13 closed this as completed Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants