Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing the maximum value in an array is slow with SIMD. #2822

Closed
juj opened this issue Sep 24, 2014 · 1 comment
Closed

Computing the maximum value in an array is slow with SIMD. #2822

juj opened this issue Sep 24, 2014 · 1 comment

Comments

@juj
Copy link
Collaborator

juj commented Sep 24, 2014

Here's a commit that tests using SSE1 to compute the maximum value in a large array: juj@c62525e

SIMD support in Emscripten is still very early, but writing this down so that it can follow the development as it progresses and gets better.

Running natively on a Macbook Pro built with clang++ benchmark_sse1.cpp -O3 -o a, the results are

N: 16777216
Block scalar took 0.027743 msecs. Result: 1.000000.
Block scalar 4 unroll took 0.029196 msecs (1.052x of scalar) Result: 1.000000.
Block SSE1 no unroll took 0.004371 msecs (0.158x of scalar) Result: 1.000000.
Block SSE1 Unroll 2 took 0.005398 msecs (0.195x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 took 0.004678 msecs (0.169x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 pf took 0.003755 msecs (0.135x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 took 0.003934 msecs (0.142x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 pf took 0.004002 msecs (0.144x of scalar) Result: 1.000000.

Running in FF Nightly from today, 35.0a1 (2014-09-24), with the command line em++ -O3 tests/benchmark_sse1.cpp -o a.html -s TOTAL_MEMORY=268435456, the output is

N: 16777216
Block scalar took 0.032746 msecs. Result: 1.000000.
Block scalar 4 unroll took 0.030144 msecs (0.921x of scalar) Result: 1.000000.
Block SSE1 no unroll took 1.347598 msecs (41.153x of scalar) Result: 1.000000.
Block SSE1 Unroll 2 took 1.327801 msecs (40.548x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 took 1.333699 msecs (40.728x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 took 1.432141 msecs (43.734x of scalar) Result: 1.000000.

so we see that running natively, the SSE1 version is 5-7 times faster than scalar, whereas the Emscripten run is 40-43 times slower than scalar. For one, the JS version is not yet asm.js-validating, so slow performance is to be expected, but that's a good start!

@juj
Copy link
Collaborator Author

juj commented Sep 3, 2015

Testing on current incoming and Nightly from today, I'm seeing

N: 16777216
Block scalar took 0.014260 msecs. Result: 1.000000.
Block scalar 4 unroll took 0.009430 msecs (0.661x of scalar) Result: 1.000000.
Block SSE1 no unroll took 0.013255 msecs (0.930x of scalar) Result: 1.000000.
Block SSE1 Unroll 2 took 0.010540 msecs (0.739x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 took 0.009025 msecs (0.633x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 took 0.007760 msecs (0.544x of scalar) Result: 1.000000.

so using min & max roughly doubles performance. Much better than 40-43x from before. Running natively, I get about 7x performance with the SIMD version (and 2x compared to SIMD.js). Closing, since this is now in the correct ballpark at least.

@juj juj closed this as completed Sep 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant