Computing the maximum value in an array is slow with SIMD. #2822

juj · 2014-09-24T19:23:17Z

Here's a commit that tests using SSE1 to compute the maximum value in a large array: juj@c62525e

SIMD support in Emscripten is still very early, but writing this down so that it can follow the development as it progresses and gets better.

Running natively on a Macbook Pro built with clang++ benchmark_sse1.cpp -O3 -o a, the results are

N: 16777216
Block scalar took 0.027743 msecs. Result: 1.000000.
Block scalar 4 unroll took 0.029196 msecs (1.052x of scalar) Result: 1.000000.
Block SSE1 no unroll took 0.004371 msecs (0.158x of scalar) Result: 1.000000.
Block SSE1 Unroll 2 took 0.005398 msecs (0.195x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 took 0.004678 msecs (0.169x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 pf took 0.003755 msecs (0.135x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 took 0.003934 msecs (0.142x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 pf took 0.004002 msecs (0.144x of scalar) Result: 1.000000.

Running in FF Nightly from today, 35.0a1 (2014-09-24), with the command line em++ -O3 tests/benchmark_sse1.cpp -o a.html -s TOTAL_MEMORY=268435456, the output is

N: 16777216
Block scalar took 0.032746 msecs. Result: 1.000000.
Block scalar 4 unroll took 0.030144 msecs (0.921x of scalar) Result: 1.000000.
Block SSE1 no unroll took 1.347598 msecs (41.153x of scalar) Result: 1.000000.
Block SSE1 Unroll 2 took 1.327801 msecs (40.548x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 took 1.333699 msecs (40.728x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 took 1.432141 msecs (43.734x of scalar) Result: 1.000000.

so we see that running natively, the SSE1 version is 5-7 times faster than scalar, whereas the Emscripten run is 40-43 times slower than scalar. For one, the JS version is not yet asm.js-validating, so slow performance is to be expected, but that's a good start!

The text was updated successfully, but these errors were encountered:

juj · 2015-09-03T14:40:16Z

Testing on current incoming and Nightly from today, I'm seeing

N: 16777216
Block scalar took 0.014260 msecs. Result: 1.000000.
Block scalar 4 unroll took 0.009430 msecs (0.661x of scalar) Result: 1.000000.
Block SSE1 no unroll took 0.013255 msecs (0.930x of scalar) Result: 1.000000.
Block SSE1 Unroll 2 took 0.010540 msecs (0.739x of scalar) Result: 1.000000.
Block SSE1 Unroll 4 took 0.009025 msecs (0.633x of scalar) Result: 1.000000.
Block SSE1 Unroll 16 took 0.007760 msecs (0.544x of scalar) Result: 1.000000.

so using min & max roughly doubles performance. Much better than 40-43x from before. Running natively, I get about 7x performance with the SIMD version (and 2x compared to SIMD.js). Closing, since this is now in the correct ballpark at least.

juj added performance SIMD labels Sep 24, 2014

juj mentioned this issue Jun 20, 2015

SSE2 emmintrin.h #3542

Merged

juj closed this as completed Sep 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computing the maximum value in an array is slow with SIMD. #2822

Computing the maximum value in an array is slow with SIMD. #2822

juj commented Sep 24, 2014

juj commented Sep 3, 2015

Computing the maximum value in an array is slow with SIMD. #2822

Computing the maximum value in an array is slow with SIMD. #2822

Comments

juj commented Sep 24, 2014

juj commented Sep 3, 2015