using Vector capabilities of the CPU for sha256 in ssz merkelization of lists #213

g11tech · 2021-11-22T08:03:51Z

In discussion with @potuz, it was discovered that there is scope for using capabilities of SIMD enabled processors, use case: ssz merkalization of the lists for which @potuz has reported 10x improvment.

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12                  160       7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12              15      74012328 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4               68      26677965 ns/op
BenchmarkHashBalancePrysm-4              7     165434686 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4             121       9711482 ns/op
BenchmarkHashBalancePrysm-4             10     103716714 ns/op
PASS

Reference Links:
https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L635-L659
https://github.com/potuz/mammon/blob/main/ssz/hasher.hpp#L27

Based on this, digged through to realize that assembly script has support for SIMD vector processing: https://v8.dev/features/simd
There are two was this can be done:

Via compiler flags for auto optimization of vector loops for single digest
Via using assembly script wrapper functions to vectorize the computation for parallelizing multiple digest processings (the approach followed by @potuz in his reference implementation, more optimal wherever multi digest & SIMD compatible workload available)

Task:

Investigate and get familiar SIMD support directives in assembly script
Investigate and develop if possible, loop parallelization for SIMD
Investigate and develop multiple digest feeds
Integrate the multi digest support in ssz merkelization of lists

The text was updated successfully, but these errors were encountered:

dapplion · 2021-11-23T09:44:27Z

Oh damn! 🚀 How can check if my host supports SIMD?

potuz · 2021-11-23T09:47:37Z

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

dapplion · 2021-11-23T09:54:49Z

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Thank you!

$ cpuid
CPU 0:
...
   feature information (1/edx):
...
      SSE extensions                         = true
      SSE2 extensions                        = true
   feature information (1/ecx):
...
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true

😍 how common is support on modern CPUs?

potuz · 2021-11-23T10:07:56Z

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Thank you!
$ cpuid
CPU 0:
...
   feature information (1/edx):
...
      SSE extensions                         = true
      SSE2 extensions                        = true
   feature information (1/ecx):
...
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true
heart_eyes how common is support on modern CPUs?

I'm making the case to implement in prysm expecting at least SSE3 which has been the standard since 2004/5 I don't expect a single CPU out there without SSE3 actually staking. In practical terms, I don't think there's a single one without AVX. This is Intel speaking. I haven't looked yet into ARM assembly.

dapplion · 2021-11-23T10:12:49Z

That's huge then! Would love to see this in Lodestar.

I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement @g11tech

ChainSafe/lodestar#2206

potuz · 2021-11-23T10:15:35Z

That's huge then! Would love to see this in Lodestar.

I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement

ChainSafe/lodestar#2206

This requires both changes in the assembly to return buffers with all roots at the same time, and changes in the hashing logic to call the whole block at the same time instead of pairwise leaves. I put out a stupid implementation in the design document, surely it can be improved, but this is already giving those x10 benches against production prysm on large lists:
https://hackmd.io/@potuz/BJyrx9DOF

potuz · 2022-01-03T19:04:06Z

We'll most probably be using https://github.com/prysmaticlabs/hashtree, It's on very early stages of development, but I'll be happy to see some benchmarks from Lodestar if you could test it. If you decide to use it I'll be happy to provide bindings or whatever you need.

g11tech self-assigned this Nov 22, 2021

g11tech mentioned this issue Nov 23, 2021

investigate: the benchmark diff in lodestar's ssz state hashing compared with lighthouse #217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using Vector capabilities of the CPU for sha256 in ssz merkelization of lists #213

using Vector capabilities of the CPU for sha256 in ssz merkelization of lists #213

g11tech commented Nov 22, 2021

dapplion commented Nov 23, 2021

potuz commented Nov 23, 2021

dapplion commented Nov 23, 2021

potuz commented Nov 23, 2021

dapplion commented Nov 23, 2021 •

edited

Loading

potuz commented Nov 23, 2021

potuz commented Jan 3, 2022

using Vector capabilities of the CPU for sha256 in ssz merkelization of lists #213

using Vector capabilities of the CPU for sha256 in ssz merkelization of lists #213

Comments

g11tech commented Nov 22, 2021

dapplion commented Nov 23, 2021

potuz commented Nov 23, 2021

dapplion commented Nov 23, 2021

potuz commented Nov 23, 2021

dapplion commented Nov 23, 2021 • edited Loading

potuz commented Nov 23, 2021

potuz commented Jan 3, 2022

dapplion commented Nov 23, 2021 •

edited

Loading