investigate low performance and CPU hotspot transferring 4GiB over loopback #166

raulk · 2021-03-19T21:40:47Z

This is an initial observation from running the Whitenoise tests with a perfect network and no interruption rate, over the loopback interface. In other words, these are ideal conditions for a data transfer.

The total payload size that was transfered was 4GiB.

The transfer stack was go-data-transfer with graphsync, conducting a push from the sender to the receiver.

The hardware used was a MBP 16", 2,3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4.

The import of the random 4GiB payload into the badger store took 22109ms.
The transfer as perceived from the client took 117000ms (almost 2 minutes).
The average data transfer rate is 35MB/s -- this is very poor for loopback.

Attached is the run result. IMPORTANT: this is really a gzip, but GitHub only allows uploading zips: change the extension to gzip!

c1ah45gi7qkqo2qcbicg.zip

Of special interest are the events files, which capture the subscription stream from go-data-transfer from both the sender and receiver. Very useful for debugging.

There are heap profiles and CPU profiles included too (using Testground's new profiling feature) for further digging.

So far, I found what appears to be a large hotspot on the receiving side. Addressing this hotspot could allow the receiver to ACK data sooner, and thus speed up the data flow.

The hotspot seems to be in the "CID lists" feature, which I'm not familiar with.

NOTE: If you run the test, make sure you use the feat/tmp-dir branch of Testground, as Whitenoise depends on the new temporary directory feature. This is also noted in the Whitenoise README.

The text was updated successfully, but these errors were encountered:

dirkmc · 2021-03-22T08:00:06Z

When go-data-transfer receives a block, it adds the CID to a list.
If the data transfer is restarted, eg because the connection went down, the receiving side sends a list of all CIDs that it has received to the sender, so that the sender can skip over these CID and only resend missing CIDs to the receiver.

I'm not sure why this is showing up in the profile. Are you running multiple tests, with a new datastore each time? The CID list is kept on disk, so it may be that each time you do a test you're writing more and more CIDs to the list, and each time you start the run you're reading the CIDs from all previous tests from the file.

Thinking about CID lists we may want to consider storing the list as a bloom filter. Then if there's a restart during the transfer, we wouldn't need to send all the individual CIDs as a list, we could just send the bloom filter.

raulk · 2021-03-22T09:41:09Z

This is just a single test, nothing fancy. Things should be configured the way we set them up in Lotus. Could you try running the basic composition in the whitenoise tests and see if you can reproduce?

raulk · 2021-03-22T09:41:51Z

Yeah, and sending all CIDs scales very poorly. We should never do that!

hannahhoward · 2021-06-02T22:20:48Z

Hi @raulk -- you're decoding the list of received CIDs every time you get a transfer event. For now, can you remove this line and run again? https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L128. I think our hotspot here may not reflect our regular use. I agree the cid lists need attention.

hannahhoward · 2021-06-02T22:21:26Z

that's a file i/o decode and read on every event

hannahhoward · 2021-06-02T22:23:06Z

btw, alternatives to storing CID lists could be as simple as skip N blocks if were just trying to resume.

dirkmc · 2021-06-04T17:00:36Z

raulk/whitenoise#1 updates whitenoise to point at #217 which eliminates the received CIDs bottleneck.

It seems that the transfer is still relatively slow on my machine:
1m41s to transfer 4GB: 40MB/s

My machine specs:

MacBook Pro (13-inch, 2019, Four Thunderbolt 3 ports)
2.8 GHz Intel Core i7
16 GB 2133 MHz LPDDR3

The bottleneck is no longer coming from reading received CIDs. It seems likely it's from hashing.

Next steps:

Update to graphsync 0.8
Profile and compare

CPU profile:

prof-rcvd-cids-fix.zip

aarshkshah1992 · 2021-06-07T14:28:12Z

@dirkmc I re-ran your branch on my machine and the hashing bottle-neck dosen't really show up on the sender or the reciever.

However, I will re-run this with GS v0.8.0 tomorrow to see if it makes a meaningful difference.
gstest.zip

dirkmc · 2021-06-08T10:05:59Z

@aarshkshah1992 your suspicions seem to be correct, updating to graphsync v0.8.0 doesn't seem to make much difference. I still get a transfer speed of about 40 MB/s. The CPU profile also looks similar:

badger-gs-v0.8.zip

dirkmc · 2021-06-08T14:50:58Z

Still with graphsync v0.8 but now with an in-memory datastore:

The transfer is significantly faster: 4GB takes 45s: 95MB/s
The major bottleneck seems to be hashing:

in-mem-gsv0.8.zip

dirkmc · 2021-06-10T12:06:18Z

Results for graphsync v0.7 with an in-memory datastore:

The transfer time is comparable to graphsync v0.8: 4GB takes 45s: 95MB/s
The CPU profile also looks similar:

gs-0.7-inmem-prof.zip

hannahhoward · 2021-06-15T23:00:31Z

So folks, I'm just going to leave my results here.

Here are my three runs. There were run on a Mac Minin M1 with 16GB of RAM & SSD. I closed most of my apps before running. I used Badger only.

Right off @raulk's master branch, running v0.8.0 (v0.8.1-rc), with batcher. Total time: 8.005984208s for 4GB. Comes in at about 500MB/S. Full testground output:
c34ihmimvdvm12r7rd2g.zip

Dirks v0.7.0 branch, using Badger. Total time: 23.006475083s for 4GB. Comes in at about 173MB/S. Full testground output: :

c34ii6amvdvm12r7rd30.zip

My own branch in PR raulk/whitenoise#4, with Badger. Total time: 8.004791959s, again 4GB transferred. Again at 500MB/S. Actually makes no difference, but resolves the CPU bottleneck

c34ig0qmvdvm12r7rd10.zip

So, both 0.8.1 (unmodified, master branch of this repo) and my branch with SHA fix show a near three fold increase on my machine.

The big difference between versions appeared after I closed all the apps on my machine -- I suspect even with badger we are consuming a lot of memory and going to swap space
Running with memory stores on both client and provider saturates my available RAM -- this makes things run much slower for me

The only thing I can offer is if you are running locally close your other apps. This made a huge difference for me.

Happy to post heap profile images thought hey are in the ZIP files.

dirkmc · 2021-06-16T10:54:42Z

To try to remove any noise from other processes from running on a laptop I ran the profiles on a cloud machine:
AWS c5d.12xlarge: 48vCPU / 96 GB RAM

I ran

whitenoise with Graphsync v0.7 (Badger datastore)
Transfer time: 13s
whitenoise with Graphsync v0.7 (in-memory datastore)
Transfer time: 13s
whitenoise with Graphsync v0.8 (Badger datastore)
Transfer time: 13s
whitenoise with Graphsync v0.8 (in-memory datastore)
Transfer time: 14s

I've attached

The output from testground
Zip files with all the profiles etc
PNG images of profiles
- Graphsync v0.7 (badger):
- Graphsync v0.7 (in-memory datastore):
- Graphsync v0.8 (badger):
- Graphsync v0.8 (in-memory datastore):

hannahhoward · 2021-06-16T14:59:44Z

Some thoughts on the discrepancy between your linux node and my node:

the CPU profiles for go-graphsync on the Sender side in your last comment for v0.8 show hashing about around 50% of v0.7 for both memory and badger.
I suspect if you were to run with the branch in Use custom graphsync with PR outstanding raulk/whitenoise#4 you would find the same for the receiver.
For whatever reason, this makes a difference on M1 but not Intel?

raulk mentioned this issue Mar 26, 2021

IO races on received CIDs feature on receiving side #178

Closed

This was referenced Jun 3, 2021

Remove CID lists #217

Merged

update go-data-transfer to remove CID list bottleneck raulk/whitenoise#1

Merged

dirkmc mentioned this issue Jun 11, 2021

Update go-graphsync v0.8.0 #212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate low performance and CPU hotspot transferring 4GiB over loopback #166

investigate low performance and CPU hotspot transferring 4GiB over loopback #166

raulk commented Mar 19, 2021 •

edited

Loading

dirkmc commented Mar 22, 2021

raulk commented Mar 22, 2021

raulk commented Mar 22, 2021

hannahhoward commented Jun 2, 2021 •

edited

Loading

hannahhoward commented Jun 2, 2021 •

edited

Loading

hannahhoward commented Jun 2, 2021

dirkmc commented Jun 4, 2021 •

edited

Loading

aarshkshah1992 commented Jun 7, 2021

dirkmc commented Jun 8, 2021 •

edited

Loading

dirkmc commented Jun 8, 2021

dirkmc commented Jun 10, 2021 •

edited

Loading

hannahhoward commented Jun 15, 2021 •

edited

Loading

dirkmc commented Jun 16, 2021 •

edited

Loading

hannahhoward commented Jun 16, 2021

investigate low performance and CPU hotspot transferring 4GiB over loopback #166

investigate low performance and CPU hotspot transferring 4GiB over loopback #166

Comments

raulk commented Mar 19, 2021 • edited Loading

dirkmc commented Mar 22, 2021

raulk commented Mar 22, 2021

raulk commented Mar 22, 2021

hannahhoward commented Jun 2, 2021 • edited Loading

hannahhoward commented Jun 2, 2021 • edited Loading

hannahhoward commented Jun 2, 2021

dirkmc commented Jun 4, 2021 • edited Loading

aarshkshah1992 commented Jun 7, 2021

dirkmc commented Jun 8, 2021 • edited Loading

dirkmc commented Jun 8, 2021

dirkmc commented Jun 10, 2021 • edited Loading

hannahhoward commented Jun 15, 2021 • edited Loading

dirkmc commented Jun 16, 2021 • edited Loading

hannahhoward commented Jun 16, 2021

raulk commented Mar 19, 2021 •

edited

Loading

hannahhoward commented Jun 2, 2021 •

edited

Loading

hannahhoward commented Jun 2, 2021 •

edited

Loading

dirkmc commented Jun 4, 2021 •

edited

Loading

dirkmc commented Jun 8, 2021 •

edited

Loading

dirkmc commented Jun 10, 2021 •

edited

Loading

hannahhoward commented Jun 15, 2021 •

edited

Loading

dirkmc commented Jun 16, 2021 •

edited

Loading