[GraphBolt] Make unique_and_compact deterministic #7217

RamonZhou · 2024-03-15T07:05:55Z

Description

Doc: https://docs.google.com/document/d/1fWX6GhQySniFCSmXnKw8mxZQ26WngD5jEi8vt2H41Ss

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-03-15T07:06:26Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-03-15T07:10:35Z

Commit ID: 498b415

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

mfbalin · 2024-03-15T07:12:15Z

LGTM, as per our discussion with @peizhou001.

mfbalin · 2024-03-15T07:16:28Z

How much slower is this new implementation compared to the old one?

dgl-bot · 2024-03-15T07:58:20Z

Commit ID: b2242bc

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

RamonZhou · 2024-03-15T08:23:05Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

mfbalin · 2024-03-15T15:50:05Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

graphbolt/src/concurrent_id_hash_map.cc

graphbolt/src/concurrent_id_hash_map.h

RamonZhou · 2024-03-18T05:31:07Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

mfbalin · 2024-03-18T05:56:38Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

Thank you. Let's test it again before we merge it. The changes during the review process may impact performance.

frozenbugs · 2024-03-18T06:16:15Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

we may also add an option to decide which path to go depends on whether user need deterministic or not.

graphbolt/src/concurrent_id_hash_map.cc

dgl-bot · 2024-03-18T07:57:59Z

Commit ID: 9d86ab56849b587abfd77ebc3220ab28d473633e

Build ID: 3

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-03-18T08:44:51Z

Commit ID: e4cfb3a

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

peizhou001 · 2024-03-18T09:03:58Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

we may also add an option to decide which path to go depends on whether user need deterministic or not.

Maybe not? because it keeps the same speed while making the result sable, there is no reason why user want the result non-deterministic w/o performance benefits.

mfbalin · 2024-03-18T13:22:00Z

@RamonZhou how is the performance now on the advanced PyG example?

mfbalin

LGTM. If the performance is same as before or even better (we insert all in a single parallel for now, should be a little faster), we can go ahead and merge it.

graphbolt/src/concurrent_id_hash_map.cc

mfbalin · 2024-03-18T17:59:14Z

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

we may also add an option to decide which path to go depends on whether user need deterministic or not.

Maybe not? because it keeps the same speed while making the result sable, there is no reason why user want the result non-deterministic w/o performance benefits.

I see around 5% performance loss with this PR on my machine with the advanced PyG example --mode=cpu-pinned-cuda.

mfbalin · 2024-03-19T02:44:38Z

@RamonZhou it might be best to measure the performance by making the feature dimension smaller in case the feature fetch is the bottleneck on your system. Like features = features[:, :32].

graphbolt/src/concurrent_id_hash_map.cc

RamonZhou · 2024-03-19T06:04:49Z

@mfbalin I tested again with the latest master branch merged. It's 7.911s (before) vs. 7.919s (after)

dgl-bot · 2024-03-19T09:20:10Z

Commit ID: c52435d

Build ID: 5

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/concurrent_id_hash_map.cc

RamonZhou · 2024-03-20T04:34:20Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s
Current PR (with the if inside the loop): 5.802s
Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

mfbalin · 2024-03-20T04:36:28Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s

Current PR (with the if inside the loop): 5.802s

Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

RamonZhou · 2024-03-20T05:28:40Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s

Current PR (with the if inside the loop): 5.802s

Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

mfbalin · 2024-03-20T05:38:43Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s

Current PR (with the if inside the loop): 5.802s

Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

RamonZhou · 2024-03-22T02:41:03Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s

Current PR (with the if inside the loop): 5.802s

Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value); is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.

mfbalin · 2024-03-22T02:43:01Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s

Current PR (with the if inside the loop): 5.802s

Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value); is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.

https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html
You can use __sync_fetch_and_or with 2nd argument as 0 to perform an atomic load operation. CompareAndSwap is one of the most expensive atomic operations.

RamonZhou · 2024-03-22T03:14:34Z

I did another test and the results are (features are sliced with [:, :32]):

Old version: 5.817s

Current PR (with the if inside the loop): 5.802s

Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value); is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.

https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html You can use __sync_fetch_and_or with 2nd argument as 0 to perform an atomic load operation.

I see. I tried __sync_fetch_and_or and tested. The performance is about the same (4.717s)

dgl-bot · 2024-03-25T08:14:17Z

Commit ID: 4104c82

Build ID: 6

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

RamonZhou · 2024-03-25T08:25:57Z

Let's merge this first and optimize in the future.

init

498b415

RamonZhou requested review from peizhou001 and frozenbugs March 15, 2024 07:05

fix lint

b2242bc

mfbalin reviewed Mar 15, 2024

View reviewed changes

graphbolt/src/concurrent_id_hash_map.cc Outdated Show resolved Hide resolved

graphbolt/src/concurrent_id_hash_map.cc Outdated Show resolved Hide resolved

graphbolt/src/concurrent_id_hash_map.cc Outdated Show resolved Hide resolved

mfbalin reviewed Mar 15, 2024

View reviewed changes