[GraphBolt][io_uring] Refactor and enable tests #7506

mfbalin · 2024-07-05T01:03:17Z

Description

We were not able to build GraphBolt with liburing support because the USE_LIBURING option was not being passed to graphbolt.
I realized that liburing is not used anywhere else except GraphBolt so I moved its configuration steps into the GraphBolt cmake file.
There was an unnecessary copy at the end and the use of unsafe allocation such as malloc and free, which are eliminated now. Changed to use the buffers from torch instead.
Enabled the tests on systems that support io_uring.
Made the disk read operation async and return a future, it also takes num_threads argument now that determines the maximum # threads that will read from disk. For now, there is no async read operation exposed in python DiskBasedFeature yet, it will be introduced in followup PRs.
Eliminated a few memory leaks and switched to safer and modern variants instead. (no manual free or delete required now.)

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-07-05T01:03:45Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-07-05T01:31:26Z

Commit ID: 67a6dd4

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-07-05T01:35:09Z

Commit ID: a6a0817

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-07-05T02:04:33Z

Commit ID: fd4e5f8

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

mfbalin · 2024-07-05T02:16:24Z

@Rhett-Ying Did the tests get enabled for our Linux CI or not, I can't parse the output of the CI.

EDIT: I forgot to enable building by default.

dgl-bot · 2024-07-05T03:07:02Z

Commit ID: ad3804ce90f2faea5979583b9aa8d0c23285d311

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

mfbalin · 2024-07-05T03:09:16Z

The tests passed on Linux and were skipped on Windows. I don't have a way to test whether it will be skipped on a system that does not support io_uring though. I am hoping that it works.

dgl-bot · 2024-07-06T22:28:34Z

Commit ID: f6b5d34

Build ID: 5

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-07-06T23:30:44Z

Commit ID: 9190573

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-07-07T00:00:02Z

Commit ID: e5e1316

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-07-07T03:02:49Z

Commit ID: 2efbbed

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-07-07T03:32:13Z

Commit ID: 6771e7f

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

python/dgl/graphbolt/impl/torch_based_feature_store.py

dgl-bot · 2024-07-09T05:48:03Z

Commit ID: ef18d3c

Build ID: 10

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

graphbolt/src/cnumpy.cc

dgl-bot · 2024-07-09T05:55:12Z

Commit ID: 41172f2

Build ID: 11

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

dgl-bot · 2024-07-09T06:29:35Z

Commit ID: a962b5b3bebe462bb572ab3c3869720514677850

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-07-09T06:45:35Z

Commit ID: 8060a83

Build ID: 13

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-07-09T06:53:00Z

Commit ID: 6ccc993

Build ID: 14

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

dgl-bot · 2024-07-09T07:26:54Z

Commit ID: d54017b

Build ID: 15

Status: ❌ CI test failed in Stage [Torch GPU Example test].

Report path: link

Full logs path: link

mfbalin · 2024-07-09T07:28:34Z

@Rhett-Ying

=================================== FAILURES ===================================
__________________________________ test_sign ___________________________________

    def test_sign():
        script = os.path.join(EXAMPLE_ROOT, "sign.py")
        out = subprocess.run(["python", str(script)], capture_output=True)
        assert (
            out.returncode == 0
        ), f"stdout: {out.stdout.decode('utf-8')}\nstderr: {out.stderr.decode('utf-8')}"
        stdout = out.stdout.decode("utf-8")
&gt;       assert float(stdout[-5:]) &gt; 0.7
E       AssertionError: assert 0.689 &gt; 0.7
E        +  where 0.689 = float('.689\n')

tests/examples/test_sparse_examples.py:101: AssertionError
- generated xml file: /home/ubuntu/jenkins/workspace/dgl_PR-7506/pytest_backend.xml -
============================ slowest 100 durations =============================
25.62s call     tests/examples/test_sparse_examples.py::test_twirls
17.99s call     tests/examples/test_sparse_examples.py::test_hgnn
17.00s call     tests/examples/test_sparse_examples.py::test_hypergraphatt
13.43s call     tests/examples/test_sampling_examples.py::test_node_classification
13.05s call     tests/examples/test_sparse_examples.py::test_gcnii
8.61s call     tests/examples/test_sampling_examples.py::test_link_prediction
7.34s call     tests/examples/test_sparse_examples.py::test_gat
5.64s call     tests/examples/test_sparse_examples.py::test_gcn
5.42s call     tests/examples/test_sparse_examples.py::test_appnp
5.02s call     tests/examples/test_sparse_examples.py::test_c_and_s
4.79s call     tests/examples/test_sparse_examples.py::test_sign
4.69s call     tests/examples/test_sparse_examples.py::test_sgc

(24 durations &lt; 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/examples/test_sparse_examples.py::test_sign - AssertionError: assert 0.689 &gt; 0.7
 +  where 0.689 = float('.689\n')
=================== 1 failed, 11 passed in 128.68s (0:02:08) ===================
FAIL: sparse examples on gpu

mfbalin · 2024-07-09T07:28:39Z

@dgl-bot

Rhett-Ying · 2024-07-09T07:47:17Z

@Rhett-Ying

=================================== FAILURES ===================================
__________________________________ test_sign ___________________________________

    def test_sign():
        script = os.path.join(EXAMPLE_ROOT, "sign.py")
        out = subprocess.run(["python", str(script)], capture_output=True)
        assert (
            out.returncode == 0
        ), f"stdout: {out.stdout.decode('utf-8')}\nstderr: {out.stderr.decode('utf-8')}"
        stdout = out.stdout.decode("utf-8")
&gt;       assert float(stdout[-5:]) &gt; 0.7
E       AssertionError: assert 0.689 &gt; 0.7
E        +  where 0.689 = float('.689\n')

tests/examples/test_sparse_examples.py:101: AssertionError
- generated xml file: /home/ubuntu/jenkins/workspace/dgl_PR-7506/pytest_backend.xml -
============================ slowest 100 durations =============================
25.62s call     tests/examples/test_sparse_examples.py::test_twirls
17.99s call     tests/examples/test_sparse_examples.py::test_hgnn
17.00s call     tests/examples/test_sparse_examples.py::test_hypergraphatt
13.43s call     tests/examples/test_sampling_examples.py::test_node_classification
13.05s call     tests/examples/test_sparse_examples.py::test_gcnii
8.61s call     tests/examples/test_sampling_examples.py::test_link_prediction
7.34s call     tests/examples/test_sparse_examples.py::test_gat
5.64s call     tests/examples/test_sparse_examples.py::test_gcn
5.42s call     tests/examples/test_sparse_examples.py::test_appnp
5.02s call     tests/examples/test_sparse_examples.py::test_c_and_s
4.79s call     tests/examples/test_sparse_examples.py::test_sign
4.69s call     tests/examples/test_sparse_examples.py::test_sgc

(24 durations &lt; 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/examples/test_sparse_examples.py::test_sign - AssertionError: assert 0.689 &gt; 0.7
 +  where 0.689 = float('.689\n')
=================== 1 failed, 11 passed in 128.68s (0:02:08) ===================
FAIL: sparse examples on gpu

This issue happens in rare. just re-run it.

graphbolt/CMakeLists.txt

dgl-bot · 2024-07-09T07:59:01Z

Commit ID: d54017b

Build ID: 16

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

[GraphBolt] Improve liburing code

67a6dd4

mfbalin requested review from frozenbugs and peizhou001 July 5, 2024 01:03

mfbalin requested a review from Rhett-Ying July 5, 2024 01:30

mfbalin changed the title ~~[GraphBolt] Improve liburing code~~ [GraphBolt][io_uring] Improvements and enable tests Jul 5, 2024

Enable tests and add check for io_uring.

fd4e5f8

mfbalin force-pushed the gb_liburing_improvements branch from a6a0817 to fd4e5f8 Compare July 5, 2024 01:34

enable liburing by default.

f3ccd34

Merge branch 'master' into gb_liburing_improvements

f6b5d34

mfbalin added 2 commits July 6, 2024 23:25

add an option to limit the thread count of io uring.

9190573

result tensor options should come from input.

e5e1316

mfbalin added 2 commits July 7, 2024 02:58

Make read async for OnDiskNpyArray

2efbbed

use new so that it is uninitialized.

6771e7f

mfbalin changed the title ~~[GraphBolt][io_uring] Improvements and enable tests~~ [GraphBolt][io_uring] Refactor and enable tests Jul 7, 2024

frozenbugs reviewed Jul 9, 2024

View reviewed changes

python/dgl/graphbolt/impl/torch_based_feature_store.py Show resolved Hide resolved

mfbalin added 2 commits July 9, 2024 01:38

Merge branch 'master' into gb_liburing_improvements

ef18d3c

Replace Linux check with io_uring check.

41172f2

frozenbugs approved these changes Jul 9, 2024

View reviewed changes

mfbalin added 3 commits July 9, 2024 02:06

make io_uring detection thread-safe and cached.

9132a1a

address part of the reviews.

4614e94

Merge branch 'master' into gb_liburing_improvements

8060a83

refactor implementation from async wrapper.

6ccc993

fix compile error.

d54017b

Rhett-Ying reviewed Jul 9, 2024

View reviewed changes

graphbolt/CMakeLists.txt Show resolved Hide resolved

mfbalin mentioned this pull request Jul 9, 2024

[Misc] Fix spurious test failure. #7510

Merged

8 tasks

mfbalin merged commit 5265137 into dmlc:master Jul 9, 2024
2 checks passed

mfbalin deleted the gb_liburing_improvements branch July 9, 2024 08:06

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][io_uring] Refactor and enable tests #7506

[GraphBolt][io_uring] Refactor and enable tests #7506

mfbalin commented Jul 5, 2024 •

edited

Loading

dgl-bot commented Jul 5, 2024

dgl-bot commented Jul 5, 2024

dgl-bot commented Jul 5, 2024

dgl-bot commented Jul 5, 2024

mfbalin commented Jul 5, 2024 •

edited

Loading

dgl-bot commented Jul 5, 2024

mfbalin commented Jul 5, 2024

dgl-bot commented Jul 6, 2024

dgl-bot commented Jul 6, 2024

dgl-bot commented Jul 7, 2024

dgl-bot commented Jul 7, 2024

dgl-bot commented Jul 7, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

mfbalin commented Jul 9, 2024

mfbalin commented Jul 9, 2024

Rhett-Ying commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

[GraphBolt][io_uring] Refactor and enable tests #7506

[GraphBolt][io_uring] Refactor and enable tests #7506

Conversation

mfbalin commented Jul 5, 2024 • edited Loading

Description

Checklist

Changes

dgl-bot commented Jul 5, 2024

dgl-bot commented Jul 5, 2024

dgl-bot commented Jul 5, 2024

dgl-bot commented Jul 5, 2024

mfbalin commented Jul 5, 2024 • edited Loading

dgl-bot commented Jul 5, 2024

mfbalin commented Jul 5, 2024

dgl-bot commented Jul 6, 2024

dgl-bot commented Jul 6, 2024

dgl-bot commented Jul 7, 2024

dgl-bot commented Jul 7, 2024

dgl-bot commented Jul 7, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

mfbalin commented Jul 9, 2024

mfbalin commented Jul 9, 2024

Rhett-Ying commented Jul 9, 2024

dgl-bot commented Jul 9, 2024

mfbalin commented Jul 5, 2024 •

edited

Loading

mfbalin commented Jul 5, 2024 •

edited

Loading