Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][io_uring] Refactor and enable tests #7506

Merged
merged 15 commits into from
Jul 9, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jul 5, 2024

Description

  1. We were not able to build GraphBolt with liburing support because the USE_LIBURING option was not being passed to graphbolt.
  2. I realized that liburing is not used anywhere else except GraphBolt so I moved its configuration steps into the GraphBolt cmake file.
  3. There was an unnecessary copy at the end and the use of unsafe allocation such as malloc and free, which are eliminated now. Changed to use the buffers from torch instead.
  4. Enabled the tests on systems that support io_uring.
  5. Made the disk read operation async and return a future, it also takes num_threads argument now that determines the maximum # threads that will read from disk. For now, there is no async read operation exposed in python DiskBasedFeature yet, it will be introduced in followup PRs.
  6. Eliminated a few memory leaks and switched to safer and modern variants instead. (no manual free or delete required now.)

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 5, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@mfbalin mfbalin requested a review from Rhett-Ying July 5, 2024 01:30
@mfbalin mfbalin changed the title [GraphBolt] Improve liburing code [GraphBolt][io_uring] Improvements and enable tests Jul 5, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 5, 2024

Commit ID: 67a6dd4

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 5, 2024

Commit ID: a6a0817

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 5, 2024

Commit ID: fd4e5f8

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 5, 2024

@Rhett-Ying Did the tests get enabled for our Linux CI or not, I can't parse the output of the CI.

EDIT: I forgot to enable building by default.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 5, 2024

Commit ID: ad3804ce90f2faea5979583b9aa8d0c23285d311

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 5, 2024

The tests passed on Linux and were skipped on Windows. I don't have a way to test whether it will be skipped on a system that does not support io_uring though. I am hoping that it works.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 6, 2024

Commit ID: f6b5d34

Build ID: 5

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 6, 2024

Commit ID: 9190573

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 7, 2024

Commit ID: e5e1316

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 7, 2024

Commit ID: 2efbbed

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 7, 2024

Commit ID: 6771e7f

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt][io_uring] Improvements and enable tests [GraphBolt][io_uring] Refactor and enable tests Jul 7, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: ef18d3c

Build ID: 10

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: 41172f2

Build ID: 11

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: a962b5b3bebe462bb572ab3c3869720514677850

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: 8060a83

Build ID: 13

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: 6ccc993

Build ID: 14

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: d54017b

Build ID: 15

Status: ❌ CI test failed in Stage [Torch GPU Example test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 9, 2024

@Rhett-Ying

=================================== FAILURES ===================================
__________________________________ test_sign ___________________________________

    def test_sign():
        script = os.path.join(EXAMPLE_ROOT, "sign.py")
        out = subprocess.run(["python", str(script)], capture_output=True)
        assert (
            out.returncode == 0
        ), f"stdout: {out.stdout.decode('utf-8')}\nstderr: {out.stderr.decode('utf-8')}"
        stdout = out.stdout.decode("utf-8")
>       assert float(stdout[-5:]) > 0.7
E       AssertionError: assert 0.689 > 0.7
E        +  where 0.689 = float('.689\n')

tests/examples/test_sparse_examples.py:101: AssertionError
- generated xml file: /home/ubuntu/jenkins/workspace/dgl_PR-7506/pytest_backend.xml -
============================ slowest 100 durations =============================
25.62s call     tests/examples/test_sparse_examples.py::test_twirls
17.99s call     tests/examples/test_sparse_examples.py::test_hgnn
17.00s call     tests/examples/test_sparse_examples.py::test_hypergraphatt
13.43s call     tests/examples/test_sampling_examples.py::test_node_classification
13.05s call     tests/examples/test_sparse_examples.py::test_gcnii
8.61s call     tests/examples/test_sampling_examples.py::test_link_prediction
7.34s call     tests/examples/test_sparse_examples.py::test_gat
5.64s call     tests/examples/test_sparse_examples.py::test_gcn
5.42s call     tests/examples/test_sparse_examples.py::test_appnp
5.02s call     tests/examples/test_sparse_examples.py::test_c_and_s
4.79s call     tests/examples/test_sparse_examples.py::test_sign
4.69s call     tests/examples/test_sparse_examples.py::test_sgc

(24 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/examples/test_sparse_examples.py::test_sign - AssertionError: assert 0.689 > 0.7
 +  where 0.689 = float('.689\n')
=================== 1 failed, 11 passed in 128.68s (0:02:08) ===================
FAIL: sparse examples on gpu

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 9, 2024

@dgl-bot

@Rhett-Ying
Copy link
Collaborator

@Rhett-Ying

=================================== FAILURES ===================================
__________________________________ test_sign ___________________________________

    def test_sign():
        script = os.path.join(EXAMPLE_ROOT, "sign.py")
        out = subprocess.run(["python", str(script)], capture_output=True)
        assert (
            out.returncode == 0
        ), f"stdout: {out.stdout.decode('utf-8')}\nstderr: {out.stderr.decode('utf-8')}"
        stdout = out.stdout.decode("utf-8")
>       assert float(stdout[-5:]) > 0.7
E       AssertionError: assert 0.689 > 0.7
E        +  where 0.689 = float('.689\n')

tests/examples/test_sparse_examples.py:101: AssertionError
- generated xml file: /home/ubuntu/jenkins/workspace/dgl_PR-7506/pytest_backend.xml -
============================ slowest 100 durations =============================
25.62s call     tests/examples/test_sparse_examples.py::test_twirls
17.99s call     tests/examples/test_sparse_examples.py::test_hgnn
17.00s call     tests/examples/test_sparse_examples.py::test_hypergraphatt
13.43s call     tests/examples/test_sampling_examples.py::test_node_classification
13.05s call     tests/examples/test_sparse_examples.py::test_gcnii
8.61s call     tests/examples/test_sampling_examples.py::test_link_prediction
7.34s call     tests/examples/test_sparse_examples.py::test_gat
5.64s call     tests/examples/test_sparse_examples.py::test_gcn
5.42s call     tests/examples/test_sparse_examples.py::test_appnp
5.02s call     tests/examples/test_sparse_examples.py::test_c_and_s
4.79s call     tests/examples/test_sparse_examples.py::test_sign
4.69s call     tests/examples/test_sparse_examples.py::test_sgc

(24 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/examples/test_sparse_examples.py::test_sign - AssertionError: assert 0.689 > 0.7
 +  where 0.689 = float('.689\n')
=================== 1 failed, 11 passed in 128.68s (0:02:08) ===================
FAIL: sparse examples on gpu

This issue happens in rare. just re-run it.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: d54017b

Build ID: 16

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin mentioned this pull request Jul 9, 2024
8 tasks
@mfbalin mfbalin merged commit 5265137 into dmlc:master Jul 9, 2024
2 checks passed
@mfbalin mfbalin deleted the gb_liburing_improvements branch July 9, 2024 08:06
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants