Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Some draft code for dataset preprocess optimization #7236

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

caojy1998
Copy link
Collaborator

@caojy1998 caojy1998 commented Mar 22, 2024

Description

This code is a draft for the preprocess algorithm. Here we deal with the first outer sort algorithm and leaving the construction of csc graph for future exploration.
The function _graph_data_to_fused_csc_sampling_graph consists 2 part.

  1. load in the coo format graph stored as csv format ~20 s
  2. convert the coo format to csv format ~ 2 s
    In our approach:
  3. read in the csv file part by part and convert it to numpy format and store it back to the disk.
  4. Read in the numpy file and perform a merge sort to convert the coo to a sorted coo. (This part should be further optimized since currently we read in the whole numpy file. However, we should read in them part by part to same the memory consumption)
  5. Store the sorted coo part by part to the disk for further reading
  6. read the sorted coo and convert it to csc using a for loop.

The first 3 step has been optimized. While the fourth one is not.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 22, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 22, 2024

Commit ID: 42dc1d351c92d402fdc3d6f469f643b9a6dad6f4

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 22, 2024

Commit ID: 35eccce22eaadfa8ec9575daeb2f493c17e004e1

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 22, 2024

Commit ID: 3336d10f95009171694c4c8d9c354e9a5c057453

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@Rhett-Ying Rhett-Ying self-requested a review March 25, 2024 08:04
@Rhett-Ying Rhett-Ying marked this pull request as draft March 25, 2024 08:04
@Rhett-Ying
Copy link
Collaborator

@caojy1998 any detailed performance number for the current improvement?

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 26, 2024

Commit ID: b6884e401e826b443f3333f296404b35c350b768

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants