Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

Closed
1 of 6 tasks
Rhett-Ying opened this issue Jan 12, 2024 · 5 comments
Closed
1 of 6 tasks

[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

Rhett-Ying opened this issue Jan 12, 2024 · 5 comments
Assignees
Labels
Work Item Work items tracked in project tracker

Comments

@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented Jan 12, 2024

🔨Work Item

IMPORTANT:

  • This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
  • DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

We've found training with GraphBolt always perform not well as counterpart in DGL. It happens in node classification on ogbn-products(drops <2%) and ogbn-mag(drops <6%), especially the ogbn-mag. Refer to daily regression for exact numbers.

GraphSAGE + ogbn-products: https://github.com/dmlc/dgl/blob/master/examples/sampling/node_classification.py
RGCN + ogbn-mag: https://github.com/dmlc/dgl/tree/master/examples/sampling/graphbolt/rgcn

As peer team hits this issue as well, so the gb.BuiltinDataset and example code are probably fine though cannot be excluded thoroughly. The root cause may lie in FuseCSCSamplingGraph(from_dglgraph()?), ItemSampler(shuffle with whole set?), sampling.

Action items:

  • get the accuracy statistics of the examples from daily regression results for GraphBolt and DGL, compute mean/std. @caojy1998 the asv web could be used for quick check.
  • calculate the label distribution of the whole dataset and each batch, then compare them to see if they are aligned.
  • statistics calculation for DGL and GB datasets: degree distribution, centrality.
  • get the hit distribution of each node during sampling/training.
  • To verify ItemSampler, DGL mini-batching + GB sampling?
  • What about sampling neighbors on GPU? @mfbalin

Depending work items or issues

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 12, 2024

GPU sampling run examples with #6861, note that my current directory is dgl/sampling/graphbolt/lightning. @Rhett-Ying

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ python ../rgcn/hetero_rgcn.py --dataset=ogbn-mag
Downloading datasets/ogbn-mag.zip from https://data.dgl.ai/dataset/graphbolt/ogbn-mag.zip...
datasets/ogbn-mag.zip: 100%|████████████████████████████████████████████████████████████████████████████████████████| 519M/519M [00:04<00:00, 114MB/s]
Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Loaded dataset: node_classification
node_num for rel_graph_embed: {'author': tensor(1134649), 'field_of_study': tensor(59965), 'institution': tensor(8740)}
Number of embedding parameters: 154029312
Number of model parameters: 337460
Start to train...
Training~Epoch 01: 615it [00:34, 17.71it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 23.79it/s]
Finish evaluating on validation set.
Epoch: 01, Loss: 2.6385, Valid accuracy: 39.45%, Time 34.7261
Training~Epoch 02: 615it [00:34, 17.65it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 23.86it/s]
Finish evaluating on validation set.
Epoch: 02, Loss: 2.0427, Valid accuracy: 42.69%, Time 34.8499
Training~Epoch 03: 615it [00:33, 18.52it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 28.95it/s]
Finish evaluating on validation set.
Epoch: 03, Loss: 1.7699, Valid accuracy: 41.81%, Time 33.2093
Testing...
Inference: 11it [00:00, 28.80it/s]
Test accuracy 40.7664
mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ python ../node_classification.py 
Training in pinned-cuda mode.
Loading data...
Downloading datasets/ogbn-products.zip from https://data.dgl.ai/dataset/graphbolt/ogbn-products.zip...
datasets/ogbn-products.zip: 100%|█████████████████████████████████████████████████████████████████████████████████| 1.52G/1.52G [00:14<00:00, 108MB/s]
Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Training...
Training: 193it [00:06, 31.57it/s]
Evaluating: 39it [00:00, 40.34it/s]
Epoch 00000 | Loss 1.8010 | Accuracy 0.8300 | Time 7.1117
Training: 193it [00:05, 34.02it/s]
Evaluating: 39it [00:00, 40.95it/s]
Epoch 00001 | Loss 0.7753 | Accuracy 0.8601 | Time 6.6312
Training: 193it [00:05, 34.56it/s]
Evaluating: 39it [00:00, 41.44it/s]
Epoch 00002 | Loss 0.6226 | Accuracy 0.8722 | Time 6.5305
Training: 193it [00:05, 36.79it/s]
Evaluating: 39it [00:00, 44.94it/s]
Epoch 00003 | Loss 0.5483 | Accuracy 0.8795 | Time 6.1186
Training: 193it [00:05, 37.67it/s]
Evaluating: 39it [00:00, 44.68it/s]
Epoch 00004 | Loss 0.5007 | Accuracy 0.8833 | Time 6.0020
Training: 193it [00:05, 35.73it/s]
Evaluating: 39it [00:00, 39.28it/s]
Epoch 00005 | Loss 0.4729 | Accuracy 0.8884 | Time 6.3999
Training: 193it [00:05, 34.40it/s]
Evaluating: 39it [00:00, 40.66it/s]
Epoch 00006 | Loss 0.4486 | Accuracy 0.8903 | Time 6.5741
Training: 193it [00:05, 35.86it/s]
Evaluating: 39it [00:00, 44.92it/s]
Epoch 00007 | Loss 0.4274 | Accuracy 0.8938 | Time 6.2544
Training: 193it [00:05, 35.95it/s]
Evaluating: 39it [00:00, 44.25it/s]
Epoch 00008 | Loss 0.4148 | Accuracy 0.8942 | Time 6.2540
Training: 193it [00:05, 36.11it/s]
Evaluating: 39it [00:00, 45.44it/s]
Epoch 00009 | Loss 0.4030 | Accuracy 0.8975 | Time 6.2067
Testing...
598it [00:02, 219.17it/s]
598it [00:02, 211.76it/s]
598it [00:02, 265.77it/s]
Test accuracy 0.7580
mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ python node_classification.py 
The dataset is already preprocessed.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type               | Params
-------------------------------------------------
0 | layers    | ModuleList         | 206 K 
1 | dropout   | Dropout            | 0     
2 | train_acc | MulticlassAccuracy | 0     
3 | val_acc   | MulticlassAccuracy | 0     
-------------------------------------------------
206 K     Trainable params
0         Non-trainable params
206 K     Total params
0.828     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/mfbalin/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:432: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/mfbalin/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:432: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 23: : 192it [00:07, 26.87it/s, v_num=167, train_acc=0.900, num_nodes/0=2.56e+5, num_edges/0=3.94e+5, num_nodes/1=39882.0, num_edges/1=43571.0, num_nodes/2=4404.0, num_edges/2=4020.0, num_nodes/3=411.0, val_acc=0.911]

@Rhett-Ying
Copy link
Collaborator Author

GPU sampling results showed in #6941 (comment) are still lower than DGL...

@Rhett-Ying
Copy link
Collaborator Author

For RGCN acc drop, it's caused by incorrect fanouts. fixed in #6959.

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 2, 2024

Do we still have the accuracy problem for the examples?

@frozenbugs
Copy link
Collaborator

frozenbugs commented Feb 4, 2024

No, we had it all fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Work Item Work items tracked in project tracker
Projects
Archived in project
Development

No branches or pull requests

6 participants