[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

Rhett-Ying · 2024-01-12T13:59:02Z

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

We've found training with GraphBolt always perform not well as counterpart in DGL. It happens in node classification on ogbn-products(drops <2%) and ogbn-mag(drops <6%), especially the ogbn-mag. Refer to daily regression for exact numbers.

GraphSAGE + ogbn-products: https://github.com/dmlc/dgl/blob/master/examples/sampling/node_classification.py
RGCN + ogbn-mag: https://github.com/dmlc/dgl/tree/master/examples/sampling/graphbolt/rgcn

As peer team hits this issue as well, so the gb.BuiltinDataset and example code are probably fine though cannot be excluded thoroughly. The root cause may lie in FuseCSCSamplingGraph(from_dglgraph()?), ItemSampler(shuffle with whole set?), sampling.

Action items:

get the accuracy statistics of the examples from daily regression results for GraphBolt and DGL, compute mean/std. @caojy1998 the asv web could be used for quick check.
calculate the label distribution of the whole dataset and each batch, then compare them to see if they are aligned.
statistics calculation for DGL and GB datasets: degree distribution, centrality.
get the hit distribution of each node during sampling/training.
To verify ItemSampler, DGL mini-batching + GB sampling?
What about sampling neighbors on GPU? @mfbalin

Depending work items or issues

The text was updated successfully, but these errors were encountered:

mfbalin · 2024-01-12T23:32:44Z

GPU sampling run examples with #6861, note that my current directory is dgl/sampling/graphbolt/lightning. @Rhett-Ying

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ python ../rgcn/hetero_rgcn.py --dataset=ogbn-mag
Downloading datasets/ogbn-mag.zip from https://data.dgl.ai/dataset/graphbolt/ogbn-mag.zip...
datasets/ogbn-mag.zip: 100%|████████████████████████████████████████████████████████████████████████████████████████| 519M/519M [00:04<00:00, 114MB/s]
Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Loaded dataset: node_classification
node_num for rel_graph_embed: {'author': tensor(1134649), 'field_of_study': tensor(59965), 'institution': tensor(8740)}
Number of embedding parameters: 154029312
Number of model parameters: 337460
Start to train...
Training~Epoch 01: 615it [00:34, 17.71it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 23.79it/s]
Finish evaluating on validation set.
Epoch: 01, Loss: 2.6385, Valid accuracy: 39.45%, Time 34.7261
Training~Epoch 02: 615it [00:34, 17.65it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 23.86it/s]
Finish evaluating on validation set.
Epoch: 02, Loss: 2.0427, Valid accuracy: 42.69%, Time 34.8499
Training~Epoch 03: 615it [00:33, 18.52it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 28.95it/s]
Finish evaluating on validation set.
Epoch: 03, Loss: 1.7699, Valid accuracy: 41.81%, Time 33.2093
Testing...
Inference: 11it [00:00, 28.80it/s]
Test accuracy 40.7664

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ python ../node_classification.py 
Training in pinned-cuda mode.
Loading data...
Downloading datasets/ogbn-products.zip from https://data.dgl.ai/dataset/graphbolt/ogbn-products.zip...
datasets/ogbn-products.zip: 100%|█████████████████████████████████████████████████████████████████████████████████| 1.52G/1.52G [00:14<00:00, 108MB/s]
Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Training...
Training: 193it [00:06, 31.57it/s]
Evaluating: 39it [00:00, 40.34it/s]
Epoch 00000 | Loss 1.8010 | Accuracy 0.8300 | Time 7.1117
Training: 193it [00:05, 34.02it/s]
Evaluating: 39it [00:00, 40.95it/s]
Epoch 00001 | Loss 0.7753 | Accuracy 0.8601 | Time 6.6312
Training: 193it [00:05, 34.56it/s]
Evaluating: 39it [00:00, 41.44it/s]
Epoch 00002 | Loss 0.6226 | Accuracy 0.8722 | Time 6.5305
Training: 193it [00:05, 36.79it/s]
Evaluating: 39it [00:00, 44.94it/s]
Epoch 00003 | Loss 0.5483 | Accuracy 0.8795 | Time 6.1186
Training: 193it [00:05, 37.67it/s]
Evaluating: 39it [00:00, 44.68it/s]
Epoch 00004 | Loss 0.5007 | Accuracy 0.8833 | Time 6.0020
Training: 193it [00:05, 35.73it/s]
Evaluating: 39it [00:00, 39.28it/s]
Epoch 00005 | Loss 0.4729 | Accuracy 0.8884 | Time 6.3999
Training: 193it [00:05, 34.40it/s]
Evaluating: 39it [00:00, 40.66it/s]
Epoch 00006 | Loss 0.4486 | Accuracy 0.8903 | Time 6.5741
Training: 193it [00:05, 35.86it/s]
Evaluating: 39it [00:00, 44.92it/s]
Epoch 00007 | Loss 0.4274 | Accuracy 0.8938 | Time 6.2544
Training: 193it [00:05, 35.95it/s]
Evaluating: 39it [00:00, 44.25it/s]
Epoch 00008 | Loss 0.4148 | Accuracy 0.8942 | Time 6.2540
Training: 193it [00:05, 36.11it/s]
Evaluating: 39it [00:00, 45.44it/s]
Epoch 00009 | Loss 0.4030 | Accuracy 0.8975 | Time 6.2067
Testing...
598it [00:02, 219.17it/s]
598it [00:02, 211.76it/s]
598it [00:02, 265.77it/s]
Test accuracy 0.7580

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ python node_classification.py 
The dataset is already preprocessed.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type               | Params
-------------------------------------------------
0 | layers    | ModuleList         | 206 K 
1 | dropout   | Dropout            | 0     
2 | train_acc | MulticlassAccuracy | 0     
3 | val_acc   | MulticlassAccuracy | 0     
-------------------------------------------------
206 K     Trainable params
0         Non-trainable params
206 K     Total params
0.828     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/mfbalin/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:432: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/mfbalin/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:432: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 23: : 192it [00:07, 26.87it/s, v_num=167, train_acc=0.900, num_nodes/0=2.56e+5, num_edges/0=3.94e+5, num_nodes/1=39882.0, num_edges/1=43571.0, num_nodes/2=4404.0, num_edges/2=4020.0, num_nodes/3=411.0, val_acc=0.911]

Rhett-Ying · 2024-01-13T00:01:04Z

GPU sampling results showed in #6941 (comment) are still lower than DGL...

Rhett-Ying · 2024-01-23T09:52:17Z

For RGCN acc drop, it's caused by incorrect fanouts. fixed in #6959.

mfbalin · 2024-02-02T15:18:21Z

Do we still have the accuracy problem for the examples?

frozenbugs · 2024-02-04T02:24:57Z

No, we had it all fixed.

Rhett-Ying added the Work Item Work items tracked in project tracker label Jan 12, 2024

Rhett-Ying assigned jermainewang, frozenbugs, caojy1998, Rhett-Ying and peizhou001 Jan 12, 2024

Rhett-Ying added this to the 2023 12.30 Graphbolt milestone Jan 12, 2024

Rhett-Ying mentioned this issue Jan 12, 2024

[GraphBolt][CUDA] Enable GPU sampling in examples #6861

Merged

frozenbugs closed this as completed Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

Rhett-Ying commented Jan 12, 2024 •

edited

Loading

mfbalin commented Jan 12, 2024 •

edited

Loading

Rhett-Ying commented Jan 13, 2024

Rhett-Ying commented Jan 23, 2024

mfbalin commented Feb 2, 2024

frozenbugs commented Feb 4, 2024 •

edited

Loading

[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

[GraphBolt] lower accuracy in GraphBolt compared to DGL #6941

Comments

Rhett-Ying commented Jan 12, 2024 • edited Loading

🔨Work Item

Description

Depending work items or issues

mfbalin commented Jan 12, 2024 • edited Loading

Rhett-Ying commented Jan 13, 2024

Rhett-Ying commented Jan 23, 2024

mfbalin commented Feb 2, 2024

frozenbugs commented Feb 4, 2024 • edited Loading

Rhett-Ying commented Jan 12, 2024 •

edited

Loading

mfbalin commented Jan 12, 2024 •

edited

Loading

frozenbugs commented Feb 4, 2024 •

edited

Loading