Skip to content

v0.7.0

Compare
Choose a tag to compare
@BarclayII BarclayII released this 22 Jul 02:01
· 98 commits to 0.7.x since this release

This is a new major release with various system optimizations, new features and enhancements, new models and bugfixes.

Important: Change on PyPI Installation

DGL pip wheels are no longer shipped on PyPI. Use the following command to install DGL with pip:

  • pip install dgl -f https://data.dgl.ai/wheels/repo.html for CPU.
  • pip install dgl-cuXX -f https://data.dgl.ai/wheels/repo.html for CUDA.
  • pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html for CPU nightly builds.
  • pip install --pre dgl-cuXX -f https://data.dgl.ai/wheels-test/repo.html for CUDA nightly builds.

This does not impact conda installation.

GPU-based Neighbor Sampling

DGL now supports uniform neighbor sampling and MFG conversion on GPU, contributed by @nv-dlasalle from NVIDIA. Experiment for GraphSAGE on the ogbn-product graph gets a >10x speedup (reduced from 113s to 11s per epoch) on a g3.16x instance. The following docs have been updated accordingly:

New Tutorials for Multi-GPU and Distributed Training

The release brings two new tutorials about multi-GPU training for node classification and graph classification, respectively. There is also a new tutorial about distributed training across multiple machines. All of them are available at https://docs.dgl.ai/.

image

Improved CPU Message Passing Kernel

The update includes a new CPU implementation of the core GSpMM kernel for GNN message passing, thanks to @sanchit-misra from Intel. The new kernel performs tiling on the sparse CSR matrix and leverages Intel’s LibXSMM for kernel generation, which gives an up to 4.4x speedup over the old kernel. Please read their paper https://arxiv.org/abs/2104.06700 for details.

More efficient NodeEmbedding for multi-GPU training and distributed training

DGL now utilizes NCCL to synchronize the gradients of sparse node embeddings (dgl.nn.NodeEmbedding) during training (credits to @nv-dlasalle from NVIDIA). The NCCL feature is available in both dgl.optim.SparseAdam and dgl.optim.SparseAdagrad. Experiments show a 20% speedup (reduced from 47.2s to 39.5s per epoch) on a g4dn.12xlarge (4 T4 GPU) instance for training RGCN on ogbn-mag graph. The optimization is automatically turned on when NCCL backend support is detected.

The sparse optimizers for dgl.distributed.DistEmbedding now use a synchronized gradient update strategy. We add a new optimizer dgl.distributed.optim.SparseAdam. The dgl.distributed.SparseAdagrad has been moved to dgl.distributed.optim.SparseAdagrad.

Sparse-sparse Matrix Multiplication and Addition Support

We add two new APIs dgl.adj_product_graph and dgl.adj_sum_graph that perform sparse-sparse matrix multiplications and additions as graph operations respectively. They can run with both CPU and GPU with autograd support. An example usage of these functions is Graph Transformer Networks.

PyTorch Lightning Compatibility

DGL is now compatible with PyTorch Lightning for single-GPU training or training with DistributedDataParallel. See this example of training GraphSAGE with PyTorch Lightning.

We thank @justusschock for making DGL DataLoaders compatible with PyTorch Lightning (#2886).

New Models

0 7-high

A batch of 19 new model examples are added to DGL in 0.7 bringing the total number to be 90+. Users can now use the search bar on https://www.dgl.ai/ to quickly locate the examples with tagged keywords. Below is the list of new models added.

New Datasets

New Functionalities

  • KD-Tree, Brute-force family, and NN-descent implementation of KNN (#2767, #2892, #2941) (@lygztq)
  • BLAS-based KNN implementation on GPU (#2868, @milesial)
  • A new API dgl.sample_neighbors_biased for biased neighbor sampling where each node has a tag, and each tag has its own (unnormalized) probability (#1665, #2987, @soodoshll). We also provide two helper functions sort_csr_by_tag and sort_csc_by_tag to sort the internal storage of a graph based on tags to allow such kind of neighbor sampling (#1664, @soodoshll).
  • Distributed sparse Adam node embedding optimizer (#2733)
  • Heterogeneous graph’s multi_update_all now supports user-defined cross-type reducers (#2891, @Secbone)
  • Add in_degrees and out_degrees supports to dgl.DistGraph (#2918)
  • A new API dgl.sampling.node2vec_random_walk for Node2vec random walks (#2992, @Smilexuhc)
  • dgl.node_subgraph, dgl.edge_subgraph, dgl.in_subgraph and dgl.out_subgraph all have a relabel_nodes argument to allow graph compaction (i.e. removing the nodes with no edges). (#2929)
  • Allow direct slicing of a batched graph without constructing a new data structure. (#2349, #2851, #2965)
  • Allow setting the distributed node embeddings with NodeEmbedding.all_set_embedding() (#3047)
  • Graphs can be directly created from CSR or CSC representations on either CPU or GPU (#3045). See the API doc of dgl.graph for more details.
  • A new dgl.reorder API to permute a graph according to RCMK, METIS or custom strategy (#3063)
  • dgl.nn.GraphConv now has a left normalization which divides the outgoing messages by out-degrees, equivalent to random-walk normalization (#3114)
  • Add a new exclude='self' to EdgeDataLoader to exclude the edges sampled in the current minibatch alone during neighbor sampling when reverse edges are not available (#3122)

Performance Optimizations

  • Check if a COO is sorted to avoid sync during forward/backward and parallelize sorted COO/CSR conversion. (#2645, @nv-dlasalle)
  • Faster uniform sampling with replacement (#2953)
  • Eliminating ctor & dtor & IsNullArray overheads in random walks (#2990, @AjayBrahmakshatriya)
  • GatedGCNConv shortcut with one edge type (#2994)
  • Hierarchical Partitioning in distributed training with 25% speedup (#3000, @soodoshll)
  • Save memory usage in node_split and edge_split during partitioning (#3132, @JingchengYu94)

Other Enhancements

  • Graph partitioning now returns ID mapping from old nodes/edges to new ones (#2857)
  • Better error message when idx_list out of bound (#2848)
  • Kill training jobs on remote machines in distributed training when receiving KeyboardInterrupt (#2881)
  • Provide a dgl.multiprocessing namespace for multiprocess training with fork and OpenMP (#2905)
  • GAT supports multidimensional input features (#2912)
  • Users can now specify graph format for distributed training (#2948)
  • CI now runs on Kubernetes (#2957)
  • to_heterogeneous(to_homogeneous(hg)) now returns the same hg. (#2958)
  • remove_nodes and remove_edges now preserves batch information. (#3119)

Bug Fixes

Deprecations

  • preserve_nodes argument in dgl.edge_subgraph is deprecated and renamed to relabel_nodes.