Add versioning to all DGLDatasets to force reloading when codes are changed #4293

jermainewang · 2022-07-25T10:52:54Z

🚀 Feature

Add versioning to all DGLDatasets to detect:

Changes to the raw dataset files stored in S3. If so, re-download the dataset.
Changes to the preprocessing code. If so, ignore the previous local cache and reprocess the data.

Motivation

Brought up by #3987 which asks to revert the default reordering behavior of DGL builtin datasets. The issue is that even if we've implemented the request, users may still load cached datasets from local disk which may not reflect the latest change. Therefore, we require some versioning mechanism to detect those changes.

cc @mufeili

Alternatives

Use a different dataset folder whenever DGL updates. This could cause excessive disk storage use.

mufeili · 2023-03-07T06:27:05Z

Review of the current practice

We use the table below to track the current practice of cache versioning for the existing datasets and the cases it fails to handle.

Dataset	Current versioning mechanism	Missing versioning behavior	Other issues
YelpDataset		Whether the graph is reordered
WikiCSDataset			The graph is always reordered, which should be optional and have a default value of False.
LegacyTUDataset	Append a hash value to the cache file name, i.e., `f"legacy_tu_{dataset_name}_{hash_value}.bin"`, which encodes `name, use_pandas, hidden_size, max_allow_node`	Since multiple datasets can be loaded with this interface, it makes more sense to have one cache file per dataset rather than a single cache file that gets overwritten whenever a different dataset is loaded.
TUDataset	Use a different file name for each dataset that employs this interface, i.e., `f"tu_{dataset_name}.bin"`
SST/SSTDataset	Use one file per mode, which is one of ["train", "dev", "test", "tiny"]
BAShapeDataset		Handle different pre-processing options, including `num_base_nodes`, `num_base_edges_per_node`, `num_motifs`, `perturb_ratio`, `seed`
BACommunityDataset		Handle different pre-processing options, including `num_base_nodes`, `num_base_edges_per_node`, `num_motifs`, `perturb_ratio`, `num_inter_edges`, `seed`
TreeCycleDataset		Handle different pre-processing options, including `tree_height`, `num_motifs`, `cycle_size`, `perturb_ratio`, `seed`
TreeGridDataset		Handle different pre-processing options, including `tree_height`, `num_motifs`, `grid_size`, `perturb_ratio`, `seed`
BA2MotifDataset
SBMMixture/SBMMixtureDataset	Append a hash value to the cache file name, i.e., `f"graphs_{hash_value}.bin"`, which encodes `n_graphs, n_nodes, n_communities, k, avg_deg, pq, rng`
RedditDataset	Use two separate directories to cache the variant with self loops and the variant without self loops
AIFBDataset		Handle different pre-processing options, including `insert_reverse`
MUTAGDataset		Handle different pre-processing options, including `insert_reverse`
BGSDataset		Handle different pre-processing options, including `insert_reverse`
AMDataset		Handle different pre-processing options, including `insert_reverse`
QM9EdgeDataset/QM9Edge
QM9Dataset/QM9
QM7bDataset/QM7b
PPIDataset/LegacyPPIDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
PATTERNDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
MiniGCDataset	Append a hash value to the cache file name, i.e., `f"dgl_graph_{hash_value}.bin"`, which encodes `num_graphs, min_num_v, max_num_v, seed`
FB15k237Dataset		Handle different pre-processing options, including `reverse`
FB15kDataset		Handle different pre-processing options, including `reverse`
WN18Dataset		Handle different pre-processing options, including `reverse`
KarateClub/KarateClubDataset
ICEWS18/ICEWS18Dataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
All datasets that inherit `GNNBenchmarkDataset`			The graph is always reordered, which should be optional and have a default value of False.
GINDataset	Append a hash value to the cache file name, i.e., `f"gin_{data_name}_{hash_value}.bin"`, which encodes `name, self_loop, degree_as_nlabel`
GDELT/GDELTDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
FraudYelpDataset	Append a hash value to the cache file name, i.e., `f"_dgl_graph_{hash_value}.bin"`, which encodes `random_seed, train_size, val_size`
FraudAmazonDataset	Append a hash value to the cache file name, i.e., `f"_dgl_graph_{hash_value}.bin"`, which encodes `random_seed, train_size, val_size`
FlickrDataset		Whether the graph is reordered
FakeNewsDataset		Handle different pre-processing options, including `feature_name`
CLUSTERDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
BitcoinOTC/BitcoinOTCDataset
CiteseerGraphDataset		Handle different pre-processing options, including `reverse_edge`, `reorder`
CoraGraphDataset		Handle different pre-processing options, including `reverse_edge`, `reorder`
PubmedGraphDataset		Handle different pre-processing options, including `reverse_edge`, `reorder`
CoraBinary
AsNodePredDataset	Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes `split_ratio, target_ntype, dataset.name`
AsLinkPredDataset	Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes `neg_ratio, split_ratio, dataset.name`
AsGraphPredDataset	Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes `split_ratio, dataset.name`

In addition, the versioning mechanism should detect:

If the current version of DGL is different from that used for generating the cache file
If the raw dataset files stored in S3 have been changed

In both cases, the cache files need to be re-generated.

Proposal

In general, hashing is an effective way to prevent loading an undesired cached file. However, its downside is that there can be many cache files if there are a huge number of possible combinations of preprocessing options. One solution is to instead save a file storing only the hash code or preprocessing steps for a sanity check when data loading is attempted. If this fails, then the data will be re-processed from scratch.

jermainewang added the feature request Feature request label Jul 25, 2022

jermainewang assigned mufeili Aug 30, 2022

frozenbugs added this to the 2023 Miscellaneous milestone Mar 15, 2023

mufeili mentioned this issue Mar 21, 2023

[Dataset] Chameleon #5477

Merged

8 tasks

mufeili mentioned this issue May 6, 2023

[Dataset] MovieLens #5567

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add versioning to all DGLDatasets to force reloading when codes are changed #4293

Add versioning to all DGLDatasets to force reloading when codes are changed #4293

jermainewang commented Jul 25, 2022 •

edited

Loading

mufeili commented Mar 7, 2023 •

edited

Loading

Add versioning to all DGLDatasets to force reloading when codes are changed #4293

Add versioning to all DGLDatasets to force reloading when codes are changed #4293

Comments

jermainewang commented Jul 25, 2022 • edited Loading

🚀 Feature

Motivation

Alternatives

mufeili commented Mar 7, 2023 • edited Loading

Review of the current practice

Proposal

jermainewang commented Jul 25, 2022 •

edited

Loading

mufeili commented Mar 7, 2023 •

edited

Loading