Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add versioning to all DGLDatasets to force reloading when codes are changed #4293

Open
jermainewang opened this issue Jul 25, 2022 · 1 comment
Assignees
Labels
feature request Feature request

Comments

@jermainewang
Copy link
Member

jermainewang commented Jul 25, 2022

🚀 Feature

Add versioning to all DGLDatasets to detect:

  • Changes to the raw dataset files stored in S3. If so, re-download the dataset.
  • Changes to the preprocessing code. If so, ignore the previous local cache and reprocess the data.

Motivation

Brought up by #3987 which asks to revert the default reordering behavior of DGL builtin datasets. The issue is that even if we've implemented the request, users may still load cached datasets from local disk which may not reflect the latest change. Therefore, we require some versioning mechanism to detect those changes.

cc @mufeili

Alternatives

Use a different dataset folder whenever DGL updates. This could cause excessive disk storage use.

@mufeili
Copy link
Member

mufeili commented Mar 7, 2023

Review of the current practice

We use the table below to track the current practice of cache versioning for the existing datasets and the cases it fails to handle.

Dataset Current versioning mechanism Missing versioning behavior Other issues
YelpDataset Whether the graph is reordered
WikiCSDataset The graph is always reordered, which should be optional and have a default value of False.
LegacyTUDataset Append a hash value to the cache file name, i.e., f"legacy_tu_{dataset_name}_{hash_value}.bin", which encodes name, use_pandas, hidden_size, max_allow_node Since multiple datasets can be loaded with this interface, it makes more sense to have one cache file per dataset rather than a single cache file that gets overwritten whenever a different dataset is loaded.
TUDataset Use a different file name for each dataset that employs this interface, i.e., f"tu_{dataset_name}.bin"
SST/SSTDataset Use one file per mode, which is one of ["train", "dev", "test", "tiny"]
BAShapeDataset Handle different pre-processing options, including num_base_nodes, num_base_edges_per_node, num_motifs, perturb_ratio, seed
BACommunityDataset Handle different pre-processing options, including num_base_nodes, num_base_edges_per_node, num_motifs, perturb_ratio, num_inter_edges, seed
TreeCycleDataset Handle different pre-processing options, including tree_height, num_motifs, cycle_size, perturb_ratio, seed
TreeGridDataset Handle different pre-processing options, including tree_height, num_motifs, grid_size, perturb_ratio, seed
BA2MotifDataset
SBMMixture/SBMMixtureDataset Append a hash value to the cache file name, i.e., f"graphs_{hash_value}.bin", which encodes n_graphs, n_nodes, n_communities, k, avg_deg, pq, rng
RedditDataset Use two separate directories to cache the variant with self loops and the variant without self loops
AIFBDataset Handle different pre-processing options, including insert_reverse
MUTAGDataset Handle different pre-processing options, including insert_reverse
BGSDataset Handle different pre-processing options, including insert_reverse
AMDataset Handle different pre-processing options, including insert_reverse
QM9EdgeDataset/QM9Edge
QM9Dataset/QM9
QM7bDataset/QM7b
PPIDataset/LegacyPPIDataset Use separate files to cache the data for different mode ("train", "valid", "test")
PATTERNDataset Use separate files to cache the data for different mode ("train", "valid", "test")
MiniGCDataset Append a hash value to the cache file name, i.e., f"dgl_graph_{hash_value}.bin", which encodes num_graphs, min_num_v, max_num_v, seed
FB15k237Dataset Handle different pre-processing options, including reverse
FB15kDataset Handle different pre-processing options, including reverse
WN18Dataset Handle different pre-processing options, including reverse
KarateClub/KarateClubDataset
ICEWS18/ICEWS18Dataset Use separate files to cache the data for different mode ("train", "valid", "test")
All datasets that inherit GNNBenchmarkDataset The graph is always reordered, which should be optional and have a default value of False.
GINDataset Append a hash value to the cache file name, i.e., f"gin_{data_name}_{hash_value}.bin", which encodes name, self_loop, degree_as_nlabel
GDELT/GDELTDataset Use separate files to cache the data for different mode ("train", "valid", "test")
FraudYelpDataset Append a hash value to the cache file name, i.e., f"_dgl_graph_{hash_value}.bin", which encodes random_seed, train_size, val_size
FraudAmazonDataset Append a hash value to the cache file name, i.e., f"_dgl_graph_{hash_value}.bin", which encodes random_seed, train_size, val_size
FlickrDataset Whether the graph is reordered
FakeNewsDataset Handle different pre-processing options, including feature_name
CLUSTERDataset Use separate files to cache the data for different mode ("train", "valid", "test")
BitcoinOTC/BitcoinOTCDataset
CiteseerGraphDataset Handle different pre-processing options, including reverse_edge, reorder
CoraGraphDataset Handle different pre-processing options, including reverse_edge, reorder
PubmedGraphDataset Handle different pre-processing options, including reverse_edge, reorder
CoraBinary
AsNodePredDataset Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes split_ratio, target_ntype, dataset.name
AsLinkPredDataset Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes neg_ratio, split_ratio, dataset.name
AsGraphPredDataset Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes split_ratio, dataset.name

In addition, the versioning mechanism should detect:

  • If the current version of DGL is different from that used for generating the cache file
  • If the raw dataset files stored in S3 have been changed

In both cases, the cache files need to be re-generated.

Proposal

In general, hashing is an effective way to prevent loading an undesired cached file. However, its downside is that there can be many cache files if there are a huge number of possible combinations of preprocessing options. One solution is to instead save a file storing only the hash code or preprocessing steps for a sanity check when data loading is attempted. If this fails, then the data will be re-processed from scratch.

@frozenbugs frozenbugs added this to the 2023 Miscellaneous milestone Mar 15, 2023
@mufeili mufeili mentioned this issue Mar 21, 2023
8 tasks
@mufeili mufeili mentioned this issue May 6, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Feature request
Projects
Status: 🏠 Backlog
Development

No branches or pull requests

3 participants