Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dgl.data.TUDataset class returns labels in {0,2} for some binary classes. Can we instead return {0, 1}? #2165

Closed
henrykenlay opened this issue Sep 9, 2020 · 2 comments

Comments

@henrykenlay
Copy link
Contributor

🚀 Feature

The graph class labels returned by some binary TUDatasets are {0, 2} (such as dgl.data.TUDataset('BZR'). Some datasets are {0, 1} such as dgl.data.TUDataset('MCF-7'). It would be good if this behaviour was more consistent.

Motivation

By having consistent behaviour users wouldn't have to wrap the class in a preprocessing layer to make sure the labels are consistent with standard conventions (such as a model using a logistic activation function for binary prediction models with binary cross-entropy loss).

I did notice that in the notes section of the docs it reads

Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset. This class does not perform additional process.

However, this isn't actually the case since the raw BZR graph labels are {-1, 1}, the labels are preprocessed by adding the minimum label to all labels.

Alternatives

Do not modify the graph labels at all, as per the docs.

Pitch

Preprocessing the labels so that the labels are {0, ..., n-1} where n is the number of classes would be the easiest for the user. An additional argument could be added which allows the user to access the raw labels if needed.

Additional context

I can put in a pull request if this change seems reasonable.

@classicsong
Copy link
Contributor

classicsong commented Sep 10, 2020

hi, HenryKenlay:
Can you help provide this feature? Currently, we just let the label id start from 0:

dgl/python/dgl/data/tu.py

Lines 320 to 330 in b10b541

for filename, field_name in self.attr_dict.items():
try:
data = loadtxt(self._file_path(filename),
delimiter=',').astype(int)
if 'label' in filename:
data = F.tensor(self._idx_from_zero(data))
else:
data = F.tensor(data)
getattr(g, field_name[0])[field_name[1]] = data
except IOError:
pass

Your help is really appreciated. 

henrykenlay added a commit to henrykenlay/dgl that referenced this issue Sep 10, 2020
henrykenlay added a commit to henrykenlay/dgl that referenced this issue Sep 10, 2020
BarclayII added a commit that referenced this issue Sep 11, 2020
* [Bugfix] fix TUDataset labelling issue (#2165)

* [Bugfix] fix TUDataset labelling issue (#2165)

* update docstring according to discussion

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
@BarclayII
Copy link
Collaborator

Fixed in #2173

zhjwy9343 pushed a commit to zhjwy9343/dgl that referenced this issue Sep 17, 2020
* [Bugfix] fix TUDataset labelling issue (dmlc#2165)

* [Bugfix] fix TUDataset labelling issue (dmlc#2165)

* update docstring according to discussion

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants