[Bugfix] fix TUDataset labelling issue (#2165) #2173

henrykenlay · 2020-09-10T14:00:14Z

Description

Fixes issue #2165.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR

Changes

The graph labels are processed using the new static method _idx_reset instead of the old static method _idx_from_zero. This new method remaps the labels to {0, ..., n-1} where n is the number of labels. Previously some datasets have labels {-1, 1} which were being mapped to {0, 2}, this change maps them to {0, 1}.

I changed the notes since this class does perform additional process to the labels (before and after this fix).

hetong007

One comment on the doc. Otherwise LGTM.

hetong007 · 2020-09-10T15:39:49Z

python/dgl/data/tu.py

@@ -276,7 +276,7 @@ class TUDataset(DGLBuiltinDataset):
    Notes
    -----
    Graphs may have node labels, node attributes, edge labels, and edge attributes,
-    varing from different dataset. This class does not perform additional process.
+    varing from different dataset. 


I suggest we clearly describe the changes here, so that people will know instantly by reading the doc. Ideally people would know how to adapt their own code just by reading this doc.

How about the following appended to the existing notes:

Labels are mapped to :math:\{0,...,n-1\} where :math:n is the number of labels (some datasets have raw labels :math:\{-1, 1\} which will be mapped to :math:\{0, 1\}). In previous versions, the minimum label was added so that :math:\{-1, 1\} was mapped to :math:\{0, 2\}.

That sounds good to me

VoVAllen

Thanks for your contribution. The implementation looks very neat :)

BarclayII · 2020-09-11T05:20:02Z

python/dgl/data/tu.py

+        """Maps n unique labels to {0, ..., n-1} in an ordered fashion."""
+        labels = np.unique(idx_tensor)
+        relabel_map = {x: i for i, x in enumerate(labels)}
+        new_idx_tensor = np.vectorize(relabel_map.get)(idx_tensor)


Just for curiosity: how did you experience the speed up by np.vectorize'ing the get method? If the gain is substantial then we probably want to standardize it in the future.

I didn't see how it compares to a list comprehension wrapped in np.array (I try and avoid type conversions if I can) but the docs for np.vectorize state its not designed for speed:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

There's some further discussion on stackoverflow which may be of interest:

https://stackoverflow.com/questions/35215161/most-efficient-way-to-map-function-over-numpy-array

https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-according-to-key

* [Bugfix] fix TUDataset labelling issue (dmlc#2165) * [Bugfix] fix TUDataset labelling issue (dmlc#2165) * update docstring according to discussion Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

henrykenlay and others added 3 commits September 10, 2020 14:44

[Bugfix] fix TUDataset labelling issue (dmlc#2165)

0db85f0

[Bugfix] fix TUDataset labelling issue (dmlc#2165)

69b9240

Merge branch 'master' into master

6f18cc3

classicsong requested review from VoVAllen and hetong007 September 10, 2020 15:17

hetong007 reviewed Sep 10, 2020

View reviewed changes

VoVAllen approved these changes Sep 11, 2020

View reviewed changes

BarclayII reviewed Sep 11, 2020

View reviewed changes

update docstring according to discussion

7e774bc

BarclayII mentioned this pull request Sep 11, 2020

[Patch Release] 0.5.2 #2178

Closed

Merge branch 'master' into master

14c5006

BarclayII merged commit 8a227bf into dmlc:master Sep 11, 2020

BarclayII mentioned this pull request Sep 11, 2020

The dgl.data.TUDataset class returns labels in {0,2} for some binary classes. Can we instead return {0, 1}? #2165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] fix TUDataset labelling issue (#2165) #2173

[Bugfix] fix TUDataset labelling issue (#2165) #2173

henrykenlay commented Sep 10, 2020

hetong007 left a comment

hetong007 Sep 10, 2020

henrykenlay Sep 10, 2020

hetong007 Sep 11, 2020

VoVAllen left a comment

BarclayII Sep 11, 2020

henrykenlay Sep 11, 2020

[Bugfix] fix TUDataset labelling issue (#2165) #2173

[Bugfix] fix TUDataset labelling issue (#2165) #2173

Conversation

henrykenlay commented Sep 10, 2020

Description

Checklist

Changes

hetong007 left a comment

Choose a reason for hiding this comment

hetong007 Sep 10, 2020

Choose a reason for hiding this comment

henrykenlay Sep 10, 2020

Choose a reason for hiding this comment

hetong007 Sep 11, 2020

Choose a reason for hiding this comment

VoVAllen left a comment

Choose a reason for hiding this comment

BarclayII Sep 11, 2020

Choose a reason for hiding this comment

henrykenlay Sep 11, 2020

Choose a reason for hiding this comment