-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] fix TUDataset labelling issue (#2165) #2173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on the doc. Otherwise LGTM.
python/dgl/data/tu.py
Outdated
@@ -276,7 +276,7 @@ class TUDataset(DGLBuiltinDataset): | |||
Notes | |||
----- | |||
Graphs may have node labels, node attributes, edge labels, and edge attributes, | |||
varing from different dataset. This class does not perform additional process. | |||
varing from different dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we clearly describe the changes here, so that people will know instantly by reading the doc. Ideally people would know how to adapt their own code just by reading this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following appended to the existing notes:
Labels are mapped to :math:
\{0,...,n-1\}
where :math:n
is the number of labels (some datasets have raw labels :math:\{-1, 1\}
which will be mapped to :math:\{0, 1\}
). In previous versions, the minimum label was added so that :math:\{-1, 1\}
was mapped to :math:\{0, 2\}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. The implementation looks very neat :)
"""Maps n unique labels to {0, ..., n-1} in an ordered fashion.""" | ||
labels = np.unique(idx_tensor) | ||
relabel_map = {x: i for i, x in enumerate(labels)} | ||
new_idx_tensor = np.vectorize(relabel_map.get)(idx_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for curiosity: how did you experience the speed up by np.vectorize
'ing the get
method? If the gain is substantial then we probably want to standardize it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see how it compares to a list comprehension wrapped in np.array
(I try and avoid type conversions if I can) but the docs for np.vectorize
state its not designed for speed:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
There's some further discussion on stackoverflow which may be of interest:
Description
Fixes issue #2165.
Checklist
Please feel free to remove inapplicable items for your PR.
or have been fixed to be compatible with this change
Changes
The graph labels are processed using the new static method
_idx_reset
instead of the old static method_idx_from_zero
. This new method remaps the labels to {0, ..., n-1} where n is the number of labels. Previously some datasets have labels {-1, 1} which were being mapped to {0, 2}, this change maps them to {0, 1}.I changed the notes since this class does perform additional process to the labels (before and after this fix).