Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] fix TUDataset labelling issue (#2165) #2173

Merged
merged 5 commits into from
Sep 11, 2020

Conversation

henrykenlay
Copy link
Contributor

Description

Fixes issue #2165.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

The graph labels are processed using the new static method _idx_reset instead of the old static method _idx_from_zero. This new method remaps the labels to {0, ..., n-1} where n is the number of labels. Previously some datasets have labels {-1, 1} which were being mapped to {0, 2}, this change maps them to {0, 1}.

I changed the notes since this class does perform additional process to the labels (before and after this fix).

Copy link
Member

@hetong007 hetong007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on the doc. Otherwise LGTM.

@@ -276,7 +276,7 @@ class TUDataset(DGLBuiltinDataset):
Notes
-----
Graphs may have node labels, node attributes, edge labels, and edge attributes,
varing from different dataset. This class does not perform additional process.
varing from different dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we clearly describe the changes here, so that people will know instantly by reading the doc. Ideally people would know how to adapt their own code just by reading this doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following appended to the existing notes:

Labels are mapped to :math:\{0,...,n-1\} where :math:n is the number of labels (some datasets have raw labels :math:\{-1, 1\} which will be mapped to :math:\{0, 1\}). In previous versions, the minimum label was added so that :math:\{-1, 1\} was mapped to :math:\{0, 2\}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me

Copy link
Collaborator

@VoVAllen VoVAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. The implementation looks very neat :)

"""Maps n unique labels to {0, ..., n-1} in an ordered fashion."""
labels = np.unique(idx_tensor)
relabel_map = {x: i for i, x in enumerate(labels)}
new_idx_tensor = np.vectorize(relabel_map.get)(idx_tensor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity: how did you experience the speed up by np.vectorize'ing the get method? If the gain is substantial then we probably want to standardize it in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see how it compares to a list comprehension wrapped in np.array (I try and avoid type conversions if I can) but the docs for np.vectorize state its not designed for speed:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

There's some further discussion on stackoverflow which may be of interest:

@BarclayII BarclayII mentioned this pull request Sep 11, 2020
@BarclayII BarclayII merged commit 8a227bf into dmlc:master Sep 11, 2020
zhjwy9343 pushed a commit to zhjwy9343/dgl that referenced this pull request Sep 17, 2020
* [Bugfix] fix TUDataset labelling issue (dmlc#2165)

* [Bugfix] fix TUDataset labelling issue (dmlc#2165)

* update docstring according to discussion

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants