Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in splitting data for new datasets for node classification with many classes #4

Open
robertjankowski opened this issue Jun 6, 2023 · 0 comments

Comments

@robertjankowski
Copy link

Thanks for a very nice framework to work with GNNs!

I encountered one issue when loading my dataset. Namely, in the function split_data

https://github.com/AnoushkaVyas/GraphZoo/blob/3934f145c25a79388feb3adf02e9f1ea56f6e424/graphzoo/dataloader/dataloader.py#LL203C4-L203C4

The code assumes two labels, i.e., 0 and the rest. In my case, I was working with many different classes and using this function, I was getting the incorrect accuracy. For the random case when the nodes had been initialized with a label at random, the accuracy was higher than expected, i.e., 1/number of classes.

I simplified a bit split_data function, and now it can be defined as:

train_prop = 1 - val_prop - test_prop
vals = np.arange(len(labels))
np.random.shuffle(vals)
idx_train, idx_val, idx_test = np.split(vals, [int(train_prop*len(labels)), int((1-test_prop)*len(labels))])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant