Bug in splitting data for new datasets for node classification with many classes #4

robertjankowski · 2023-06-06T15:11:49Z

Thanks for a very nice framework to work with GNNs!

I encountered one issue when loading my dataset. Namely, in the function split_data

https://github.com/AnoushkaVyas/GraphZoo/blob/3934f145c25a79388feb3adf02e9f1ea56f6e424/graphzoo/dataloader/dataloader.py#LL203C4-L203C4

The code assumes two labels, i.e., 0 and the rest. In my case, I was working with many different classes and using this function, I was getting the incorrect accuracy. For the random case when the nodes had been initialized with a label at random, the accuracy was higher than expected, i.e., 1/number of classes.

I simplified a bit split_data function, and now it can be defined as:

train_prop = 1 - val_prop - test_prop
vals = np.arange(len(labels))
np.random.shuffle(vals)
idx_train, idx_val, idx_test = np.split(vals, [int(train_prop*len(labels)), int((1-test_prop)*len(labels))])

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in splitting data for new datasets for node classification with many classes #4

Bug in splitting data for new datasets for node classification with many classes #4

robertjankowski commented Jun 6, 2023

Bug in splitting data for new datasets for node classification with many classes #4

Bug in splitting data for new datasets for node classification with many classes #4

Comments

robertjankowski commented Jun 6, 2023