Adding new datasets to dgl.data #2876

YingtongDou · 2021-04-27T19:43:19Z

🚀 Feature

Adding a new graph dataset related to node classification (fraud detection) and a new graph dataset related to graph classification (fake news detection) as the default datasets in dgl.data.

Motivation

The first graph dataset includes two homogeneous multi-relational graphs extracted from Yelp and Amazon where nodes represent fraudulent reviews or fraudulent reviewers. It was first proposed in a CIKM'20 paper and has been used by a recent WWW'21 paper as a benchmark. Another paper also takes the dataset as an example to study the non-homophilous graphs. This dataset is built upon industrial data and has rich relational information and unique properties like class-imbalance and feature inconsistency, which makes the dataset be a good instance to investigate how GNNs perform on real-world noisy graphs.

The second graph dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models. The dataset could help GNNs learn how to fuse multi-modal information and learn representations for tree-structured graphs. It would be a good addition to current graph classification benchmarks.

Alternatives

N/A

Pitch

Adding the above two new datasets as default datasets in dgl.data.

Additional context

N/A

The text was updated successfully, but these errors were encountered:

classicsong · 2021-04-28T03:02:39Z

Hi, can you follow this guildline to provide your dataset? https://docs.dgl.ai/en/latest/guide/data-dataset.html#guide-data-pipeline-dataset

BTW, we can help upload the dataset into DGL's dataset S3.

YingtongDou · 2021-04-28T03:45:28Z

Hi, can you follow this guildline to provide your dataset? https://docs.dgl.ai/en/latest/guide/data-dataset.html#guide-data-pipeline-dataset

BTW, we can help upload the dataset into DGL's dataset S3.

Thanks for your suggestions!
We will follow the instructions to prepare the dataset and pin you in this issue after finishing it.

sbyebss · 2021-05-04T01:14:55Z

Hi! I find in this tutorial https://docs.dgl.ai/en/0.6.x/new-tutorial/6_load_data.html, under Creating a Dataset for Node Classification section:

This part of the code just simply splits the data from head to end into 3 sections and doesn't add any randomness. This is too misleading because normally we need to add randomness when we split data into train data, validation data, and test data.

BarclayII · 2021-05-10T07:49:51Z

This part of the code just simply splits the data from head to end into 3 sections and doesn't add any randomness. This is too misleading because normally we need to add randomness when we split data into train data, validation data, and test data.

Normally we want to fix the training, validation, and test set across all experiments for fair comparisons. So indeed we do not want to introduce randomness during dataset split.

sbyebss · 2021-05-11T18:33:11Z

Isn't setting a random seed a better way to control randomness? From my experience, if you simply cut data into 3 sections, the validation accuracy and test accuracy would vary a lot.

YingtongDou · 2021-05-12T17:23:53Z

Isn't setting a random seed a better way to control randomness? From my experience, if you simply cut data into 3 sections, the validation accuracy and test accuracy would vary a lot.

Random seed works but I think storing the fixed train-val-test ids is more safe and standard.

saharshleo · 2021-06-22T12:34:12Z

Where can I see the article or tutorial for node classification in imbalanced dataset.

Thank you

BarclayII · 2021-07-01T06:42:32Z

@saharshleo You might find the RECT example helpful.

saharshleo · 2021-07-01T20:36:05Z

@BarclayII Thank you for your response!

There is a little update, I am now working on edge classification with imbalance classes. I have modified RECT-L as follows:

class MLPPredictor(nn.Module):
    def __init__(self, in_features, out_classes):
        super().__init__()
        self.W = nn.Linear(in_features * 2, out_classes)

    def apply_edges(self, edges):
        h_u = edges.src['h']
        h_v = edges.dst['h']
        score = self.W(torch.cat([h_u, h_v], 1))
        return {'score': score}

    def forward(self, graph, h):
        # h contains the node representations computed from the GNN defined
        # in the node classification section (Section 5.1).
        with graph.local_scope():
            graph.ndata['h'] = h
            graph.apply_edges(self.apply_edges)
            return graph.edata['score']

class RECT_L(nn.Module):
    def __init__(self, g, in_feats, n_hidden, activation, dropout=0.0):
        super(RECT_L, self).__init__()
        self.g = g
        self.gcn_1 = GraphConv(in_feats, n_hidden, activation=activation)
        self.fc = nn.Linear(n_hidden, in_feats)
        self.dropout = dropout
        nn.init.xavier_uniform_(self.fc.weight.data)

        self.pred = MLPPredictor(in_feats, 1)
        
    def forward(self, inputs):
        h_1 = self.gcn_1(self.g, inputs)
        h_1 = F.dropout(h_1, p=self.dropout, training=self.training)
        preds = self.fc(h_1)

        preds = self.pred(g, preds)
        preds = torch.sigmoid(preds)
        return preds
    
    # Detach the return variables
    def embed(self, inputs):
        h_1 = self.gcn_1(self.g, inputs)
        return h_1.detach()

MLPPredictor class is same as given here
in_feats = g.ndata['features'].shape[1]
hidden_feats = 200
activation = nn.PReLU()

Also I am using binary cross entropy as loss function:

# class_weights = [90.0]
loss = F.binary_cross_entropy_with_logits(logits[train_mask], edge_labels[train_mask], pos_weight=torch.FloatTensor(class_weights))

I have tried both with and without class weights in loss function, but there is no impact on predictions. After certain epochs the model predicts only majority class.

kayzliu mentioned this issue May 13, 2021

[Feature] add two fraud datasets #2908

Merged

7 tasks

kayzliu mentioned this issue May 24, 2021

[Feature] fix #2876, add two fake news datasets #2939

Merged

7 tasks

zhjwy9343 closed this as completed in 64d2a2a May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new datasets to dgl.data #2876

Adding new datasets to dgl.data #2876

YingtongDou commented Apr 27, 2021

classicsong commented Apr 28, 2021

YingtongDou commented Apr 28, 2021

sbyebss commented May 4, 2021

BarclayII commented May 10, 2021

sbyebss commented May 11, 2021

YingtongDou commented May 12, 2021

saharshleo commented Jun 22, 2021

BarclayII commented Jul 1, 2021

saharshleo commented Jul 1, 2021 •

edited

Loading

Adding new datasets to dgl.data #2876

Adding new datasets to dgl.data #2876

Comments

YingtongDou commented Apr 27, 2021

🚀 Feature

Motivation

Alternatives

Pitch

Additional context

classicsong commented Apr 28, 2021

YingtongDou commented Apr 28, 2021

sbyebss commented May 4, 2021

BarclayII commented May 10, 2021

sbyebss commented May 11, 2021

YingtongDou commented May 12, 2021

saharshleo commented Jun 22, 2021

BarclayII commented Jul 1, 2021

saharshleo commented Jul 1, 2021 • edited Loading

saharshleo commented Jul 1, 2021 •

edited

Loading