Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a snorkel_labels_train.xlsx file anywhere? #108

Open
jambo6 opened this issue Oct 29, 2021 · 5 comments
Open

Is there a snorkel_labels_train.xlsx file anywhere? #108

jambo6 opened this issue Oct 29, 2021 · 5 comments

Comments

@jambo6
Copy link

jambo6 commented Oct 29, 2021

I'd like to utilise these labels for another project. It seems the folder

snorkeling/disease_gene/disease_associates_gene/data/sentences

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

@danich1
Copy link
Contributor

danich1 commented Nov 1, 2021

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

So this folder only contains sentences that were manually hand labeled for this project. The train version isn't available as it is supposes to consist of all the remaining documents within Pubtator. The following output would be too big of a file for github to host on their LFS (max file is 2GB).

Currently, the main way to get those sentences is to download a snapshot of pubtator central and extract those sentences into a database. Otherwise I have a snapshot of the database used for this project that you could import (118GB); however, would need to figure out how to transport that large of a file. Overall recommendation is to use the first option as you would have the most current version for whichever project you are going to work on.

@jambo6
Copy link
Author

jambo6 commented Nov 2, 2021

I was after the hand labelled train/dev/test sentences to bolster my dataset for a similar RE project, not the entire pubtator db. Would it be okay for me to use these and if so, is there a straightforward method to download just these sentences with hand labellings?

@danich1
Copy link
Contributor

danich1 commented Nov 2, 2021

I was after the hand labelled train/dev/test sentences to bolster my dataset for a similar RE project, not the entire pubtator db. Would it be okay for me to use these and if so, is there a straightforward method to download just these sentences with hand labellings?

Sure. Can't guarantee that train.xlsx exists or has a lot of sentences annotated but here are the quick links to the available data atm:

Compound Treats Disease Train
Compound Treats Disease Dev
Compound Treats Disease Test

Disease Associates Gene Dev
Disease Associates Gene Test

Gene interacts Gene Train
Gene interacts Gene Dev
Gene interacts Gene Test

Compound binds Gene would take a bit for me to get to you so if you need that let me know.

@jambo6
Copy link
Author

jambo6 commented Nov 4, 2021

So do there not exist handcrafted labels for Disease Associates Gene Train?

@danich1
Copy link
Contributor

danich1 commented Nov 4, 2021

I forgot to upload onto this repository, but here is your request file:
Disease Associates Gene Train

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants