Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: extract pdf references #10200

Closed
ghost opened this issue Aug 22, 2023 · 6 comments · Fixed by #10437
Closed

feature request: extract pdf references #10200

ghost opened this issue Aug 22, 2023 · 6 comments · Fixed by #10437

Comments

@ghost
Copy link

ghost commented Aug 22, 2023

Hi

I don't know if it's possible already but I think a useful feature would be to extract al citations (to other papers) inside a PDF file and automatically add those to the current (or another) library.
There could be a right click menu "Extract PDF references to library/new library" when clicking an entry that has a PDF file in its 'file' field.
The goal is to quickly build a libray of related papers based on the references inside papers you already have in your database.

@ghost
Copy link
Author

ghost commented Aug 22, 2023

It can also answer the question "what to read next" when you have all references extracted, because you don't need to check the papers themselves for the references.

@ThiloteE
Copy link
Member

Great idea. I want to have this too.

What we currently use to extract pdf metadata is Grobid. It features following functionality:

  • Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
  • References extraction and parsing from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
  • Citation contexts recognition and resolution of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
  • Full text extraction and structuring from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.).
  • PDF coordinates for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.
  • Parsing of references in isolation (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).
  • Parsing of names (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).
  • Parsing of affiliation and address blocks.
  • Parsing of dates, ISO normalized day, month, year.
  • Consolidation/resolution of the extracted bibliographical references using the biblio-glutton service or the CrossRef REST API. In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
  • Extraction and parsing of patent and non-patent references in patent publications.

I would not know how to, but Grobid would be the starting point.

@koppor
Copy link
Member

koppor commented Sep 5, 2023

The feature is also important in the context of reviews organized by some IEEE groups. They wait for this feature. In case it is implemented, it saves them plenty of time. Therefore, I put it to higher priority.

@aqurilla
Copy link
Contributor

I would like to take up this issue!

@ThiloteE
Copy link
Member

We also use Apache's PDFBox.

image

@ThiloteE
Copy link
Member

Similar wish came up again in the forum: https://discourse.jabref.org/t/creating-bibtex-or-doi-list-from-bibliography/4109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants