feature request: extract pdf references #10200

ghost · 2023-08-22T00:41:36Z

Hi

I don't know if it's possible already but I think a useful feature would be to extract al citations (to other papers) inside a PDF file and automatically add those to the current (or another) library.
There could be a right click menu "Extract PDF references to library/new library" when clicking an entry that has a PDF file in its 'file' field.
The goal is to quickly build a libray of related papers based on the references inside papers you already have in your database.

ghost · 2023-08-22T00:43:34Z

It can also answer the question "what to read next" when you have all references extracted, because you don't need to check the papers themselves for the references.

ThiloteE · 2023-08-22T07:24:10Z

Great idea. I want to have this too.

What we currently use to extract pdf metadata is Grobid. It features following functionality:

Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).

References extraction and parsing from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).

Citation contexts recognition and resolution of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).

Full text extraction and structuring from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.).

PDF coordinates for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.

Parsing of references in isolation (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).

Parsing of names (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).

Parsing of affiliation and address blocks.

Parsing of dates, ISO normalized day, month, year.

Consolidation/resolution of the extracted bibliographical references using the biblio-glutton service or the CrossRef REST API. In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.

Extraction and parsing of patent and non-patent references in patent publications.

I would not know how to, but Grobid would be the starting point.

koppor · 2023-09-05T19:48:10Z

The feature is also important in the context of reviews organized by some IEEE groups. They wait for this feature. In case it is implemented, it saves them plenty of time. Therefore, I put it to higher priority.

aqurilla · 2023-09-25T01:42:50Z

I would like to take up this issue!

ThiloteE · 2023-09-25T07:13:50Z

We also use Apache's PDFBox.

ThiloteE · 2023-12-18T22:11:33Z

Similar wish came up again in the forum: https://discourse.jabref.org/t/creating-bibtex-or-doi-list-from-bibliography/4109

ThiloteE added type: feature needs-refinement labels Aug 22, 2023

ghost mentioned this issue Sep 4, 2023

Enable the exploration of bib entries' relations (cited by and citing) #10324

Merged

6 tasks

ThiloteE assigned aqurilla Sep 25, 2023

aqurilla mentioned this issue Oct 1, 2023

[WIP] Extract PDF References #10437

Merged

6 tasks

ThiloteE added the external files label Oct 13, 2023

koppor mentioned this issue Mar 11, 2024

Create documentation for "Extract references from PDF" JabRef/user-documentation#484

Open

calixtus closed this as completed in #10437 Mar 12, 2024

koppor mentioned this issue Apr 7, 2024

Add logic for parsing references from last page of PDF #11156

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: extract pdf references #10200

feature request: extract pdf references #10200

ghost commented Aug 22, 2023

ghost commented Aug 22, 2023

ThiloteE commented Aug 22, 2023

koppor commented Sep 5, 2023

aqurilla commented Sep 25, 2023

ThiloteE commented Sep 25, 2023

ThiloteE commented Dec 18, 2023

feature request: extract pdf references #10200

feature request: extract pdf references #10200

Comments

ghost commented Aug 22, 2023

ghost commented Aug 22, 2023

ThiloteE commented Aug 22, 2023

koppor commented Sep 5, 2023

aqurilla commented Sep 25, 2023

ThiloteE commented Sep 25, 2023

ThiloteE commented Dec 18, 2023