Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

btut · 2021-07-20T08:51:19Z

We want to be able to import PDFs into JabRef and infer all Bib-data from the file itself.
This is done by using multiple importers. If they disagree about the metadata, we need a way to merge the conflicting data. As of #7947, this is done by prioritization of importers.
For pro-users, we want to have the option to do the merge manually using an n-way merge dialog. It can be triggered by clicking a button next to a linked (offline, pdf) file.

The dialog looks like a table. Each source will be represented by a column, each field by a row. There will be an additional, editable, column that represents the final entry.

Users can:

Enter the information manually in the final entry column
Select one source to copy all it's fields to the final entry column
Select a field to copy it's content to the corresponding row in the final entry column

Sources will be the existing PDF importers and the importers implemented in #7947.

In a second step, we want to use this functionality to clean-up bib entries. Users may select an existing bib-entry and 'enhance' it by analyzing linked pdf files. In that case, the original entry will be displayed by an additional source-column.

TODO:

GUI glitches (text moves vertically sometimes when selecting a text)

Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).

btut

I annotated some thoughts I had during implementation, any comments are highly welcome!

src/main/java/org/jabref/logic/importer/util/GrobidService.java

src/test/java/org/jabref/logic/importer/fileformat/GrobidPdfMetadataImporterTest.java

src/main/java/org/jabref/logic/importer/util/GrobidService.java

src/main/java/org/jabref/logic/importer/fetcher/GrobidCitationFetcher.java

Siedlerchr · 2021-07-20T14:49:08Z

have you looked into the Apache Tika class? Could this be an option?
https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser

btut · 2021-07-20T15:09:39Z

have you looked into the Apache Tika class? Could this be an option?

I did not until now. I don't really see a benefit. I think using a decent library for the pdf-transfer is enough.

Siedlerchr · 2021-07-20T15:48:51Z

Merge dialog, I like the general idea, but it has to be usable with small screens.
Refs also #6190

calixtus · 2021-07-20T18:57:16Z

Maybe its not even necessary to display the textfields on the right, since the information displayed is redundant to the togglebutton selected.

btut · 2021-07-20T20:39:01Z

Maybe its not even necessary to display the textfields on the right, since the information displayed is redundant to the togglebutton selected.

The idea is:

if there are errors, I want to be able to fix them immediately
later, I would like to add the option to add fields that no importer detected (via an add button at the bottom)

tobiasdiez · 2021-07-21T07:14:58Z

That looks nice! Good work.

I agree with Christoph, that the merge dialog needs to work also on smaller screens. For this, the n-merge approach might be problematic and probably the user is overwhelmed anyway by the information of > 3 choices. It may be worth a consideration to merge some of the extracted metadata automatically (say based on some priority list) and then only present the user two choices. For example, merge grobid + firstpage automatically, and then have a user interface with only the options "grobid/firstpage" and "xmp".

That brings me to the question what is the "firstpage" extractor? Is this our self-written importer that only barely works with a small selection of IEEE documents (if I remember correctly)? In this case, I would actually argue that we just remove the FirstPage extractor and replace it completely by grobid.

Based on the merging "grobid + firstpage" I got another idea: what about automatically merging all extracted metadata (grobid + xmp) per default, with the choice in the preferences to activate the advanced merge dialog? In the end, it's a tradeoff of how quickly users can import PDFs vs the quality of the extracted metadata. And if grobid is good enough, then I would argue that it's justified to value speed over a small quality improvement. It would be a bit of a magical feeling if you drop a PDF into JabRef, and an entry is directly created with almost-perfect metadata without any further user interaction. But on the other hand, there are certain user circles that have good-quality xmp metadata stored in their PDFs, so for them it would be frustrating if this is automatically overwritten by the subpar grobid metadata. For those users, the merge dialog has a lot of value and they can activate it in the preferences. What do you think?

Final small remark: For me "Grobid" is an implementation detail and most users don't know (and don't need to know) what this is. So I propose to replace it by some generic name in the UI, something like "Automatically extracted".

btut · 2021-07-21T08:18:12Z

That looks nice! Good work.

Thanks :)

I agree with Christoph, that the merge dialog needs to work also on smaller screens.

Sure. This is just a first draft, lots of details to work on. Once all that empty space is gone, the whole thing will be much smaller I hope.

For this, the n-merge approach might be problematic and probably the user is overwhelmed anyway by the information of > 3 choices.

Actually, there will be even more than three. Grobid, Xmp, Firstpage, embedded bibtex, DOI fetcher (if any other importer detects a DOI, I would use the fetchers to get information from the DOI and add another column).

It may be worth a consideration to merge some of the extracted metadata automatically (say based on some priority list) and then only present the user two choices. For example, merge grobid + firstpage automatically, and then have a user interface with only the options "grobid/firstpage" and "xmp".

Thats the plan for later on. I wanted to first implement the merge to get a feeling for how good Grobid works, how likely it is for XMP metadata / embedded bibtex to be present...
I though about having a user-sortable priority list in preferences, where users can also disable an importer. This could be just as overwhelming though. Do you think a static priority list defined by us would be ok?
The goal would be to step through that priority list and always keep the first value for a field we find (so if Grobid gives Author + Title and XMP gives Title and Year, we would use Author and Title from Grobid and Year from XMP, provided Gobid has higher priority).

That brings me to the question what is the "firstpage" extractor? Is this our self-written importer that only barely works with a small selection of IEEE documents (if I remember correctly)? In this case, I would actually argue that we just remove the FirstPage extractor and replace it completely by grobid.

Yes, thats the one. AFAIK it works very well for the small set of data it works with, right? I did not check the implementation yet, will it fail if the pdf is not in IEEE format? If so, I would opt to use it if it does not fail.

Based on the merging "grobid + firstpage" I got another idea: what about automatically merging all extracted metadata (grobid + xmp) per default, with the choice in the preferences to activate the advanced merge dialog? In the end, it's a tradeoff of how quickly users can import PDFs vs the quality of the extracted metadata. And if grobid is good enough, then I would argue that it's justified to value speed over a small quality improvement. It would be a bit of a magical feeling if you drop a PDF into JabRef, and an entry is directly created with almost-perfect metadata without any further user interaction. But on the other hand, there are certain user circles that have good-quality xmp metadata stored in their PDFs, so for them it would be frustrating if this is automatically overwritten by the subpar grobid metadata. For those users, the merge dialog has a lot of value and they can activate it in the preferences. What do you think?

See above. This sounds like an argument to make the priority list sortable.

Final small remark: For me "Grobid" is an implementation detail and most users don't know (and don't need to know) what this is. So I propose to replace it by some generic name in the UI, something like "Automatically extracted".

True, but all others are 'automatically extracted' as well. Before starting this project, I didn't know about XMP either. Maybe we should just use generic names for the options (Source 1, Source 2) and show details on mouse-over?

Siedlerchr · 2021-07-21T08:19:10Z

Regarding the custom first page importer, it also checks if it finds a DOI on the page and then simply calls the DOI fetcher with that doi

btut · 2021-07-21T08:31:12Z

Regarding the custom first page importer, it also checks if it finds a DOI on the page and then simply calls the DOI fetcher with that doi

In that case, I would disable that feature from the first-page importer and do DOI lookup last. So if any importer finds a DOI, use the DOI fetcher.

Headers and the entry editor are now placed in VBox/HBox containers around the table that displays the options. Users can (if necessary) scroll in h and v directions.

tobiasdiez · 2021-07-21T13:52:01Z

Actually, there will be even more than three. Grobid, Xmp, Firstpage, embedded bibtex, DOI fetcher (if any other importer detects a DOI, I would use the fetchers to get information from the DOI and add another column).

I'm a bit worried that displaying all of this information in one screen will be overwhelming. Maybe condense it into "Extracted" (Grobid, Firstpage, DOI merged) vs "Embedded" (XMP, bibtex merged) ?

Do you think a static priority list defined by us would be ok?

Yes! Don't make everything configurable, especially not at the beginning.

This sounds like an argument to make the priority list sortable.

This was more an argument/idea to merge all information by default. In the end, the user doesn't care what kind of sources we used, he just wants to have a high-quality entry when he puts a pdf into JabRef.

To be able to use the DiffHighlighter, but the TextFlow is ugly because it grows beyond it's boundaries if the text is too long.

calixtus

Some remarks were made in PR #8002 .
After fixing them, this should be ready too.
Otherweise lgtm.

calixtus · 2021-08-21T07:44:37Z

Before merging this PR, merge #8002 !
This one closes #8002 .

CHANGELOG.md

src/main/java/org/jabref/gui/mergeentries/MultiMergeEntriesViewModel.java

…etadataImport

calixtus · 2021-08-21T18:45:48Z

Two green checkmarks. Codacy is complaining about 4-space tabs instead of 2, but in every other file its done like here. So merging now. 🎉

GrobidPdfMetadataImporter implemented

22f0241

Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).

btut self-assigned this Jul 20, 2021

btut commented Jul 20, 2021

View reviewed changes

btut added 2 commits July 20, 2021 17:24

Fixed class when accessing resources

8effaa9

Draft of merge dialog

5d487d2

This comment has been minimized.

Sign in to view

Default to first available entry

96cd5cf

Changed layout

8b5510e

Headers and the entry editor are now placed in VBox/HBox containers around the table that displays the options. Users can (if necessary) scroll in h and v directions.

btut added 12 commits July 21, 2021 16:00

Checkstyle

3a4a01a

Bind buttons with equal content together

8314855

Use TextArea only for multiline fields

05964bc

Use SplitPane

0f64b1c

Fixed scaling of labels

1260cf9

Add tooltip for toggle buttons

97fb43d

Implemented loading BibEntries in background

733415f

Implemented DOI Lookup button

620424c

Changed Button content to TextFlow

46bf75a

To be able to use the DiffHighlighter, but the TextFlow is ugly because it grows beyond it's boundaries if the text is too long.

Change DOI button to icon

f036112

Use FileHelper method to get extension

a5d216c

Use ellipsing text flow

d9dc84e

btut added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Aug 20, 2021

btut requested review from Siedlerchr, calixtus and koppor August 20, 2021 12:56

btut added 2 commits August 20, 2021 16:14

Merge branch 'main' of github.com:JabRef/jabref into useGrobidPreference

70a2e3e

Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport

51eb15d

btut marked this pull request as ready for review August 20, 2021 14:15

btut added 2 commits August 20, 2021 16:22

Fixed missing import (introduced by merge)

61b3b5b

Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport

3dbafbe

calixtus approved these changes Aug 21, 2021

View reviewed changes

btut added 4 commits August 21, 2021 13:36

Extract given-clause in test

69af125

Improved readability

f676003

Changelog

1e97104

Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport

43aaa05

Siedlerchr reviewed Aug 21, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

btut added 2 commits August 21, 2021 14:18

Changelog update

55a4653

Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport

f0afe0c

Siedlerchr reviewed Aug 21, 2021

View reviewed changes

src/main/java/org/jabref/gui/mergeentries/MultiMergeEntriesViewModel.java Outdated Show resolved Hide resolved

Renamed Entry to EntrySource

87eded9

Siedlerchr approved these changes Aug 21, 2021

View reviewed changes

Merge branch 'main' of github.com:JabRef/jabref into improvement/pdfM…

32c0a3d

…etadataImport

calixtus merged commit fd1cab0 into JabRef:main Aug 21, 2021

btut deleted the improvement/pdfMetadataImport branch August 21, 2021 19:33

koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Aug 29, 2021

calixtus mentioned this pull request Sep 22, 2021

Improve entry merge dialog (3-way merge) #6190

Closed

3 tasks

This was referenced Oct 7, 2021

Add documentation on advanced PDF merge dialog JabRef/user-documentation#368

Open

Support for multi-paper PDFs (AKA proceedings) #8128

Open

HoussemNasri mentioned this pull request Jul 5, 2022

[GSOC22] - A - Implement a fully functional three way merge UI #8945

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

btut commented Jul 20, 2021 •

edited

Loading

btut left a comment

Siedlerchr commented Jul 20, 2021

btut commented Jul 20, 2021

This comment has been minimized.

Siedlerchr commented Jul 20, 2021

calixtus commented Jul 20, 2021

btut commented Jul 20, 2021

tobiasdiez commented Jul 21, 2021 •

edited

Loading

btut commented Jul 21, 2021

Siedlerchr commented Jul 21, 2021

btut commented Jul 21, 2021

tobiasdiez commented Jul 21, 2021

calixtus left a comment

calixtus commented Aug 21, 2021

calixtus commented Aug 21, 2021

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

Conversation

btut commented Jul 20, 2021 • edited Loading

btut left a comment

Choose a reason for hiding this comment

Siedlerchr commented Jul 20, 2021

btut commented Jul 20, 2021

This comment has been minimized.

Siedlerchr commented Jul 20, 2021

calixtus commented Jul 20, 2021

btut commented Jul 20, 2021

tobiasdiez commented Jul 21, 2021 • edited Loading

btut commented Jul 21, 2021

Siedlerchr commented Jul 21, 2021

btut commented Jul 21, 2021

tobiasdiez commented Jul 21, 2021

calixtus left a comment

Choose a reason for hiding this comment

calixtus commented Aug 21, 2021

calixtus commented Aug 21, 2021

btut commented Jul 20, 2021 •

edited

Loading

tobiasdiez commented Jul 21, 2021 •

edited

Loading