Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

Merged
merged 128 commits into from
Aug 21, 2021

Conversation

btut
Copy link
Contributor

@btut btut commented Jul 20, 2021

We want to be able to import PDFs into JabRef and infer all Bib-data from the file itself.
This is done by using multiple importers. If they disagree about the metadata, we need a way to merge the conflicting data. As of #7947, this is done by prioritization of importers.
For pro-users, we want to have the option to do the merge manually using an n-way merge dialog. It can be triggered by clicking a button next to a linked (offline, pdf) file.
Screenshot from 2021-08-16 12-34-56

The dialog looks like a table. Each source will be represented by a column, each field by a row. There will be an additional, editable, column that represents the final entry.

Peek 2021-08-17 14-00

Users can:

  • Enter the information manually in the final entry column
  • Select one source to copy all it's fields to the final entry column
  • Select a field to copy it's content to the corresponding row in the final entry column

Sources will be the existing PDF importers and the importers implemented in #7947.

In a second step, we want to use this functionality to clean-up bib entries. Users may select an existing bib-entry and 'enhance' it by analyzing linked pdf files. In that case, the original entry will be displayed by an additional source-column.

TODO:

  • GUI glitches (text moves vertically sometimes when selecting a text)
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Implemented an Importer that querries Grobid for metadata of a pdf.
The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet
available in Grobid, but we opened a PR that implements it
(kermitt2/grobid#800).
@btut btut self-assigned this Jul 20, 2021
Copy link
Contributor Author

@btut btut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I annotated some thoughts I had during implementation, any comments are highly welcome!

@Siedlerchr
Copy link
Member

have you looked into the Apache Tika class? Could this be an option?
https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser

@btut
Copy link
Contributor Author

btut commented Jul 20, 2021

have you looked into the Apache Tika class? Could this be an option?

I did not until now. I don't really see a benefit. I think using a decent library for the pdf-transfer is enough.

@btut

This comment has been minimized.

@Siedlerchr
Copy link
Member

Merge dialog, I like the general idea, but it has to be usable with small screens.
Refs also #6190

@calixtus
Copy link
Member

Maybe its not even necessary to display the textfields on the right, since the information displayed is redundant to the togglebutton selected.

@btut
Copy link
Contributor Author

btut commented Jul 20, 2021

Maybe its not even necessary to display the textfields on the right, since the information displayed is redundant to the togglebutton selected.

The idea is:

  • if there are errors, I want to be able to fix them immediately
  • later, I would like to add the option to add fields that no importer detected (via an add button at the bottom)

@tobiasdiez
Copy link
Member

tobiasdiez commented Jul 21, 2021

That looks nice! Good work.

I agree with Christoph, that the merge dialog needs to work also on smaller screens. For this, the n-merge approach might be problematic and probably the user is overwhelmed anyway by the information of > 3 choices. It may be worth a consideration to merge some of the extracted metadata automatically (say based on some priority list) and then only present the user two choices. For example, merge grobid + firstpage automatically, and then have a user interface with only the options "grobid/firstpage" and "xmp".

That brings me to the question what is the "firstpage" extractor? Is this our self-written importer that only barely works with a small selection of IEEE documents (if I remember correctly)? In this case, I would actually argue that we just remove the FirstPage extractor and replace it completely by grobid.

Based on the merging "grobid + firstpage" I got another idea: what about automatically merging all extracted metadata (grobid + xmp) per default, with the choice in the preferences to activate the advanced merge dialog? In the end, it's a tradeoff of how quickly users can import PDFs vs the quality of the extracted metadata. And if grobid is good enough, then I would argue that it's justified to value speed over a small quality improvement. It would be a bit of a magical feeling if you drop a PDF into JabRef, and an entry is directly created with almost-perfect metadata without any further user interaction. But on the other hand, there are certain user circles that have good-quality xmp metadata stored in their PDFs, so for them it would be frustrating if this is automatically overwritten by the subpar grobid metadata. For those users, the merge dialog has a lot of value and they can activate it in the preferences. What do you think?

Final small remark: For me "Grobid" is an implementation detail and most users don't know (and don't need to know) what this is. So I propose to replace it by some generic name in the UI, something like "Automatically extracted".

@btut
Copy link
Contributor Author

btut commented Jul 21, 2021

That looks nice! Good work.

Thanks :)

I agree with Christoph, that the merge dialog needs to work also on smaller screens.

Sure. This is just a first draft, lots of details to work on. Once all that empty space is gone, the whole thing will be much smaller I hope.

For this, the n-merge approach might be problematic and probably the user is overwhelmed anyway by the information of > 3 choices.

Actually, there will be even more than three. Grobid, Xmp, Firstpage, embedded bibtex, DOI fetcher (if any other importer detects a DOI, I would use the fetchers to get information from the DOI and add another column).

It may be worth a consideration to merge some of the extracted metadata automatically (say based on some priority list) and then only present the user two choices. For example, merge grobid + firstpage automatically, and then have a user interface with only the options "grobid/firstpage" and "xmp".

Thats the plan for later on. I wanted to first implement the merge to get a feeling for how good Grobid works, how likely it is for XMP metadata / embedded bibtex to be present...
I though about having a user-sortable priority list in preferences, where users can also disable an importer. This could be just as overwhelming though. Do you think a static priority list defined by us would be ok?
The goal would be to step through that priority list and always keep the first value for a field we find (so if Grobid gives Author + Title and XMP gives Title and Year, we would use Author and Title from Grobid and Year from XMP, provided Gobid has higher priority).

That brings me to the question what is the "firstpage" extractor? Is this our self-written importer that only barely works with a small selection of IEEE documents (if I remember correctly)? In this case, I would actually argue that we just remove the FirstPage extractor and replace it completely by grobid.

Yes, thats the one. AFAIK it works very well for the small set of data it works with, right? I did not check the implementation yet, will it fail if the pdf is not in IEEE format? If so, I would opt to use it if it does not fail.

Based on the merging "grobid + firstpage" I got another idea: what about automatically merging all extracted metadata (grobid + xmp) per default, with the choice in the preferences to activate the advanced merge dialog? In the end, it's a tradeoff of how quickly users can import PDFs vs the quality of the extracted metadata. And if grobid is good enough, then I would argue that it's justified to value speed over a small quality improvement. It would be a bit of a magical feeling if you drop a PDF into JabRef, and an entry is directly created with almost-perfect metadata without any further user interaction. But on the other hand, there are certain user circles that have good-quality xmp metadata stored in their PDFs, so for them it would be frustrating if this is automatically overwritten by the subpar grobid metadata. For those users, the merge dialog has a lot of value and they can activate it in the preferences. What do you think?

See above. This sounds like an argument to make the priority list sortable.

Final small remark: For me "Grobid" is an implementation detail and most users don't know (and don't need to know) what this is. So I propose to replace it by some generic name in the UI, something like "Automatically extracted".

True, but all others are 'automatically extracted' as well. Before starting this project, I didn't know about XMP either. Maybe we should just use generic names for the options (Source 1, Source 2) and show details on mouse-over?

@Siedlerchr
Copy link
Member

Regarding the custom first page importer, it also checks if it finds a DOI on the page and then simply calls the DOI fetcher with that doi

@btut
Copy link
Contributor Author

btut commented Jul 21, 2021

Regarding the custom first page importer, it also checks if it finds a DOI on the page and then simply calls the DOI fetcher with that doi

In that case, I would disable that feature from the first-page importer and do DOI lookup last. So if any importer finds a DOI, use the DOI fetcher.

Headers and the entry editor are now placed in VBox/HBox containers around the
table that displays the options.
Users can (if necessary) scroll in h and v directions.
@tobiasdiez
Copy link
Member

Actually, there will be even more than three. Grobid, Xmp, Firstpage, embedded bibtex, DOI fetcher (if any other importer detects a DOI, I would use the fetchers to get information from the DOI and add another column).

I'm a bit worried that displaying all of this information in one screen will be overwhelming. Maybe condense it into "Extracted" (Grobid, Firstpage, DOI merged) vs "Embedded" (XMP, bibtex merged) ?

Do you think a static priority list defined by us would be ok?

Yes! Don't make everything configurable, especially not at the beginning.

This sounds like an argument to make the priority list sortable.

This was more an argument/idea to merge all information by default. In the end, the user doesn't care what kind of sources we used, he just wants to have a high-quality entry when he puts a pdf into JabRef.

@btut btut added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Aug 20, 2021
@btut btut marked this pull request as ready for review August 20, 2021 14:15
Copy link
Member

@calixtus calixtus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some remarks were made in PR #8002 .
After fixing them, this should be ready too.
Otherweise lgtm.

@calixtus
Copy link
Member

Before merging this PR, merge #8002 !
This one closes #8002 .

CHANGELOG.md Outdated Show resolved Hide resolved
@calixtus
Copy link
Member

Two green checkmarks. Codacy is complaining about 4-space tabs instead of 2, but in every other file its done like here. So merging now. 🎉

@calixtus calixtus merged commit fd1cab0 into JabRef:main Aug 21, 2021
@btut btut deleted the improvement/pdfMetadataImport branch August 21, 2021 19:33
@koppor koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Aug 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants