Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve duplicate detection #1884

Closed
wants to merge 4 commits into from
Closed

Improve duplicate detection #1884

wants to merge 4 commits into from

Conversation

stefan-kolb
Copy link
Member

@stefan-kolb stefan-kolb commented Aug 29, 2016

  • strict comparison does not care about entry type. Is this intended?
    @Test
    public void noStrictDuplicateForDifferentTypes() {
        BibEntry e1 = new BibEntry("1", "article");
        BibEntry e2 = new BibEntry("2", "journal");
        assertEquals(0, DuplicateCheck.compareEntriesStrictly(e1, e2), 0.01);
    }
  • Correlate by words has a strange algorithm. Only works good for words appended at the end of the String.
     * Compare two strings on the basis of word-by-word correlation analysis.
     * TODO: strange algorithm as when there are only words inserted this gives a bad value, e.g.,
     * a test -> this a test (0.0)
     * characterization -> characterization of me (1.0)
  • We need a small benchmark for the duplicate testing, i.e., a DB that has all kinds of expected duplicates.

@@ -16,25 +18,28 @@
import net.sf.jabref.model.entry.FieldProperty;
import net.sf.jabref.model.entry.InternalBibtexFields;

import info.debatty.java.stringsimilarity.Levenshtein;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having this build dependency as it basically does what the removed code below does and adds quite a bit of other unused methods? (I know @koppor was also skeptic.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this dependency inside the code that's why I reused it. For the general dependecy question, I'm not 100% sure as we might use more distance measures and the library is probably not very large. So why should I reinvent the wheel here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from that perspective it is OK for sure and we can of course revive the old code if required later. Oh, here I found it: koppor#131

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just go ahead. No worries. We migrate to another library if the functionality is the same. If not, we just keep the dependency. Reason for migrating is only koppor#135

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since JabRef 3.6 is close to be available in Debian/unstable, there is no need for a library migration.

@oscargus
Copy link
Contributor

Good that you give it a go!

double[] req;
if (var == null) {
if (requiredFields == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really returning null? I think, we ommitted null, thus also line 79 is not covered by tests at all.

@koppor
Copy link
Member

koppor commented Sep 12, 2016

Strict comparison: We should also consider the type.

Correlate by words: I think, you did not find any other algorithm? Can't an existing algorithm be used? Something like counting the similar and the different words and dividing by max(len1, len2)?

Small benchmark: Can we do that as a separate issue? Maybe in the koppor-repo? 😇

}

@Article{dupJournalTechreport,
author = {Kolb, Stefan and Wirtz, Guido},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test where the authors are written in a different serialization? Firstname Lastname?

@tobiasdiez
Copy link
Member

What is the state of this PR?

Another improvement to the duplicate algorithm concerns how DOI, ISBN and edition fields should be weighted:

  • Entries with the same DOI or ISBN are almost surely duplicates although they might have different titles or authors (Citavi is doing something similar)
  • Entries which are almost identical but for edition or volume are probably no duplicates (but just a different version of the same book/article)

@lenhard
Copy link
Member

lenhard commented Dec 21, 2016

And again: @stefan-kolb Is there an advance here? I think at some point, we should either try to proceed with the old PRs or scrap them.

@stefan-kolb
Copy link
Member Author

No time right now. There are some point that we should fix however. You can close if you want tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants