Improve duplicate detection #1884

stefan-kolb · 2016-08-29T15:23:21Z

strict comparison does not care about entry type. Is this intended?

    @Test
    public void noStrictDuplicateForDifferentTypes() {
        BibEntry e1 = new BibEntry("1", "article");
        BibEntry e2 = new BibEntry("2", "journal");
        assertEquals(0, DuplicateCheck.compareEntriesStrictly(e1, e2), 0.01);
    }

Correlate by words has a strange algorithm. Only works good for words appended at the end of the String.

     * Compare two strings on the basis of word-by-word correlation analysis.
     * TODO: strange algorithm as when there are only words inserted this gives a bad value, e.g.,
     * a test -> this a test (0.0)
     * characterization -> characterization of me (1.0)

We need a small benchmark for the duplicate testing, i.e., a DB that has all kinds of expected duplicates.

oscargus · 2016-08-29T19:18:45Z

src/main/java/net/sf/jabref/model/DuplicateCheck.java

@@ -16,25 +18,28 @@
 import net.sf.jabref.model.entry.FieldProperty;
 import net.sf.jabref.model.entry.InternalBibtexFields;

+import info.debatty.java.stringsimilarity.Levenshtein;


Is it worth having this build dependency as it basically does what the removed code below does and adds quite a bit of other unused methods? (I know @koppor was also skeptic.)

We already have this dependency inside the code that's why I reused it. For the general dependecy question, I'm not 100% sure as we might use more distance measures and the library is probably not very large. So why should I reinvent the wheel here.

Yes, from that perspective it is OK for sure and we can of course revive the old code if required later. Oh, here I found it: koppor#131

Just go ahead. No worries. We migrate to another library if the functionality is the same. If not, we just keep the dependency. Reason for migrating is only koppor#135

Since JabRef 3.6 is close to be available in Debian/unstable, there is no need for a library migration.

oscargus · 2016-08-29T19:19:16Z

Good that you give it a go!

koppor · 2016-09-12T21:39:35Z

src/main/java/net/sf/jabref/model/DuplicateCheck.java

        double[] req;
-        if (var == null) {
+        if (requiredFields == null) {


Are we really returning null? I think, we ommitted null, thus also line 79 is not covered by tests at all.

koppor · 2016-09-12T21:46:07Z

Strict comparison: We should also consider the type.

Correlate by words: I think, you did not find any other algorithm? Can't an existing algorithm be used? Something like counting the similar and the different words and dividing by max(len1, len2)?

Small benchmark: Can we do that as a separate issue? Maybe in the koppor-repo? 😇

koppor · 2016-09-12T21:46:58Z

src/test/resources/net/sf/jabref/model/duplicates.bib

+}
+
+@Article{dupJournalTechreport,
+  author  = {Kolb, Stefan and Wirtz, Guido},


Can we also add a test where the authors are written in a different serialization? Firstname Lastname?

tobiasdiez · 2016-11-16T10:36:38Z

What is the state of this PR?

Another improvement to the duplicate algorithm concerns how DOI, ISBN and edition fields should be weighted:

Entries with the same DOI or ISBN are almost surely duplicates although they might have different titles or authors (Citavi is doing something similar)
Entries which are almost identical but for edition or volume are probably no duplicates (but just a different version of the same book/article)

lenhard · 2016-12-21T08:48:41Z

And again: @stefan-kolb Is there an advance here? I think at some point, we should either try to proceed with the old PRs or scrap them.

stefan-kolb · 2016-12-21T09:03:14Z

No time right now. There are some point that we should fix however. You can close if you want tho.

stefan-kolb added 4 commits August 29, 2016 16:57

Refactor and test a little bit

074ae3e

Improve tests for duplicate detection

4645813

Another find

6b83848

Add duplicate database

99e7b4c

oscargus reviewed Aug 29, 2016
View reviewed changes

koppor reviewed Sep 12, 2016
View reviewed changes

lenhard added the status: devcall label Dec 21, 2016

stefan-kolb closed this Dec 21, 2016

stefan-kolb deleted the dup-detection branch December 21, 2016 09:53

lenhard added on-hold and removed status: devcall labels Dec 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve duplicate detection #1884

Improve duplicate detection #1884

stefan-kolb commented Aug 29, 2016 •

edited

Loading

oscargus Aug 29, 2016

stefan-kolb Aug 29, 2016

oscargus Aug 29, 2016

koppor Sep 12, 2016

koppor Nov 26, 2016

oscargus commented Aug 29, 2016

koppor Sep 12, 2016

koppor commented Sep 12, 2016

koppor Sep 12, 2016

tobiasdiez commented Nov 16, 2016

lenhard commented Dec 21, 2016

stefan-kolb commented Dec 21, 2016

Improve duplicate detection #1884

Improve duplicate detection #1884

Conversation

stefan-kolb commented Aug 29, 2016 • edited Loading

oscargus Aug 29, 2016

Choose a reason for hiding this comment

stefan-kolb Aug 29, 2016

Choose a reason for hiding this comment

oscargus Aug 29, 2016

Choose a reason for hiding this comment

koppor Sep 12, 2016

Choose a reason for hiding this comment

koppor Nov 26, 2016

Choose a reason for hiding this comment

oscargus commented Aug 29, 2016

koppor Sep 12, 2016

Choose a reason for hiding this comment

koppor commented Sep 12, 2016

koppor Sep 12, 2016

Choose a reason for hiding this comment

tobiasdiez commented Nov 16, 2016

lenhard commented Dec 21, 2016

stefan-kolb commented Dec 21, 2016

stefan-kolb commented Aug 29, 2016 •

edited

Loading