-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve duplicate detection #1884
Conversation
stefan-kolb
commented
Aug 29, 2016
•
edited
Loading
edited
- strict comparison does not care about entry type. Is this intended?
- Correlate by words has a strange algorithm. Only works good for words appended at the end of the String.
- We need a small benchmark for the duplicate testing, i.e., a DB that has all kinds of expected duplicates.
@@ -16,25 +18,28 @@ | |||
import net.sf.jabref.model.entry.FieldProperty; | |||
import net.sf.jabref.model.entry.InternalBibtexFields; | |||
|
|||
import info.debatty.java.stringsimilarity.Levenshtein; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having this build dependency as it basically does what the removed code below does and adds quite a bit of other unused methods? (I know @koppor was also skeptic.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this dependency inside the code that's why I reused it. For the general dependecy question, I'm not 100% sure as we might use more distance measures and the library is probably not very large. So why should I reinvent the wheel here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, from that perspective it is OK for sure and we can of course revive the old code if required later. Oh, here I found it: koppor#131
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just go ahead. No worries. We migrate to another library if the functionality is the same. If not, we just keep the dependency. Reason for migrating is only koppor#135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since JabRef 3.6 is close to be available in Debian/unstable, there is no need for a library migration.
Good that you give it a go! |
double[] req; | ||
if (var == null) { | ||
if (requiredFields == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really returning null? I think, we ommitted null, thus also line 79 is not covered by tests at all.
Strict comparison: We should also consider the type. Correlate by words: I think, you did not find any other algorithm? Can't an existing algorithm be used? Something like counting the similar and the different words and dividing by max(len1, len2)? Small benchmark: Can we do that as a separate issue? Maybe in the koppor-repo? 😇 |
} | ||
|
||
@Article{dupJournalTechreport, | ||
author = {Kolb, Stefan and Wirtz, Guido}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a test where the authors are written in a different serialization? Firstname Lastname?
What is the state of this PR? Another improvement to the duplicate algorithm concerns how DOI, ISBN and edition fields should be weighted:
|
And again: @stefan-kolb Is there an advance here? I think at some point, we should either try to proceed with the old PRs or scrap them. |
No time right now. There are some point that we should fix however. You can close if you want tho. |