Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

Merged
merged 30 commits into from
Apr 23, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
2caa8e0
fix issue #5850 for encoding problem
MrGhabi Apr 17, 2021
26d5100
add a blank line for build.gradle
MrGhabi Apr 17, 2021
aec8447
initial as main branch for build.gradle
MrGhabi Apr 17, 2021
304adc0
initial as main branch for build.gradle
MrGhabi Apr 17, 2021
8a8df28
add the change of fix information of issue 5850
MrGhabi Apr 17, 2021
590940a
Fix check style
MrGhabi Apr 17, 2021
ee1cac7
Update CHANGELOG.md
MrGhabi Apr 17, 2021
c6f0cc2
Add the utf8 check for biblatex and ascii check for bibtex
MrGhabi Apr 17, 2021
cc099d7
Merge remote-tracking branch 'origin/fix-for-issue-5850' into fix-for…
MrGhabi Apr 17, 2021
a18a3af
add the new localization string the l10 files
MrGhabi Apr 17, 2021
fe69305
fix error
MrGhabi Apr 17, 2021
673cc42
add the statement only in en.properties
MrGhabi Apr 18, 2021
7e04a98
Merge remote-tracking branch 'origin/fix-for-issue-5850' into fix-for…
MrGhabi Apr 18, 2021
f3bf4ac
revert changes
MrGhabi Apr 18, 2021
083e3ea
Update JabRef_da.properties
MrGhabi Apr 18, 2021
b1b5999
Update JabRef_ru.properties
MrGhabi Apr 18, 2021
9e94837
Update build.gradle
MrGhabi Apr 18, 2021
e07e530
Update JabRef_fa.properties
MrGhabi Apr 18, 2021
b1a4f58
Update JabRef_no.properties
MrGhabi Apr 18, 2021
85d2198
Update JabRef_pl.properties
MrGhabi Apr 18, 2021
7e44819
Update JabRef_pt.properties
MrGhabi Apr 18, 2021
a81d2ec
Update JabRef_vi.properties
MrGhabi Apr 18, 2021
d980120
Update JabRef_zh_TW.properties
MrGhabi Apr 18, 2021
d7b1917
reset the default charset
MrGhabi Apr 19, 2021
cec382e
Merge remote-tracking branch 'origin/fix-for-issue-5850' into fix-for…
MrGhabi Apr 19, 2021
02cc61e
reset the default charset
MrGhabi Apr 19, 2021
a4aff23
add the javaDoc of UTF8Checker
MrGhabi Apr 19, 2021
e8e02a9
add the javaDoc of UTF8CheckerTest and IntegrityCheckTest
MrGhabi Apr 19, 2021
5092817
Remove the unwieldy Junit tests
MrGhabi Apr 21, 2021
7bfe74a
Merge branch 'main' into fix-for-issue-5850
Siedlerchr Apr 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ Note that this project **does not** adhere to [Semantic Versioning](http://semve
- We fixed an issue where opening BibTex file (doubleclick) from Folder with spaces not working. [#6487](https://github.com/JabRef/jabref/issues/6487)
- We fixed an issue with saving large `.bib` files [#7265](https://github.com/JabRef/jabref/issues/7265)
- We fixed an issue with very large page numbers [#7590](https://github.com/JabRef/jabref/issues/7590)
- We fixed an issue where journal abbreviations in UTF-8 not recognized [#5850](https://github.com/JabRef/jabref/issues/5850)
MrGhabi marked this conversation as resolved.
Show resolved Hide resolved

### Removed

Expand Down
1 change: 0 additions & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -758,4 +758,3 @@ task downloadDependencies {
}
}


3 changes: 1 addition & 2 deletions src/main/java/org/jabref/logic/integrity/IntegrityCheck.java
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ public IntegrityCheck(BibDatabaseContext bibDatabaseContext,
} else {
entryCheckers.addAll(List.of(
new JournalInAbbreviationListChecker(StandardField.JOURNAL, journalAbbreviationRepository),
new ASCIICharacterChecker(),
new UTF8Checker(),
new NoBibtexFieldChecker(),
new BibTeXEntryTypeChecker())
);
Expand All @@ -59,7 +59,6 @@ List<IntegrityMessage> check() {
for (BibEntry entry : database.getEntries()) {
result.addAll(checkEntry(entry));
}

result.addAll(checkDatabase(database));

return result;
Expand Down
44 changes: 44 additions & 0 deletions src/main/java/org/jabref/logic/integrity/UTF8Checker.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
package org.jabref.logic.integrity;

import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import org.jabref.logic.l10n.Localization;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.Field;

public class UTF8Checker implements EntryChecker {

/**
* Detect any non UTF-8 encoded field
*/
@Override
public List<IntegrityMessage> check(BibEntry entry) {
List<IntegrityMessage> results = new ArrayList<>();
for (Map.Entry<Field, String> field : entry.getFieldMap().entrySet()) {
Charset charset = Charset.forName(System.getProperty("file.encoding"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract this out of the loop, as it doesn't depend on the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to use System.getProperty("file.encoding") and not say the encoding specified in the Library properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since different users have different charsets due to the operating system or the default settings of the computer. And System.getProperty("file.encoding") is used get the default charset. If the charset is not UTF-8, we should give a warning about that.
And the reason not to use the Library properties & Database properties: Maybe the user doesn't know the default charset in his computer or he set the charset for jabref, but we should give a warning about that since Non-UTF-8 charset may cause to garbled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the explanation.

But doesn't this give a lot of false positives? Say I have my library encoded in Charset A, and my systems default is Charset B. If all characters in my database are properly encoded with Charset A, then I shouldn't get any warnings even though some of the characters may not be encodable in Charset B, right?

But I also have to admit that I do not yet understand the use-case from the user perspective, so maybe I'm missing something obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that. In my test, if one user's default charset is A then his paste-board, his input is all encoded in A. So when he input something maybe just garbled. Maybe there is an example: #7629
So the scenario may be rare. That's the reason I don't choose to get the charset by Charset charset = bibDatabaseContext.getMetaData().getEncoding().orElse(preferences.getDefaultEncoding());

By the way, I have a question about the design. If the bibtex is only allowed in ascii in design, why do we allow the user to save it into different charsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, reviewer! @tobiasdiez After thinking for a long time and doing some tests, I think maybe it's better to give 2 kinds of warning:

  1. In BibLatex, if the Library charset is not UTF-8, then give a warning Non-UTF-8 field found.
  2. In both BibLatex and BibTeX, if the System env is not UTF-8, give the warning Non-UTF-8 env, may cause garbled.
    And I'm eagerly waiting for your suggestions and reply!

boolean utfOnly = UTF8EncodingChecker(field.getValue().getBytes(charset));
if (!utfOnly) {
results.add(new IntegrityMessage(Localization.lang("Non-UTF-8 encoded found"), entry,
field.getKey()));
}
}
return results;
}

public static boolean UTF8EncodingChecker(byte[] data) {
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
try {
decoder.decode(ByteBuffer.wrap(data));
} catch (CharacterCodingException ex) {
return false;
}
return true;
}
}
39 changes: 39 additions & 0 deletions src/test/java/org/jabref/logic/integrity/UTF8CheckerTest.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package org.jabref.logic.integrity;

import java.util.Collections;
import java.util.List;

import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.StandardField;

import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.assertEquals;

public class UTF8CheckerTest {

private final BibEntry entry = new BibEntry();

@Test
void fieldAcceptsUTF8() {
UTF8Checker checker = new UTF8Checker();
entry.setField(StandardField.TITLE, "Only ascii characters!'@12");
assertEquals(Collections.emptyList(), checker.check(entry));
}

@Test
void fieldDoesNotAcceptUmlauts() {
System.getProperties().put("file.encoding", "GBK");
UTF8Checker checker = new UTF8Checker();
String NonUTF8 = "";
try {
NonUTF8 = new String("你好,这条语句使用GBK字符集".getBytes(), "GBK");
}catch (Exception e){
e.printStackTrace();
}
entry.setField(StandardField.MONTH, NonUTF8);
assertEquals(List.of(new IntegrityMessage("Non-UTF-8 encoded found", entry, StandardField.MONTH)), checker.check(entry));
}


}