Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

MrGhabi · 2021-04-17T09:42:43Z

Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Reproduce the issue:

New library
New article

BibeTx source adds the following:

@article{杨芙清2005软件工程技术发展思索,
  title={软件工程技术发展思索},
  author={杨芙清},
  journal={软件学报},
  volume={16},
  number={1},
  year={2005},
  publisher={Citeseer}
}

click "check integrity"

The main reason for this bug is the check-tools Check integrity only accept the charset ASCII. It works well in English citations, but jabref has users across the world and they have different charsets.

The screenshot:

before:
after

The way to fix:

First I find the bug related to the class ASCIICharacterChecker.java
In this class

boolean asciiOnly = CharMatcher.ascii().matchesAllOf(field.getValue());

any non-ASCII encoded characters will be warned.
3. Then I remove the steps in IntegrityCheck .

And still, I want to give a warning about non-UTF8 encoded characters.

I get the default encoding from the system (Since we need to give a warning when the field cannot be decoded in UTF-8. And this may happen when users' default encoding charset is non-UTF-8)
check whether the value of fields(string) can be decoded in UTF-8
if not, just give a warning about "Non-UTF-8 encoded found"

To check this, we need first set out the default charset(for example GBK) in the whole environment.
Then we can get the following warning when using Integrity check:

Siedlerchr · 2021-04-17T10:22:42Z

src/main/java/org/jabref/logic/integrity/UTF8Checker.java

+    public List<IntegrityMessage> check(BibEntry entry) {
+        List<IntegrityMessage> results = new ArrayList<>();
+        for (Map.Entry<Field, String> field : entry.getFieldMap().entrySet()) {
+            Charset charset = Charset.forName(System.getProperty("file.encoding"));


I would extract this out of the loop, as it doesn't depend on the loop.

What's the reason to use System.getProperty("file.encoding") and not say the encoding specified in the Library properties?

Since different users have different charsets due to the operating system or the default settings of the computer. And System.getProperty("file.encoding") is used get the default charset. If the charset is not UTF-8, we should give a warning about that.
And the reason not to use the Library properties & Database properties: Maybe the user doesn't know the default charset in his computer or he set the charset for jabref, but we should give a warning about that since Non-UTF-8 charset may cause to garbled.

Ok thanks for the explanation.

But doesn't this give a lot of false positives? Say I have my library encoded in Charset A, and my systems default is Charset B. If all characters in my database are properly encoded with Charset A, then I shouldn't get any warnings even though some of the characters may not be encodable in Charset B, right?

But I also have to admit that I do not yet understand the use-case from the user perspective, so maybe I'm missing something obvious.

I have thought about that. In my test, if one user's default charset is A then his paste-board, his input is all encoded in A. So when he input something maybe just garbled. Maybe there is an example: #7629
So the scenario may be rare. That's the reason I don't choose to get the charset by Charset charset = bibDatabaseContext.getMetaData().getEncoding().orElse(preferences.getDefaultEncoding());

By the way, I have a question about the design. If the bibtex is only allowed in ascii in design, why do we allow the user to save it into different charsets?

Hi, reviewer! @tobiasdiez After thinking for a long time and doing some tests, I think maybe it's better to give 2 kinds of warning:

In BibLatex, if the Library charset is not UTF-8, then give a warning Non-UTF-8 field found.

In both BibLatex and BibTeX, if the System env is not UTF-8, give the warning Non-UTF-8 env, may cause garbled.
And I'm eagerly waiting for your suggestions and reply！

CHANGELOG.md

Siedlerchr

Thanks! Looks already good. PLease have a look at the checkstyle issues

tobiasdiez · 2021-04-17T10:51:55Z

I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change?

MrGhabi · 2021-04-17T12:37:05Z

I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change?

Yeah, some journals and papers use non-ASCII characters as their names.. etc(just as the bib in bibtex I added before). and maybe it is difficult to do with it in jabref. The details are shown in the issue. So I think maybe it is better to trade them equally.

tobiasdiez · 2021-04-17T13:33:56Z

I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex.

So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex?

Siedlerchr · 2021-04-17T13:43:40Z

I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then.

MrGhabi · 2021-04-17T13:56:03Z

I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex.

So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex?

I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then.

Good idea！I will refactor my code to meet this need! (After searching more information about bibtex and biblatex, I agree with you~ )

And the check of utf8 for biblatex maybe it's not a bug but an enhancement? (laugh) I will focus on it!

Co-authored-by: Christoph <siedlerkiller@gmail.com>

…-issue-5850

MrGhabi · 2021-04-17T14:36:26Z

Hi Reviewers! I have added the UTF-8 check for biblatex and recovery the ASCII check for bibtex!
Is there anything I should do to give a better submission?

Siedlerchr · 2021-04-17T14:40:43Z

So far looks good, you only need to add the new localization string the l10 files, see here for more details https://devdocs.jabref.org/getting-into-the-code/code-howtos#using-localization-correctly

MrGhabi · 2021-04-17T16:27:16Z

Hi reviewers！I added this statement to all language packs, but I rely on Google Translate for most of my translations, so please double check it for errors~

Siedlerchr · 2021-04-17T17:16:45Z

You only need to add it to the English file. All otherttranslations are managed by crowdin.

MrGhabi · 2021-04-17T17:19:16Z

You only need to add it to the English file. All otherttranslations are managed by crowdin.

Emmm, so I need to subtract all the files except the English file, right?

…-issue-5850

MrGhabi · 2021-04-18T01:38:55Z

I have changed that. Hope everything goes well...

…-issue-5850

MrGhabi · 2021-04-19T07:59:21Z

I added the javaDoc for UTFChecker and fix a little problem in my Junit test.

Siedlerchr · 2021-04-19T08:22:32Z

src/test/java/org/jabref/logic/integrity/UTF8CheckerTest.java

+        String NonUTF8 = "";
+        try {
+            NonUTF8 = new String("你好，这条语句使用GBK字符集".getBytes(), "GBK");
+        } catch (Exception e) {


You can simply remove that catch here and add throws Exception to the test method

add 2 Junit Test for UTF8Checker.UTF8EncodingChecker in UTF8CheckerTest add 2 Junit Test for IntegrityCheck in IntegrityCheckTest

MrGhabi · 2021-04-19T14:07:22Z

Hi reviewers! I have added 2 Junit Test for UTF8Checker and 2 for IntegrityCheck. I'm not quite sure if these test cases are redundant and standardized, so please give me some advice if problems exist!

src/test/java/org/jabref/logic/integrity/IntegrityCheckTest.java

Siedlerchr · 2021-04-23T07:36:35Z

Thanks a lot for your contribution!

…om.tngtech.archunit-archunit-junit5-api-0.18.0 * upstream/main: Fix exception when searching (#7659) Fixes #7660 (#7663) Fix for issue 5850: Journal abbreviations in UTF-8 not recognized (#7639) Fix SSLHandshake Exception by using bypass (#7657) Fix for issue 7633: Unable to download arXiv pdfs if Title contains curly brackets (#7652) Fix#7195 partly Opacity of disabled icon-buttons

MrGhabi added 5 commits April 17, 2021 16:38

fix issue #5850 for encoding problem

2caa8e0

add a blank line for build.gradle

26d5100

initial as main branch for build.gradle

aec8447

initial as main branch for build.gradle

304adc0

add the change of fix information of issue 5850

8a8df28

Siedlerchr reviewed Apr 17, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Siedlerchr requested changes Apr 17, 2021

View reviewed changes

Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Apr 17, 2021

Fix check style

590940a

MrGhabi and others added 3 commits April 17, 2021 22:03

Update CHANGELOG.md

ee1cac7

Co-authored-by: Christoph <siedlerkiller@gmail.com>

Add the utf8 check for biblatex and ascii check for bibtex

c6f0cc2

Merge remote-tracking branch 'origin/fix-for-issue-5850' into fix-for…

cc099d7

…-issue-5850

add the new localization string the l10 files

a18a3af

fix error

fe69305

MrGhabi and others added 5 commits April 18, 2021 09:09

add the statement only in en.properties

673cc42

Merge remote-tracking branch 'origin/fix-for-issue-5850' into fix-for…

7e04a98

…-issue-5850

revert changes

f3bf4ac

Update JabRef_da.properties

083e3ea

Update JabRef_ru.properties

b1b5999

MrGhabi added 7 commits April 18, 2021 09:35

Update build.gradle

9e94837

Update JabRef_fa.properties

e07e530

Update JabRef_no.properties

b1a4f58

Update JabRef_pl.properties

85d2198

Update JabRef_pt.properties

7e44819

Update JabRef_vi.properties

a81d2ec

Update JabRef_zh_TW.properties

d980120

Siedlerchr approved these changes Apr 18, 2021

View reviewed changes

MrGhabi added 4 commits April 19, 2021 14:28

reset the default charset

d7b1917

Merge remote-tracking branch 'origin/fix-for-issue-5850' into fix-for…

cec382e

…-issue-5850

reset the default charset

02cc61e

add the javaDoc of UTF8Checker

a4aff23

Siedlerchr reviewed Apr 19, 2021

View reviewed changes

add the javaDoc of UTF8CheckerTest and IntegrityCheckTest

e8e02a9

add 2 Junit Test for UTF8Checker.UTF8EncodingChecker in UTF8CheckerTest add 2 Junit Test for IntegrityCheck in IntegrityCheckTest

MrGhabi requested a review from tobiasdiez April 19, 2021 12:26

Siedlerchr requested changes Apr 20, 2021

View reviewed changes

src/test/java/org/jabref/logic/integrity/IntegrityCheckTest.java Outdated Show resolved Hide resolved

Remove the unwieldy Junit tests

5092817

MrGhabi requested a review from Siedlerchr April 21, 2021 23:00

Siedlerchr approved these changes Apr 22, 2021

View reviewed changes

Merge branch 'main' into fix-for-issue-5850

7bfe74a

Siedlerchr merged commit 434250d into JabRef:main Apr 23, 2021

k3KAW8Pnf7mkmdSMPHz27 mentioned this pull request Dec 25, 2021

Changed encoding used by IntegrityCheck #8359

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

MrGhabi commented Apr 17, 2021

Siedlerchr Apr 17, 2021

tobiasdiez Apr 17, 2021

MrGhabi Apr 17, 2021

tobiasdiez Apr 17, 2021

MrGhabi Apr 17, 2021

MrGhabi Apr 20, 2021

Siedlerchr left a comment

tobiasdiez commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

tobiasdiez commented Apr 17, 2021

Siedlerchr commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

Siedlerchr commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

Siedlerchr commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

MrGhabi commented Apr 18, 2021

MrGhabi commented Apr 19, 2021

Siedlerchr Apr 19, 2021

MrGhabi Apr 19, 2021

MrGhabi commented Apr 19, 2021

Siedlerchr commented Apr 23, 2021

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

Conversation

MrGhabi commented Apr 17, 2021

Reproduce the issue:

The screenshot:

The way to fix:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Siedlerchr left a comment

Choose a reason for hiding this comment

tobiasdiez commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

tobiasdiez commented Apr 17, 2021

Siedlerchr commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

Siedlerchr commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

Siedlerchr commented Apr 17, 2021

MrGhabi commented Apr 17, 2021

MrGhabi commented Apr 18, 2021

MrGhabi commented Apr 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrGhabi commented Apr 19, 2021

Siedlerchr commented Apr 23, 2021