Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with accent or other special characters - Export encoding format #87

Open
pauline-chauvet opened this issue Dec 13, 2022 · 20 comments

Comments

@pauline-chauvet
Copy link

While exporting data written with special characters in .nccsv, the encoding makes it look unpleasant for the user.
for example:
-Original
"Observation temps réel des bouées Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)"
-Becomes
"Observation temps r\u00e9el des bou\u00e9es Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)"

Right now, the encoding is ASCII: ".nccsv - Download a NetCDF-3-like 7-bit ASCII NCCSV .csv file with COARDS/CF/ACDD metadata."

Will it be possible in the future to have it in UTF8?

@BobSimons
Copy link
Collaborator

The underlying problem is that one of the goals of nccsv was to have it be readable by spreadsheet programs like Excel. Traditionally they just supported ASCII in csv files (I don't know about now).
So if I changed the nccsv specification to support UTF-8 (which would obviously be good in many ways), it would break that basic goal.

I will say: although it isn't proper, a solution is to use the ISO 8859-1 character set in the nccsv files that you create. That gives you all of the common accented characters in European languages (with the exception of the Euro character and others above Unicode #255). ERDDAP already reads .nccsv files as if they have the ISO 8859-1 character set. (There is a programmer's saying "Be strict in what you write but liberal when you read.") Doing this may cause trouble down the road (e.g., if someone ever tries to load one of these files into Excel), but you may be willing to accept that risk.

Is that a reasonable/acceptable solution for your purposes?

@n-a-t-e
Copy link

n-a-t-e commented Dec 14, 2022 via email

@BobSimons
Copy link
Collaborator

Sorry, my bad. I skipped over the first line of the original question "While exporting data written with special characters in .nccsv" and answered the question as if pauline-chauvet were making files to be imported into ERDDAP. Instead, you and pauline-chauvet are interested in the exported nccsv files. But the bulk of my answer remains. A goal of nccsv was to define a csv standard that could be read into spreadsheets and traditionally that has meant that they needed to be ASCII files. Yes, you're right, the encoding system is ugly compared to the UTF-8 characters, but it worked.

That said, I see that I (or my successor) should revisit this to see if all (or most) modern spreadsheet programs can now import UTF-8 CSV files. If so, then yes, I (or my successor) will change the nccsv specification to use UTF-8 and change ERDDAP to read and write UTF-8 CSV files.

Since I no longer work for NOAA and no longer have access to Excel, you (and others) can help. Can you please make a tiny UTF-8 CSV file with ASCII, 8859-1 (e.g., accented), and UTF-8-specific (e.g., the Euro symbol \u20ac) characters, then import the csv file into various common spreadsheet programs (Excel, Google Docs, Open Office?, Libre Office?, others?) to make sure each can read the file correctly? For each program, post a message here with

  • The name and version number of the program
  • The steps needed to import the UTF-8 csv files correctly, e.g., "File : Open : ...",
    That would help me tremendously.

@n-a-t-e
Copy link

n-a-t-e commented Dec 14, 2022

Thanks for taking a look. I tried opening this CSV file in various tools:

id,symbol,food,price
1,🐟,pêches,€4
2,🍕,pizza,¥1
3,🥦,légumes,₹2
  • Google sheets
    Imports perfectly by default

  • LibreOffice Community 7.2.5.2 (Mac)
    Imports perfectly by default

  • Apple Numbers 10.3.5 (Mac)
    Imports perfectly by default

  • Excel 16.68 (Mac)
    By default it loads the csv but doesnt parse any non-ascii characters well

    To import as unicode, Create blank workbook, then go to
    Data->Get External Data->Text file, which opens the "Text Import Wizard":

    1. Set File Origin to "Unicode (UTF-8)"
    2. Set delimiter to Comma
    3. Finish
    4. Select "Create new Sheet"

    Then it imports all characters perfectly

@BobSimons
Copy link
Collaborator

That is great! Thank you very much for doing that.

Can someone else verify that it works in Windows Excel?

@thogar-computer
Copy link

Tested on Excel (Microsoft® Excel® for Microsoft 365 MSO (Version 2211 Build 16.0.15831.20098) 64-bit) for windows 10 using same file as above.

We can import direct from web and excel will auto pick the UTF-8 encoding
image

or data can be loaded via the from text/CSV option from an blank excel sheet and encoding is auto picked again.
image

both result in
image

when directly opening the csv in excel, it does not detect the encoding correctly and therefore the data is shown incorrectly.

@pauline-chauvet
Copy link
Author

Also tested on Microsoft® Excel® pour Microsoft 365 MSO (Version 2211 Build 16.0.15831.20098) 64 bits
and same result than thogar-computer
image

Tested with google sheet online (opened directly from my google drive)
It works using the "import" tool !
image

@pauline-chauvet
Copy link
Author

And it also works with LibreOffice Calc Version: 7.4.3.2 (x64) / LibreOffice Community CPU threads: 8; OS: Windows 10.0 Build 19044;

image

@BobSimons
Copy link
Collaborator

Excellent! Thank you all very much for doing the tests.

Okay. I will make the changes to the NCCSV specification and to the way the ERDDAP reads and writes NCCSV files. Hopefully, it will be in the next release.

@pauline-chauvet
Copy link
Author

This is great !
Merci beaucoup =)

@kutoso
Copy link

kutoso commented Oct 5, 2023

I was wondering if the encoding can be change from UTF-8 to UTF-8-BOM ?
I believe this adjustment might help resolve the issue in Excel. As previously mentioned, when opening the CSV file directly in Excel, the encoding detection is not accurate, resulting in incorrectly displayed data. Currently, the excel import tool is required to visualize the data correctly. By employing the UTF-8-BOM encoding, the need for the import tool might be eliminated, as the data is expected to be displayed accurately without any additional steps.

@BobSimons
Copy link
Collaborator

For the uninitiated, BOM = Byte Order Mark.

I didn't know there was a problem reading the UTF-8 nccsv files into Excel. Do other people have this problem?

I didn't have Excel when I made the changes above and so relied on the tests from those users. (I still don't have Excel.)

I'm disappointed in Excel. It should be able to detect and read UTF-8 without a BOM. See the top answer for
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-with-bom
which quotes the Unicode spec section 2.6 as saying "Use of a BOM is neither required nor recommended for UTF-8".
See also all the mentions of "byte order" in the UTF-8 page in Wikipedia
https://en.wikipedia.org/wiki/UTF-8
which agree but mention some software exceptions. Note that the JSON spec says BOM must not be used.

So I think adding the BOM is a bad idea, in general. It seems like trouble to add it just to fix the Excel use case (and probably then cause problems elsewhere), but that is a primary use case.

Comments?
Does anyone else have a solution for automatically importing the nccsv files correctly in Excel?

@robitaillej
Copy link

Reading UTF-8 into Excel isn't necessarily a problem. However, as previously mentioned in this conversation, users must use the import wizard to load csv without BOM correctly.
I understand that using a BOM can cause issues with certain software applications. Nevertheless, one of the primary objectives of the nccsv format is to ensure correct compatibility with spreadsheet programs like Excel and Google Sheets. Therefore, wouldn't specifying the BOM make sense if it facilitates the workflow for our primary users (those who rely on Excel or Google Sheets)?

@BobSimons
Copy link
Collaborator

Although I don't like BOMs and don't like going against the utf-8 file recommendations, your point about spreadsheets being a primary objective is strong. (MS/Excel are non-standard trouble yet again.)

Can you (or someone) please check: if a utf-8 nccsv file has a BOM, can it be read correctly in all of the common spreadsheet programs (see above), hopefully automatically?

@n-a-t-e
Copy link

n-a-t-e commented Oct 23, 2023

I converted my test file above to UTF-8 with BOM: https://gist.github.com/n-a-t-e/005339cdb23905cc4118db57f41cfb72

I tested it in:

Google sheets

  • opens perfectly by default

LibreOffice Community 7.2.5.2 (Mac)

  • opens perfectly by default

Apple Numbers 10.3.5 (Mac)

  • opens perfectly by default

Excel 16.77.1 (Mac)

  • Opening the BOM csv file brings up an import wizard, which it doesn't seem to do for non-BOM files. It prompts for column delimiter and sadly defaults to "tab" instead of "comma". So the user would have to know to change this. Then it opens properly, looks good.

  • In the images below, the BOM csv forces the import wizard to be used, and looks good. The non-BOM csv isn't prompted to import, and displays garbage (unless you manually run the import wizard).

Screen Shot 2023-10-23 at 9 36 40 AM

@kutoso
Copy link

kutoso commented Oct 23, 2023

Hi, In my case I created an csv-utf8-BOM file and I could open it with all this spreadsheet programs.

Excel (windows- version 2309)
image

Notepad++ v8.4.7 (64-bit)
image

Sublime
image

Bloc-notes Version22H2 v 19045.3570
image

OpenOffice 4.1.14 (an import wizard pop-up and then it is ok)
image

WordPad
image

Google sheet
image

@BobSimons
Copy link
Collaborator

@n-a-t-e, that is good news for the non-Excel imports, but it is a bummer that Excel doesn't read the unicode+BOM file automatically. That was the main goal.

@kutoso (or @n-a-t-e), why do you think your test of unicode+BOM didn't trigger the Import Wizard?

@n-a-t-e
Copy link

n-a-t-e commented Oct 25, 2023

I think I should clarify my previous long message- The BOM file works great in Excel on Mac and Windows (tested both).

There is a tiny quirk, at least in the Mac version -

  • if you double click on the file to open it, works perfecly
  • if you open it via the file menu it opens to a wizard which guesses the delimiter wrong, so the user has to select "Comma". Still better than the non-BOM version by a lot

@BobSimons
Copy link
Collaborator

Thanks for clarifying. So BOM sounds like a good idea.

@ChrisPJohn, it is up to you, but it sounds to me like a good idea to:

  • Make a 0.1(?) increment of the nccsv standard document which strongly recommends (but doesn't require) that .nccsv files have a BOM. Please keep the existing version and also add the new one as the current version. You might include little hints like how to open the file in Excel to avoid the Import Wizard (or what to do if you see the wizard) and how to add a BOM to a file that doesn't have one (there must be a way with 'cat').
  • Make it so that when ERDDAP reads nccsv files, it handles files with and without a BOM.
  • Make it so that nccsv files created with ERDDAP have a BOM.

Thank you @kutoso, @robitaillej, and @n-a-t-e. This is a good improvement.

@ChrisJohnNOAA
Copy link
Contributor

That sounds like a solid plan. I'll add this to the todo list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants