Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix encoding issues on windows for some formats #361

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Feb 23, 2024

I noticed in some geofileops tests that using pyogrio to write/read dataframes to ".csv" files gives encoding issues.

This PR fixes those.

Odd detail: I only saw this behaviour on the github CI windows systems: locally I couldn't reproduce. Apparently:

  • when running tests on my local windows locale.getprefferedencoding() returns "UTF-8", even though when I run locale.getprefferedencoding() in a seperate script it returns the typical one for windows: "cp1252".
  • in the tests on the github windows machines, locale.getprefferedencoding() returns "cp1252".

@theroggy theroggy marked this pull request as draft February 23, 2024 18:35
@theroggy theroggy changed the title ENH: improve handling of encoding on windows BUF: fix encoding issues on windows for some formats. Feb 24, 2024
@theroggy theroggy changed the title BUF: fix encoding issues on windows for some formats. BUF: fix encoding issues on windows for some formats Feb 24, 2024
@theroggy theroggy changed the title BUF: fix encoding issues on windows for some formats BUG: fix encoding issues on windows for some formats Feb 24, 2024
@theroggy theroggy marked this pull request as ready for review February 24, 2024 01:42
@theroggy
Copy link
Member Author

theroggy commented Feb 24, 2024

XLSX gave UTF-8 decoding errors when reading a file written after the change. GDAL does say that XLSX needs UTF-8 (via the OLCStringsAsUTF8 capability), but this only worked for existing files, not for new files/layers being created.

After reporting this via OSGeo/gdal#9295 and some more debugging and searching, this has already been fixed in GDAL in OSGeo/gdal#9301. So, for GDAL >= 3.8.5 this will be fixed.

@theroggy theroggy marked this pull request as draft February 24, 2024 12:06
@theroggy theroggy marked this pull request as ready for review February 24, 2024 20:23
@theroggy theroggy marked this pull request as draft February 24, 2024 20:44
pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when encoding is not provided by the user, but otherwise this looks reasonable.

pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
@theroggy
Copy link
Member Author

theroggy commented Feb 29, 2024

Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when encoding is not provided by the user, but otherwise this looks reasonable.

I'm not sure about the best way forward either. I now implemented it the same as the behaviour in fiona and the default of gdal, but e.g. just always using UTF-8 sound reasonable to me as well.

Not sure why fiona and gdal default to the 1990's encoding, but possibly some applications e.g. don't use the ".cpg" properly when reading leading to more risk of compatibility issues?

@jorisvandenbossche
Copy link
Member

From an ESRI page (https://support.esri.com/en-us/knowledge-base/read-and-write-shapefile-and-dbase-files-encoded-in-var-000013192):

Shapefiles can now be stored in UTF-8. However, shapefiles encoded in UTF-8 are only recognized in ArcMap, ArcCatalog and ArcGIS Pro.

Now, I don't know what this "only" means (i.e. what other ESRI products are not included in that list).

Anyway, given we were already using UTF-8 before and didn't yet get any complaints about that, I would personally keep that. That seems like a better default nowadays (while the default in fiona and GDAL probably stems from many years ago)

@jorisvandenbossche
Copy link
Member

However, I would then maybe do that for all platforms, including Windows?

@theroggy
Copy link
Member Author

theroggy commented Feb 29, 2024

However, I would then maybe do that for all platforms, including Windows?

I agree. If we would go for "UTF-8", I would also vote to do it for all platforms.

EDIT: interesting detail: on the same ESRI page, in the "Summary", they state this:

The default code page in a shapefile (.DBF) is set to UTF-8 (UNICODE). This is the default for current internationalization practices.

@brendan-ward
Copy link
Member

All right - let's go with UTF-8 as the default for shapefiles on all platforms and revisit (by setting to ISO-8859-1) if we get errors reported by users.

@theroggy
Copy link
Member Author

theroggy commented Mar 1, 2024

UTF-8 is now the default for Shapefile on all platforms...

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Can you please add a test that Shapefile is always written to UTF-8 by default (since it wasn't necessarily set that way on Windows before) unless encoding is passed by user.

CHANGES.md Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @theroggy ! Apologies for the very slow final review!

Planning to merge once CI is green so that we can have this in place before changes needed for #380

@brendan-ward brendan-ward merged commit 04a71e9 into geopandas:main Apr 4, 2024
19 checks passed
@theroggy theroggy deleted the ENH-improve-handling-of-encoding-on-windows branch April 4, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants