Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support for reading and writing datetimes with timezones #253

Merged
merged 55 commits into from
Oct 20, 2023

Conversation

m-richards
Copy link
Member

@m-richards m-richards commented Jun 3, 2023

This is an initial proof of concept into reading and writing datetimes with timezones by reading and writing strings from gdal, and parsing the strings on the pandas side, very directly following discussion and suggestions in #123 123.

I'm trying not to affect the numpy read api, as a this functionality doesn't make sense from the numpy layer, as it doesn't support timezones.

There is something missing from this, the numpy api should still be able to read datetimes with timezones and convert them correctly to UTC (currently timezones are ignored complete), but I think that can be a separate PR (perhaps it should even come before this).

There's also probably a fair bit of work to handle driver support for this, I noticed issues locally trying to write geopackages (using WSL with an old version of gdal). I had a quick look at the tests in fiona, I was hoping for a simple list of supported formats, but it's quite complex there). I figure though that's a secondary issue after figuring out the implementation.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this! Added some comments / suggestions

pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
Comment on lines 27 to 29
res = pd.to_datetime(ser, format="ISO8601")
if res.dtype != "object":
res = res.dt.as_unit("ms")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to try to detect if there are timezones or not (eg check the first (few) strings to see if they contain a "+" or "-"), because if so we have to pass utc=True to not get object dtype:

In [20]: pd.to_datetime(["2021/01/02 01:02:03+01:00", "2021/01/02 01:02:03+01:00"], format="ISO8601")
Out[20]: DatetimeIndex(['2021-01-02 01:02:03+01:00', '2021-01-02 01:02:03+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)

In [21]: pd.to_datetime(["2021/01/02 01:02:03+01:00", "2021/01/02 01:02:03+02:00"], format="ISO8601")
Out[21]: Index([2021-01-02 01:02:03+01:00, 2021-01-02 01:02:03+02:00], dtype='object')

(with longer timeseries, it is very normal to have mixed offsets for the same timezone, i.e. values with and without DST)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, I checked with what we did in geopandas for the fiona IO and did the same thing as there to be consistent -> read without utc=True, if still object dtype then read with utc=True and see if that makes a difference.

I could also look at check for "+" and "-" as you suggest, I'd imagine that's faster, but maybe there are some edge cases?

Comment on lines 178 to 179
) # TODO this shouldn't be called datetime as string in the pandas layer,
# should it even be accessible?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have some users asking for it, I wouldn't expose it in read_dataframe (so only in raw.read), just to keep it simpler in read_dataframe

".*parsing datetimes with mixed time zones will raise.*",
FutureWarning,
)
res = pd.to_datetime(ser, **datetime_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still might want to add a try/except around this, for when this would start raising an error in the future (also in that case we should fall back to the utc=True path below

m-richards and others added 2 commits September 30, 2023 19:48
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@brendan-ward
Copy link
Member

Apologies for the very slow review here; I am slowly trying to get up to speed and looking for other alternatives to adding dtype info to the meta.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @m-richards for all your work on this!

I'll readily admit I don't work with datetimes very much, so the details around that are a bit beyond my experience. However, your approach here seems like a reasonable way of getting around the limitations of datetime handling in numpy vs pandas. Adding dtypes to meta seems reasonable, as does ignoring it on in raw.write().

It might be a good idea to add some notes about this to introduction.md to guide users about when to use datetime_as_string and otherwise how the returned values will be dependent on the version of Pandas used (if I follow correctly).

pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Show resolved Hide resolved
pyogrio/geopandas.py Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
@@ -355,8 +401,24 @@ def write_dataframe(
# TODO: may need to fill in pd.NA, etc
field_data = []
field_mask = []
# dict[str, np.array(datetime.datetime)] special case for dt-tz fields
timezone_cols_metadata = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timezone_cols_metadata = {}
gdal_tz_offsets = {}

Slight preference to prefix params specifically intended to pass things down to GDAL as gdal_, and since the only thing being stored here is the GDAL offset rather than other metadata, this rename seems a bit more clear (provided I follow correctly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you do follow correctly, the new name makes more sense. I think the name I had predates me passing down the offset only.

pyogrio/raw.py Show resolved Hide resolved
expected = pd.to_datetime(raw_expected, format="ISO8601").as_unit("ms")
else:
expected = pd.to_datetime(raw_expected)
expected = pd.Series(expected, name="col")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected = pd.Series(expected, name="col")
expected = pd.Series(expected, name="datetime_col")

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_raw_io.py Outdated Show resolved Hide resolved
@m-richards
Copy link
Member Author

Thanks @m-richards for all your work on this!

Thanks for the detailed review, I hope I have addressed most of the suggestions now.

It might be a good idea to add some notes about this to introduction.md to guide users about when to use datetime_as_string and otherwise how the returned values will be dependent on the version of Pandas used (if I follow correctly).

Happy to do this, but just want to double check the structuring. I am not super familiar with the pyogrio documentation, but introduction.md seems to all describe using read_dataframe. At present this pull request doesn't expose datetime_as_string to read_dataframe, only in raw.py::read so I don't know if it makes sense to document / whether it should be documented elsewhere?

On return types, happy to add this, but I saw there is already somewhat of a discussion of this at known_issues.md - which I now realise I need to update as part of this pull request too. Happy to have an example in the introduction as well though if you think that makes sense.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

At present this pull request doesn't expose datetime_as_string to read_dataframe, only in raw.py::read so I don't know if it makes sense to document / whether it should be documented elsewhere?

I would personally say it's good to keep this raw only, and then having the keyword explained in the docstrings seems good enough to me.

And the section in the "known_issues" about the datetime gotchas is already a nice and quite complete section I think. Maybe we could just move it "introduction", as now it's actually no longer a known "issue" with pyogrio, but general limitations of reading/writing datetimes because of GDAL's data model.

@brendan-ward
Copy link
Member

Apologies for the poor direction. Leaving datetime_as_string at raw level only is fine. I'd suggest moving the datimetime section from known issues to the introduction.

pyogrio/_io.pyx Outdated Show resolved Hide resolved
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @m-richards ! Getting really close. Looks like there are a few uncommitted suggestions, but after those are in this looks good to me.

m-richards and others added 3 commits October 18, 2023 08:54
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@m-richards
Copy link
Member Author

Thanks for the updates @m-richards ! Getting really close. Looks like there are a few uncommitted suggestions, but after those are in this looks good to me.

I think they're all done now, don't know how I missed them before.

@brendan-ward brendan-ward merged commit 2f6a679 into geopandas:main Oct 20, 2023
14 of 15 checks passed
@brendan-ward
Copy link
Member

Thanks @m-richards ! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants