Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: support writing NAT in datetime column #146

Merged
merged 9 commits into from
Sep 13, 2022

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Aug 17, 2022

closes #147

@theroggy theroggy marked this pull request as draft August 17, 2022 19:21
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

pyogrio/_io.pyx Outdated
datetime.second + datetime.microsecond / 10**6,
0
)
if field_value is None or np.isnat(field_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it ever be None?

I am also wondering if we can do np.isnat more efficiently (knowing that this is basically the smallest int64 value). Although it might not matter much for performance, since below we convert to a Python datetime object, which will probably be much slower and dominate the time anyway (so if we want to improve performance here, if it mattered, we should probably first look there).

Copy link
Member Author

@theroggy theroggy Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it ever be None?

Not sure. I was also wondering whether a check for nan would be useful or not. The only case I could imagine is if you are saving a string column into an existing datetime columns (when append is supported) or something like that, but it sounds a bit far fetched.
Possibly the same with None: never gonna happen in a numpy datetime column?

I am also wondering if we can do np.isnat more efficiently (knowing that this is basically the smallest int64 value). Although it might not matter much for performance, since below we convert to a Python datetime object, which will probably be much slower and dominate the time anyway (so if we want to improve performance here, if it mattered, we should probably first look there).

Indeed. When I implemented the datetime write support I noticed the performance is indeed bad. I experimented a bit back then trying to speed it up by implementing the parsing of the datetime using only c, functions, but I didn't get it working (immediately). So the python datetime is the big bottleneck, normally np.isnat even should be fast I think because the cdef version of numpy is imported (as well)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pandas cython libraries, we have a bunch of code to convert a datime64 numpy int to the different fields, but that's quite a lot of code that we can't just copy/paste (and it's not public, so we also can't import it from pandas).

One possible avenue would be to convert the full array up front to the different fields (for that there are public features in pandas). That would help performance, but will increase memory usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing regarding the "fieldvalue is None" check, and apparently

  • if you add a value None to a datetime64 numpy array it is automatically converted to NaT
  • if you add a value np.nan to a datetime64 numpy array you get an error

So, I removed is "fieldvalue is None" check and added a None case in the unittest, so the test demonstrates this behaviour.

@theroggy theroggy marked this pull request as ready for review August 18, 2022 08:00
@theroggy
Copy link
Member Author

The docker builds fail on the following error. I'm not familiar with the building of wheels... so ?

Building wheels for collected packages: pyogrio
  Building editable for pyogrio (pyproject.toml): started
  Building editable for pyogrio (pyproject.toml): finished with status 'done'
  Created wheel for pyogrio: filename=pyogrio-refs_pull_146_merge-0.editable-cp38-cp38-linux_x86_64.whl size=4841 sha256=eb576dc15932f04cd06541d26b3d36c010f2542a8827cd32e5b0e394b47d1a4e
  Stored in directory: /tmp/pip-ephem-wheel-cache-arlfft3k/wheels/0e/fe/78/aff354d48d7cbd9cae3d0b91e271d65df869a6d7f6b694513e
  WARNING: Built editable for pyogrio is invalid: Metadata 1.2 mandates PEP 440 version, but 'refs-pull-146-merge' is not
Failed to build pyogrio
ERROR: Could not build wheels for pyogrio, which is required to install pyproject.toml-based projects
Error: Process completed with exit code 1.

@jorisvandenbossche
Copy link
Member

The docker builds fail on the following error. I'm not familiar with the building of wheels... so ?

That's fixed in the meantime, if you merge in the main branch

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @theroggy ! One minor comment but otherwise this looks good to me.

CHANGES.md Outdated Show resolved Hide resolved
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@brendan-ward brendan-ward merged commit ddcc3cc into geopandas:main Sep 13, 2022
@theroggy theroggy deleted the fix-writing-datetime-nat branch March 9, 2023 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: writing NAT values to a datetime column gives an error
3 participants