Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enable support for writing to memory #397

Merged
merged 12 commits into from
May 2, 2024
Merged

Conversation

brendan-ward
Copy link
Member

@brendan-ward brendan-ward commented Apr 30, 2024

Resolves #249

This enables basic writing to memory using an io.BytesIO object. This does not currently support append / adding layers; I tried working that out unsuccessfully in earlier commits, but we could return to this in a later PR. It seems like it should be possible in theory.

This specifically blacklists ESRI Shapefile and OpenFileGDB because they write multiple files, which can't be returned successfully as a single memory file.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 30, 2024

Some things we still need to add / figure out:

  • append / add layers (still trying to figure out how to make this work)

FWIW, I would personally first focus on getting the plain writing to memory work without append (just raise for now if the mode is append)
(fiona also raises an error if mode="a" for writing to file-like objects)

@brendan-ward
Copy link
Member Author

Fiona has specific tests for appending so I assumed it was possible and expected behavior, but indeed in the interest of getting this out sooner we can more strictly limit this to write-only mode.

@jorisvandenbossche
Copy link
Member

Fiona has specific tests for appending so I assumed it was possible and expected behavior

I assume this only works when directly using the MemoryFile interface. Because at that point you have the same object, where you first write to, and then append to.
But when creating a new memfile, you can't append to it (test). And that is essentially what happens in geopandas when people pass a BytesIO object to to_file, we each time create a new MemoryFile object through fiona.open(), and thus only pure writing is supported (fiona explicitly raise in case of append mode in open() for a writeable object).

So longer term that is also something we could do, but it might requiring creating some similar custom object like fiona's MemoryFile?

pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/raw.py Outdated
raise NotImplementedError("append is not supported for in-memory files")

else:
path = str(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still call vsi_path(..)? (what it did before in the deleted code below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took it out because I wasn't convinced that it would work properly and my instead emit confusing errors (though stringifying something that would have gotten a VSI path would be confusing too). We don't have any tests that prove writing to /vsi*/ paths would be successful. I also didn't see support for writing to a /vsi*/ path in Fiona; they are largely used for reading.

For instance, it looks like it is not possible to write via the /vsizip/ handler:

write_dataframe(df, "/vsizip//tmp/test.shp.zip", driver="ESRI Shapefile")

Traceback (most recent call last):
...
pyogrio.errors.DataSourceError: Driver ESRI Shapefile does not support write functionality

Are you aware of any things that we can write to a /vsi*/ path that should succeed? I can also restore the original behavior, in case our current unknown / untested functionality is working for someone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory /vsizip should have write capabilities (https://gdal.org/user/virtual_file_systems.html#write-capabilities), but indeed I also don't get that working (maybe the way we handle that path internally isn't fit for working with such a vsi path)

But something like writing to S3 with vsis3 should in theory also be supported? (although that is indeed something we don't test)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly tested this with a public test bucket, and I can write to an "s3://" path (which will get converted to /vsis3 by our preprocessing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing! That confirms we should indeed leave it in place for now.

@brendan-ward brendan-ward marked this pull request as ready for review May 1, 2024 23:29
@brendan-ward brendan-ward changed the title WIP: enable support for writing to memory ENH: enable support for writing to memory May 1, 2024
@@ -582,7 +588,7 @@ def test_write_append_unsupported(tmpdir, naturalearth_lowres, driver, ext):
def test_write_gdalclose_error(naturalearth_lowres):
meta, table = read_arrow(naturalearth_lowres)

filename = "s3://non-existing-bucket/test.geojson"
filename = "/vsis3/non-existing-bucket/test.geojson"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for this change? (we should at least also ensure writing to s3:// works as well, so maybe parametrize the test for both options?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were producing failures when I had disabled handling the vsi paths; since that has been reverted I'm also reverting these.

# Geojson files don't hava a fast way to determine total_bounds
@pytest.mark.parametrize("force_total_bounds", [True, False])
def test_read_info_force_total_bounds(tmpdir, naturalearth_lowres, force_total_bounds):
# Geojson files don't have a fast way to determine total_bounds for GDAL <3.9.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally for later we should find a new file format to test with that doesn't support fast total bounds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like GeoJSONSeq still doesn't support fast total bounds, so I reverted back to using that directly.

pyogrio/tests/test_raw_io.py Outdated Show resolved Hide resolved
pyogrio/raw.py Outdated
raise NotImplementedError("append is not supported for in-memory files")

else:
path = str(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly tested this with a public test bucket, and I can write to an "s3://" path (which will get converted to /vsis3 by our preprocessing)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@brendan-ward brendan-ward merged commit 6b3d3dc into main May 2, 2024
20 checks passed
@brendan-ward brendan-ward deleted the write_memory_filelike branch May 2, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support writing to in-memory (byte) objects
2 participants