Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groundwork for V3 group tests #1743

Merged
merged 63 commits into from
May 15, 2024
Merged

groundwork for V3 group tests #1743

merged 63 commits into from
May 15, 2024

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Apr 4, 2024

Adds tests for the Group API in v3. In progress. I will remove the in progress label when it's ready for a final review, but I encourage people with an interest in the testing infrastructure to look at the early changes and weigh in if you see anything you don't like. It will be easier to course-correct before a lot of code is written.

To outline the strategy I'm using:

  • no classes (or at least, no inheritance)
  • parameterize / fixturize as much as possible
  • every method of the Group / AsyncGroup classes gets a test, that should parametrize over the arguments of that method

This is a deliberate departure from the strategy used in the v2 tests, where test classes are defined that inherit from a base test class, potentially overriding methods. I'm pretty sure we can use pytest fixtures to express the same functionality without needing the cognitive overhead of inheritance, but this is my personal judgment, and I'm open to other ideas here. The goal is just to have a test suite that's easy for developers to understand, which (in my experience) is not a criterion the v2 tests always meet.

I don't think I will fix failing tests here. That should probably done in separate PRs, once we agree on the shape of the testing infrastructure. edit: amendment to this: since test failures do slow down the creation of new tests, I'm going to apply some fixes to obvious bugs revealed in the process of adding tests. These fixes should be separable in the commit history, and on the basis of the files they affect.

depends on #1726

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b added the in progress Someone is currently working on this label Apr 4, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 5, 2024

Some of my group tests are running into bugs from the storage side. So I'm broadening the scope of this effort to include tests for the stores.

@pep8speaks
Copy link

pep8speaks commented Apr 5, 2024

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 62:29: E203 whitespace before ':'

Comment last updated at 2024-05-10 08:58:14 UTC

@d-v-b d-v-b mentioned this pull request Apr 6, 2024
6 tasks
src/zarr/v3/group.py Outdated Show resolved Hide resolved
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get to reviewing the tests yet but this is looking good.

Comment on lines 171 to 173
# if `key` names an object in storage, it cannot be an array or group
if await store_path.exists():
raise KeyError(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should include this check. This adds a roundtrip to the store to protect against a situation I have a hard time imaging being a real issue.

For one, most storage backends forbid such a situation but I don't think the spec really cares.

/store/a/b/c
/store/a/b/c/zarr.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have this check and we are on v3, then zarr_json_bytes = await (store_path / ZARR_JSON).get() will be None for any object, which will result in an implicit group. I don't think we want this. I'm actually thinking that implicit groups might be more trouble than they are worth.

In fact, getitem will never raise a KeyError for zarr v3, since any prefix is Group | Array, thanks to implicit groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened zarr-developers/zarr-specs#291 to discuss removing implicit groups from the spec. the PR is at zarr-developers/zarr-specs#292

Comment on lines 307 to 308
# would be nice to make these special keys accessible programmatically,
# and scoped to specific zarr versions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rework this comment to explain what the subkeys_filtered variable is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You suggestion to move the tuple below is a good one though.

subkeys = await self.store_path.store.list_dir(self.store_path.path)
# would be nice to make these special keys accessible programmatically,
# and scoped to specific zarr versions
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zattrs"), subkeys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused here. Don't we want to be looking for .zarray to find v2 arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if foo/bar/ is the prefix for a group, then we might expect to find either v2 group stuff (foo/bar/.zattrs, foo/bar/.zgroup) or v3 group stuff (foo/bar/zarr.json), but foo/bar/.zarray would be unexpected, since we know foo/bar is a group, and the same path can't correspond to both a group and an array.

)

raise ValueError(msg)
subkeys = await self.store_path.store.list_dir(self.store_path.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ideal async way to do this would be for list_dir to return an AsyncGenerator. Then your for loop below would look something like:

async for key in subkeys:
    # filter
    if key in ("zarr.json", ".zgroup", ".zattrs"):
         ...
         yield await self.getitem(subkey)
         
     ...

…(name, AsyncArray / AsyncGroup) pairs; Group.members repackages these into a dict.
def _parse_async_node(node: AsyncGroup) -> Group: ...


def _parse_async_node(node: AsyncArray | AsyncGroup) -> Array | Group:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exists as something that can be used with map over an iterable of AsyncArray / AsyncGroup to transform each one to its synchronous counterpart. Previously we used a tuple around a generator expression, but mypy couldn't type check this correctly.

return tuple(ret)
_members = self._sync_iter(self._async_group.members())

result = tuple(map(lambda kv: (kv[0], _parse_async_node(kv[1])), _members))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -88,4 +88,4 @@ async def list_dir(self, prefix: str) -> AsyncGenerator[str, None]:
else:
for key in self._store_dict:
if key.startswith(prefix + "/") and key != prefix:
yield key.strip(prefix + "/").split("/")[0]
yield key.removeprefix(prefix + "/").split("/")[0]
Copy link
Contributor Author

@d-v-b d-v-b May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixed a subtle bug in the list_dir method of MemoryStore. str.strip treats its argument as a set of characters to remove, not a substring / prefix. e.g.,

>>> 'a/b/a/b/b/x'.strip('a/b')
'x'

vs

>>> 'a/b/a/b/b/x'.removeprefix('a/b')
'/a/b/b/x'

@d-v-b d-v-b changed the title [WIP] V3 group tests groundwork for V3 group tests May 10, 2024
@d-v-b d-v-b requested a review from jhamman May 10, 2024 09:05
@d-v-b
Copy link
Contributor Author

d-v-b commented May 10, 2024

I think this is ready to go

@d-v-b d-v-b requested a review from normanrz May 10, 2024 09:05
src/zarr/array.py Outdated Show resolved Hide resolved
tests/v2/conftest.py Show resolved Hide resolved
@d-v-b d-v-b removed the in progress Someone is currently working on this label May 10, 2024
@d-v-b d-v-b mentioned this pull request May 10, 2024
)

async def create_array(self, path: str, **kwargs) -> AsyncArray:
async def create_array(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group.create_group and group.create_array should preserve the zarr format of the group, which means we can't accept **kwargs in these methods without checking the contents of **kwargs for the absence of zarr_format. Since I'm not a huge fan of **kwargs anyway, I just elevated the underlying function signatures to this method, sans zarr_format, which is set automatically based on the zarr format of the group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to redo that in #1857

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @d-v-b. A few small comments for you.

@@ -0,0 +1,3 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was added in error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, this is the zombie fixture directory that gets created by tests, see #1813

@@ -1,10 +1,802 @@
# import array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a bad merge conflict resolution here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good guess, honestly I have no idea what happened there. it should be fixed now.

src/zarr/group.py Show resolved Hide resolved
@d-v-b
Copy link
Contributor Author

d-v-b commented May 15, 2024

one large, but minor (or petty) change in the latest version of this effort: I renamed test_storage.py to test_store.py, in order to be consistent with the naming: we have zarr.store, not zarr.storage.

@d-v-b d-v-b merged commit f3305d9 into zarr-developers:v3 May 15, 2024
18 checks passed
@d-v-b d-v-b deleted the v3_group_tests branch May 15, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants