New alignment option: `join='strict'` #8698

etienneschalk · 2024-02-03T17:58:43Z

Title: New alignment option: join='strict'

Closes xr.concat concatenates along dimensions that it wasn't asked to #8231
Closes New alignment option: "exact" without broadcasting OR Turn off automatic broadcasting #6806
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
- What's new entry
- Refer to PR ID (cannot be done before the PR has been created)
New functions/methods are listed in api.rst
- No new functions/methods.

Motive

This PR is motivated by solving of the following issues:

xr.concat concatenates along dimensions that it wasn't asked to xr.concat concatenates along dimensions that it wasn't asked to #8231
- New alignment option: "exact" without broadcasting OR Turn off automatic broadcasting New alignment option: "exact" without broadcasting OR Turn off automatic broadcasting #6806

The current PR does not solve the unexpected issue described in #8231 without a change in user-code. Indeed, in the tests written, it is shown that to get the said expected behavior, the user would have to use the new join='strict' mode suggested in #6806 for the concatenation operation. Only in that case, the uniqueness of the indexed dimensions' names will be checked, re-using the same logic that was already applied for join='override' in Aligner.find_matching_indexes

This may not be enough to fix #8231. If that isn't, I can split the PR into two, first one for adding the join='strict' for #6806 and later on one for #8321.

Technical Details

I try to detail here my thought process. Please correct me if there is anything wrong. This is my first time digging into this core logic!

Here is my understanding of the terms:

An indexed dimension is attached to a coordinate variable
An unindexed dimension is not attached to a coordinate variable ("Dimensions without coordinates")

Input data for Scenario 1, tested in test_concat_join_coordinate_variables_non_asked_dims

    ds1 = Dataset(
        coords={
            "x_center": ("x_center", [1, 2, 3]),
            "x_outer": ("x_outer", [0.5, 1.5, 2.5, 3.5]),
        },
    )

    ds2 = Dataset(
        coords={
            "x_center": ("x_center", [4, 5, 6]),
            "x_outer": ("x_outer", [4.5, 5.5, 6.5]),
        },
    )

Input data for Scenario 2, tested in test_concat_join_non_coordinate_variables

    ds1 = Dataset(
        data_vars={
            "a": ("x_center", [1, 2, 3]),
            "b": ("x_outer", [0.5, 1.5, 2.5, 3.5]),
        },
    )

    ds2 = Dataset(
        data_vars={
            "a": ("x_center", [4, 5, 6]),
            "b": ("x_outer", [4.5, 5.5, 6.5]),
        },
    )

Logic for non-indexed dimensions logic was working "as expected", as it relies on Aligner.assert_unindexed_dim_sizes_equal, checking that unindexed dimension sizes are equal as its name suggests. (Scenario 1)

However, the logic for indexed dimensions was surprising as such an expected check on dimensions' sizes was not performed. A check exists in Aligner.find_matching_indexes but was only applied to join='override'. Applying it for join='strict' too is suggested in this Pull Request.

welcome · 2024-02-03T17:58:46Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

etienneschalk · 2024-02-04T09:42:19Z

CI Failed because of multiple timeouts being > 180s:

https://github.com/pydata/xarray/actions/runs/7768517575/job/21186540956?pr=8698

Does it happen sometimes? If so, is it possible to re-trigger the failed pipeline?

Thanks!

max-sixty

Sorry this didn't get an earlier review...

I left one question.

Others know more about this area so they should feel free to comment; and it's a big enough change that I'll ask others to approve before merging

xarray/tests/test_concat.py

max-sixty · 2024-02-09T19:45:42Z

xarray/tests/test_dataarray.py

+    def test_align_exact_vs_strict(self) -> None:
+        xda_1 = xr.DataArray([1], dims="x1")
+        xda_2 = xr.DataArray([1], dims="x2")
+
+        # join='exact' passes
+        aligned_1, aligned_2 = xr.align(xda_1, xda_2, join="exact")
+        assert aligned_1 == xda_1
+        assert aligned_2 == xda_2


[edited from earlier incorrect response]

~~The existing behavior does seem quite surprising. Is it only an issue with 1D arrays?~~

Another option would be refining exact. It sounds like you tried this but many tests failed. It might be worth pushing that PR if you still have it.

I'd want to ask others why we don't enforce identical dimension names...

The existing behavior does seem quite surprising. Is it only an issue with 1D arrays?

I added a test with 2D arrays, they are concerned too

What I can do as a simple test is to systematically transform 'exact' to 'strict' when entering the alignment logic and pushing this logic as a separate draft PR (#8729) to see the failing tests.

Edit: Results on the "bare-minimum" CI job: 96 Failed. It seems that in many cases, the leniency is desirable (eg many tests containing the string broadcast in them).

Example: test_cat_broadcast_left.

Stacktrace

/home/me/dev/xarray/xarray/tests/test_accessor_str.py::test_cat_broadcast_left[bytes] failed: dtype = <class 'numpy.bytes_'> def test_cat_broadcast_left(dtype) -> None: values_1 = xr.DataArray( ["a", "bb", "cccc"], dims=["Y"], ).astype(dtype) values_2 = xr.DataArray( [["11111", "222", "33"], ["4", "5555", "66"]], dims=["X", "Y"], ) targ_blank = ( xr.DataArray( [["a11111", "bb222", "cccc33"], ["a4", "bb5555", "cccc66"]], dims=["X", "Y"], ) .astype(dtype) .T ) targ_space = ( xr.DataArray( [["a 11111", "bb 222", "cccc 33"], ["a 4", "bb 5555", "cccc 66"]], dims=["X", "Y"], ) .astype(dtype) .T ) targ_bars = ( xr.DataArray( [["a||11111", "bb||222", "cccc||33"], ["a||4", "bb||5555", "cccc||66"]], dims=["X", "Y"], ) .astype(dtype) .T ) targ_comma = ( xr.DataArray( [["a, 11111", "bb, 222", "cccc, 33"], ["a, 4", "bb, 5555", "cccc, 66"]], dims=["X", "Y"], ) .astype(dtype) .T ) > res_blank = values_1.str.cat(values_2) /home/me/dev/xarray/xarray/tests/test_accessor_str.py:3319: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /home/me/dev/xarray/xarray/core/accessor_str.py:508: in cat return self._apply( /home/me/dev/xarray/xarray/core/accessor_str.py:232: in _apply return _apply_str_ufunc( /home/me/dev/xarray/xarray/core/accessor_str.py:130: in _apply_str_ufunc return apply_ufunc( /home/me/dev/xarray/xarray/core/computation.py:1270: in apply_ufunc return apply_dataarray_vfunc( /home/me/dev/xarray/xarray/core/computation.py:295: in apply_dataarray_vfunc deep_align( /home/me/dev/xarray/xarray/core/alignment.py:977: in deep_align aligned = align( /home/me/dev/xarray/xarray/core/alignment.py:913: in align aligner.align() /home/me/dev/xarray/xarray/core/alignment.py:594: in align self.assert_equal_dimension_names() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <xarray.core.alignment.Aligner object at 0x7f1e824e00a0> def assert_equal_dimension_names(self) -> None: # Strict mode only allows objects having the exact same dimensions' names. if not self.join == "strict": return unique_dims = set(tuple(o.sizes) for o in self.objects) all_objects_have_same_dims = len(unique_dims) == 1 if not all_objects_have_same_dims: > raise ValueError( f"cannot align objects with join='strict' " f"because given objects do not share the same dimension names ({[tuple(o.sizes) for o in self.objects]!r}); " f"try using join='exact' if you only care about equal indexes" ) E ValueError: cannot align objects with join='strict' because given objects do not share the same dimension names ([('Y',), ('X', 'Y')]); try using join='exact' if you only care about equal indexes /home/me/dev/xarray/xarray/core/alignment.py:498: ValueError

xarray/tests/test_dataarray.py

max-sixty · 2024-02-10T00:14:07Z

OK, I think I see what's going on now! Sorry I was slow. I appreciate you doing so much to make it easy to engage with the issue.

So when we say "'exact' allows shape-compatible DataArrays to be aligned with differing dimension names, whereas 'strict' forbids.", the existing behavior doesn't align dimensions with different names — it just ignores them.

My mental model has that as correct behavior — I haven't observed much demand for "only align these arrays if all the dimensions match"

Is that consistent with your view?

The change I would support is #6806, which would make strict raise an error if one object has a dimension with size 1 and another has a dimension with size n. Currently exact allows that.

Does that make sense? Lmk if I'm still missing something.

xarray/tests/test_concat.py

etienneschalk · 2024-02-10T11:42:14Z

Hello,

Thank your for your time giving me feedback on this PR!

⁂

the existing behavior doesn't align dimensions with different names — it just ignores them.

Yes, they are not considered. It is like manipulating raw numpy arrays: the typing is structural and not nominal, I would say. This situation reminds me of the structural and nominal terminology used to describe type systems. So even if the two DataArrays are structurally equivalent, the 'strict' mode would fail because of the difference in names. Definitely the Python community is more leaning towards structural typing: operations on numpy arrays rely on the structure of arrays, PEP 544 – Protocols: Structural subtyping (static duck typing), etc.

But sometimes, having the safety of nominal typing can be useful. In a way, this is what xarray seems to aim to address by naming dimensions. Indeed, it is common to get confused with x, y, row, col, latitude, longitude conventions when working with georeferenced rasters, for instance. Structure alone is in such cases insufficient to work confidently with data, and xarray helps with that.

I found this interesting quote on the Wikipedia page of Structural type system:

A pitfall of structural typing versus nominative typing is that two separately defined types intended for different purposes, but accidentally holding the same properties (e.g. both composed of a pair of integers), could be considered the same type by the type system, simply because they happen to have identical structure. One way this can be avoided is by creating one algebraic data type for each use.

This issue is also well-known in the TypeScript world. See How do I prevent two types from being structurally compatible?

So to summarize, my mental model is the following:

'exact' is structural
'strict' is nominal

⁂

The change I would support is #6806, which would make strict raise an error if one object has a dimension with size 1 and another has a dimension with size n. Currently exact allows that.

I understood that the issue described in #6806 (comment) was related to arrays having the same structure but different dimension names being successfully aligned, not related to dimension sizes.

Also, I understand that implementing an option to use 'strict' by default would solve the scenario described in #6806 (comment) (not implemented in this PR)

I added new test to experiment. Here is a test matrix result of the current implementation:

`test_align_exact_vs_strict_*`	dim names	Nominally equivalent (same dim names)	dim sizes	Structurally equivalent (same dim sizes)	exact	strict	Error message
`*_same_dim_same_size`	x, x	Yes	1, 1	Yes	🆗	🆗	-
`*_same_dim_differing_sizes`	x, x	Yes	1, 2	No	❌	❌	(E1)
`*_differing_dims_same_sizes`	x1, x2	No	1, 1	Yes	🆗	❌	(E2)
`*_differing_dims_differing_sizes`	x1, x2	No	1, 2	No	🆗	❌	(E2)

(E1): Structural error

"cannot reindex or align along dimension 'x' because of "
"conflicting dimension sizes: {1, 2}"

(E2): Nominal error (check happens before the structural error has a chance to happen)

"cannot align objects with join='strict' "
"because given objects do not share the same dimension names "
"([('x1',), ('x2',)])"

Sorry, this starts to become very long!

max-sixty · 2024-02-10T19:19:53Z

(Great! I asked a question at #6806 to clarify the request)

…array into eschalk/issue-8231-align

doc/whats-new.rst

xarray/core/alignment.py

xarray/core/combine.py

dcherian · 2024-02-18T22:01:22Z

xarray/core/computation.py

@@ -969,6 +974,8 @@ def apply_ufunc(
        dimensions as input and vectorize it automatically with
        :py:func:`numpy.vectorize`. This option exists for convenience, but is
        almost always slower than supplying a pre-vectorized function.
+    broadcast : bool


I think we should skip apply_ufunc for now (and just ignore arithmetic_broadcast in dot)

Seems like there are some confusing implications to implementing it. See #1618 for an alternative approach.

dcherian · 2024-02-18T22:03:20Z

xarray/core/options.py

@@ -91,26 +94,35 @@ def _positive_integer(value: int) -> bool:
    return isinstance(value, int) and value > 0


+def _is_boolean(value: Any) -> bool:
+    return isinstance(value, bool)


minor comment: This is a matter of taste but this feels like a fairly tiny benefit for the added indirection. The diff would be a lot smaller too, if you just followed the existing copy-paste approach.

I tried to apply "opportunistic refactoring", since I was updating this zone of the code. Currently, there is a mixed approach: lambdas are used, but a function _positive_integer already exists. So I tried to align all the existing checks to _positive_integer. It looks at least to me a little bit cleaner to refer multiple times one function rather than define multiple duplicated lambdas.
I understand however the main downside of it adding unrelated noise to the diff, sorry for that. But now that the change is made, should I revert it or keep it?

xarray/tests/test_dataarray.py

dcherian · 2024-02-18T22:10:22Z

xarray/tests/test_concat.py

+        )
+
+
+@pytest.mark.parametrize("join", ("outer", "exact"))


lets merge this with the previous test.

They are not exactly the same, the previous test test_concat_join_coordinate_variables_non_asked_dims tests coordinate variables while this one test_concat_join_non_coordinate_variables tests non-coordinate variables, also the parametrization is not possible with the previous test as the behaviour differs when using join=outer vs join=exact for coordinate variables

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

…array into eschalk/issue-8231-align

dcherian · 2024-02-24T20:24:56Z

xarray/core/computation.py

@@ -282,6 +282,7 @@ def apply_dataarray_vfunc(
    *args,
    signature: _UFuncSignature,
    join: JoinOptions = "inner",
+    broadcast: bool = True,


Sorry for being unclear (again). I don't think we need this in align at all.

We should simply be checking OPTIONS["arithmetic_broadcast"] in Variable._binary_op . The whole business with align was a misunderstanding of mine.

I will close this PR as it diverged too much from the original wanted behavior

New alignment option: join='strict'

bbe7d05

etienneschalk and others added 4 commits February 4, 2024 20:24

Fix what's new newlines + retrigger CI

cddcaa1

wrong join

37a7b09

Merge branch 'main' into eschalk/issue-8231-align

a93e44b

Merge branch 'main' into eschalk/issue-8231-align

ed4873e

max-sixty reviewed Feb 9, 2024

View reviewed changes

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

max-sixty reviewed Feb 9, 2024

View reviewed changes

Added test align 2d three arrays

c6b1df5

etienneschalk commented Feb 9, 2024

View reviewed changes

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

etienneschalk mentioned this pull request Feb 9, 2024

Reinforce alignment checks when join='exact' #8729

Closed

max-sixty reviewed Feb 10, 2024

View reviewed changes

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

Added tests and use assert_identical

ed0414a

Merge branch 'main' into eschalk/issue-8231-align

a8bc8dc

max-sixty mentioned this pull request Feb 10, 2024

New alignment option: "exact" without broadcasting OR Turn off automatic broadcasting #6806

Closed

etienneschalk and others added 7 commits February 11, 2024 13:13

Merge branch 'main' into eschalk/issue-8231-align

451c96f

Added tests for join=exact

f46b19b

Try replacing join=strict by broadcast=False

95295d1

Merge branch 'eschalk/issue-8231-align' of github.com:etienneschalk/x…

80f5e1b

…array into eschalk/issue-8231-align

More broadcasts

d84c688

Merge branch 'main' into eschalk/issue-8231-align

06c2077

CI failed: mypy + warnings

a7148d6