Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray usage #1

Merged
merged 1 commit into from
Mar 24, 2021
Merged

xarray usage #1

merged 1 commit into from
Mar 24, 2021

Conversation

OriolAbril
Copy link
Contributor

@OriolAbril OriolAbril commented Mar 24, 2021

Hi, great post!

I saw the code and the first plot is quite similar to other plots I have done, so I already more or less have a template on how to do them with xarray, it is more verbose than pandas (in general xarray is quite verbose as everything is labeled) but I also find it more clear and explicit.

Basically what is happening is that we are working with qs as an xarray dataset with 3 variables: var_name, "truth" and "truth_in_interval". All 3 variables have the skater name dimension and var_name has the extra quantile dimension.

Disclaimer: I have not tried to run this! ⚠️

Feel free to do whatever you want with the code and the PR, I just wanted to share this possibility to avoid xarray-pandas conversion.

(qs["truth"] > qs[var_name].sel(quantile=0.1))
)
qs = qs.sortby(qs[var_name].sel(quantile=0.9))
y = np.linspace(*ax.get_ylim(), qs.dims["skater_name"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the piece that is more xarray specific and can be confusing. qs.dims["skater_name"] returns an the length of the dimension as integer. The key is in the difference between dimension and coordinate. Here both things have the same name, like in all ArviZ generated datasets but this is not a requirement in xarray. qs.skater_name or qs.coords.skater_name will return a DataArray with all the skater name values.

@teddygroves
Copy link
Owner

Thanks a lot for the contribution! Loads of nice xarray things in there that I hadn't seen before and definitely an improvement in explicitness.

I've never put the effort into properly learning how to interact with arviz InferenceData objects as xarrays because I found I often wanted to do a multi-dimensional groupby and I think xarray doesn't currently support this. For example, here I needed to count the number of posterior draws in each outcome for each measurement and did

infd.posterior_predictive["yrep"]
        .to_series()
        ...
        .groupby(["measurement", "yrep"])
        .size()

Do you by any chance know an equivalent xarray way of doing this?

@teddygroves teddygroves merged commit d114e7f into teddygroves:master Mar 24, 2021
@OriolAbril OriolAbril deleted the arviz branch March 24, 2021 11:24
@OriolAbril
Copy link
Contributor Author

This one I'd have to take a look and play around a bit before getting somewhere, I would actually love to do it but I don't currently have much time. I can see two ways going forward:

  • Use a single groupby over yrep with .map and a function that unstacks, converts to boolean with isnan and sums reducing all dims but measurement.
  • Use something like Grouping with multiple levels pydata/xarray#1569. You would then not be able to count directly but would have to explicitly loop over and count.

One thing I don't see clearly is the conversion from counts to probability, is this done by heatmap under the hood? is it done over rows, over columns? Not sure if doing this explicitly and then plotting with imshow or something could help somewhere.

None of the two above are ideal, xarray should allow multilevel groupbys, and afaik it's a work in progress pydata/xarray#324, but not yet available. It could be that even when not ideal one of these is a bit better than converting to pandas, but I'm not sure I'd bet on that 😅

Extra comment, I noticed you have astype(int), is the data in yrep not already stored as int? Do you know where this happens? Is it due to yrep being a vector in stan code? something over at cmdstanpy side? or does it happen converting to inferencedata?

@teddygroves
Copy link
Owner

Thanks very much for the suggestions - I'll have a go.

The conversion from counts to probability is done at this line:

        .pipe(lambda df: df.div(df.sum(axis=1), axis=0))

It's not very explicit as you need to know what axes 0 and 1 represent, and what the axis arguments to sum and div do.

I hadn't actually noticed that yrep was a vector in the Stan code - thanks for the tip! Interestingly after correcting the Stan code infd.posterior_predictive["yrep"] still ends up with dtype float64:

In [113]: infd.posterior_predictive["yrep"]
Out[121]: 
<xarray.DataArray 'yrep' (chain: 4, draw: 2000, measurement: 2345)>
[18760000 values with dtype=float64]
Coordinates:
  * chain        (chain) int64 0 1 2 3
  * draw         (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999
  * measurement  (measurement) int64 0 1 2 3 4 5 ... 2340 2341 2342 2343 2344

@OriolAbril
Copy link
Contributor Author

We did something that could be similar to what needs to happen here back at pymc3 examples: https://nbviewer.jupyter.org/github/pymc-devs/pymc-examples/blob/main/examples/case_studies/rugby_analytics.ipynb (see towards the end on posterior predictive section). This ends up with a loop and uses np.unique to make the "histogram"/get rank counts of each team, but there are other alternatives.

IMO, the ideal situation (for the rugby case) would be to have an implementation of np.bincount as a ufunc, so we can choose which dimensions it should reduce on and which should be batched. I see 2 quick ways of doing that. These two snippets should go between the computation of pp["rank"] and the plotting. They generate a dataarray dim_da instead of a dataframe, sim_da can be converted to sim_table with sim_table = sim_da.to_series().unstack(0)

Option 1: az.wrap_xarray_ufunc

xr.apply_ufunc is already amazing, az.wrap_xarray_ufunc goes even further (at the expense of not being very efficient) and allows using any kind of function, which is converted into a ufunc and then used on the xarray object with xr.apply_ufunc.

sim_da = az.wrap_xarray_ufunc(
    np.bincount,
    pp["rank"].stack(sample=("chain", "draw"))-1,  
    input_core_dims=[["sample"]],
    output_core_dims=[["position"]],
    func_kwargs = {"out_shape": (6,), "minlength": 6},
).assign_coords(position=np.arange(1, 7)) / 4000

Option 2: manually create the bincount ufunc with numba

We can also create a ufunc that behaves like np.bincount "properly" and then use xr.apply_ufunc like we did with rankdata. It's "easy" to do this with numba:

import numba

# this is a ufunc version of np.bincount with minlength=len(dummy)
# the dummy input is needed for numba to be able to compile the function
# otherwise it doesn't know the shape of the output
@numba.guvectorize(["void(int64[:], int64[:], int64[:])"], "(m),(n)->(n)")
def rankcount(x, dummy, res):
    m = len(dummy)
    n = len(x)
    for j in range(m):
        res[j] = 0
    for i in range(n):
        res[x[i]-1] += 1

This can be applied to arrays of any dimension, and will count along the last dimension while batching over all other dimensions. We have to do this dummy trick to get guvectorize to work on that which is not idea though, another numba alternative below. We can apply that with:

sim_da = xr.apply_ufunc(
    rankcount,
    pp["rank"].stack(sample=("chain", "draw")), 
    np.empty(6, dtype=int), 
    input_core_dims=[["sample"], []],
    output_core_dims=[["position"]]
).assign_coords(position=np.arange(1, 7)) / 4000

We can also create a function that takes a 2d input and bincounts over the last dimension while batching over the 1st dimension. This needs no tricks and should be faster than the guvectorized version, may even be faster than np.bincount.

@numba.jit(["int64[:,:](int64[:,:])"])
def rankcount2d(x):
    m = x.shape[0]
    n = x.shape[1]
    res = np.zeros((m, m), dtype=np.int64)
    for j in numba.prange(m):
        for i in range(n):
            res[j, x[j, i]-1] += 1
    return res

sim_da = xr.apply_ufunc(
    rankcount2d,
    pp["rank"].stack(sample=("chain", "draw")),
    input_core_dims=[["sample"]],
    output_core_dims=[["position"]]
).assign_coords(position=np.arange(1, 7)) / 4000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants