Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import weighted stats and moments from StatsBase to Statistics #31395

Closed
wants to merge 12 commits into from

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Mar 18, 2019

This includes methods for mean, quantile, median, var, std, cov and cor, plus new functions skewness and kurtosis, and weight types. Code is copied from StatsBase with some cleanup where needed, in particular for dispatch, to move from @nloops/@nrefs to cartesian indexing and to
be closer to the mapreducedim code. Weights are now passed via a keyword argument rather than by dispatching on AbstractWeights, so as to support any array where all weights types give the same result.

Still to address:

  • Do we want mean_and_var and mean_and_std? It's not too hard to do m = mean(...); s = std(..., mean=m) manually. Also the names aren't great. UPDATE: dropped.
  • Do we want moment? It feels redundant with mean, var, skewness and kurtosis. UPDATE: dropped.
  • Should we also support weighted sum? For now I've kept wsum internal (it is used by mean), it can always be implemented later in Base. UPDATE: added.
  • Check that performance didn't regress when porting from @nloop/@nref to cartesian indexing. UPDATE: performance is similar, and even sometimes faster.
  • Clean tests (notably ensuring that non-vector and non-AbstractWeights weights are supported).
  • Check that everything is clean, in particular docstrings.

Closes https://github.com/JuliaLang/julia/issues/29974. See #27152 (comment) for an outline of a broader roadmap.

@nalimilan nalimilan added stdlib Julia's standard library domain:statistics The Statistics stdlib module and removed stdlib Julia's standard library labels Mar 18, 2019
@ararslan
Copy link
Member

Do we want mean_and_var and mean_and_std? It's not too hard to do m = mean(...); s = std(..., mean=m) manually. Also the names aren't great.

👍 to dumping them.

Do we want moment? It feels redundant with mean, var, skewness and kurtosis.

Unless we're going to support arbitrary moments (or just 5th or higher), I'd say leave it out.

Should we also support weighted sum?

👍 to supporting it.

@rofinn
Copy link
Contributor

rofinn commented Mar 19, 2019

Out of curiosity, what's the advantage with having this in Statistics? I only ask because there are still some open issues with the weights type in StatsBase that we might want to address (e.g., n-dimensional weights, mutability of weights).

@ararslan
Copy link
Member

Out of curiosity, what's the advantage with having this in Statistics?

One big one is better APIs, for example mean(x, weights=w), which also ensures that cases like JuliaStats/StatsBase.jl#475 don't occur.

@rofinn
Copy link
Contributor

rofinn commented Mar 19, 2019

Oh, yeah, I guess that makes sense. Looks like github also supports transferring issues to another repo https://help.github.com/en/articles/transferring-an-issue-to-another-repository. I would prefer that we try to retain the history for the files that we're actively copying over.

@nalimilan
Copy link
Member Author

Another advantage is to eventually remove StatsBase, whose name is confusing since it's not more "base" than Statistics (quite the contrary actually).

I would prefer that we try to retain the history for the files that we're actively copying over.

I'm not sure that's really possible, as the code cannot be imported as-is and pass tests. So we would have to keep broken commits in history, which isn't great for bisecting. We can still refer to the StatsBase repo for history, though.

This includes methods for mean, quantile, median, var, std, cov and cor,
plus new functions skewness and kurtosis, and weight types.
Code is copied from StatsBase with some cleanup where needed, in particular
for dispatch, to move from `@nloops`/`@nrefs` to cartesian indexing and to
be closer to the mapreducedim code.
Weights are now passed via a keyword argument rather than by dispatching on
AbstractWeights, so as to support any array where all weights types
give the same result.
@nalimilan
Copy link
Member Author

I've implemented sum(x; weights, dims) in Base. This is tricky since StatsBase had optimized methods using BLAS, which make a large difference for performance. The solution I found is to have the generic fallbacks in reducedim.jl, and add methods for BlasReal from LinearAlgebra.

I've also cleaned a few things and checked that performance hasn't regressed, so I think the PR is now ready for a serious review. This is a large piece of code, and the reductions are particularly tricky, so double-checking would really be appreciated.

@nalimilan nalimilan marked this pull request as ready for review May 8, 2019 16:40
isempty(r) && return oftype((first(r) + last(r)) / 2, NaN)
(first(r) + last(r)) / 2
end

median(r::AbstractRange{<:Real}) = mean(r)
_mean(A::AbstractArray, dims, weights::Nothing) =
_mean!(Base.reducedim_init(t -> t/2, Base.add_sum, A, dims), A, nothing)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code used + instead of add_sum, but I figured it would be more consistent with sum. In practice I don't think it makes a difference for standard types because of the /2 which changes the type to floating point.

The Statistics module contains basic statistics functionality.
The Statistics module contains basic statistics functionality: mean, median, quantiles,
standard deviation, variance, skewness, kurtosis, correlation and covariance.
Statistics can be weighted, and several weights types are distinguished to apply appropriate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should read "several weight types" instead of "several weights types". If you're referring to the actual type then " several AbstractWeights types" maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "weight type"? The plural sounds more appropriate since weights only make sense as a series of values. Though "types of weights" might be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just sounded weird to me. I think "types of weights" is what sounds best.


"""
var(itr; dims, corrected::Bool=true, mean=nothing)
var(itr; corrected::Bool=true, [weights::AbstractWeights], [mean], [dims])
Copy link
Contributor

@rofinn rofinn Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should support weights::AbstractArray for consistency with other methods like mean? We could always convert non-weight arrays with Weights(weights). Same applies below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can't support arbitrary vectors since there's no varcorrection for them. Weights isn't supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but you should still be able to do var(itr; weights=..., corrected=false).

varcorrection(w::AnalyticWeights, corrected=false)

* `corrected=true`: ``\\frac{1}{\\sum w - \\sum {w^2} / \\sum w}``
* `corrected=false`: ``\\frac{1}{\\sum w}``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to include links here as well (e.g., https://en.wikipedia.org/wiki/Weighted_arithmetic_mean)


wsum = sum(w)
wsum == 0 && throw(ArgumentError("weight vector cannot sum to zero"))
length(v) == length(w) || throw(ArgumentError("data and weight vectors must be the same size," *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to split it over multiple lines then you might want to just use an if?

x < 0 && throw(ArgumentError("weight vector cannot contain negative entries"))
end

isa(w, FrequencyWeights) && !(eltype(w) <: Integer) && any(!isinteger, w) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe just use an if block.

end

Base.isequal(x::AbstractWeights, y::AbstractWeights) = false
Base.:(==)(x::AbstractWeights, y::AbstractWeights) = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a newline at end of file.

Copy link
Contributor

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's really possible, as the code cannot be imported as-is and pass tests. So we would have to keep broken commits in history, which isn't great for bisecting. We can still refer to the StatsBase repo for history, though.

How do you propose we refer to the StatsBase history? FWIW, I'd prefer that we retain history for this data type as a lot of changes/decisions were made while they were being developed. It is also helpful for github's suggested reviewers feature. Finally, I think this would be better to add after the stdlibs have been moved into separate packages as julia can just pin the Statistics version to a particular release, but folks can easily free/update to these new features regardless of which julia version they're using. It'll also be easier add features and bug fixes for these types without depending on the next julia release.

@nalimilan
Copy link
Member Author

How do you propose we refer to the StatsBase history? FWIW, I'd prefer that we retain history for this data type as a lot of changes/decisions were made while they were being developed. It is also helpful for github's suggested reviewers feature.

As I said, I don't know whether that's possible, and I'm not aware of a precedent. Do you have ideas?

Finally, I think this would be better to add after the stdlibs have been moved into separate packages as julia can just pin the Statistics version to a particular release, but folks can easily free/update to these new features regardless of which julia version they're using. It'll also be easier add features and bug fixes for these types without depending on the next julia release.

What would be the advantage of waiting? We can keep exporting the types from StatsBase (and just re-exporting when Statistics provides them), so that it keeps working as it does on all Julia versions. Then once stdlibs can be versioned, people will be able to start depending only on a given Statistics version, and drop the StatsBase dependency. But waiting blocks any progress on the StatsBase front (this PR is just a small part of the needed work).

cm3 = cm2 * z # empirical 3rd centered moment
n = 1
y = iterate(x, s)
while y !== nothing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this kind of loop is slower than what can be achieved for AbstractArray using @inbounds @simd for i in eachindex(A). The only solution AFAICT is to add a special method for AbstractArray -- but better do that in another PR as this one is quite large already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason that you can't use @inbounds @simd for i in eachindex(A) because you're starting on the 2nd iterate in this loop?


!!! note
- The weight vector is a light-weight wrapper of the input vector.
The input vector is NOT copied during construction.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The input vector is NOT copied during construction.
The input vector is *not* copied during construction.

# Return the NaN of the type that we would get, had this collection
# contained any elements (this is consistent with var)
z0 = zero(T) - zero(m)
return (z0^3 + z0^3)/sqrt((z0^2+z0^2)^3)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of initialization is probably overzealous, but I always find it hard to decide what simplifications are OK.

check_reducedims(R,A)
reddims = size(R) .!= size(A)
dim = something(findfirst(reddims), ndims(R)+1)
if dim > N
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is quite ugly but I'm not sure what's the best solution. For unweighted sum, reducing over dim > N is a no-op, so that's easy, but for the weighted sum it amounts to multiplying values by their corresponding weight. Maybe this should just be an error?


"""
var(itr; dims, corrected::Bool=true, mean=nothing)
var(itr; corrected::Bool=true, [weights::AbstractWeights], [mean], [dims])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can't support arbitrary vectors since there's no varcorrection for them. Weights isn't supported.

The Statistics module contains basic statistics functionality.
The Statistics module contains basic statistics functionality: mean, median, quantiles,
standard deviation, variance, skewness, kurtosis, correlation and covariance.
Statistics can be weighted, and several weights types are distinguished to apply appropriate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "weight type"? The plural sounds more appropriate since weights only make sense as a series of values. Though "types of weights" might be better.

@rofinn
Copy link
Contributor

rofinn commented Jul 26, 2019

As I said, I don't know whether that's possible, and I'm not aware of a precedent. Do you have ideas?

Yes, we do this a reasonable amount at work.

  1. You'd use git filter-branch in your StatsBase.jl repo to isolate the files you want to merge into Statistics locally (don't push this).
  2. Add your local StatsBase repo as a remote to the local julia repo
  3. Use git pull --allow-unrelated-histories from you branch in the julia repo to pull in those files from the StatsBase repo. Then you'd just need to move the file to where you'd like and apply the changes for the new API that you like.

NOTE: The reason I'd prefer to wait till Statistics is a separate repo is that it'll make the history cleaner because the src/weights.jl file will be pulled into the right location by default. It should also set a nice precedent for moving features from StatsBase.jl to Statistics.jl. If someone is willing to create the repo then I can:

  1. make a pull request to port over the history from base
  2. tag a release of just that
  3. Make a separate PR that copies the history for these files from StatsBase
  4. Leave that open for you to apply your changes to that branch before merging into master

@nalimilan
Copy link
Member Author

Let's continue this at JuliaStats/Statistics.jl#2.

@iamed2
Copy link
Contributor

iamed2 commented Nov 5, 2019

The reason I'd prefer to wait till Statistics is a separate repo is that it'll make the history cleaner because the src/weights.jl file will be pulled into the right location by default.

You can pull the history in and have it match up by using git subtree with a prefix. I think moving it out is still better, but I just want everyone to know that it's possible to do this history merge without moving it out.

@pdeffebach
Copy link
Contributor

Can someone give me an update on what work needs to be done on this? Is it all waiting on https://github.com/JuliaLang/Statistics.jl/pull/2?

Or are there design decisions that need to be made still?

@nalimilan
Copy link
Member Author

Continuation is at JuliaLang/Statistics.jl#2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:statistics The Statistics stdlib module stdlib Julia's standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants