Import weighted stats and moments from StatsBase to Statistics #31395

nalimilan · 2019-03-18T21:51:28Z

This includes methods for mean, quantile, median, var, std, cov and cor, plus new functions skewness and kurtosis, and weight types. Code is copied from StatsBase with some cleanup where needed, in particular for dispatch, to move from @nloops/@nrefs to cartesian indexing and to
be closer to the mapreducedim code. Weights are now passed via a keyword argument rather than by dispatching on AbstractWeights, so as to support any array where all weights types give the same result.

Still to address:

Do we want mean_and_var and mean_and_std? It's not too hard to do m = mean(...); s = std(..., mean=m) manually. Also the names aren't great. UPDATE: dropped.
Do we want moment? It feels redundant with mean, var, skewness and kurtosis. UPDATE: dropped.
Should we also support weighted sum? For now I've kept wsum internal (it is used by mean), it can always be implemented later in Base. UPDATE: added.
Check that performance didn't regress when porting from @nloop/@nref to cartesian indexing. UPDATE: performance is similar, and even sometimes faster.
Clean tests (notably ensuring that non-vector and non-AbstractWeights weights are supported).
Check that everything is clean, in particular docstrings.

Closes https://github.com/JuliaLang/julia/issues/29974. See #27152 (comment) for an outline of a broader roadmap.

ararslan · 2019-03-18T22:04:18Z

Do we want mean_and_var and mean_and_std? It's not too hard to do m = mean(...); s = std(..., mean=m) manually. Also the names aren't great.

👍 to dumping them.

Do we want moment? It feels redundant with mean, var, skewness and kurtosis.

Unless we're going to support arbitrary moments (or just 5th or higher), I'd say leave it out.

Should we also support weighted sum?

👍 to supporting it.

rofinn · 2019-03-19T15:30:46Z

Out of curiosity, what's the advantage with having this in Statistics? I only ask because there are still some open issues with the weights type in StatsBase that we might want to address (e.g., n-dimensional weights, mutability of weights).

ararslan · 2019-03-19T15:45:40Z

Out of curiosity, what's the advantage with having this in Statistics?

One big one is better APIs, for example mean(x, weights=w), which also ensures that cases like JuliaStats/StatsBase.jl#475 don't occur.

rofinn · 2019-03-19T15:53:07Z

Oh, yeah, I guess that makes sense. Looks like github also supports transferring issues to another repo https://help.github.com/en/articles/transferring-an-issue-to-another-repository. I would prefer that we try to retain the history for the files that we're actively copying over.

nalimilan · 2019-03-19T17:29:13Z

Another advantage is to eventually remove StatsBase, whose name is confusing since it's not more "base" than Statistics (quite the contrary actually).

I would prefer that we try to retain the history for the files that we're actively copying over.

I'm not sure that's really possible, as the code cannot be imported as-is and pass tests. So we would have to keep broken commits in history, which isn't great for bisecting. We can still refer to the StatsBase repo for history, though.

stdlib/Statistics/src/moments.jl

This includes methods for mean, quantile, median, var, std, cov and cor, plus new functions skewness and kurtosis, and weight types. Code is copied from StatsBase with some cleanup where needed, in particular for dispatch, to move from `@nloops`/`@nrefs` to cartesian indexing and to be closer to the mapreducedim code. Weights are now passed via a keyword argument rather than by dispatching on AbstractWeights, so as to support any array where all weights types give the same result.

…rable

nalimilan · 2019-05-08T16:40:42Z

I've implemented sum(x; weights, dims) in Base. This is tricky since StatsBase had optimized methods using BLAS, which make a large difference for performance. The solution I found is to have the generic fallbacks in reducedim.jl, and add methods for BlasReal from LinearAlgebra.

I've also cleaned a few things and checked that performance hasn't regressed, so I think the PR is now ready for a serious review. This is a large piece of code, and the reductions are particularly tricky, so double-checking would really be appreciated.

nalimilan · 2019-05-08T15:56:23Z

stdlib/Statistics/src/Statistics.jl

    isempty(r) && return oftype((first(r) + last(r)) / 2, NaN)
    (first(r) + last(r)) / 2
 end

-median(r::AbstractRange{<:Real}) = mean(r)
+_mean(A::AbstractArray, dims, weights::Nothing) =
+    _mean!(Base.reducedim_init(t -> t/2, Base.add_sum, A, dims), A, nothing)


The old code used + instead of add_sum, but I figured it would be more consistent with sum. In practice I don't think it makes a difference for standard types because of the /2 which changes the type to floating point.

Tokazama · 2019-07-25T14:39:24Z

stdlib/Statistics/docs/src/index.md

-The Statistics module contains basic statistics functionality.
+The Statistics module contains basic statistics functionality: mean, median, quantiles,
+standard deviation, variance, skewness, kurtosis, correlation and covariance.
+Statistics can be weighted, and several weights types are distinguished to apply appropriate


Perhaps this should read "several weight types" instead of "several weights types". If you're referring to the actual type then " several AbstractWeights types" maybe?

Why "weight type"? The plural sounds more appropriate since weights only make sense as a series of values. Though "types of weights" might be better.

It just sounded weird to me. I think "types of weights" is what sounds best.

rofinn · 2019-07-26T01:41:59Z

stdlib/Statistics/src/Statistics.jl


 """
-    var(itr; dims, corrected::Bool=true, mean=nothing)
+    var(itr; corrected::Bool=true, [weights::AbstractWeights], [mean], [dims])


Maybe we should support weights::AbstractArray for consistency with other methods like mean? We could always convert non-weight arrays with Weights(weights). Same applies below.

Actually we can't support arbitrary vectors since there's no varcorrection for them. Weights isn't supported.

That's true, but you should still be able to do var(itr; weights=..., corrected=false).

rofinn · 2019-07-26T01:50:52Z

stdlib/Statistics/src/Statistics.jl

+    varcorrection(w::AnalyticWeights, corrected=false)
+
+* `corrected=true`: ``\\frac{1}{\\sum w - \\sum {w^2} / \\sum w}``
+* `corrected=false`: ``\\frac{1}{\\sum w}``


Might be good to include links here as well (e.g., https://en.wikipedia.org/wiki/Weighted_arithmetic_mean)

rofinn · 2019-07-26T01:56:54Z

stdlib/Statistics/src/Statistics.jl

+
+    wsum = sum(w)
+    wsum == 0 && throw(ArgumentError("weight vector cannot sum to zero"))
+    length(v) == length(w) || throw(ArgumentError("data and weight vectors must be the same size," *


If you need to split it over multiple lines then you might want to just use an if?

rofinn · 2019-07-26T01:57:57Z

stdlib/Statistics/src/Statistics.jl

+        x < 0 && throw(ArgumentError("weight vector cannot contain negative entries"))
+    end
+
+    isa(w, FrequencyWeights) && !(eltype(w) <: Integer) && any(!isinteger, w) &&


Again, maybe just use an if block.

rofinn · 2019-07-26T02:12:14Z

stdlib/Statistics/src/weights.jl

+end
+
+Base.isequal(x::AbstractWeights, y::AbstractWeights) = false
+Base.:(==)(x::AbstractWeights, y::AbstractWeights)   = false


Missing a newline at end of file.

rofinn

I'm not sure that's really possible, as the code cannot be imported as-is and pass tests. So we would have to keep broken commits in history, which isn't great for bisecting. We can still refer to the StatsBase repo for history, though.

How do you propose we refer to the StatsBase history? FWIW, I'd prefer that we retain history for this data type as a lot of changes/decisions were made while they were being developed. It is also helpful for github's suggested reviewers feature. Finally, I think this would be better to add after the stdlibs have been moved into separate packages as julia can just pin the Statistics version to a particular release, but folks can easily free/update to these new features regardless of which julia version they're using. It'll also be easier add features and bug fixes for these types without depending on the next julia release.

nalimilan · 2019-07-26T08:21:01Z

How do you propose we refer to the StatsBase history? FWIW, I'd prefer that we retain history for this data type as a lot of changes/decisions were made while they were being developed. It is also helpful for github's suggested reviewers feature.

As I said, I don't know whether that's possible, and I'm not aware of a precedent. Do you have ideas?

Finally, I think this would be better to add after the stdlibs have been moved into separate packages as julia can just pin the Statistics version to a particular release, but folks can easily free/update to these new features regardless of which julia version they're using. It'll also be easier add features and bug fixes for these types without depending on the next julia release.

What would be the advantage of waiting? We can keep exporting the types from StatsBase (and just re-exporting when Statistics provides them), so that it keeps working as it does on all Julia versions. Then once stdlibs can be versioned, people will be able to start depending only on a given Statistics version, and drop the StatsBase dependency. But waiting blocks any progress on the StatsBase front (this PR is just a small part of the needed work).

nalimilan · 2019-05-09T07:26:59Z

stdlib/Statistics/src/moments.jl

+    cm3 = cm2 * z # empirical 3rd centered moment
+    n = 1
+    y = iterate(x, s)
+    while y !== nothing


Unfortunately, this kind of loop is slower than what can be achieved for AbstractArray using @inbounds @simd for i in eachindex(A). The only solution AFAICT is to add a special method for AbstractArray -- but better do that in another PR as this one is quite large already.

Is the reason that you can't use @inbounds @simd for i in eachindex(A) because you're starting on the 2nd iterate in this loop?

nalimilan · 2019-05-09T07:27:19Z

stdlib/Statistics/docs/src/index.md

+
+!!! note
+    - The weight vector is a light-weight wrapper of the input vector.
+      The input vector is NOT copied during construction.


Suggested change

The input vector is NOT copied during construction.

The input vector is *not* copied during construction.

nalimilan · 2019-05-09T07:34:51Z

stdlib/Statistics/src/moments.jl

+        # Return the NaN of the type that we would get, had this collection
+        # contained any elements (this is consistent with var)
+        z0 = zero(T) - zero(m)
+        return (z0^3 + z0^3)/sqrt((z0^2+z0^2)^3)


This kind of initialization is probably overzealous, but I always find it hard to decide what simplifications are OK.

nalimilan · 2019-05-09T07:39:06Z

base/reducedim.jl

+    check_reducedims(R,A)
+    reddims = size(R) .!= size(A)
+    dim = something(findfirst(reddims), ndims(R)+1)
+    if dim > N


This code is quite ugly but I'm not sure what's the best solution. For unweighted sum, reducing over dim > N is a no-op, so that's easy, but for the weighted sum it amounts to multiplying values by their corresponding weight. Maybe this should just be an error?

nalimilan · 2019-07-26T08:10:22Z

stdlib/Statistics/src/Statistics.jl


 """
-    var(itr; dims, corrected::Bool=true, mean=nothing)
+    var(itr; corrected::Bool=true, [weights::AbstractWeights], [mean], [dims])


Actually we can't support arbitrary vectors since there's no varcorrection for them. Weights isn't supported.

nalimilan · 2019-07-26T08:22:40Z

stdlib/Statistics/docs/src/index.md

-The Statistics module contains basic statistics functionality.
+The Statistics module contains basic statistics functionality: mean, median, quantiles,
+standard deviation, variance, skewness, kurtosis, correlation and covariance.
+Statistics can be weighted, and several weights types are distinguished to apply appropriate


Why "weight type"? The plural sounds more appropriate since weights only make sense as a series of values. Though "types of weights" might be better.

rofinn · 2019-07-26T15:01:19Z

As I said, I don't know whether that's possible, and I'm not aware of a precedent. Do you have ideas?

Yes, we do this a reasonable amount at work.

You'd use git filter-branch in your StatsBase.jl repo to isolate the files you want to merge into Statistics locally (don't push this).
Add your local StatsBase repo as a remote to the local julia repo
Use git pull --allow-unrelated-histories from you branch in the julia repo to pull in those files from the StatsBase repo. Then you'd just need to move the file to where you'd like and apply the changes for the new API that you like.

NOTE: The reason I'd prefer to wait till Statistics is a separate repo is that it'll make the history cleaner because the src/weights.jl file will be pulled into the right location by default. It should also set a nice precedent for moving features from StatsBase.jl to Statistics.jl. If someone is willing to create the repo then I can:

make a pull request to port over the history from base
tag a release of just that
Make a separate PR that copies the history for these files from StatsBase
Leave that open for you to apply your changes to that branch before merging into master

nalimilan · 2019-09-28T14:16:16Z

Let's continue this at JuliaStats/Statistics.jl#2.

iamed2 · 2019-11-05T00:04:35Z

The reason I'd prefer to wait till Statistics is a separate repo is that it'll make the history cleaner because the src/weights.jl file will be pulled into the right location by default.

You can pull the history in and have it match up by using git subtree with a prefix. I think moving it out is still better, but I just want everyone to know that it's possible to do this history merge without moving it out.

pdeffebach · 2020-03-18T14:27:47Z

Can someone give me an update on what work needs to be done on this? Is it all waiting on https://github.com/JuliaLang/Statistics.jl/pull/2?

Or are there design decisions that need to be made still?

nalimilan · 2020-03-18T15:36:26Z

Continuation is at JuliaLang/Statistics.jl#2.

nalimilan added stdlib Julia's standard library domain:statistics The Statistics stdlib module and removed stdlib Julia's standard library labels Mar 18, 2019

oxinabox reviewed Mar 19, 2019

View reviewed changes

stdlib/Statistics/src/moments.jl Outdated Show resolved Hide resolved

nalimilan mentioned this pull request Apr 19, 2019

cov(x, w::AbstractWeights) dispatches on cov(X, Y) fallback JuliaStats/StatsBase.jl#409

Open

nalimilan added 11 commits May 4, 2019 14:45

Remove moment and combined stats, make other functions accept any ite…

d5e33de

…rable

Implement weighted sum

9a41048

Move optimized weighted sum methods to LinearAlgebra

022d029

More tests

24f530a

Move varcorrection to Statistics.jl

d7ae38b

Docs

4618723

Performance fix

d627393

Cleanup

7f364f4

Fix bug

f51d8db

Fix TODO

1f7b3d9

nalimilan force-pushed the nl/weightedstats branch from 655603d to 1f7b3d9 Compare May 8, 2019 15:20

Another cleanup, enable more tests

d5e0135

nalimilan mentioned this pull request May 8, 2019

ignoring elements with 0 weight JuliaStats/StatsBase.jl#492

Open

nalimilan marked this pull request as ready for review May 8, 2019 16:40

nalimilan commented May 8, 2019

View reviewed changes

nalimilan mentioned this pull request May 8, 2019

Release StatsBase.jl v1.0 JuliaStats/StatsBase.jl#493

Open

nalimilan mentioned this pull request May 26, 2019

Add exponential weights JuliaStats/StatsBase.jl#401

Merged

nickrobinson251 mentioned this pull request Jun 17, 2019

Consider exporting StatsBase.sample from Statistics stdlib #32343

Closed

Tokazama reviewed Jul 25, 2019

View reviewed changes

rofinn reviewed Jul 26, 2019

View reviewed changes

rofinn suggested changes Jul 26, 2019

View reviewed changes

nalimilan commented Jul 26, 2019

View reviewed changes

This was referenced Sep 18, 2019

RFC: Add weights argument to sum #33310

Closed

Allow weights in cut(x, ngroups) JuliaData/CategoricalArrays.jl#209

Open

This was referenced Sep 27, 2019

Move Statistics stdlib module to external repository #33399

Merged

Import StatsBase into Statistics JuliaStats/Statistics.jl#2

Draft

nalimilan mentioned this pull request Sep 29, 2019

Simplify weights JuliaStats/StatsBase.jl#526

Merged

nalimilan mentioned this pull request Oct 19, 2019

Add a WeightedResampler JuliaStats/Distributions.jl#890

Open

nalimilan mentioned this pull request Dec 19, 2020

Proposal for moving StatsBase weights framework to stdlib/Statistics JuliaStats/Statistics.jl#4

Open

This was referenced Feb 19, 2021

std not re-exported? JuliaStats/StatsBase.jl#504

Closed

Reexport LogExpFunctions JuliaStats/StatsFuns.jl#108

Merged

nalimilan mentioned this pull request Apr 26, 2021

Preserve eltype where possible for moments JuliaStats/StatsBase.jl#688

Open

nalimilan closed this Sep 25, 2021

DilumAluthge deleted the nl/weightedstats branch October 26, 2021 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import weighted stats and moments from StatsBase to Statistics #31395

Import weighted stats and moments from StatsBase to Statistics #31395

nalimilan commented Mar 18, 2019 •

edited

Loading

ararslan commented Mar 18, 2019

rofinn commented Mar 19, 2019

ararslan commented Mar 19, 2019

rofinn commented Mar 19, 2019

nalimilan commented Mar 19, 2019

nalimilan commented May 8, 2019

nalimilan May 8, 2019

Tokazama Jul 25, 2019

nalimilan Jul 26, 2019

Tokazama Jul 26, 2019

rofinn Jul 26, 2019 •

edited

Loading

nalimilan Jul 26, 2019

rofinn Jul 26, 2019

rofinn Jul 26, 2019

rofinn Jul 26, 2019

rofinn Jul 26, 2019

rofinn Jul 26, 2019

rofinn left a comment •

edited

Loading

nalimilan commented Jul 26, 2019

nalimilan May 9, 2019

Tokazama Jul 26, 2019

nalimilan May 9, 2019

nalimilan May 9, 2019

nalimilan May 9, 2019

nalimilan Jul 26, 2019

nalimilan Jul 26, 2019

rofinn commented Jul 26, 2019 •

edited

Loading

nalimilan commented Sep 28, 2019

iamed2 commented Nov 5, 2019

pdeffebach commented Mar 18, 2020

nalimilan commented Mar 18, 2020

	The input vector is NOT copied during construction.
	The input vector is not copied during construction.

Import weighted stats and moments from StatsBase to Statistics #31395

Import weighted stats and moments from StatsBase to Statistics #31395

Conversation

nalimilan commented Mar 18, 2019 • edited Loading

ararslan commented Mar 18, 2019

rofinn commented Mar 19, 2019

ararslan commented Mar 19, 2019

rofinn commented Mar 19, 2019

nalimilan commented Mar 19, 2019

nalimilan commented May 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn left a comment • edited Loading

Choose a reason for hiding this comment

nalimilan commented Jul 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn commented Jul 26, 2019 • edited Loading

nalimilan commented Sep 28, 2019

iamed2 commented Nov 5, 2019

pdeffebach commented Mar 18, 2020

nalimilan commented Mar 18, 2020

nalimilan commented Mar 18, 2019 •

edited

Loading

rofinn Jul 26, 2019 •

edited

Loading

rofinn left a comment •

edited

Loading

rofinn commented Jul 26, 2019 •

edited

Loading