Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chatterjee Correlation Coefficient #770

Merged
merged 23 commits into from
May 25, 2022
Merged

Conversation

mborland
Copy link
Member

@mborland mborland commented Mar 1, 2022

@NAThompson
Copy link
Collaborator

@mborland : This might be one of the most cool and unique features we have in the library; thanks for taking this on.

@NAThompson
Copy link
Collaborator

@mborland : Do you know how to quickly recover the rank of Y_{(i)}? That's one part of the paper I didn't quite understand.

@mborland
Copy link
Member Author

mborland commented Mar 2, 2022

@mborland : Do you know how to quickly recover the rank of Y_{(i)}? That's one part of the paper I didn't quite understand.

If you look at rank.hpp the function takes the y values and returns a std::vector<std::size_t> with the rank values in the same sort order as the y values. In the actual implementation I am going to have an assertion that the x values are sorted before this function is called.

@NAThompson
Copy link
Collaborator

In the actual implementation I am going to have an assertion that the x values are sorted before this function is called.

Ah, that's a good idea. Might want to have two:

sorted_chatterjee_coefficient

which asserts std::is_sorted(X) and chatterjee_coefficient which performs the sort.

@mborland mborland changed the title Chaterjee Correlation Coefficient Chatterjee Correlation Coefficient Mar 3, 2022
@mborland
Copy link
Member Author

mborland commented Mar 3, 2022

@NAThompson here is the performance data:

Running ./chatterjee_correlation_performance
Run on (10 X 24.0565 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x10)
  L1 Instruction 128 KiB (x10)
  L2 Unified 4096 KiB (x5)
Load Average: 2.48, 2.44, 2.37
-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
chatterjee_correlation<float>/64/real_time              534 ns          534 ns      1268796
chatterjee_correlation<float>/128/real_time            1097 ns         1097 ns       645575
chatterjee_correlation<float>/256/real_time            2357 ns         2357 ns       290292
chatterjee_correlation<float>/512/real_time            5099 ns         5099 ns       136816
chatterjee_correlation<float>/1024/real_time          12173 ns        12173 ns        60000
chatterjee_correlation<float>/2048/real_time          72584 ns        72584 ns        11677
chatterjee_correlation<float>/4096/real_time         163897 ns       163895 ns         3731
chatterjee_correlation<float>/8192/real_time         419733 ns       419733 ns         1737
chatterjee_correlation<float>/16384/real_time        911279 ns       911278 ns          792
chatterjee_correlation<float>/32768/real_time       1998998 ns      1998997 ns          359
chatterjee_correlation<float>/65536/real_time       4250824 ns      4250776 ns          165
chatterjee_correlation<float>/131072/real_time      9087735 ns      9087731 ns           78
chatterjee_correlation<float>/262144/real_time     19238598 ns     19238622 ns           37
chatterjee_correlation<float>/524288/real_time     41215804 ns     41215824 ns           17
chatterjee_correlation<float>/1048576/real_time    84805198 ns     84805000 ns            8
chatterjee_correlation<float>/real_time_BigO           4.06 NlgN       4.06 NlgN
chatterjee_correlation<float>/real_time_RMS               2 %             2 %
chatterjee_correlation<double>/64/real_time             529 ns          529 ns      1372474
chatterjee_correlation<double>/128/real_time           1067 ns         1067 ns       649907
chatterjee_correlation<double>/256/real_time           2248 ns         2248 ns       320911
chatterjee_correlation<double>/512/real_time           4690 ns         4690 ns       145379
chatterjee_correlation<double>/1024/real_time         10196 ns        10196 ns        58376
chatterjee_correlation<double>/2048/real_time         67452 ns        67451 ns        10000
chatterjee_correlation<double>/4096/real_time        139739 ns       139739 ns         3923
chatterjee_correlation<double>/8192/real_time        396547 ns       396546 ns         1744
chatterjee_correlation<double>/16384/real_time       866781 ns       866781 ns          744
chatterjee_correlation<double>/32768/real_time      1965081 ns      1965079 ns          353
chatterjee_correlation<double>/65536/real_time      4249051 ns      4249048 ns          166
chatterjee_correlation<double>/131072/real_time     8951655 ns      8951662 ns           77
chatterjee_correlation<double>/262144/real_time    19002963 ns     19003000 ns           36
chatterjee_correlation<double>/524288/real_time    40619715 ns     40619778 ns           18
chatterjee_correlation<double>/1048576/real_time   83150833 ns     83150875 ns            8
chatterjee_correlation<double>/real_time_BigO          3.99 NlgN       3.99 NlgN
chatterjee_correlation<double>/real_time_RMS              3 %             3 %

@NAThompson
Copy link
Collaborator

@mborland : Beautiful nlog(n) complexity just as expected.

BTW looks like you accidently committed a binary file.

@mborland mborland marked this pull request as ready for review May 22, 2022 23:46
@mborland
Copy link
Member Author

@NAThompson This is good for review. The only failure in the previous run was fixing a non-ASCII character in a comment.


This is the problem Chatterjee's coefficient solves.
Let X and Y be random variables, where Y is not constant, and let (X_i, Y_i) be samples from this distribution.
Rearrange these samples so that X_(0) < X_{(1)} < ... X_{(n-1)} and create (X_{(i)}, Y_{(i)}).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mborland : Does this render properly? I wonder if we need to use (say) this to get the docs correct?

In the limit of an infinite amount of i.i.d data, the statistic lies in [0, 1].
However, if the data is not infinite, the statistic may be negative.
If X and Y are independent, the value is zero, and if Y is a measurable function of X, then the statistic is unity.
The complexity is O(n log n).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get O(n log n) nicely rendered?

@NAThompson
Copy link
Collaborator

@mborland : Just put some trivial comments in-I think this is basically good to go.

One last thing: Is there another really clean unit test we could add? I wonder if figure 2 of the attached could be used "morally" to simply ensure that we haven't done any silly scaling errors, i.e., just let Y = X, and attempt to show ξ ≈ 0.970, let Y = X^2 and show ξ ≈ 0.941, and Y = sin(X) and show ξ≈0.885.

@NAThompson
Copy link
Collaborator

@mborland : Looks good to me; I sign off!

@jzmaddock : Want to do a final sign off?

@mborland
Copy link
Member Author

This is good to go now. Autodiff has been consistently hanging in the drone run under USAN.

@NAThompson NAThompson merged commit e5eae18 into boostorg:develop May 25, 2022
@mborland mborland deleted the chaterjee branch May 25, 2022 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants