instable results #13

immaryw · 2021-08-18T08:21:21Z

Hi there,

I'm using the package to calculate density ratio for multi-dimensional data. I run the program many times using the same training and test datasets, but the estimated density ratio are often slightly different. Is there a way to make the result more stable.

Another question involves how to set sigma and lambda search range. These hyper-params affect the estimations too much!

Thanks in advance!!

mierzejk · 2021-08-18T12:24:40Z

Hi @immaryw,

I am not sure if the repo is being maintained by the owner, as over a year ago I submitted a pull request (#9) and an issue (#10), but no reaction has ensued whatsoever. Yet you might be interested in my branch, that can possibly alleviate your concern. The branch lacks an updated README, but the only configurable feature is thoroughly described in #9, and by default it offers significant performance improvement by applying proper numpy vectorization.

The said instability is on account of the way kernel centers are being randomly picked from the x vector. Namely, with the use of the numpy.random.randint method, that is a simple variation with repetition (replacement). Your options are:

Use numpy.random.seed and set the pseudo-random number generator seed to a constant value. I know, the function is somehow deprecated and being discouraged, but this is how densratio_py is implemented.
Go for my branch, where numpy.random.randint has been superseded with unique numpy.random.choice (without replacement).
Furthermore, by applying numpy.percentile the choice is stratified with respect to (possibly multivariate) x values. Please refer to the semi_stratified_sample function. This approach should increase the result stability.
Set the number of kernels equal to the length of x; or greater, but it is effectively the same as equal to. In case of my branch, it determines all x values become kernel centers. While for the original repo there is still the with replacement factor.

Obviously you can combine any two, or even all of the list points.

Best regards,
Chris

mierzejk · 2021-08-18T12:53:40Z

As regards lambda and sigma, right now you just provide a list of values whose Cartesian product is evaluated, and a pair that results in the least error is selected (a grid search approach). Perhaps the automated machine learning (AutoML) approach could be taken up to expedite the process, but it would definitely require some core source code modification to get it working.

immaryw · 2021-08-19T00:19:30Z

@mierzejk I appreciated your awesome work and very helpful reply!! I installed your branch and the result looks more stable now 👍

This was referenced Aug 27, 2023

Kernel centres semi-stratified selection without replacement #21

Open

Aggregated pull request #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instable results #13

instable results #13

immaryw commented Aug 18, 2021

mierzejk commented Aug 18, 2021 •

edited

Loading

mierzejk commented Aug 18, 2021 •

edited

Loading

immaryw commented Aug 19, 2021

instable results #13

instable results #13

Comments

immaryw commented Aug 18, 2021

mierzejk commented Aug 18, 2021 • edited Loading

mierzejk commented Aug 18, 2021 • edited Loading

immaryw commented Aug 19, 2021

mierzejk commented Aug 18, 2021 •

edited

Loading

mierzejk commented Aug 18, 2021 •

edited

Loading