Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instable results #13

Open
immaryw opened this issue Aug 18, 2021 · 3 comments · May be fixed by #21 or #23
Open

instable results #13

immaryw opened this issue Aug 18, 2021 · 3 comments · May be fixed by #21 or #23

Comments

@immaryw
Copy link

immaryw commented Aug 18, 2021

Hi there,

I'm using the package to calculate density ratio for multi-dimensional data. I run the program many times using the same training and test datasets, but the estimated density ratio are often slightly different. Is there a way to make the result more stable.

Another question involves how to set sigma and lambda search range. These hyper-params affect the estimations too much!

Thanks in advance!!

@mierzejk
Copy link
Contributor

mierzejk commented Aug 18, 2021

Hi @immaryw,

I am not sure if the repo is being maintained by the owner, as over a year ago I submitted a pull request (#9) and an issue (#10), but no reaction has ensued whatsoever. Yet you might be interested in my branch, that can possibly alleviate your concern. The branch lacks an updated README, but the only configurable feature is thoroughly described in #9, and by default it offers significant performance improvement by applying proper numpy vectorization.

The said instability is on account of the way kernel centers are being randomly picked from the x vector. Namely, with the use of the numpy.random.randint method, that is a simple variation with repetition (replacement). Your options are:

  1. Use numpy.random.seed and set the pseudo-random number generator seed to a constant value. I know, the function is somehow deprecated and being discouraged, but this is how densratio_py is implemented.
  2. Go for my branch, where numpy.random.randint has been superseded with unique numpy.random.choice (without replacement).
    Furthermore, by applying numpy.percentile the choice is stratified with respect to (possibly multivariate) x values. Please refer to the semi_stratified_sample function. This approach should increase the result stability.
  3. Set the number of kernels equal to the length of x; or greater, but it is effectively the same as equal to. In case of my branch, it determines all x values become kernel centers. While for the original repo there is still the with replacement factor.

Obviously you can combine any two, or even all of the list points.

Best regards,
Chris

@mierzejk
Copy link
Contributor

mierzejk commented Aug 18, 2021

As regards lambda and sigma, right now you just provide a list of values whose Cartesian product is evaluated, and a pair that results in the least error is selected (a grid search approach). Perhaps the automated machine learning (AutoML) approach could be taken up to expedite the process, but it would definitely require some core source code modification to get it working.

@immaryw
Copy link
Author

immaryw commented Aug 19, 2021

@mierzejk I appreciated your awesome work and very helpful reply!! I installed your branch and the result looks more stable now 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants