Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ellipse should sub-sample data for random selection #38

Open
edsmall-bodc opened this issue Jul 17, 2020 · 0 comments
Open

Ellipse should sub-sample data for random selection #38

edsmall-bodc opened this issue Jul 17, 2020 · 0 comments
Labels
enhancement New feature or request ignore-for-release Ignore this for next release

Comments

@edsmall-bodc
Copy link
Collaborator

When selecting historical data for comparison, 1/3 of this data is selected randomly. The rest are selected by how well the remaining historical data matches the current float profile (space and time).

The randomly selected data are specifically selected to avoid choosing only data from one area because of strong correlations. We want to ensure we use strongly correlated data AND enough data to cover the area generally.

However, if we select 1/3 of our data points randomly, we are sill at risk of selecting a poor spatial/temporal distribution (though this probability is very small). We can increase our chances of getting a good distribution by splitting up the ellipse into N parts and randomly selecting data in each of these parts.

We need to discuss how we should go about splitting up the ellipse into chunks. We should try and do this dynamically, so that DMQC operators can decide themselves how many areas the ellipse should be split into.

Also need to decide how to allocate the amount of selection for each segment. Eg, what if we want 20 random data points from each section, but one section only contains 5 data points? How do we allocate the work to pick up the slack?

I've added some diagrams to visualise what we are trying to achieve.

Yellow dot is the current profile.
Green dots are selected data
Red dots are data that fall outside our spatial/temporal parameters.

Current Random Selection
N is 2
N is 3
N is 4
N is 6

@edsmall-bodc edsmall-bodc added the enhancement New feature or request label Jul 17, 2020
@kamwal kamwal added the ignore-for-release Ignore this for next release label Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ignore-for-release Ignore this for next release
Projects
None yet
Development

No branches or pull requests

2 participants