Ellipse should sub-sample data for random selection #38

edsmall-bodc · 2020-07-17T08:29:15Z

When selecting historical data for comparison, 1/3 of this data is selected randomly. The rest are selected by how well the remaining historical data matches the current float profile (space and time).

The randomly selected data are specifically selected to avoid choosing only data from one area because of strong correlations. We want to ensure we use strongly correlated data AND enough data to cover the area generally.

However, if we select 1/3 of our data points randomly, we are sill at risk of selecting a poor spatial/temporal distribution (though this probability is very small). We can increase our chances of getting a good distribution by splitting up the ellipse into N parts and randomly selecting data in each of these parts.

We need to discuss how we should go about splitting up the ellipse into chunks. We should try and do this dynamically, so that DMQC operators can decide themselves how many areas the ellipse should be split into.

Also need to decide how to allocate the amount of selection for each segment. Eg, what if we want 20 random data points from each section, but one section only contains 5 data points? How do we allocate the work to pick up the slack?

I've added some diagrams to visualise what we are trying to achieve.

Yellow dot is the current profile.
Green dots are selected data
Red dots are data that fall outside our spatial/temporal parameters.