Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trip confidence can be artificially high for single trip clusters #663

Closed
shankari opened this issue Aug 14, 2021 · 18 comments
Closed

Trip confidence can be artificially high for single trip clusters #663

shankari opened this issue Aug 14, 2021 · 18 comments

Comments

@shankari
Copy link
Contributor

From e-mission/e-mission-eval-private-data#28 (comment)

the current confidence is essentially the same as the h-score. But that has a problem with our expectation design. If a user goes to a new location for the first time by car, and the second time by bike, then during the second time, we will have a match with a confidence of 1.0, so we won't even show it to the user. So we won't even know that it is a trip that is sometimes taken by bike. So we need to change the confidence calculation to wait until we have k labels before we are confident.

  • I am a bit on the fence about the metrics for this change. Do we put it into the paper or not? It is a useful extension and one that I think is easy to explain without bringing in the second round. We can use the same metrics, but just for algorithms where:
    • confidence = max_p, or
    • confidence = k_max_p

@GabrielKS if I work on #662, can you take this one? It is a lot more straightforward, and it is a correctness issue, so I think we should fix it before a larger scale deployment. And I think you already had an idea of how to implement it?

@shankari shankari changed the title Trip confidence can be artificially high Trip confidence can be artificially high for single trip clusters Aug 14, 2021
@GabrielKS
Copy link

GabrielKS commented Aug 15, 2021

Discounting Inference Confidence for Low Cluster Sizes

Theory

Currently, we calculate confidence as the fraction of trips in the cluster that correspond to a given label tuple. This works well to assign relative confidence in the various labels — if a user drove a given trip twice in the past and biked it once, it's fair to guess that they're about twice as likely to drive it as bike it in the future. But it doesn't effectively capture the "uncertainty." Here, when I say uncertainty, I mean the difference between the sum of the confidences of all the label tuples and 1. If a user drove a given trip twice in the past and biked it once, it is not fair to guess that there is zero chance they will carpool it in the future.

Thus, our algorithm should work by calculating an uncertainty coefficient, where 1 is no additional uncertainty and 0 is complete uncertainty, and multiplying all of the naïve probability values by it. This uncertainty coefficient should depend in some way on the number of trips in the cluster — larger sample size, less uncertainty. For now, we'll just use the raw number of trips in the cluster, but future work may explore whether it may be more appropriate to incorporate some measure of the homogeneity of the cluster, e.g. by instead using the count of the most frequent label tuple in the cluster.

As sample size n tends towards infinity, what should the uncertainty coefficient u tend towards? Naïvely, we might say 1, but even if we drive a trip 1000 times, we can't rule out the possibility that next time we might bike it. Let's let the constant A be the lingering uncertainty (0=none, 1=all) that can never be removed no matter how large our sample is; I propose a default value of A=0.01.

If we have a sample size of n=1, what should u be? In other words, if we bike a given trip for the first time, how likely are we to bike it the second time? We can answer this question empirically, either in general or for each user, but for now I'll make up a number and call it B: B=0.75.

Finally, how do we get from u=B at n=1 to u~=1-A at n~=∞? To state it more mathily, how much of the lingering (removable) uncertainty should be removed by each additional member of the cluster? I don't know, 10%. Let's let C=0.1.

This uniquely defines the formula

, or
u=(1-A)-(1-A-B)*(1-C)^(n-1).

Practice

The clustering-based prediction algorithm exists as a trip->labels function in inferrers.py. This makes it easy to implement a function that calls this one and returns a copy of the labels data structure with every probability value multiplied by the calculated u. For now, we'll hard-code the constants, but future work might put them in a file or compute them dynamically or something.

@shankari
Copy link
Contributor Author

shankari commented Aug 15, 2021

@GabrielKS this makes sense to me, but I am concerned that it is not very explainable. The program admins do ask us questions about how the algorithm works and the current explanation is fairly simple - we look at your prior trips and use them for the probability. At our Friday meeting, Jeanne actually asked me if she could see the probabilities for the labels from the UI 😄

I think that an explanation similar to "we need 'k' trips before the cluster counts" would also be understandable. I'm afraid this will make people's eyes glaze over.

Is there a way to explain this in simple terms for somebody like Jeanne to understand? If not, I would prefer a simpler version even if it is not sufficiently mathy.

@GabrielKS
Copy link

Here's an attempt at explaining the above (I changed C to 0.25) to a lay audience:

As you know, we try to predict a new trip's labels by matching the trip data to previous trips. For instance, if you bike to work every day, after a while we can pick up on that pattern. If you sometimes bike to work and sometimes drive to work, we can learn that too and predict whichever is more frequent, but with a reduced confidence so the app prompts you to verify it. However, if you bike somewhere for the first time, we don't know whether you will bike or take some other form of transportation the second time you go there.

The solution is the same: to reduce the confidence. We say we are 75% sure that you will take the same mode of transportation next time, so next time we will predict "bike" but have you verify it. If you bike again the second time, we are more sure (81%) that you will bike again the third time. The confidence keeps increasing every trip you take: after seven trips, we are 95% sure you will bike again the eighth time. Of course, no matter how many trips you take, you could always decide to take a new mode of transportation the next time, so our confidence score maxes out at 99%.

The last sentence may be omitted for more simplicity, and much of the first paragraph may be omitted if the user is familiar with how the status quo works.

@GabrielKS
Copy link

For now, I will implement the algorithm as planned, and if we need to simplify the formula it will be a simple one-line change.

GabrielKS added a commit to GabrielKS/e-mission-server that referenced this issue Aug 16, 2021
 + Implements the algorithm described in e-mission/e-mission-docs#663 (comment)
 + Uses constant values `A=0.01`, `B=0.75`, `C=0.25`
 + Changes eacilp.primary_algorithms to use the new algorithm

No unit tests yet (working on a tight timeline), but tested as follows:
 1. Run `[eacili.n_to_confidence_coeff(n) for n in [1,2,3,4,5,7,10,15,20,30,1000]]`, check for reasonableness, compare to results from plugging the formula into a calculator
 2. Run the modeling and intake pipeline with `eacilp.primary_algorithms` set to the old algorithm
 3. Run the first few cells of the "Explore label inference confidence" notebook for a user with many inferrable trips to get a list of unique probabilities and counts for that user
 4. Set `eacilp.primary_algorithms` to the new algorithm, rerun the intake pipeline (modeling pipeline hasn't changed)
 5. Rerun the notebook as above, examine how the list of probabilities and counts has changed
@GabrielKS
Copy link

GabrielKS commented Aug 17, 2021

Here are some results from playing with the A,B,C constants:
Box plot comparison
This is taking all trips from stage users who have at least 30 trips, removing those for which there is no prediction, and graphing the distribution of probabilities of the most likely inferences once they have been discounted by the algorithm described above. The numbers in parentheses denote values of the constants respectively labeled A, B, and C above, except we show 1-A instead of A.

@shankari
Copy link
Contributor Author

My reading of that result is that the metric is not very sensitive to A, but is sensitive to both b and c. Given the difference between the naive approach and the default, it seems like bumping up b and c (e.g. b = 0.9, c = 0.33) would help improve the upper quantiles. Or is there a domain specific reason for picking those values of b and c?

@shankari
Copy link
Contributor Author

shankari commented Aug 17, 2021

On the bright side, though, it looks like the lower quantiles are not really affected by any of the settings, which makes sense given the change. So I don't think that there will be a visible impact in the number of yellow labels, only on the trigger for inclusion in "To Label".

{
                    "trigger": 0.95,
                    "expect": {
                        "type": "all"
                    },
                    "notify": {
                        "type": "dayEnd"
                    }
                }

We may want to drop the trigger to 0.90, or have a second, relaxed round in which we drop it to something like 0.75 or both. Let's see what the higher b and c values show...

@GabrielKS
Copy link

My reading of that result is that the metric is not very sensitive to A, but is sensitive to both b and c. Given the difference between the naive approach and the default, it seems like bumping up b and c (e.g. b = 0.9, c = 0.33) would help improve the upper quantiles. Or is there a domain specific reason for picking those values of b and c?

The value of B represents the chance that the user will take the same mode of transportation the second time they take a trip as the first time. As I note above, this is a question we can answer empirically, but for now both 0.75 and 0.9 seem reasonable. The value of C also has a domain-specific meaning, but it's a bit more abstract.

In any case, new box plots with the suggested values, now with gridlines:
New box plots

@shankari
Copy link
Contributor Author

I think I like "high_b_and_c" better. Basically, the 1.0 probability has just moved down a bit to around 0.9 and the rest of it is unchanged. I like it a lot better than default, where the quantile moves down to below 0.8.

I saw your note about empirically determining b, and I think that it is a great idea although clearly out of scope at this time. Given that we are basically guessing b at this point, I would prefer a value that has as little a change from the beta tested solution as possible. Thoughts?

@GabrielKS
Copy link

That makes sense. We should just be mindful of how it interacts with the upper threshold. Under high_b_and_c and with a high confidence threshold of 0.95, a common trip will no longer need to be user-verified if the labels are the same for four trips, whereas under default, it's eight trips. If we drop the trigger to 0.90, it's two and five; 0.75 and it's one and two. This suggests we should be careful to avoid lowering the threshold too much if we have high B and C.

@GabrielKS
Copy link

Another way to do this would be to decide how many occurrences of a common trip we want the user to have labeled or confirmed before we take it out of their hands completely, and then work backwards.

@shankari
Copy link
Contributor Author

I like the approach of deciding the occurrences and working backwards because, as I indicated, that is easier to explain #663 (comment)

I think that number of trips that need to be labeled = 3 is a reasonable starting point.

@GabrielKS
Copy link

I propose the values A=0.01, B=0.80, C=0.30. Gory details in the notebook I'm about to push, but the gist is that this is what works best if we reverse-engineer from how many trips we want to trigger various reasonably configured thresholds in intensive and relaxed modes.

The obligatory box plots:
More box plots!

@GabrielKS
Copy link

Here are said gory details. I also propose configuration values:

  • Relaxed mode: low confidence threshold: 0.40; high confidence threshold: 0.89
  • Intensive mode: low confidence threshold: 0.60; high confidence threshold: 0.99

@shankari
Copy link
Contributor Author

shankari commented Aug 18, 2021

@GabrielKS all that looks good, except that I don't understand this inconsistency.

Relaxed:
2. 4 occurrences of a common trip before the user doesn't need to interact with it at all
4. If the first occurrence of a common trip has label tuple X, we need 10 occurrences with label tuple Y before we predict label tuple Y and don't require any user interaction

Why don't we need 4 occurrences in R4, similar to R2? R3 has the same value as R1 (1 occurrence before yellow trip)

@GabrielKS
Copy link

My thinking was that four occurrences of a common trip with all the same labels was enough for us to assume relaxedly that the labels would always be the same, but if we have an example of the labels being different, we need more data to convince ourselves that the labels have gone back to being all the same. (Also, I3 != I1.)

@shankari
Copy link
Contributor Author

Ok, makes sense. Let's go ahead and deploy with these settings, and maybe spend some time today afternoon going over these assumptions with @andyduvall

GabrielKS added a commit to GabrielKS/e-mission-server that referenced this issue Aug 18, 2021
@GabrielKS
Copy link

corinne-hcr/e-mission-server#4, incorporated into e-mission/e-mission-server#829, closes this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants