Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis of label inference confidence #656

Open
GabrielKS opened this issue Aug 3, 2021 · 17 comments
Open

Analysis of label inference confidence #656

GabrielKS opened this issue Aug 3, 2021 · 17 comments

Comments

@GabrielKS
Copy link

To be able to make an informed decision on what confidence threshold we should use in the staging test of the new Label UI, I did some analysis of what label inference confidence looked like in the staging data. I first chose an individual user known to label many of their trips. Looking only at labeled trips, I found that the most likely inferences generated (i.e., the label tuple with the highest probability in the inference data structure) fell into 7 buckets. Here, trips with an empty inference data structure were counted as having a probability of 0.

Inferred p: number of trips
1.000: 154
0.667: 3
0.500: 6
0.407: 27
0.364: 12
0.286: 7
0.000: 67

I then compared these stated probability values to the fraction of trips in each bucket for which the inference actually matched the user labels:

Inferred p: fraction correct
1.000: 0.994
0.667: 0.667
0.500: 0.500
0.407: 0.407
0.364: 0.333
0.286: 0.286
0.000: 0.000

Presumably the reason for such a close correspondence here is due to how the clustering algorithm behaves having been trained on this data.

I then did some analysis on all confirmed trips across all users in the staging dataset. 2046 of these confirmed trips were fully labeled by users and had the following inference probability distribution:

Inferred p: number of trips (percentage of trips)
1.000: 748  (36.56%)
0.900: 10   (0.49%)
0.875: 8    (0.39%)
0.833: 20   (0.98%)
0.714: 2    (0.10%)
0.667: 15   (0.73%)
0.600: 6    (0.29%)
0.571: 6    (0.29%)
0.556: 7    (0.34%)
0.500: 62   (3.03%)
0.429: 22   (1.08%)
0.407: 27   (1.32%)
0.400: 15   (0.73%)
0.364: 12   (0.59%)
0.333: 10   (0.49%)
0.308: 20   (0.98%)
0.286: 7    (0.34%)
0.269: 21   (1.03%)
0.235: 15   (0.73%)
0.000: 1013 (49.51%)

3155 confirmed trips were fully unlabeled by users and had the following inference probability distribution:

Inferred p: number of trips (percentage of trips)
1.000: 149  (4.72%)
0.900: 8    (0.25%)
0.875: 9    (0.29%)
0.833: 19   (0.60%)
0.714: 1    (0.03%)
0.600: 56   (1.77%)
0.556: 2    (0.06%)
0.500: 67   (2.12%)
0.429: 138  (4.37%)
0.400: 2    (0.06%)
0.333: 1    (0.03%)
0.286: 1    (0.03%)
0.000: 2702 (85.64%)

142 confirmed trips were partially labeled by users — i.e., the user filled in some of the labels for the trip but not all of them.

Here are some graphs visualizing this data:
Probability distribution of most-likely inferences, full range
Probability distribution of most-likely inferences excluding 0 and 1
Probability distribution of most-likely inferences only 0 and 1

From the graphs, we see that a significant fraction (85.6%) of the unlabeled trips have no inference at all, more so than for labeled trips (49.5%). There are also more labeled trips with 100% certainty (36.6%) than unlabeled (4.7%). However, aside from these endpoints, the trend is reversed — unlabeled trips tend to cluster towards the middle and upper end of the probability spectrum, whereas labeled trips are more evenly distributed.

@shankari
Copy link
Contributor

shankari commented Aug 3, 2021

First, these numbers are likely wrong, because although I build the model with a radius of 500
https://github.com/e-mission/e-mission-server/pull/829/files#diff-5281eebce1b462a2a39465cd785e4f36572ec618a3a79efbdfdf35fc508a9c90R64

I forgot to change the radius for the prediction to 100
https://github.com/e-mission/e-mission-server/pull/829/files#diff-18a304dace1163481f6faf1cd707af237b2d6766b4508e73caeacb1f51056b48R65

We should re-analyse after doing that.

@shankari
Copy link
Contributor

shankari commented Aug 3, 2021

To re-run after changing the radius locally, use
e-mission/e-mission-server#829 (comment)

@shankari
Copy link
Contributor

shankari commented Aug 3, 2021

First, I don't think that the results on the labeled data are meaningful, because we are essentially testing the model on the training data. From the aggregate results, focusing only on unlabeled, it seems like we want to have a threshold of something like 0.4? That will ensure that most trips that have labels will show them instead of being converted to red labels but will still filter out the very low quality labels.

@shankari
Copy link
Contributor

shankari commented Aug 3, 2021

Thinking out loud, we have a lot of unlabeled trips with no inferences, and, based on e-mission/e-mission-eval-private-data#28 (comment) there is significant variability between the labeling % for users, primarily determined by how many labeled trips we already have.

So we may want to briefly plot out a per-user histogram, or focus only on users with more than 50-100 trips because if people have too few trips, we can tell them that we won't be able to predict.

Or maybe have two different distributions for people with > 20% labeled v/s < 20% labeled etc

@shankari
Copy link
Contributor

shankari commented Aug 3, 2021

@GabrielKS one challenge with configuring with 0.4 is the same_mode issue. We used to allow same_mode as an option for the replaced mode and at least some times, the percentage goes below 0.4 because of slight differences there - e.g. see below.

[{'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'home',
   'replaced_mode': 'drove_alone'},
  'p': 0.3333333333333333},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'home',
   'replaced_mode': 'same_mode'},
  'p': 0.08333333333333333},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'drove_alone'},
  'p': 0.16666666666666666},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'same_mode'},
  'p': 0.08333333333333333},
 {'labels': {'mode_confirm': 'shared_ride',
   'purpose_confirm': 'home',
   'replaced_mode': 'drove_alone'},
  'p': 0.25},
 {'labels': {'mode_confirm': 'shared_ride',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'drove_alone'},
  'p': 0.08333333333333333}]

Remapping same_mode to the same value as mode_confirm should solve that problem.

I removed the same_mode label from the UI on June 5th
e-mission/e-mission-phone@1806e6c#diff-8f7e0cbf2ba6bd210c65bfcac14614c6fabfd3bd95b99f6d2974c615ddcef159

So none of the actual participants would have ever selected same mode

But we should handle it on the staging server before tuning.

@GabrielKS
Copy link
Author

The Jupyter Notebook I used for analysis is now pushed (without the results) to https://github.com/GabrielKS/e-mission-eval-private-data/tree/inference_confidence_analysis.

@GabrielKS
Copy link
Author

GabrielKS commented Aug 3, 2021

Redoing the inference myself with the existing 100m radius threshold, I get slightly different numbers, perhaps because there were trips added to the dataset after inference had been run:

Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
  1.000: 718  (37.20%)
  0.900: 10   (0.52%)
  0.875: 8    (0.41%)
  0.833: 20   (1.04%)
  0.714: 2    (0.10%)
  0.667: 12   (0.62%)
  0.571: 6    (0.31%)
  0.556: 7    (0.36%)
  0.500: 63   (3.26%)
  0.429: 12   (0.62%)
  0.407: 27   (1.40%)
  0.400: 25   (1.30%)
  0.367: 20   (1.04%)
  0.333: 22   (1.14%)
  0.300: 21   (1.09%)
  0.286: 7    (0.36%)
  0.235: 15   (0.78%)
  0.000: 935  (48.45%)
}

Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
  1.000: 89   (3.40%)
  0.900: 8    (0.31%)
  0.875: 9    (0.34%)
  0.833: 19   (0.73%)
  0.714: 1    (0.04%)
  0.556: 2    (0.08%)
  0.500: 66   (2.52%)
  0.429: 57   (2.18%)
  0.400: 83   (3.17%)
  0.333: 1    (0.04%)
  0.286: 1    (0.04%)
  0.000: 2284 (87.18%)
}

Changing the 100m radius threshold to 500m, I get significantly different results:

Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
  1.000: 751  (39.69%)
  0.917: 12   (0.63%)
  0.900: 10   (0.53%)
  0.875: 8    (0.42%)
  0.833: 24   (1.27%)
  0.800: 10   (0.53%)
  0.786: 14   (0.74%)
  0.778: 9    (0.48%)
  0.714: 7    (0.37%)
  0.700: 10   (0.53%)
  0.692: 13   (0.69%)
  0.667: 45   (2.38%)
  0.600: 15   (0.79%)
  0.583: 23   (1.22%)
  0.571: 6    (0.32%)
  0.556: 27   (1.43%)
  0.545: 11   (0.58%)
  0.531: 23   (1.22%)
  0.500: 179  (9.46%)
  0.455: 11   (0.58%)
  0.452: 42   (2.22%)
  0.429: 21   (1.11%)
  0.419: 42   (2.22%)
  0.407: 27   (1.43%)
  0.400: 30   (1.59%)
  0.393: 159  (8.40%)
  0.375: 8    (0.42%)
  0.367: 30   (1.59%)
  0.360: 25   (1.32%)
  0.333: 78   (4.12%)
  0.312: 32   (1.69%)
  0.300: 40   (2.11%)
  0.294: 17   (0.90%)
  0.286: 7    (0.37%)
  0.250: 58   (3.07%)
  0.235: 15   (0.79%)
  0.222: 6    (0.32%)
  0.200: 18   (0.95%)
  0.000: 29   (1.53%)
}

Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
  1.000: 172  (7.06%)
  0.917: 1    (0.04%)
  0.900: 8    (0.33%)
  0.875: 9    (0.37%)
  0.833: 20   (0.82%)
  0.800: 4    (0.16%)
  0.786: 17   (0.70%)
  0.750: 3    (0.12%)
  0.714: 11   (0.45%)
  0.667: 2    (0.08%)
  0.583: 3    (0.12%)
  0.556: 12   (0.49%)
  0.531: 8    (0.33%)
  0.500: 53   (2.18%)
  0.452: 6    (0.25%)
  0.429: 66   (2.71%)
  0.419: 76   (3.12%)
  0.400: 102  (4.19%)
  0.393: 2    (0.08%)
  0.360: 5    (0.21%)
  0.333: 75   (3.08%)
  0.300: 1    (0.04%)
  0.294: 4    (0.16%)
  0.286: 1    (0.04%)
  0.250: 35   (1.44%)
  0.000: 1740 (71.43%)
}

Graphs for 500m:
Probability distribution of most-likely inferences, full range
Probability distribution of most-likely inferences excluding 0 and 1
Probability distribution of most-likely inferences only 0 and 1

Many (544) more unlabeled trips have some sort of inference now. There is no longer nearly so strong a cutoff at 0.4.

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

From this, it looks like the threshold should be 0.25, but then, we will basically not exclude anything. Do you get different results on a per-user basis, or only looking at users with lots of trips?

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

wrt:

But we should handle it on the staging server before tuning.

the obvious fix would be to change the inputs in the database and re-run the pipeline.
An alternate solution would be to fix it in the code, but that would require special casing the handling of replaced mode instead of working with user inputs generically.

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

There is existing code to map the replaced mode with same_mode (https://github.com/e-mission/e-mission-server/pull/829/files#diff-c7ece2e6b65a06d6fd262e2ca047f676b4050b865a1b2a1b3f91a85b72ca5460R48) but unfortunately, it is only called from the second round for now. And of course, it is specific to the replaced mode. We could re-introduce that for now instead of modifying the user inputs.

@GabrielKS
Copy link
Author

That middle graph if we only consider users who have labeled at least 10 or 20 trips (which is 15/39=38% of users; there are no users who have labeled between 10 and 20 trips):
User labeled at least 10 or 20 trips
If we only consider those who have labeled at least 50 trips (9/39=23% of users):
User labeled at least 50 trips
I don't see that this suggests any obvious way forward.

shankari added a commit to corinne-hcr/e-mission-server that referenced this issue Aug 4, 2021
So that we can collapse the different categories and increase the percentage
e-mission/e-mission-docs#656 (comment)

+ also check to see whether the replacement mode actually exists before
performing the mapping since we are checking this into master and it may not
always exist.

And even on CanBikeCO, we may have users who have never filled in a replaced mode.
@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

@GabrielKS back of the envelope estimate of the difference
e-mission/e-mission-eval-private-data#28 (comment)

Some actual values are:

cluster_label before_unique_combo_len after_unique_combo_len before_max_p after_max_p
25 0 10 9 0.423077 0.615385
26 1 5 4 0.545455 0.727273
27 2 6 4 0.4 0.5
29 4 6 5 0.166667 0.333333
32 7 2 1 0.666667 1

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

@GabrielKS If the out-and-back errors are common, both on staging and on the real deployments, I think I might be able to come up with a way to fix at least that pattern automatically. But it will take me ~ 3-4 days with no time for anything else.

Detecting the pattern (as opposed to fixing it) is much easier - I've basically already implemented it.

I think it might be worthwhile to work this into the expectation code somehow, maybe to mark trips as "double check". Let's discuss at today's meeting.

shankari added a commit to corinne-hcr/e-mission-server that referenced this issue Aug 4, 2021
This ensures that we can match trips properly
(for fix, see e-mission/e-mission-docs#656 (comment))
@GabrielKS
Copy link
Author

The middle graph with a threshold of 50 trips after the same_mode mapping was reenabled:
image

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

Although this is going to make the threshold meaningless at this point, I think we should go with 25% as the threshold aka show all the trips. This is because although the number of trips affected is low, the number of trips for which we have inferences at all is also low. I think it is more important to give people the sense that we're doing something than it is to be perfect with the accuracy. We can always tune this after the first two weeks if needed although analysing those results will be in the future work.

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

Just for the record, the value in here don't seem to change much
#656 (comment)

but when I did the side by side comparison (with boxplots), I got a pretty significant change:
e-mission/e-mission-eval-private-data#28 (comment)

One difference between the two is that @GabrielKS is looking at the matched inferences, while I am looking at the clusters in the model. So maybe this is skewed by the fact that there aren't a lot of matches?

@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

So using the actual inferences allows us to look at what the impact to this set of users and this trip history would be. Looking at the clusters directly, we get what could happen if we had better matching.

image

But that also seems to argue for something between 20% and 40%, so I am happy with 25%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants