Analysis of label inference confidence #656

GabrielKS · 2021-08-03T20:35:08Z

To be able to make an informed decision on what confidence threshold we should use in the staging test of the new Label UI, I did some analysis of what label inference confidence looked like in the staging data. I first chose an individual user known to label many of their trips. Looking only at labeled trips, I found that the most likely inferences generated (i.e., the label tuple with the highest probability in the inference data structure) fell into 7 buckets. Here, trips with an empty inference data structure were counted as having a probability of 0.

Inferred p: number of trips
1.000: 154
0.667: 3
0.500: 6
0.407: 27
0.364: 12
0.286: 7
0.000: 67

I then compared these stated probability values to the fraction of trips in each bucket for which the inference actually matched the user labels:

Inferred p: fraction correct
1.000: 0.994
0.667: 0.667
0.500: 0.500
0.407: 0.407
0.364: 0.333
0.286: 0.286
0.000: 0.000

Presumably the reason for such a close correspondence here is due to how the clustering algorithm behaves having been trained on this data.

I then did some analysis on all confirmed trips across all users in the staging dataset. 2046 of these confirmed trips were fully labeled by users and had the following inference probability distribution:

Inferred p: number of trips (percentage of trips)
1.000: 748  (36.56%)
0.900: 10   (0.49%)
0.875: 8    (0.39%)
0.833: 20   (0.98%)
0.714: 2    (0.10%)
0.667: 15   (0.73%)
0.600: 6    (0.29%)
0.571: 6    (0.29%)
0.556: 7    (0.34%)
0.500: 62   (3.03%)
0.429: 22   (1.08%)
0.407: 27   (1.32%)
0.400: 15   (0.73%)
0.364: 12   (0.59%)
0.333: 10   (0.49%)
0.308: 20   (0.98%)
0.286: 7    (0.34%)
0.269: 21   (1.03%)
0.235: 15   (0.73%)
0.000: 1013 (49.51%)

3155 confirmed trips were fully unlabeled by users and had the following inference probability distribution:

Inferred p: number of trips (percentage of trips)
1.000: 149  (4.72%)
0.900: 8    (0.25%)
0.875: 9    (0.29%)
0.833: 19   (0.60%)
0.714: 1    (0.03%)
0.600: 56   (1.77%)
0.556: 2    (0.06%)
0.500: 67   (2.12%)
0.429: 138  (4.37%)
0.400: 2    (0.06%)
0.333: 1    (0.03%)
0.286: 1    (0.03%)
0.000: 2702 (85.64%)

142 confirmed trips were partially labeled by users — i.e., the user filled in some of the labels for the trip but not all of them.

Here are some graphs visualizing this data:

From the graphs, we see that a significant fraction (85.6%) of the unlabeled trips have no inference at all, more so than for labeled trips (49.5%). There are also more labeled trips with 100% certainty (36.6%) than unlabeled (4.7%). However, aside from these endpoints, the trend is reversed — unlabeled trips tend to cluster towards the middle and upper end of the probability spectrum, whereas labeled trips are more evenly distributed.

The text was updated successfully, but these errors were encountered:

shankari · 2021-08-03T21:37:06Z

First, these numbers are likely wrong, because although I build the model with a radius of 500
https://github.com/e-mission/e-mission-server/pull/829/files#diff-5281eebce1b462a2a39465cd785e4f36572ec618a3a79efbdfdf35fc508a9c90R64

I forgot to change the radius for the prediction to 100
https://github.com/e-mission/e-mission-server/pull/829/files#diff-18a304dace1163481f6faf1cd707af237b2d6766b4508e73caeacb1f51056b48R65

We should re-analyse after doing that.

shankari · 2021-08-03T21:43:46Z

To re-run after changing the radius locally, use
e-mission/e-mission-server#829 (comment)

shankari · 2021-08-03T21:46:12Z

First, I don't think that the results on the labeled data are meaningful, because we are essentially testing the model on the training data. From the aggregate results, focusing only on unlabeled, it seems like we want to have a threshold of something like 0.4? That will ensure that most trips that have labels will show them instead of being converted to red labels but will still filter out the very low quality labels.

shankari · 2021-08-03T21:52:15Z

Thinking out loud, we have a lot of unlabeled trips with no inferences, and, based on e-mission/e-mission-eval-private-data#28 (comment) there is significant variability between the labeling % for users, primarily determined by how many labeled trips we already have.

So we may want to briefly plot out a per-user histogram, or focus only on users with more than 50-100 trips because if people have too few trips, we can tell them that we won't be able to predict.

Or maybe have two different distributions for people with > 20% labeled v/s < 20% labeled etc

shankari · 2021-08-03T22:13:51Z

@GabrielKS one challenge with configuring with 0.4 is the same_mode issue. We used to allow same_mode as an option for the replaced mode and at least some times, the percentage goes below 0.4 because of slight differences there - e.g. see below.

[{'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'home',
   'replaced_mode': 'drove_alone'},
  'p': 0.3333333333333333},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'home',
   'replaced_mode': 'same_mode'},
  'p': 0.08333333333333333},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'drove_alone'},
  'p': 0.16666666666666666},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'same_mode'},
  'p': 0.08333333333333333},
 {'labels': {'mode_confirm': 'shared_ride',
   'purpose_confirm': 'home',
   'replaced_mode': 'drove_alone'},
  'p': 0.25},
 {'labels': {'mode_confirm': 'shared_ride',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'drove_alone'},
  'p': 0.08333333333333333}]

Remapping same_mode to the same value as mode_confirm should solve that problem.

I removed the same_mode label from the UI on June 5th
e-mission/e-mission-phone@1806e6c#diff-8f7e0cbf2ba6bd210c65bfcac14614c6fabfd3bd95b99f6d2974c615ddcef159

So none of the actual participants would have ever selected same mode

But we should handle it on the staging server before tuning.

GabrielKS · 2021-08-03T22:34:24Z

The Jupyter Notebook I used for analysis is now pushed (without the results) to https://github.com/GabrielKS/e-mission-eval-private-data/tree/inference_confidence_analysis.

GabrielKS · 2021-08-03T23:53:26Z

Redoing the inference myself with the existing 100m radius threshold, I get slightly different numbers, perhaps because there were trips added to the dataset after inference had been run:

Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
  1.000: 718  (37.20%)
  0.900: 10   (0.52%)
  0.875: 8    (0.41%)
  0.833: 20   (1.04%)
  0.714: 2    (0.10%)
  0.667: 12   (0.62%)
  0.571: 6    (0.31%)
  0.556: 7    (0.36%)
  0.500: 63   (3.26%)
  0.429: 12   (0.62%)
  0.407: 27   (1.40%)
  0.400: 25   (1.30%)
  0.367: 20   (1.04%)
  0.333: 22   (1.14%)
  0.300: 21   (1.09%)
  0.286: 7    (0.36%)
  0.235: 15   (0.78%)
  0.000: 935  (48.45%)
}

Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
  1.000: 89   (3.40%)
  0.900: 8    (0.31%)
  0.875: 9    (0.34%)
  0.833: 19   (0.73%)
  0.714: 1    (0.04%)
  0.556: 2    (0.08%)
  0.500: 66   (2.52%)
  0.429: 57   (2.18%)
  0.400: 83   (3.17%)
  0.333: 1    (0.04%)
  0.286: 1    (0.04%)
  0.000: 2284 (87.18%)
}

Changing the 100m radius threshold to 500m, I get significantly different results:

Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
  1.000: 751  (39.69%)
  0.917: 12   (0.63%)
  0.900: 10   (0.53%)
  0.875: 8    (0.42%)
  0.833: 24   (1.27%)
  0.800: 10   (0.53%)
  0.786: 14   (0.74%)
  0.778: 9    (0.48%)
  0.714: 7    (0.37%)
  0.700: 10   (0.53%)
  0.692: 13   (0.69%)
  0.667: 45   (2.38%)
  0.600: 15   (0.79%)
  0.583: 23   (1.22%)
  0.571: 6    (0.32%)
  0.556: 27   (1.43%)
  0.545: 11   (0.58%)
  0.531: 23   (1.22%)
  0.500: 179  (9.46%)
  0.455: 11   (0.58%)
  0.452: 42   (2.22%)
  0.429: 21   (1.11%)
  0.419: 42   (2.22%)
  0.407: 27   (1.43%)
  0.400: 30   (1.59%)
  0.393: 159  (8.40%)
  0.375: 8    (0.42%)
  0.367: 30   (1.59%)
  0.360: 25   (1.32%)
  0.333: 78   (4.12%)
  0.312: 32   (1.69%)
  0.300: 40   (2.11%)
  0.294: 17   (0.90%)
  0.286: 7    (0.37%)
  0.250: 58   (3.07%)
  0.235: 15   (0.79%)
  0.222: 6    (0.32%)
  0.200: 18   (0.95%)
  0.000: 29   (1.53%)
}

Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
  1.000: 172  (7.06%)
  0.917: 1    (0.04%)
  0.900: 8    (0.33%)
  0.875: 9    (0.37%)
  0.833: 20   (0.82%)
  0.800: 4    (0.16%)
  0.786: 17   (0.70%)
  0.750: 3    (0.12%)
  0.714: 11   (0.45%)
  0.667: 2    (0.08%)
  0.583: 3    (0.12%)
  0.556: 12   (0.49%)
  0.531: 8    (0.33%)
  0.500: 53   (2.18%)
  0.452: 6    (0.25%)
  0.429: 66   (2.71%)
  0.419: 76   (3.12%)
  0.400: 102  (4.19%)
  0.393: 2    (0.08%)
  0.360: 5    (0.21%)
  0.333: 75   (3.08%)
  0.300: 1    (0.04%)
  0.294: 4    (0.16%)
  0.286: 1    (0.04%)
  0.250: 35   (1.44%)
  0.000: 1740 (71.43%)
}

Graphs for 500m:

Many (544) more unlabeled trips have some sort of inference now. There is no longer nearly so strong a cutoff at 0.4.

shankari · 2021-08-04T00:08:56Z

From this, it looks like the threshold should be 0.25, but then, we will basically not exclude anything. Do you get different results on a per-user basis, or only looking at users with lots of trips?

shankari · 2021-08-04T01:11:18Z

wrt:

But we should handle it on the staging server before tuning.

the obvious fix would be to change the inputs in the database and re-run the pipeline.
An alternate solution would be to fix it in the code, but that would require special casing the handling of replaced mode instead of working with user inputs generically.

shankari · 2021-08-04T01:18:51Z

There is existing code to map the replaced mode with same_mode (https://github.com/e-mission/e-mission-server/pull/829/files#diff-c7ece2e6b65a06d6fd262e2ca047f676b4050b865a1b2a1b3f91a85b72ca5460R48) but unfortunately, it is only called from the second round for now. And of course, it is specific to the replaced mode. We could re-introduce that for now instead of modifying the user inputs.

GabrielKS · 2021-08-04T03:20:04Z

That middle graph if we only consider users who have labeled at least 10 or 20 trips (which is 15/39=38% of users; there are no users who have labeled between 10 and 20 trips):

If we only consider those who have labeled at least 50 trips (9/39=23% of users):

I don't see that this suggests any obvious way forward.

So that we can collapse the different categories and increase the percentage e-mission/e-mission-docs#656 (comment) + also check to see whether the replacement mode actually exists before performing the mapping since we are checking this into master and it may not always exist. And even on CanBikeCO, we may have users who have never filled in a replaced mode.

shankari · 2021-08-04T07:12:04Z

@GabrielKS back of the envelope estimate of the difference
e-mission/e-mission-eval-private-data#28 (comment)

Some actual values are:

	cluster_label	before_unique_combo_len	after_unique_combo_len	before_max_p	after_max_p
25	0	10	9	0.423077	0.615385
26	1	5	4	0.545455	0.727273
27	2	6	4	0.4	0.5
29	4	6	5	0.166667	0.333333
32	7	2	1	0.666667	1

shankari · 2021-08-04T15:26:51Z

@GabrielKS If the out-and-back errors are common, both on staging and on the real deployments, I think I might be able to come up with a way to fix at least that pattern automatically. But it will take me ~ 3-4 days with no time for anything else.

Detecting the pattern (as opposed to fixing it) is much easier - I've basically already implemented it.

I think it might be worthwhile to work this into the expectation code somehow, maybe to mark trips as "double check". Let's discuss at today's meeting.

This ensures that we can match trips properly (for fix, see e-mission/e-mission-docs#656 (comment))

GabrielKS · 2021-08-04T22:48:27Z

The middle graph with a threshold of 50 trips after the same_mode mapping was reenabled:

shankari · 2021-08-04T22:52:04Z

Although this is going to make the threshold meaningless at this point, I think we should go with 25% as the threshold aka show all the trips. This is because although the number of trips affected is low, the number of trips for which we have inferences at all is also low. I think it is more important to give people the sense that we're doing something than it is to be perfect with the accuracy. We can always tune this after the first two weeks if needed although analysing those results will be in the future work.

shankari · 2021-08-04T22:57:04Z

Just for the record, the value in here don't seem to change much
#656 (comment)

but when I did the side by side comparison (with boxplots), I got a pretty significant change:
e-mission/e-mission-eval-private-data#28 (comment)

One difference between the two is that @GabrielKS is looking at the matched inferences, while I am looking at the clusters in the model. So maybe this is skewed by the fact that there aren't a lot of matches?

shankari · 2021-08-04T22:59:59Z

So using the actual inferences allows us to look at what the impact to this set of users and this trip history would be. Looking at the clusters directly, we get what could happen if we had better matching.

But that also seems to argue for something between 20% and 40%, so I am happy with 25%

This was referenced Aug 4, 2021

Modeling and functions e-mission/e-mission-server#829

Merged

Finally compare dataset characteristics against each other e-mission/e-mission-eval-private-data#28

Merged

shankari added a commit to corinne-hcr/e-mission-server that referenced this issue Aug 4, 2021

Change the prediction radius also to 500

6cc353a

This ensures that we can match trips properly (for fix, see e-mission/e-mission-docs#656 (comment))

shankari mentioned this issue Aug 6, 2021

Common trip system building #647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis of label inference confidence #656

Analysis of label inference confidence #656

GabrielKS commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021 •

edited

Loading

GabrielKS commented Aug 3, 2021

GabrielKS commented Aug 3, 2021 •

edited

Loading

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

GabrielKS commented Aug 4, 2021

shankari commented Aug 4, 2021 •

edited

Loading

shankari commented Aug 4, 2021 •

edited

Loading

GabrielKS commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

Analysis of label inference confidence #656

Analysis of label inference confidence #656

Comments

GabrielKS commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021

shankari commented Aug 3, 2021 • edited Loading

GabrielKS commented Aug 3, 2021

GabrielKS commented Aug 3, 2021 • edited Loading

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

GabrielKS commented Aug 4, 2021

shankari commented Aug 4, 2021 • edited Loading

shankari commented Aug 4, 2021 • edited Loading

GabrielKS commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 4, 2021

shankari commented Aug 3, 2021 •

edited

Loading

GabrielKS commented Aug 3, 2021 •

edited

Loading

shankari commented Aug 4, 2021 •

edited

Loading

shankari commented Aug 4, 2021 •

edited

Loading