-
Notifications
You must be signed in to change notification settings - Fork 34
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis of label inference confidence #656
Comments
First, these numbers are likely wrong, because although I build the model with a radius of 500 I forgot to change the radius for the prediction to 100 We should re-analyse after doing that. |
To re-run after changing the radius locally, use |
First, I don't think that the results on the labeled data are meaningful, because we are essentially testing the model on the training data. From the aggregate results, focusing only on unlabeled, it seems like we want to have a threshold of something like |
Thinking out loud, we have a lot of unlabeled trips with no inferences, and, based on e-mission/e-mission-eval-private-data#28 (comment) there is significant variability between the labeling % for users, primarily determined by how many labeled trips we already have. So we may want to briefly plot out a per-user histogram, or focus only on users with more than 50-100 trips because if people have too few trips, we can tell them that we won't be able to predict. Or maybe have two different distributions for people with > 20% labeled v/s < 20% labeled etc |
@GabrielKS one challenge with configuring with 0.4 is the
Remapping I removed the So none of the actual participants would have ever selected But we should handle it on the staging server before tuning. |
The Jupyter Notebook I used for analysis is now pushed (without the results) to https://github.com/GabrielKS/e-mission-eval-private-data/tree/inference_confidence_analysis. |
Redoing the inference myself with the existing 100m radius threshold, I get slightly different numbers, perhaps because there were trips added to the dataset after inference had been run:
Changing the 100m radius threshold to 500m, I get significantly different results:
Many (544) more unlabeled trips have some sort of inference now. There is no longer nearly so strong a cutoff at 0.4. |
From this, it looks like the threshold should be 0.25, but then, we will basically not exclude anything. Do you get different results on a per-user basis, or only looking at users with lots of trips? |
wrt:
the obvious fix would be to change the inputs in the database and re-run the pipeline. |
There is existing code to map the replaced mode with |
So that we can collapse the different categories and increase the percentage e-mission/e-mission-docs#656 (comment) + also check to see whether the replacement mode actually exists before performing the mapping since we are checking this into master and it may not always exist. And even on CanBikeCO, we may have users who have never filled in a replaced mode.
@GabrielKS back of the envelope estimate of the difference Some actual values are:
|
@GabrielKS If the out-and-back errors are common, both on staging and on the real deployments, I think I might be able to come up with a way to fix at least that pattern automatically. But it will take me ~ 3-4 days with no time for anything else. Detecting the pattern (as opposed to fixing it) is much easier - I've basically already implemented it. I think it might be worthwhile to work this into the expectation code somehow, maybe to mark trips as "double check". Let's discuss at today's meeting. |
This ensures that we can match trips properly (for fix, see e-mission/e-mission-docs#656 (comment))
Although this is going to make the threshold meaningless at this point, I think we should go with 25% as the threshold aka show all the trips. This is because although the number of trips affected is low, the number of trips for which we have inferences at all is also low. I think it is more important to give people the sense that we're doing something than it is to be perfect with the accuracy. We can always tune this after the first two weeks if needed although analysing those results will be in the future work. |
Just for the record, the value in here don't seem to change much but when I did the side by side comparison (with boxplots), I got a pretty significant change: One difference between the two is that @GabrielKS is looking at the matched inferences, while I am looking at the clusters in the model. So maybe this is skewed by the fact that there aren't a lot of matches? |
To be able to make an informed decision on what confidence threshold we should use in the staging test of the new Label UI, I did some analysis of what label inference confidence looked like in the staging data. I first chose an individual user known to label many of their trips. Looking only at labeled trips, I found that the most likely inferences generated (i.e., the label tuple with the highest probability in the inference data structure) fell into 7 buckets. Here, trips with an empty inference data structure were counted as having a probability of 0.
I then compared these stated probability values to the fraction of trips in each bucket for which the inference actually matched the user labels:
Presumably the reason for such a close correspondence here is due to how the clustering algorithm behaves having been trained on this data.
I then did some analysis on all confirmed trips across all users in the staging dataset. 2046 of these confirmed trips were fully labeled by users and had the following inference probability distribution:
3155 confirmed trips were fully unlabeled by users and had the following inference probability distribution:
142 confirmed trips were partially labeled by users — i.e., the user filled in some of the labels for the trip but not all of them.
Here are some graphs visualizing this data:
From the graphs, we see that a significant fraction (85.6%) of the unlabeled trips have no inference at all, more so than for labeled trips (49.5%). There are also more labeled trips with 100% certainty (36.6%) than unlabeled (4.7%). However, aside from these endpoints, the trend is reversed — unlabeled trips tend to cluster towards the middle and upper end of the probability spectrum, whereas labeled trips are more evenly distributed.
The text was updated successfully, but these errors were encountered: