Confidence discounting for small clusters #4

GabrielKS · 2021-08-18T21:20:10Z

See e-mission/e-mission-docs#663 for an explanation of the problem and the solution. This should be ready to merge; I'll run some final tests to make sure everything is working smoothly in the next few minutes.

+ Implements the algorithm described in e-mission/e-mission-docs#663 (comment) + Uses constant values `A=0.01`, `B=0.75`, `C=0.25` + Changes eacilp.primary_algorithms to use the new algorithm No unit tests yet (working on a tight timeline), but tested as follows: 1. Run `[eacili.n_to_confidence_coeff(n) for n in [1,2,3,4,5,7,10,15,20,30,1000]]`, check for reasonableness, compare to results from plugging the formula into a calculator 2. Run the modeling and intake pipeline with `eacilp.primary_algorithms` set to the old algorithm 3. Run the first few cells of the "Explore label inference confidence" notebook for a user with many inferrable trips to get a list of unique probabilities and counts for that user 4. Set `eacilp.primary_algorithms` to the new algorithm, rerun the intake pipeline (modeling pipeline hasn't changed) 5. Rerun the notebook as above, examine how the list of probabilities and counts has changed

+ See comments in e-mission/e-mission-docs#663

+ Useful for playing around with other constants in notebooks

shankari · 2021-08-19T01:31:05Z

emission/analysis/classification/inference/labels/inferrers.py

+    if max_confidence is None: max_confidence = 0.99  # Confidence coefficient for n approaching infinity -- in the GitHub issue, this is 1-A
+    if first_confidence is None: first_confidence = 0.80  # Confidence coefficient for n = 1 -- in the issue, this is B
+    if confidence_multiplier is None: confidence_multiplier = 0.30  # How much of the remaining removable confidence to remove between n = k and n = k+1 -- in the issue, this is C
+    return max_confidence-(max_confidence-first_confidence)*(1-confidence_multiplier)**(n-1)  # This is the u = ... formula in the issue
+


later, I would like to put these into a config file. But definitely not needed now.

shankari · 2021-08-19T01:49:57Z

Ran into error

Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/pipeline.py", line 34, in infer_labels
    lip.run_prediction_pipeline(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/pipeline.py", line 64, in run_prediction_pipeline
    results = self.compute_and_save_algorithms(inferred_trip)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/pipeline.py", line 81, in compute_and_save_algorithms
    prediction = algorithm_fn(trip)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/inferrers.py", line 143, in predict_cluster_confidence_discounting
    labels, n = lp.predict_labels_with_n(trip)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/modelling/tour_model_first_only/load_predict.py", line 90, in predict_labels_with_n
    return [], -1

I think this will fix it.

diff --git a/emission/analysis/modelling/tour_model_first_only/load_predict.py b/emission/analysis/modelling/tour_model_first_only/load_predict.py
index f3039f04..a20e51ea 100644
--- a/emission/analysis/modelling/tour_model_first_only/load_predict.py
+++ b/emission/analysis/modelling/tour_model_first_only/load_predict.py
@@ -85,14 +85,13 @@ def predict_labels_with_n(trip):
         # e.g. {'0': [{'1': [{'labels': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'home', 'replaced_mode': 'drove_alone'}}]}]}
         user_labels = loadModelStage('user_labels_first_round_' + str(user))

-        # Get the number of trips in each cluster from the number of locations in each bin
-        # This is a bit hacky; in the future, we might want the model stage to save a metadata file with this and potentially other information
-        cluster_sizes = {k: len(bin_locations[k]) for k in bin_locations}
-
     except IOError as e:
         logging.info(f"No models found for {user}, no prediction")
         return [], -1

+    # Get the number of trips in each cluster from the number of locations in each bin
+    # This is a bit hacky; in the future, we might want the model stage to save a metadata file with this and potentially other information
+    cluster_sizes = {k: len(bin_locations[k]) for k in bin_locations}

will test and fix as part of my pending changes

shankari · 2021-08-19T04:00:05Z

Actually, that was because of some uncommitted changes to the build stage that was saving "null" values. If the file is not found, we will go directly to the except block and not run the cluster trip calculation.

GabrielKS added 2 commits August 16, 2021 12:22

Adjust discounting constants

05f44fc

+ See comments in e-mission/e-mission-docs#663

GabrielKS marked this pull request as draft August 18, 2021 21:25

Add option to override default discounting constants

93f507c

+ Useful for playing around with other constants in notebooks

shankari approved these changes Aug 19, 2021

View reviewed changes

shankari merged commit 93f507c into corinne-hcr:modeling_and_functions Aug 19, 2021

GabrielKS mentioned this pull request Aug 19, 2021

Trip confidence can be artificially high for single trip clusters e-mission/e-mission-docs#663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confidence discounting for small clusters #4

Confidence discounting for small clusters #4

GabrielKS commented Aug 18, 2021

shankari Aug 19, 2021

shankari commented Aug 19, 2021

shankari commented Aug 19, 2021

Confidence discounting for small clusters #4

Confidence discounting for small clusters #4

Conversation

GabrielKS commented Aug 18, 2021

shankari Aug 19, 2021

Choose a reason for hiding this comment

shankari commented Aug 19, 2021

shankari commented Aug 19, 2021