Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confidence discounting for small clusters #4

Merged

Conversation

GabrielKS
Copy link

See e-mission/e-mission-docs#663 for an explanation of the problem and the solution. This should be ready to merge; I'll run some final tests to make sure everything is working smoothly in the next few minutes.

 + Implements the algorithm described in e-mission/e-mission-docs#663 (comment)
 + Uses constant values `A=0.01`, `B=0.75`, `C=0.25`
 + Changes eacilp.primary_algorithms to use the new algorithm

No unit tests yet (working on a tight timeline), but tested as follows:
 1. Run `[eacili.n_to_confidence_coeff(n) for n in [1,2,3,4,5,7,10,15,20,30,1000]]`, check for reasonableness, compare to results from plugging the formula into a calculator
 2. Run the modeling and intake pipeline with `eacilp.primary_algorithms` set to the old algorithm
 3. Run the first few cells of the "Explore label inference confidence" notebook for a user with many inferrable trips to get a list of unique probabilities and counts for that user
 4. Set `eacilp.primary_algorithms` to the new algorithm, rerun the intake pipeline (modeling pipeline hasn't changed)
 5. Rerun the notebook as above, examine how the list of probabilities and counts has changed
@GabrielKS GabrielKS marked this pull request as draft August 18, 2021 21:25
 + Useful for playing around with other constants in notebooks
Comment on lines +136 to +140
if max_confidence is None: max_confidence = 0.99 # Confidence coefficient for n approaching infinity -- in the GitHub issue, this is 1-A
if first_confidence is None: first_confidence = 0.80 # Confidence coefficient for n = 1 -- in the issue, this is B
if confidence_multiplier is None: confidence_multiplier = 0.30 # How much of the remaining removable confidence to remove between n = k and n = k+1 -- in the issue, this is C
return max_confidence-(max_confidence-first_confidence)*(1-confidence_multiplier)**(n-1) # This is the u = ... formula in the issue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

later, I would like to put these into a config file. But definitely not needed now.

@shankari shankari merged commit 93f507c into corinne-hcr:modeling_and_functions Aug 19, 2021
@shankari
Copy link

Ran into error

Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/pipeline.py", line 34, in infer_labels
    lip.run_prediction_pipeline(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/pipeline.py", line 64, in run_prediction_pipeline
    results = self.compute_and_save_algorithms(inferred_trip)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/pipeline.py", line 81, in compute_and_save_algorithms
    prediction = algorithm_fn(trip)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/classification/inference/labels/inferrers.py", line 143, in predict_cluster_confidence_discounting
    labels, n = lp.predict_labels_with_n(trip)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/modelling/tour_model_first_only/load_predict.py", line 90, in predict_labels_with_n
    return [], -1

I think this will fix it.

diff --git a/emission/analysis/modelling/tour_model_first_only/load_predict.py b/emission/analysis/modelling/tour_model_first_only/load_predict.py
index f3039f04..a20e51ea 100644
--- a/emission/analysis/modelling/tour_model_first_only/load_predict.py
+++ b/emission/analysis/modelling/tour_model_first_only/load_predict.py
@@ -85,14 +85,13 @@ def predict_labels_with_n(trip):
         # e.g. {'0': [{'1': [{'labels': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'home', 'replaced_mode': 'drove_alone'}}]}]}
         user_labels = loadModelStage('user_labels_first_round_' + str(user))

-        # Get the number of trips in each cluster from the number of locations in each bin
-        # This is a bit hacky; in the future, we might want the model stage to save a metadata file with this and potentially other information
-        cluster_sizes = {k: len(bin_locations[k]) for k in bin_locations}
-
     except IOError as e:
         logging.info(f"No models found for {user}, no prediction")
         return [], -1

+    # Get the number of trips in each cluster from the number of locations in each bin
+    # This is a bit hacky; in the future, we might want the model stage to save a metadata file with this and potentially other information
+    cluster_sizes = {k: len(bin_locations[k]) for k in bin_locations}

will test and fix as part of my pending changes

@shankari
Copy link

Actually, that was because of some uncommitted changes to the build stage that was saving "null" values. If the file is not found, we will go directly to the except block and not run the cluster trip calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants