Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline stuck for user because TRIP_SEGMENTATION did not mark 'completed' nor 'failed' #1075

Open
JGreenlee opened this issue May 22, 2024 · 4 comments

Comments

@JGreenlee
Copy link

The user has a 2-week backlog of draft trips, since May 7. I searched for error messages for the user in CloudWatch with the filter expression %ERROR.*uuidOfTheUser%

I found this error, which occurs repeatedly (presumably on every pipeline run):

ERROR:139822526199616:Found error curr_state.curr_run_ts = 1715124566.4760673 while processing pipeline for user █████, skipping

The error occurs shortly after INFO:139822526199616:**********UUID █████: segmenting into trips**********.
The first time the error occurred was 2024-05-08T01:49:41.377Z.

My understanding of this is that there's a curr_run_ts left over from the previous time the TRIP_SEGMENTATION stage ran. Which means it exited improperly without being marked as 'completed' or 'failed'.

I checked logs of the previous run of the pipeline, but I couldn't find anything out of the ordinary.

@shankari
Copy link
Contributor

This can also sometime happen if the pipeline takes a long time to run and a new instance is launched before the first is complete. At least on AWS, new scheduled tasks should be launched only when the previous one completes, but apparently sometimes this doesn't happen.

I typically use bin/monitor/find_invalid_pipeline_state.py and bin/monitor/reset_invalid_pipeline_states.py If you let me know which program the user is a part of, I can run those scripts and fix the issue

@shankari
Copy link
Contributor

I reset the pipeline for this user, and it looked like it failed again during TRIP_SEGMENTATION.

@shankari
Copy link
Contributor

shankari commented May 22, 2024

We checked, and the pipeline for this user is still running, stuck in trip segmentation.
It is currently processing data from the 18th.

2024-05-22 15:33:51,351:DEBUG:140473223255872:sample activities are [{'fmt_time': '2024-05-18T14:15:09.670000-04:00'}}]

In parallel, even if another run is launched, maybe because AWS thought that this one was "stuck", that run will fail in the trip segmentation stage, but the curr_run_ts will be cleared when the current run ends, so it should be OK.

Some potential fixes:

  • improve the scalability of trip_segmentation by reducing the number of database calls. There is a PR out for this that is close to being done Improving Trip Segmentation by reducing DB calls e-mission-server#958
  • there is also a script that detects "badly stuck" pipelines and resets them (bin/monitor/reset_invalid_pipeline_states.py). I currently run it manually periodically, but we can run it automatically using a scheduled task maybe once a day. I have thought about that, but have been concerned about getting into an infinite loop with long-running tasks that take more than a day because a the end of the day, we think is stuck and reset it, so it never gets a chance to complete. We may be able to fix this by improving the script and adding in additional fail-safes.
    • This basically sees if a curr_run_ts was set more than X time ago, and then assumes that if it is, then the pipeline is stuck, and rolls back to before that set point by deleting any entries after that since they are expected to be partially done.

@shankari
Copy link
Contributor

Looking more carefully, the trip segmentation just stopped at

2024-05-07T23:35:25.757Z	2024-05-07 23:35:25,757:DEBUG:139645930166080:finished querying values for ['background/motion_activity'], count = 1
	2024-05-07T23:35:25.758Z	2024-05-07 23:35:25,758:DEBUG:139645930166080:finished querying values for [], count = 0
	2024-05-07T23:35:25.759Z	2024-05-07 23:35:25,758:DEBUG:139645930166080:orig_ts_db_matches = 1, analysis_ts_db_matches = 0

There are no additional logs (from any user) in this log file.

And the next log file starts at May 8th 00:25.

2024-05-08T00:25:40.737Z	Found configuration, overriding...
2024-05-08T00:25:40.740Z	
	2024-05-08T00:25:40.775Z	{
	2024-05-08T00:25:40.775Z	"timeseries": {
	2024-05-08T00:25:40.775Z	"url": 
	2024-05-08T00:25:40.775Z	"result_limit": 250000
	2024-05-08T00:25:40.775Z	}
	2024-05-08T00:25:40.775Z	}
	2024-05-08T00:25:40.775Z	0.0.0.0
	2024-05-08T00:25:40.779Z	{

My best guess is that the task was killed because it was taking up too much memory and then just re-launched at its usual time, which appears to be around :25 after the hour.
Screenshot 2024-05-22 at 9 07 08 AM

We can mitigate some of this by reading only a limited number of entries at a time. It may be that one of the other users that was running in parallel read a huge amount of data.

So to mitigate:

  • make trip segmentation faster
  • read data for processing in batches
  • auto-recover from bad situations (self-healing systems are the best)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants