Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedulers other than depth-first unlikely to set TaskRun created time correctly #35

Open
normangilmore opened this issue Sep 7, 2018 · 7 comments

Comments

@normangilmore
Copy link

The random scheduler won't set TaskRun created time correctly, breadth first created time will have race conditions with other users active, the default depth first scheduler is setting created time correctly with a little luck.

Before Pybossa.js presents a task, it pre-loads the next task. At the time of pre-load, the back-end sets a Redis key for the pre-loaded task with a time stamp, even though the user hasn't started the task yet.

You can easily verify this by going to your Pybossa back-end environment:

import redis
r = redis.StrictRedis()
r.keys("pybossa:task_requested:*")
# load task
r.keys("pybossa:task_requested:*")

You will see that current task and next task are both set, as expected.
But they have almost identical time stamps, so we must examine when or if the time stamp is updated for the pre-loaded task, so we know it matches when the user sees the task, not when it is pre-loaded.

When you save the first task, if you are using a depth first scheduler (default), the call to _fetchNewTask here will get the task that is currently being presented, because, after all, the user has not completed it.

https://github.com/Scifabric/pybossa.js/blob/master/pybossa.js#L147

                var xhr = (taskId && (previousTask === undefined)) ? _fetchTask(taskId) : _fetchNewTask(project.id, offset);
                xhr.done(function(task) {
                    if (previousTask && task.id === previousTask.id) {
                        var secondTry = _fetchNewTask(project.id, offset+1)
                        .done(function(secondTask){
                            _resolveNextTaskLoaded(secondTask, def);
                        });
                    }
                    else {
                        _resolveNextTaskLoaded(task, def);
                    }
                });

Luckily, /project/<project_id>/newtask calls guard.stamp(task, get_user_id_or_ip()) which updates the Redis key that will be used as the current task's timestamp when it is saved.

Then there is an if (previousTask && task.id === previousTask.id) statement in the pre-load , that says, oops, same task, let's get the one after.

So with a depth first scheduler, the current task just being presented will be loaded (a second time), and secondTry will always be run to get the actual next task.

This is good because it has the effect of setting the time correctly for the current task right when it is presented. But it relies on the scheduler returning the same task again - a RANDOM scheduler will not luckily refresh the created time of the current task.

BREADTH FIRST may return a different task than the first time if in the meantime, another user has completed tasks, shuffling what is next in line.

In either case, we would expect the TaskRun created time stamp to incorrectly be saved with the time as of the pre-load, thus incorrectly including all the time of the prior task with the task now being saved.

Possible Solution
Some API call has to be made at the time the task is presented to update the Redis key with the created time stamp. This is the only other routine on the backend that also sets the time stamp, so it could be called just to refresh the timestamp, and discard the result.

https://github.com/Scifabric/pybossa/blob/master/pybossa/view/projects.py#L852

@blueprint.route('/<short_name>/task/<int:task_id>')
def task_presenter(short_name, task_id):
…
	guard.stamp(task, get_user_id_or_ip())

Or a new lightweight API could be made that returned nothing, just refreshes the time stamp. It would be a bad idea to make /api/task/<task_id> set the time stamp, as it would break existing expectations and not be backward compatible with older Pybossa.js. For example it is called in pybossa.saveTask and you wouldn't want to set the created stamp at the moment you are saving the taskrun.

@teleyinex
Copy link
Member

Hi,

Thanks a lot for the feedback. Actually, you have been really thoughtful about it.

Regarding the random scheduler: we don't support it in the current PYBOSSA setup (since October 2017). It is included as a demo plugin, so people can build their own PYBOSSA schedulers.

However, you are right about the issue with the random scheduler. Since almost 8 months ago, we're not using a lot PYBOSSA.JS as we have now more flexibility in the API to build schedulers on the frontend. Basically, the api now allows you to requests all the tasks for the user from /api/task with a flag, to mimick as well the newtask enpoint. Basically, you can request all the tasks for a given project for a given user with this flag ?participated=1 In this way, you can order the tasks by date, by attribute, get 100 tasks at a time, use last_id for keyset pagination, etc. You have many more options, but you are right, while this new flag gives us more flexibility we still have to validate the tasks before submitting one. Thus, how do we do it? Exactly as you said, calling the API endpoint: /<short_name>/task/.

Thus, if you want to implement a random scheduler int he frontend you can do the following:

  1. Request 100 tasks where the user has not participated using /api/task?participated=1
  2. Save the IDs or even the tasks into localStorage
  3. Get one of this randomly, remove it from the localStorage
  4. Allow the user to play with the task: classify it, analyze it, ...
  5. When the user clicks the button to save the task_run, get the date properly by going to /<short_name>/task/<task_id>
  6. We should get a 200, so we're ready to the POST into api/taskrun
  7. Go to point 3 until we run out of 100 tasks from localStorage, if we run out go to point 1

Point 3 will save you to request tasks all the time ;-) However, if you have too many users you can end up with tasks that have been completed by the crowd, so you should add some logic behind the scenes to remove those tasks that have been completed from the localStorage.

Thus, I think we don't need to implement anything new. You can pass the header application/json and you will get the response in JSON format to know how it went the request. We can however, add a new flag to this point to allow just the refresh, and return only the confirmation, but do we need this? I'm not sure about it. What do you think?

PS: This is the way we're building our projects now, using this method.

@normangilmore
Copy link
Author

normangilmore commented Sep 7, 2018

Hi Daniel,

Thanks for the details. I didn't realize the random scheduler was removed, but I believe this issue could arise with breadth first as well, as a race condition while other tasks are being completed.

To restate my concern more succinctly, there is a code path that results in incorrect created time stamps. I have put a sample alert in that code path below:

var xhr = (taskId && (previousTask === undefined)) ? _fetchTask(taskId) : _fetchNewTask(project.id, offset);
xhr.done(function(task) {
    if (previousTask && task.id === previousTask.id) {
        var secondTry = _fetchNewTask(project.id, offset+1)
        .done(function(secondTask){
            _resolveNextTaskLoaded(secondTask, def);
        });
    }
    else {
        console.log("The time stamp was not updated for task id"+str(previousTask.id))
        _resolveNextTaskLoaded(task, def);
    }
});

If pybossa.js is deprecated, this line could be added to the code so people using it will be aware the scheduler behavior is a dependency.

However, the simple fix would be to add a function that calls the project presenter endpoint at the time the user starts working on the task. The purpose is solely to update the created stamp in Redis when the user starts the task. The result is not needed. (I emphasize that /api/task, does not set the time stamp, as noted above. You must call the /project/<projectname>/task/<taskId> endpoint to set the time stamp.):

function _fetchTaskIsStarting(projectname, taskId) {
    return $.ajax({
        url: url + 'project/' + projectname + '/task/' + taskId,
        cache: false,
        dataType: 'json'
    });
}

and call it here:

var xhr = (taskId && (previousTask === undefined)) ? _fetchTask(taskId) : _fetchNewTask(project.id, offset);
xhr.done(function(task) {
    if (previousTask && task.id === previousTask.id) {
        var secondTry = _fetchNewTask(project.id, offset+1)
        .done(function(secondTask){
            _resolveNextTaskLoaded(secondTask, def);
        });
    }
    else {
       var discard_result = _fetchTaskIsStarting(project.name, previousTask.id);
        _resolveNextTaskLoaded(task, def);
    }
});

This will almost never run if depth first is the scheduler, but if it does run, it will set the time stamp correctly.

The code flow in Pybossa.js is very complex, but semantically, the goal is this:

    var _presentTask = function(task, deferred) {
        var discard_result = _fetchTaskIsStarting(project.name, task.id);
        deferred.resolve(task);
    };
  1. When the user clicks the button to save the task_run, get the date properly by going to /<short_name>/task/<task_id>

If the created date in the Redis key is set when the user clicks the "Save" button, then the created time will be wrong. The created time must be set when presentTask is called and the user begins working on the task.

@normangilmore
Copy link
Author

normangilmore commented Sep 7, 2018

Overall, my suggestion is that when people are caching tasks on the front-end, there should be clear documented recommendation about the best way to tell the back-end to set the created stamp. The two current choices are either non-deterministic newtask, or heavy /project/<projectname>/task/<taskId>.

@normangilmore
Copy link
Author

To clarify for others reading this who are not familiar with the Pybossa schema, task_run.finish_time is set when the taskrun is saved.

elapsed_time_working_on_task = task_run.finish_time - task_run.created should always measure the time the task is on screen presented to the volunteer.

@teleyinex
Copy link
Member

Totally agree. Can you send a PR with your code? We will merge it. Loving your solution for PYBOSSA.JS.

@teleyinex
Copy link
Member

Regarding task_run.finish_time. We decided to take the measurements always in the server, as it is a bit more "precise" When we started developing PYBOSSA timers on the browser were not really good, now we can use https://developer.mozilla.org/en-US/docs/Web/API/Performance/now#Syntax Thus, it could be something you can control as well on the task_run.info to measure it. What do you think?

@normangilmore
Copy link
Author

Hmm. So you are suggesting that the server API allow posting TaskRun.created and TaskRun.finish_time as measured by the browser?

Basically by removing the two fields from pybossa/api/task_run.py
reserved_keys = set(['id', 'created', 'finish_time'])?

My first instinct is that it makes it easier to fake the timing, but I guess it's not that hard to fake out the server side timers anyway.

It would have the advantage that the Redis database would only be used for caching tasks and queues, and it would be less of a problem to reset the Redis db. (Right now I lose the Redis db when I deploy to new machines in the cluster, so I lose the timing keys discussed above, which caused me a lot of confusion before I figured it out. With the cluster solution I am using, it would be some work for me to put the Redis DB on persistent storage.)

My initial impression is it would be good to move responsibility for timing to the front end, especially as it would be simpler for front-end task caches to be responsible to track the start time. Also, we wouldn't have to worry about keys expiring, people could leave a task open in a browser all week if they wanted, and come back to it and keep going, without getting a timer key expired error.

Seems like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants