Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Custom Balancer #590

Draft
wants to merge 11 commits into
base: next
Choose a base branch
from
Draft

feat: Custom Balancer #590

wants to merge 11 commits into from

Conversation

metcoder95
Copy link
Member

@metcoder95 metcoder95 commented Jun 12, 2024

@metcoder95 metcoder95 marked this pull request as ready for review June 19, 2024 10:39
@metcoder95 metcoder95 requested a review from ronag June 19, 2024 10:39
@metcoder95
Copy link
Member Author

Still need to figure out a set of things but it should be on a good shape to bring up the idea of how the balancer can look like. Let me know what you think, and I'll adjust over the week 👍

Comment on lines +74 to +90
* [Symbol.iterator] () {
yield * this.pendingItems;
yield * this.readyItems;
}

get size () {
return this.pendingItems.size + this.readyItems.size;
}

maybeAvailable (item : T) {
/* istanbul ignore else */
if (item.currentUsage() < this.maximumUsage) {
for (const listener of this.onAvailableListeners) {
listener(item);
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of:

  • iterator
  • maybeAvailable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • maybeAvailable -- It is meant to hint that a worker is available, either because has been bootstrapped recently or has finished executing a task
  • iterator -- to support the pool iterate over the workers, e.g. to process pending messages (on teardown)

};
export type PisicnaLoadBalancer = (
task: PiscinaTask,
workers: PiscinaWorker[]
Copy link

@jerome-benoit jerome-benoit Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the PiscinaWorker[] exposing their usage statistics? Without them, the load balancer code is not able to take any decision besides simple algorithms like round robin. And since the usage statistics are currently dispatched in different thread pool private properties, adding them to the API without refactoring will end with an quite ugly LB API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assessing wether or not to provide that detailed information; currently my goal was to provide the overall usage (max capacity, tasks running, etc.) to help the LB make decisions.

IMO it provides enough for simple algorithms like RB, but for more complex ones based on usage over time, it will require further data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronag is contacting worker_threads pool maintainers one after the other to ask for load balancing features :)

Since I've already spent some times on them in poolifier, without per worker usage stats, designing useful LB will be impossible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While thinking of it: non ready workers must be removed from the array supplied to the LB to avoid code to test for it on each LB implementation. Is this.workers safe regarding worker readyness?

Plenty of works for the months to come :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While thinking of it: non ready workers must be removed from the array supplied to the LB to avoid code to test for it on each LB implementation. Is this.workers safe regarding worker readyness?

Partially, the Pool categorizes the workers between ready and non-ready, but when iterating it goes through all of them instead of just the ready ones.

Room for improvement for sure

Plenty of works for the months to come :)

Definitely, but plenty of good ideas and optimizations are being made here; thanks for the insightful discussion!

currentUsage: number;
isRunningAbortableTask: boolean;
[kWorkerData]: WorkerInfo;
// TODO: maybe add histogram data here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a maybe, it's a requirement to implement load balancing algorithm.


return false;
}
case 1: { // add
Copy link

@jerome-benoit jerome-benoit Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the LB code that must take such a decision. It's only responsibility is to pick up a worker given custom criteria among the ones available if any. Worker creation is orthogonal to load balancing (and expected to be done before invoking the LB).
Furthermore, it will make writing custom LB more complex and error prone than it should.
Untangling the worker creation code from the LB one a prerequisite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is an interesting point; maybe I saw it in another way.
My initial idea was to help the balancer hint the pool wether to instantiate a new worker or not; but overall, if the balancer decides not to pick one, the pool can make the further decision of spawning and enqueuing.

That can make it cleaner.

Copy link

@jerome-benoit jerome-benoit Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're targeting a public API, I really think it should be as simple as possible, like the custom queue one.

Regarding the worker creation usage, you can add optional tunables with sensible defaults to the LB API to allow to give hints to the pool behavior, such as the direct usage of any newly created worker instead of running the LB algo with it added.

Something like:

{ 
  useCreatedWorker: boolean,
  ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be definitely something interesting to explore, but maybe we can start simply and state that if the balancer returns null, Piscina will attempt to enqueue the task or start a new worker.

Further we can explore how a balancer and scheduler can collaborate together; but maybe for simplicity just keep it as is

Copy link

@jerome-benoit jerome-benoit Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it's a good idea to expect the worker creation decision logic to be part of the LB code. It will end by having the same code in all LBs:

Is there a worker available? -> Yes -> Run the LB algo | -> No -> (Is pool full -> Yes -> End | -> No -> Ask for worker creation)
The LB code will need to be called twice.

That logic can be factored out the LB code and makes LB implementation easier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further we can explore how a balancer and scheduler can collaborate together; but maybe for simplicity just keep it as is

Another point: please version the public API since the beginning. The extra work worth it, it will evolve since it's not like a queue API, it's more like a plugin API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LB code will need to be called twice.

Hmm, this is interesting. It will require to split the handling of a new worker from when a worker finishes a task, which is currently coupled. That will need to be handled differently.

In that way, upon creation we deliver the first task in the queue; but I imagined the LB, although called twice, might have some benefits if the algorithm is based on some affinity.

The extra work worth it, it will evolve since it's not like a queue API, it's more like a plugin API.

Yeah, that's the intention. Thanks for the callout

Copy link

@jerome-benoit jerome-benoit Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LB code will need to be called twice.

Hmm, this is interesting. It will require to split the handling of a new worker from when a worker finishes a task, which is currently coupled. That will need to be handled differently.

I guess the code entanglement for worker creation logic and load balancing logic can be part of another PR. I would do it prior to LB API integration. If I remember correctly, in piscina, at task end (message channel response):

  • lock is taken on the shared tasks queue
  • LB is run to find an available worker with a minimum of executing tasks, if none available, worker is created if pool not full
  • lock is released

I'm not sure about the locking, it's probably taken only if an eligible worker is found. Whatever, testing for workers availability and readyness and create one if needed logic can be factored out and run once in the current code as an optimization. And will help for that PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not really locking (but mostly a side-effect). Piscina infers that if you already have a task queue that hasn't been emptied yet, new tasks are enqueued so this ensures the tasks are solved in the same way they entered (unless a custom task queue has been set).

Nevertheless, the thread is stopped until a further task is picked.

That sounds good, think that gonna make the implementation easier and move forward.

}
}

return command;
Copy link

@jerome-benoit jerome-benoit Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array index is usually enough. There's race condition left if the worker crash: the LB API shall include a method to handle worker removal.
The API needs also to include a method to update the LB internal that will be called at the end of each task execution. LB algos might maintain internal statistics to take decision. The method handling worker removal must then also remove its internal stats relative to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So reduce the communication semantics to just the id of the worker, so the pool and lb can react to it.

This can be interesting, an onClose or onDestroy can be called from the pool to hear whenever a worker is down so it can update its internal statistics.

Do you have something specific in mind?

Copy link

@jerome-benoit jerome-benoit Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting stat is the average tasks runtime per worker. A custom LB might implement the algo to pick up the worker with the minimum.
But it will need:

  • per worker histogram for runtime stats
  • sensible default stats runtime at worker creation to avoid biased decision in LB algo

The median runtime is also interesting and the LB code will need to compute it since histogram does not compute it.

Using the worker_threads worker id is better than using the array id but at a cost of doing a lookup to find the worker instance. The array index usage has also drawbacks: the code must be very careful at worker removal and async task handling, the index can stay valid while not pointing to the correct worker.

Piscina runtime calculation is also lacking some accuracy: it's not exactly the task execution time on the worker, it also includes worker implementation overhead + message channel latency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point you made at the latter was one of the reasons I've had some concerns exposing histograms. The time measured is including the message channel overhead + some extra here and there; it will require some tuning to make it fully valuable for something sensitive to precision as an LB; for overall metric gathering can be enough tho.

Using the worker_threads worker id is better than using the array id but at a cost of doing a lookup to find the worker instance.

Overall a Map might not impose too much overhead, but we can explore; we can start with that and maybe tune for performance afterwards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time measured is including the message channel overhead + some extra here and there; it will require some tuning to make it fully valuable for something sensitive to precision as an LB; for overall metric gathering can be enough tho.

Doing accurate runtime measurement will require to change the current worker API: the measurement needs to be done on the worker side. I've not been able to come up with a better solution than extending an existing worker class.

Using the worker_threads worker id is better than using the array id but at a cost of doing a lookup to find the worker instance.

Overall a Map might not impose too much overhead, but we can explore; we can start with that and maybe tune for performance afterwards.

Been there. And finally I've reverted to using an array:

  • .reduce() was faster than iteration on Map to consolidate any information from the workers such as finding a min.
  • the number of records in the array will stay low, so Map will probably remain overkill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing accurate runtime measurement will require to change the current worker API: the measurement needs to be done on the worker side. I've not been able to come up with a better solution than extending an existing worker class.

That should be fine, the result comes packed with some extra data from the Worker, we can extend that schema to add that calculation.

Been there. And finally I've reverted to using an array:

Good catch.
Overall sticking to an array and leaving the LB to decide should be fine, any optimization can come afterwards

@metcoder95
Copy link
Member Author

I've already the list of thing to add to this PR.
To avoid make it so big and also do not close it, I'll start creating PRs to this branch with smaller enhancements here and there.

I'll expedite new major from Piscina next week to remove support for older Node.js versions and deprecate some APIs already.

Will work on this over the upcoming months

@metcoder95
Copy link
Member Author

If someone is willing to open a PR tackle this work, happy to have a look 👍

@metcoder95 metcoder95 marked this pull request as draft July 21, 2024 09:39
@metcoder95 metcoder95 self-assigned this Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants