Design: micro-wait primitives for efficient locks #87

lars-t-hansen · 2016-03-24T19:47:56Z

With the demise of the synchronic proposal (Issue #12) we're back to futexes as the main blocking mechanism. From that comes some further issues we should discuss:

Efficient locks will likely require some amount of benign spinning and/or micro-waiting before going into the futex.
The concrete mechanisms for spinning and micro-waiting tend to be platform-dependent; for example, some architectures prefer spin locks to use a relaxation instruction in the loop body; the mechanisms for waiting briefly or simply yielding the scheduler are OS-dependent. If exposed to JS it must be in some abstracted form.

(Further examples of platform variations are here.)

For v1 we can perhaps get away with not providing more than futexes and the ability to spin in a regular loop before going into futexWait, ie, what we already have, but if we know we're going to need more, this would be a good time to have the discussion.

(A related issue is #64, an idea that it is good to allow micro-waiting on the main thread even if we don't allow blocking.)

The first order of business should be to determine whether such a mechanism is required and what its (minimal + nice-to-have) aspects are.

ogiroux · 2016-03-24T20:43:32Z

One option is to take the stateless part of the library, just the spin wait part. Make it clear it's not a viable long-term way to wait for something -- maybe call it try_wait( ... ) or similar.

lars-t-hansen · 2016-03-24T20:51:23Z

Good, that corresponds with my intuition (but I didn't trust it).

pizlonator · 2016-03-24T21:30:41Z

Here's an idea. I just thought of it, so I don't yet know what its performance would be.

The spec doesn't export any spinning helpers, but it does provide non-normative encouragement for the user to never spin and instead to just call futexWait on the slow path, sort of like glibc's lowlevellock does. futexWait is already allowed to return prematurely (NOTEQUAL and TIMEDOUT). We could get rid of those two return codes and simply return false from futexWait if it is returning due to something other than some thread doing futexWake. When false is returned the client has no choice but to loop around again.

Then implementations do as follows. Each thread gets an internal thread-local variable called spinCount. This will be initialized to whatever the implementation thinks is the optimal number of times that you should spin. For example, WebKit would pick 40.

Every call to futexWait will inspect the spinCount. If it is greater than zero, futexWait will do whatever spin magic the implementation believes in (so WebKit will call sched_yield) and then it will decrement the spinCount and return false. If it is zero, futexWait will reset it to 40 (or whatever the implementation's optimal spin setting is) and then it will actually do the futex wait operation, which means that it may return false (on timeout or notequal) or true (some futexWake happened).

If you sort of imagine how this will work, you will see that this causes many locking algorithms that do no spinning of their own to become like WebKit's adaptive 40-spins-with-sched_yield lock, except for an occasional hiccup where a lock will do the real futexWait even though it would have been better to spin. I think that this will not impact performance much because for real-world programs that use locks, there is a huge plateau of different spinning strategies that are roughly optimal. Some locking algorithms will become suboptimal adaptive locks because they think that futexWait has a high probability of returning true, but even such algorithms will still perform a bit better this way than if they had no spinning.

I looked at the following futex-based algorithms and all of them would magically turn into adaptive locks if futexWait had this behavior:

The glibc lowlevellock that serves as the basis for pthread_mutex.
The usersem (one of the original futex-based locks, it's a counting semaphore). This one turns into a suboptimal adaptive lock because it requires that any thread that wants to futexWait essentially reserves the semaphore for this purpose, and then unreserves it if futexWait returns false. But I don't care too much about this problem because usersem is suboptimal to begin with.
WebKit's WTF::Lock.

lars-t-hansen · 2016-03-25T17:16:24Z

@pizlonator,

I see the appeal of that API, the spinning is an implementation detail and does not show up in the spec at all.

The API is a little poorer than the current one because it loses the information in the return value from futexWait, adding complexity to user code that wants to use timeouts notably. Not sure how much that matters, since timeouts must be recomputed if we're looping around futexWait anyway. On the third hand, futexWait is currently specified so that such looping is generally not necessary (no spurious wakeups allowed), so the need to spin adds complexity there. Unavoidable though, if we're going to spin.

I'm probably more inclined to decouple the spinning from futexWait and instead introduce something like an Atomics.pause() API that takes an argument that is an iteration value (>= 0), this method would do what it needs to do for the platform appropriately for that value - relax/yield/whatever. (There's some hand-waving here about the program deciding that at a high enough speed, but maybe, as you say, always yielding is about good enough.) Suppose the method returns true if it can go again for a higher count and false otherwise. Using this in combination with the futexWait we have, programs can trade complexity (recomputing timeouts, managing the spin count, etc) for performance - just using futexWait without spinning is easy, but potentially slower; but existing code like the libc we're using for emscripten is already managing its own spin loops and will benefit from the decoupling I expect.

(It might also make sense to expose Atomics.pause to asm.js, and call-out to futexWait only when we can't spin more.)

pizlonator · 2016-03-25T18:08:18Z

@lars-t-hansen That mostly makes sense to me.

Can you be clear on what you think pause() will do with the iteration value? I'd like to understand if we can get away with a pause() implementation that doesn't take any arguments. I'd also like to know what data anyone has on the profitability of different implementations of pause(). I've heard you and others make the claim that there is a portability issue with how you spin. I'm aware of this, but I'm optimistic that we can find something simpler than requiring the client to pass the iteration value.

Here's what I know:

glibc's lowlevellock, which uses futexes on Linux and serves as the core of pthread_mutex_lock, does not spin at all by default. It relies on futexWait being fast enough that it's fine to call into it if you fail the fast path lock acquisition. I've benchmarked these GNU/Linux locks against a bunch of other implementations including ones that spin, and found that this no-spin approach actually works fine in practice. It's not my preferred way of implementing locks, but as I said, I lack data that proves that the lack of spinning hurts that lock. It would be a shame if we added spinning stuff to the spec even though there exists a way of implementing locks on top of futexes that doesn't require the client to think about spinning.
On Linux, Darwin, and AIX, it appears to be most profitable to spin 10-100 times, calling sched_yield() each time. Doing anything else slows down real-world benchmarks (in Jikes RVM we tested this on the multi-threaded subset of the Dacapo benchmarks as well as some older multi-threaded benchmarks like portBOB and SPECjbb; and in WebKit we tested this on our concurrent JIT and parallel GC). Note that I've consistently found that you should never sleep during a spin loop. I've also consistently found that using x86's pause makes no difference on macrobenchmarks and that failing to yield in the spin loop is at best no better than yielding, and at worst it's a regression.
On Windows, it appears to be most profitable to spin about 1000 times without making any syscalls. All of the yielding/sleeping syscalls are too expensive, so you want to just hint to the CPU that you're spinning. So, you can call use x86 pause in the spin loop.

So, the highest performance implementation of pause on POSIX platforms will be sched_yield and the highest performance implementation on Win32 will be the x86 pause instruction. This gets pretty weird, since if you do this then you have to communicate to the client that they must spin a different amount of times on different platforms. I think we should avoid such a situation because it means that it will be too easy to write code that is not perf-portable.

This is why I was looking for an easy fix that doesn't require exposing too much to the client.

Another option is to say that Atomics.pause takes no arguments and must do the moral equivalent of sched_yield, which on Win32 would mean busy-waiting about 100 times while calling 'pause' - the reason why sched_yield is so good is that it's a hint to run another thread that the kernel is free to ignore, and in cases where there aren't other threads to run, it just introduces a tiny delay in execution. That delay is smaller than what you'd get from any of the sleep syscalls but larger than if you just used the x86 'pause' instruction. Win32 has very expensive yield/sleep operations (like SwitchToThread), so the best you can do is a tiny busy-wait loop to emulate the small delay of sched_yield.

To summarize, I'm fine with Atomics.pause() working the way you propose, but I'd like for us to have a good justification for it. It should be more than just that someone ran a microbenchmark. While I appreciate that @ogiroux does crazy spinning in synchronic, that's just not how most production locks work. They either do no spinning (Linux) or they have a dead-simple spin loop (RVM, WebKit, etc) that doesn't try to do anything super smart.

lars-t-hansen · 2016-03-27T16:24:32Z

@pizlonator, re the system-specific behavior I've been leaning on @ogiroux's testimony since he has more experience than me. What you're reporting backs it up, to some extent, but I understand your desire to hide it as much as possible / look for platform-independent solutions.

My intent was for Atomics.pause() to use the iteration value in some explicitly system-specific way to determine what to do and to return false if calling pause() again with a higher value would not do anything useful. The contract would be simple and would ideally require pause() to have no state. On Darwin one would imagine that for an iteration value i < 10 (say), pause() would call sched_yield and return true; for i == 10 it would call sched_yield and return false; and for i > 10 it would just return false.

The idea is that the only "communication" you need to give to the client is that it should increment the loop counter and check the return value; it doesn't need to be told what the loop limit is, it only needs to start at 0.

I agree that Atomics.pause() should have good justification behind it before we introduce it. At this point it might be possible to introduce it in Firefox, do a review on our Futex implementation to check that it is as good as it should be (it isn't), and then try to write some programs with the API to see if we get some benefit out of it and if the ergonomics are adequate.

pizlonator · 2016-03-28T00:28:03Z

I'm happy with this. I'd like to see data on what the perf impact is over the obvious alternatives (not spinning at all or spinning without calling pause()), because that would make for useful informative text.

-Filip

On Mar 27, 2016, at 9:24 AM, Lars T Hansen notifications@github.com wrote:

@pizlonator, re the system-specific behavior I've been leaning on @ogiroux's testimony since he has more experience than me. What you're reporting backs it up, to some extent, but I understand your desire to hide it as much as possible / look for platform-independent solutions.

My intent was for Atomics.pause() to use the iteration value in some explicitly system-specific way to determine what to do and to return false if calling pause() again with a higher value would not do anything useful. The contract would be simple and would ideally require pause() to have no state. On Darwin one would imagine that for an iteration value i < 10 (say), pause() would call sched_yield and return true; for i == 10 it would call sched_yield and return false; and for i > 10 it would just return false.

The idea is that the only "communication" you need to give to the client is that it should increment the loop counter and check the return value; it doesn't need to be told what the loop limit is, it only needs to start at 0.

I agree that Atomics.pause() should have good justification behind it before we introduce it. At this point it might be possible to introduce it in Firefox, do a review on our Futex implementation to check that it is as good as it should be (it isn't), and then try to write some programs with the API to see if we get some benefit out of it and if the ergonomics are adequate.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-28T03:31:14Z

Any reason for discrete time values over abstract hint constants if it were to take an argument?

pizlonator · 2016-03-28T13:43:22Z

What abstract unit of time would you propose to use?

Usually the challenge in a spin loop is that you want to sleep for less time than the amount of time that it takes to check the current time.

-Filip

On Mar 27, 2016, at 8:31 PM, Grant Galitz notifications@github.com wrote:

Any reason for discrete time values over abstract hint constants if it were to take an argument?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-28T15:19:51Z

I'm afraid to even specify that. I'm personally against giving "time" to the pauses, in order to make the pauses only defined in helping circumvent wake/wait syscall latencies in usage. If people want time then add a function with the name sleep in it.

lars-t-hansen · 2016-03-29T02:37:05Z

@taisel, the typical usage pattern here would be that the thread wants to sleep until it receives a signal, but normally that signal is also accompanied by a change in a shared cell (which futexWait checks so that it does not go to sleep if the signal has already been sent). The spinloop before the futexWait is thus used to check the value change very often for a time if the signal can be expected "soon". (If the signal is expected to be slow there's no reason to spin.) A "sleep" primitive does not serve that purpose, since we don't know how long to sleep for. (I guess you chould have three values MICRO, MINI, MAXI but then you're back to handling your own backoffs and that defeats the purpose to a fair extent I would say.) The iteration count is so abstract that the user code does not have to worry. And again, there's no requirement to use pause(). What is now just a call to futexWait(a, n, v) would instead be:

for (let i=0;;i++) {
  if (Atomics.load(a,n) != v) return;
  if (!Atomics.pause(i)) break;
}
Atomics.futexWait(a,n,v);

Of course, this would have to be significantly better performing (and generally no worse) in a range of scenarios than raw futexWait() to be worth the bother.

pizlonator · 2016-03-29T02:54:50Z

I'm starting to really like this primitive. This example is very convincing.

-Filip

On Mar 28, 2016, at 7:37 PM, Lars T Hansen notifications@github.com wrote:

@taisel, the typical usage pattern here would be that the thread wants to sleep until it receives a signal, but normally that signal is also accompanied by a change in a shared cell (which futexWait checks so that it does not go to sleep if the signal has already been sent). The spinloop before the futexWait is thus used to check the value change very often for a time if the signal can be expected "soon". (If the signal is expected to be slow there's no reason to spin.) A "sleep" primitive does not serve that purpose, since we don't know how long to sleep for. (I guess you chould have three values MICRO, MINI, MAXI but then you're back to handling your own backoffs and that defeats the purpose to a fair extent I would say.) The iteration count is so abstract that the user code does not have to worry. And again, there's no requirement to use pause(). What is now just a call to futexWait(a, n, v) would inst ead be:< /p>

for (let i=0;;i++) {
if (Atomics.load(a,n) != v) return;
if (!Atomics.pause(i)) break;
}
Atomics.futexWait(a,n,v);
Of course, this would have to be significantly better performing (and generally no worse) in a range of scenarios than raw futexWait() to be worth the bother.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-29T07:46:33Z

@lars-t-hansen
Pardon my trainwreck of thoughts here, but what I think I'm getting at is Atomics.pause could be expanded in scope for covering more use cases than a tight spin with exit conditon.

Edit: Merging the comments from above... below:

What if someone could query for the proper starting constant?
let i = Atomics.getBackOffTime(n)
What if we want unconditional thread sleep, where a memory location for futexWait is unnecessary? I really want to avoid having victim arrays around for this case, which I've already run into. Should we have some dedicated function call for this or roll it into Atomics.pause?
Could Atomics.pause or some sleep variant have different limitations on main thread than that of Atomics.futexWait?

The desires listed above might be better suited for a separate function call rather than Atomics.pause, to reduce the overhead of an Atomics.pause call itself.

pizlonator · 2016-03-29T08:15:06Z

Actually, there is nothing wrong with using futexWait for this purpose.

-Filip

On Mar 29, 2016, at 12:34 AM, Grant Galitz notifications@github.com wrote:

There's also the awkwardness that futexWait can be used as a sleep call anyhow, by never having a wake signal passed to the worker that calls such, relying on the timeout value instead... I'm hoping (ab)use of such can be avoided and shoved over to a proper sleep.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-29T08:42:35Z

@pizlonator It feels awkward to specify a memory location when the use case doesn't call for a conditional.

taisel · 2016-03-29T09:04:42Z

I just remembered window.setImmediate exists, so I'm wondering if Atomics.pause exists on the main thread, how will the two differ in implementation? I understand Atomics.pause is blocking based, while setImmediate returns to the message pump, but I'm wondering if there's fine print throttling that would be different between the two. After applying main thread restrictions to Atomics.pause, of course.

lars-t-hansen · 2016-03-29T13:26:23Z

I just remembered window.setImmediate exists, so I'm wondering if Atomics.pause exists on the main thread, how will the two differ in implementation

Atomics.pause would be synchronous, setImmediate is asynchronous.

I understand Atomics.pause is blocking based

I actually think it should not be thought of as "blocking", but as an efficient spin-wait.

After applying main thread restrictions to Atomics.pause, of course.

And I also don't think the main-thread restriction on Atomics.futexWait should apply to Atomics.pause. One of the intriguing prospects about pause is that it can be used to implement efficient optimistic waiting on the main thread without risking blocking the main thread due to a lost signal bug. I'm winging the syntax here, but consider:

async function expectUpdate(ia, index, current) {
   for (let i=0;;i++ ) {
     if (Atomics.load(ia, index) != current) return;
     if (!Atomics.pause(i)) break;
  }
  await slowExpectUpdate(ia, index, current);
}

where slowExpectUpdate probably needs some additional working storage for coordinating with whoever will signal it, but basically it registers its interest in a change of the value away from "current" and turns into a promise that will be resolved when that change is observed, I'm virtually certain that can be implemented on top of regular message events (since I have code that does that now, without the promises or async/await).

This may or may not be an improvement over other alternatives and we'll have to experiment, but in situations with high message frequency it could make communicating with the main thread more affordable.

lars-t-hansen · 2016-03-29T14:30:41Z

POC for Firefox: https://bugzil.la/1260432. This does not inline Atomics.pause() and I'm not sure why it would, either, but it could.

I adapted a test program I had lying around to use Atomics.pause() before futexWait and the performance improved adequately (more than 20x). But (a) my futex implementation is still not realistic enough, it has a single global lock, and (b) the test program is probably not representative. More data needed, for sure. The test program in question is the futex.html case here, see futex-worker.js.

pizlonator · 2016-03-29T15:57:26Z

I understand that concurrency primitives can be tricky to understand and futexWait is no exception. I suggest reading: https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf

-Filip

On Mar 29, 2016, at 1:42 AM, Grant Galitz notifications@github.com wrote:

@pizlonator It feels awkward to specify a memory location when the use case doesn't call for a conditional.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-29T16:37:30Z

@pizlonator In most cases I have utilization for the conditional in futexWait, and I understand futexWait has to get passed a shared array to manage the signals. My comment pertains to the rarer case where a general mutex non-conditional wait is all that's needed with no user-space check. Although passing a function with a shared array with no target indice seems awkward as well to use. I also completely understand reducing the API surface to keep things simple. I'm asking whether or not adding a slightly more specific wait is worth the extra function in the API, and if the overhead difference validates having such.

pizlonator · 2016-03-29T16:39:09Z

You are wrong about mutex non-conditional wait. It will use the conditional in the futexWait. If you aren't using it then you aren't using futexes correctly.

-Filip

On Mar 29, 2016, at 9:37 AM, Grant Galitz notifications@github.com wrote:

@pizlonator In most cases I have utilization for the conditional in futexWait, and I understand futexWait has to get passed a shared array to manage the signals. My comment pertains to the rarer case where a general mutex non-conditional wait is all that's needed with no user-space check. Although passing a function with a shared array with no target indice seems awkward as well to use. I also completely understand reducing the API surface to keep things simple. I'm asking whether or not adding a slightly more specific wait is worth the extra function in the API, and if the overhead difference validates having such.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-29T16:40:57Z

I've weighted against using such in an extremely rare case where it's known the worker will always end up ending before the next signal, and it's allowed to skip one signal (by design). I think the misunderstanding is whether or not a signal skip is required or allowed for in the batching used. I went against setting/clearing additional variables to track whether the signal was received or not to avoid that minuscule overhead.

I did have a hard time justifying the elision of the conditional, and almost everywhere else I do indeed use the conditional in futexWait, as I agree it's very necessary.

pizlonator · 2016-03-29T16:52:44Z

That's not the common case. The common case is that workers that share an array buffer will want to have many mutexes or condvars.

-Filip

On Mar 29, 2016, at 9:40 AM, Grant Galitz notifications@github.com wrote:

I've weighted against using such in an extremely rare case where it's known the worker will always end up ending before the next signal, and it's allowed to skip one signal (by design).

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

taisel · 2016-03-29T16:57:05Z

@pizlonator I'm just golfing with suggestions here, in the hopes that the published API is polished out the door. I'm hoping that the overhead of the rarer cases is taken into account, and that power saving is promoted, as well as being developer friendly. Don't worry, I approve of the API as it is, I'm just trying to find corner cases or possible weirdness/overhead.

taisel · 2016-03-29T16:59:04Z

I do want to be refuted. The comment golf is to verify we are on the right path.

lars-t-hansen mentioned this issue Mar 24, 2016

Why not synchronic #12

Closed

jfbastien mentioned this issue Mar 24, 2016

Formal memory model #88

Open

lars-t-hansen added the Futex support label Mar 25, 2016

lars-t-hansen mentioned this issue Mar 25, 2016

Why use numbers for enum values? #69

Closed

lars-t-hansen mentioned this issue May 18, 2016

Reconsider restrictions on Atomics.wait in the browser thread #100

Closed

lars-t-hansen added the Future work label Jul 21, 2016

lars-t-hansen mentioned this issue Jan 13, 2023

Spinloop relaxation instruction WebAssembly/threads#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: micro-wait primitives for efficient locks #87

Design: micro-wait primitives for efficient locks #87

lars-t-hansen commented Mar 24, 2016

ogiroux commented Mar 24, 2016

lars-t-hansen commented Mar 24, 2016

pizlonator commented Mar 24, 2016

lars-t-hansen commented Mar 25, 2016

pizlonator commented Mar 25, 2016

lars-t-hansen commented Mar 27, 2016

pizlonator commented Mar 28, 2016

taisel commented Mar 28, 2016

pizlonator commented Mar 28, 2016

taisel commented Mar 28, 2016

lars-t-hansen commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

taisel commented Mar 29, 2016

lars-t-hansen commented Mar 29, 2016

lars-t-hansen commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

taisel commented Mar 29, 2016

Design: micro-wait primitives for efficient locks #87

Design: micro-wait primitives for efficient locks #87

Comments

lars-t-hansen commented Mar 24, 2016

ogiroux commented Mar 24, 2016

lars-t-hansen commented Mar 24, 2016

pizlonator commented Mar 24, 2016

lars-t-hansen commented Mar 25, 2016

pizlonator commented Mar 25, 2016

lars-t-hansen commented Mar 27, 2016

pizlonator commented Mar 28, 2016

taisel commented Mar 28, 2016

pizlonator commented Mar 28, 2016

taisel commented Mar 28, 2016

lars-t-hansen commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

taisel commented Mar 29, 2016

lars-t-hansen commented Mar 29, 2016

lars-t-hansen commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

pizlonator commented Mar 29, 2016

taisel commented Mar 29, 2016

taisel commented Mar 29, 2016