Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V Concurrency Considerations #3814

Closed
ylluminate opened this issue Feb 23, 2020 · 90 comments
Closed

V Concurrency Considerations #3814

ylluminate opened this issue Feb 23, 2020 · 90 comments
Labels
Feature Request This issue is made to request a feature.

Comments

@ylluminate
Copy link
Contributor

Transferred from a document posted here in these documents by @cristian-ilies-vasile:

V concurrency high level design

After a carefully consideration of proposed concurrency models and suggestions expressed on discord v-chat channel the high level design is based on GO language model (message passing via channels) which in turn is a variation of communicating sequential processes.

I did read papers on actor model but seems that coders do not use all the primitives provided by language and resort to threads and queues (see Why Do Scala Developers Mix the Actor Model paper).

Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

Because there are many words used more or less interchangeably like coroutine, goroutine, fiber, green threads etc I will use the term green thread.

Key points

  • the language will provide primitives able to deal with green threads – the green threads monitor/scheduler will be part of the language
  • there will be no yield() points in code; the scheduler can suspend / resume / block / cancel any green thread in a preemtive mode
  • based on memory model the green thread will be stackless or stackful
  • one green thread could spawn n others green threads
  • a green thread must be reentrant
  • the scheduler could start a new instance of a green thread in order to accommodate load bursts (elastic computing)
  • Green threads communicate using message passing (channels) instead of shared variables

Open points

@ylluminate ylluminate added the Feature Request This issue is made to request a feature. label Feb 23, 2020
@elimisteve
Copy link
Member

I very very much want to see channels in V! They're one of the best ideas in Go; they're both simple and powerful, which is rare.

@elimisteve
Copy link
Member

@ylluminate See related discussion: #1868

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Feb 23, 2020

@elimisteve
It's there.
Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

@dumblob
Copy link
Contributor

dumblob commented Feb 25, 2020

@ylluminate @cristian-ilies-vasile @elimisteve thank you for summarizing the high level goals/requirements - they're pretty much what I envision 😉. I'll keep an eye on this to see how will everything evolve.

@dumblob
Copy link
Contributor

dumblob commented Feb 25, 2020

An amendment:

the green threads monitor/scheduler will be part of the language

should read:

the green threads monitor/scheduler will be part of the language, but will be extensible by the programmer to allow changing the default scheduling behavior

@cristian-ilies-vasile
Copy link

@dumblob
I updated the document shared on google drive with your comment.

@cristian-ilies-vasile
Copy link

https://golang.org/doc/go1.14#runtime
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection.

@dumblob
Copy link
Contributor

dumblob commented Feb 28, 2020

@cristian-ilies-vasile that's definitely interesting (see specification), but it's still by far not fully preemptible (therefore it's called asynchronously preemptible, because it actually still just inserts safe points, but now starting from go 1.14 they'll be inserted in many more places but still carefully enough to not increase the overhead much - actually they tuned it so that the overhead is in majority of cases not measurable - though in some cases the overhead seems to be enormous as seen from this benchmark and it's equivalent in plain C which doesn't suffer from any such bottlenecks).

It's also good to note here, that the implementation of "asynchronously preemptible" in go is really tricky and is definitely not universally applicable (hehe) to other languages (e.g. because of the problem on Windows which they kind of "solved" for go semantics and internals).

Though I generally like the go design with "safe points" - I deliberately didn't go into details like this when describing the "interleaving between concurrency and parallelism" as outlined in #1868 (comment) because it's a lot of work with "little benefit" (as you see even go lang tackles preemptiveness - and to date still just partially - first now after many years of development by hundreds of engineers with many of them full time go devs).

@wanton7
Copy link

wanton7 commented Feb 28, 2020

Not sure if this adds any value to this conversation but ScyllaDB is build on top this async framework https://github.com/scylladb/seastar
There is also this Redis clone that uses it https://github.com/fastio/1store

Maybe instead of fibers etc. this kind of library solution could be used and added to vlib?

@pquentin
Copy link

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

@dumblob
Copy link
Contributor

dumblob commented Mar 13, 2020

@pquentin thanks for the pointer - didn't know about the Trio concurrency library (and the "nursery concept"). It's a very good approach to concurrency programming indeed. In the context of V I have these observations:

  1. I'm curious how long running "background" tasks would be implemented in V as authors of the nursery concept recommend using yield to make a generator, but there are no generators/yield in V

  2. Assuming my proposal gets implemented in V, nursery objects might be "simulated" using the pluggable scheduler from my proposa. The scheduler state can be tinkered with, i.e. read from and written to, any time pretty much as nursery objects has to be passed around though it'll be less convenient to traverse the one global graph in the scheduler state than to have explicit nursery object from each "split" where spawning happened. That said my proposal allows for cheap abstraction in form of an additional API modeled after nurseries, but doesn't enforce it.

    I think the only thing missing is to specify, that the pluggable scheduler has to handle "exceptions" (i.e. optionals "raised" by return error() and panics "raised" by panic() if panic() will stay in V). @cristian-ilies-vasile could you please add to your overview?

    Care would have to be taken also when implementing the automagical (de)spawning of reentrant go routines together with nurseries as that actually means dynamically at arbitrary times change the nursery objects (maybe making nursery objects just "live views" of the pluggable scheduler state?).

  3. One thing to contemplate is enhancing go routines to immediately return something. E.g. a handle which could then provide methods like value<t>() t which would be a blocking method returing exactly what the routine would return if it wasn't executed with prepended go (including optionals, of course), running() bool which would check in non-blocking manner whether the routine already finished, wait( s float ) to wait for completion with a given timeout, etc. Such handle would need to support reentrancy though, so the methods above would need to distinguish which instance we're talking about 😉.

    This might ease an implementation of the nursery concept on top of the proposed primitives (with the chance of them getting included into V standard library and maybe even as built-in).

@elimisteve
Copy link
Member

@pquentin Interesting concept, but preventing goroutines from being spawned without permitting execution of the main thread/goroutine is extremely limiting and would prevent a huge percentage of the value gained from having them in the first place.

The argument made against goroutines is honestly a severe straw man, as the normal and common way to achieve a nursery-like pattern in Go is to use a WaitGroup, or to pass a "done channel" (done := make(chan struct{})) to each function or method spawned in a goroutine, enabling each to signal when it's done doing its work.

The fact that general solutions exist in the ecosystem is eventually admitted in the middle of the 4th footnote:

[Edit: I've also been pointed to the highly relevant golang.org/x/sync/errgroup and github.com/oklog/run in Golang.]

That said, the point about stack traces only going back to the point where the goroutine was launched, rather than going all the way back to main, is an interesting one I hadn't realized 👍.

@smurfix
Copy link

smurfix commented Mar 13, 2020

I disagree. The argument against unstructured Go is not a straw man, just like the arguments against "goto" a generation years ago weren't straw men. Enforced structure allows new ways of reasoning about your code, thus enabling you to code the Happy Eyeballs algorithm in 40 lines of code instead of 400. This directly translates to fewer bugs.

Yes, cool, the ecosystem has solutions, but if it's way easier to not use them (e.g. because they require a heap of boilerplate in every function call, can't handle cancellation, and whatnot) they're ultimately not helpful. In Trio, "waiting for your child tasks" and "leaving the scope of a nursery" is exactly the same thing and requires zero lines of code (just the end of a block, in Python that's a de-indent), which again helps reduce the bug count and reduces the cognitive load on the programmer.

@dumblob The yield thing is just a convenient wrapper to package "run some code that returns a value, SOME_CODE, run some more code to clean up no matter how SOME_CODE exited" in a single function. The caller says with wrapper() as value: SOME_CODE, and Python guarantees that the cleanup code always runs when you leave the scope of that with block.

In current V (or Go) you'd use a defer block for cleanup, but that's potentially buggy – you must never forget the defer line, otherwise you have a resource leak, an unreleased lock, or whatever. Clean-up should be encapsulated in the object – the caller (i.e. the person writing the calling code) shouldn't have to think about whether, let alone how, to do it. Contrast Python's

    with open("/some/path") as file:
        write_to(file)

with

    file := os.open('/some/path')
    defer { file.close() }
    write_to(file)

IMHO that second line shouldn't exist. Not for files, not for locks, and not for nurseries.

In Go, starting a goroutine is as easy as it gets – but cleaning it up is not and needs to be done manually, esp when error handling is involved. Making coroutine creation slightly more difficult (by requiring a scoped nursery object to attach them to) is a no-brainer when the dividend you get from this is trivial cleanup.

@elimisteve

preventing goroutines from being spawned without permitting execution of the main thread/goroutine

I don't understand that sentence.

@dumblob
Copy link
Contributor

dumblob commented Mar 14, 2020

In current V (or Go) you'd use a defer block for cleanup, but that's potentially buggy – you must never forget the defer line, otherwise you have a resource leak, an unreleased lock, or whatever.

Exactly that's the reason why I'm curious how would we implement the long running "background" tasks using the nursery concept in V as there are currently no means in V guaranteeing the existence of a proper defer. And having no guarantees totally defeats the purpose of the whole nursery concept 😉.

@smurfix
Copy link

smurfix commented Mar 14, 2020

Having no guarantees is detrimental to a whole lot of other constructs too. Locks for instance, or not leaving a file handle open when I'm done with using it.

So that's not an argument against nurseries, that's an argument for statically-bounded-visibility of objects – with a destructor that is guaranteed to be called exactly once. As V doesn't seem to have that concept, perhaps the first step should be to add it.

@dumblob
Copy link
Contributor

dumblob commented Mar 14, 2020

Having no guarantees is detrimental to a whole lot of other constructs too. Locks for instance, or not leaving a file handle open when I'm done with using it.

I don't think it's comparable. I argue, that everything from unsafe {} in V (e.g. locks) can be easily avoided by other safe constructs V offers and thus I won't discuss nor assume that in this discussion (that's the whole point of existence of unsafe{} in V). What's left? Not much. Actually I think only files and then a group of infinite number of semantically high-level cases which we can safely ignore in this discussion, because they're not solvable by any technology, but only by the programmer.

Regarding files, yes, that's undoubtedly a hole in the safety. On the other hand I argue, that open files (be it virtual unix files or any other files) are by far not that detrimental as concurrency issues.

@medvednikov what about putting file handling also into unsafe{}? It's not that ridiculous as it sounds as file system tree hierarchy is getting slowly "replaced" by other DB-like interfaces - especially in the context of Web (WebAssembly as well as WASI is agnostic to the concept of filesystem hierarchy and support for filesystem hierarchy is just plain optional module pretty much like any other DB API module etc.).

In summary, implementing a destructor-like concept in V wouldn't IMHO solve anything (there have been several discussions in this issue tracker and elsewhere - feel free to join them as this thread is about something else).

Back to the topic. Because I like the nursery concept, I'm asking for ideas how to implement the "background" long running tasks with guarantees, but without introducing yield in V and a bit more syntactically readable than https://godoc.org/github.com/oklog/run#example-Group-Add-Context (assuming larger graph of such "actors"). Anyone?

@cristian-ilies-vasile
Copy link

@dumblob
initial requirements
the scheduler could start a new instance of a green thread in order to accommodate load bursts (elastic computing)
comments on https://discordapp.com/channels/592103645835821068/592842126304477184
o the green thread monitor will spawn a 2nd consumer if first cannot cope with the load.
o how exactly will the scheduler/monitor know what to spawn?
If it needs to do that in a general way, does that mean that you would have to register a custom spawning function to each channel? Also what would happen, if the system does not have enough resources for the new consumer?

@dumblob
Copy link
Contributor

dumblob commented Mar 17, 2020

  • how exactly will the scheduler/monitor know what to spawn?

The scheduler is pluggable, so it's up to the programmer. If you're asking for defaults, then I'd follow the principle only good defaults determine success. So I'd say we really want to provide basic scaling ("elasticity") by default, so we should provide some scheduler working moderately well for common scenarios (e.g. average destop PCs, average laptops, average smartphones, average small/medium servers).

If it needs to do that in a general way, does that mean that you would have to register a custom spawning function to each channel?

I think the default behavior (as mentioned above) should be without any custom spawning function - the scheduler will first look whether the go routine is "eligible" for elasticity (I envision this through the reentrancy flag) and then will look for every channel (the ring buffer) used in the go routine which is also used in other go routine (thus "connects" them while forming dataflow programming pattern - imagine e.g. NoFlo or Node-RED) and then mux (multiplex) the channels (inputs & outputs) accordingly.

Also what would happen, if the system does not have enough resources for the new consumer?

By default I wouldn't do anything (except in debug mode I'd log this information if it wasn't logged before in the last minute or so).

@spytheman
Copy link
Member

@dumblob consider a situation, where due to a temporary peak in load, it cannot keep up, so a channel/buffer/queue is filled to the max. The scheduler decides to launch a new worker, which increases the load even further, and so on and so forth => Things come to a grinding halt.

Adding a positive feedback loop, should not be done without understanding its limits and the behavior of the system in the degraded cases. I personally think, that it can not be done in the general case. I agree that this kind of elastic scaling may be useful for some cases, where you know that the feedback will be limited by external factors, but I think the default should be to do nothing, except may be log that the buffer was found to be full, and do not try to add new consumers automatically.

@dumblob
Copy link
Contributor

dumblob commented Mar 17, 2020

consider a situation, where due to a temporary peak in load, it cannot keep up, so a channel/buffer/queue is filled to the max. The scheduler decides to launch a new worker, which increases the load even further, and so on and so forth => Things come to a grinding halt.

This is just one of many situations. The default scheduler should be pretty conservative. There are also many other measures the default (i.e. a simple one) scheduler can (shall) take into account - but in this case it's quite easy as the scheduler knows everything about the whole graph (unlike in many other technologies where one doesn't know the whole state) and connections between nodes - imagine you know the load of all buffers thus the min/max flow between any two nodes (preferably a go routine instance acting as pure source(s) and a go routine instance being a pure sink(s)) is easily observable and thus the whole system perfectly tunable as the underlying graph theory and approximation algorithms are out there.

My hint to use full buffers as an important information for the scheduler was meant primarily as a trigger for reevaluating the situation, not as the only measure 😉.

Adding a positive feedback loop, should not be done without understanding its limits and the behavior of the system in the degraded cases. I personally think, that it can not be done in the general case. I agree that this kind of elastic scaling may be useful for some cases, where you know that the feedback will be limited by external factors, but I think the default should be to do nothing, except may be log that the buffer was found to be full, and do not try to add new consumers automatically.

That's the sole reason why reentrant go routines exist. Only reentrant go routines will take part on elasticity - any other instance of a go routine (i.e. non-reentrant instance) will not get any additionally spawned/stopped instance from the scheduler. So it's again 100% in the hands of the programmer and V in that case doesn't handle a general case you seemed to be afraid of 😉.

To clarify, I'm not proposing having a cool conservative scheduler supporting elasticity already in V 1.0. The reentrant go routines can be implemented the very same way as non-reentrant ones (i.e. without elasticity) for half a decade and it'll be fine. What I'm proposing though is the semantics, which I find important for now and the upcoming decades and which thus must be part of V 1.0.

@cristian-ilies-vasile
Copy link

discord quote
how exactly will the scheduler/monitor know what to spawn?
If it needs to do that in a general way, does that mean that you would have to register a custom spawning function to each channel? Also what would happen, if the system does not have enough resources for the new consumer?

@dumblob
I think that could be done in a transparent way for coder. The green thread monitor could know which GT (green thread) is feed by which channel. At every preemptive allocation/reallocation the monitor could compute few smart statistics and check that the FIFO grow at a higher rate than the GT could process.
the GT should have an option indicating that the coder will allow GTM (Green Thread Monitor) to scale automatically the load or not.

@dumblob
Copy link
Contributor

dumblob commented Mar 17, 2020

@cristian-ilies-vasile that's basically how I envision that (though not that simplified 😉) with the only difference, that you've split my concept of a general scheduler into two parts - a scheduler and a separate monitor (I'd though rather leave them technically together as both work on the very same data and splitting them will undoubtedly lead to performance decrease).

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Mar 17, 2020

@dumblob
No, the GTM (green thread monitor) contains the scheduler. In fact is/will be the same piece of code with 2 definitions attached.
I placed a sketch of testing concurrency document on the shared folder
https://drive.google.com/drive/folders/1LwsfHBlEdOEIf2IT7SjijKoJic7mN6Zj

@cristian-ilies-vasile
Copy link

Understanding Real-World Concurrency Bugs in Go
https://songlh.github.io/paper/go-study.pdf

@dumblob
Copy link
Contributor

dumblob commented Mar 18, 2020

No, the GTM (green thread monitor) contains the scheduler. In fact is/will be the same piece of code with 2 definitions attached.

Ok, I'm fine with that 😉. One minor thing is though the naming - I'm deliberately trying to not use any terminology resembling cooperative multitasking (e.g. "green thread"), because it's highly misleading (for the future as well as for the current state of things) as it implies many guarantees which the full preemption which I advocate for in V can not offer.

@cristian-ilies-vasile could you rename the Green Thread Monitor to something way more generic? Maybe V Thread Monitor (and V Thread Scheduler for the scheduler API) or so?

Understanding Real-World Concurrency Bugs in Go
https://songlh.github.io/paper/go-study.pdf

Yeah, that's a good source of issues with the Go semantics (especially the behavior of Go channels to be precise). I think V can learn from that (though many of their decisions have been done due to performance reasons) and improve on that - the full preemptiveness I'm strongly suggesting for V should help with some of them too.

Btw. for Go is this paper not any more valid after introducing the non-cooperative goroutine preemption in Go 1.14.

@cristian-ilies-vasile
Copy link

OK, here are new descriptions:

V Light Thread Monitor
V Light Thread Scheduler - for the scheduler API

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Mar 21, 2020

Seems that AI could help with subtle deadlocks errors.
https://medium.com/deepcode-ai/deepcodes-top-findings-9-deadlocks-836f240ad455

@cristian-ilies-vasile
Copy link

Bounding data races in space and time – part I
https://blog.acolyer.org/2018/08/09/bounding-data-races-in-space-and-time-part-i/

Bounding data races in space and time – part II
https://blog.acolyer.org/2018/08/10/bounding-data-races-in-space-and-time-part-ii/

@cristian-ilies-vasile
Copy link

I did some research regarding NIM’s multi threading engine named Weave
https://github.com/mratsim/weave
Embracing Explicit Communication in Work-Stealing Runtime Systems
Designing an ultra low overhead multithreading runtime for Nim

Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing.
Quote: Picasso is not optimized for real-time or latency: It can be used in latency critical applications like games however there is no fairness or no system like a priority queue to ensure that a job finishes in priority.The goal is to process the total work as fast as possible, fast processing of individual pieces of work is a side-effect.

I know that V fans are working in the gaming industry - so main coders should decide what type of concurrency is best suited for language and applicability domains.
We can pursue Weave goals or if low latency is paramount then a full preemptive solution should be designed.

However Weave is freaky fast
On matrix multiplication, the traditional benchmark to classify the top 500 supercomputers of the world, Weave speedup on an 18-core CPU is 17.5x while the state-of-the-art Intel implementation using OpenMP allows 15.5x-16x speedup.

@dumblob
Copy link
Contributor

dumblob commented Aug 12, 2020

Posting in hurry.

Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing.
Quote: Picasso is not optimized for real-time or latency: It can be used in latency critical applications like games however there is no fairness or no system like a priority queue to ensure that a job finishes in priority.The goal is to process the total work as fast as possible, fast processing of individual pieces of work is a side-effect.

This is not updated - Weave actually has an optional centralized queue to bring in certain degree of fairness more than acceptable for game and similar use cases. @mratsim, could this be rephrased on the Picasso/Weave sites/docs/wikis?

full preemptive solution should be designed.

You know me already 😉 - I'm definitely a proponent of fully preemptive solution to make V flawlessly interoperable with anything in the world (all those C/C++/FFI/... libraries).

There are "tags" (like [inline] etc.) in V which could later be used as programmer covenant with the compiler, that the given routine doesn't need to be preempted and if all parallel routines will be tagged like that, then a non-preemptive scheduler can be used instead of the default preemptive one. But that's a very specific case and I'm confident it shouldn't be part of V 1.0 (even the tiniest embedded chips have counter with interrupt and thus allow full preemptiveness which should be cheap with the proposal in the last paragraph of #1868 (comment) ).

@cristian-ilies-vasile
Copy link

Maybe we can borrow some Weave’s design decision (channel “plumbing”, work stealing) in the preemptive run time.
Also splitting potential go fn() workload in 2 main partitions:
a) cpu bound / aka sensitive to throughput
b) io bound + 3 levels of priority (Low, Medium & High) / aka sensitive to latency
is not a weird idea anymore. We can address specific issues in the best manner for each.

@cristian-ilies-vasile
Copy link

For Those About To Rock
The concurrency’s discussions topic on Discord has been launched!
https://discord.com/channels/592103645835821068/743149495818387476

@shelby3
Copy link

shelby3 commented Sep 15, 2020

I don’t do Discord so I have no access to what is being discussed there.

  • a green thread must be reentrant

I presume this is some attempt to achieve data race safety for data that is referenced by more than one green thread?[I found out it is some idea for duplication of a thread presented by @dumblob)

This is quite onerous in terms of what a type system will need to prove to fulfill this guarantee. Why can’t each green thread behave as if it is a sequential green thread and then the data race safety issue can be handled separately for shared data, such as by making it immutable or implementing some reference capabilities with ownership for the shared data such as for the Pony language?

  • Green threads communicate using message passing (channels) instead of shared variables

No shared memory access between threads? If yes, then that will be very limiting.

With massively multicore arriving now, and the inevitable cost to sharing memory accesses across a communication bridge between a cluster of cores, there be a need for limiting sharing of memory to a cluster of cores. There may still be a use case for sharing memory within a cluster?

If not sharing memory then what is the reason for the reentrant stipulation?

mux/demux green threads on real threads how to?

Golang presumably has a huge investment in this tuning. There are issues such as starvation pressure versus latency. This is very tricky to get right. Should you not be trying to lift much of the implementation from Go instead reinventing the wheel with your presumably much more meager resources? Google is presumably running Go on thousands of servers so has a significant incentive to invest in the runtime.

Here’s some links to a collection of the research, thoughts and notes I collected on M:N concurrency:

keean/zenscript#17 (comment)

keean/zenscript#17 (comment)

https://www.reddit.com/r/scala/comments/ir8ygb/what_is_missing_in_scala_ecosystem/g57axaz/?context=6

  • how to catch programs bugs (Heisenbugs) before the code is shipped to prod systems?

Are you asking what the data race safety model should be?

If you are referring to bugs due to multithreaded contention, don’t go there. It’s an insoluble, hornet’s nest. Opt for a lockfree model. There will always be lock contention at some higher-level semantic but then that is not your concern.

Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

I need to read this next. I will post what I have for this comment first.

It also explains why lock-free not was chosen.

I will need to watch that, thanks.

Edit: I found the reference at ~11:30 juncture in the video. Okay so it’s apples-to-oranges in terms of what I am thinking should be lock-free. Dmitry is referring to the scheduler for the M:N user-threads (aka green threads). I am referring to the API that the Vlang programmer interfaces with. I am saying the the Vlang programmers should be prevented or strongly discouraged from using low-level thread synchronization primitives. And instead should employ other lockfree means for (avoiding) contention over shared resources. I elaborated in the other thread on this topic.

I also found this:

https://stackoverflow.com/questions/13102874/is-gos-buffered-channel-lockless

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above.

@dumblob
Copy link
Contributor

dumblob commented Sep 15, 2020

@shelby3 thanks for reaching out to us!

No rush here, V parallelism is work in progress so read slowly and carefully both threads (this one and #1868 ) and all referenced resources & links recursively. Then think of it for a few days, come back, clarify or adjust your concerns expressed above and we'll be more than happy to discuss open questions (if there are any 😉) and accept PRs.

There should be enough time (look at the frequency with which both threads progress and correlate with commits in master), so don't hurry, we'll wait. Thanks.

@shelby3
Copy link

shelby3 commented Sep 15, 2020

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above.

Now I remember I already incorporated that into my analysis in 2018:

https://gitlab.com/shelby/Zer0/-/issues/11#spawning-goroutines-posited-to-be-as-harmful-as-goto

keean/zenscript#17 (comment)

The structured concurrency is applicable to when you want to explicitly express short-lived, concurrent operations which are considered to be children which should complete before the algorithm in the parent thread continues. Because in this case there are data race issues to contend with as the forked paths may return in any timing order.

Whereas, for goroutines which are to be long-lived entities no longer owned or subservient to the thread which spawned them, then structured concurrency does not apply and only appropriate data race safety restrictions on shared data apply.

IOW, I suggest we need both go and the nursery concept.

@cristian-ilies-vasile
Copy link

I read an interesting, 2020 issued, down to Earth paper on how to identify rare and hard to detect errors & bugs in multi-threading code (locks, unlocks, mutexes, semaphores, atomics etc)
Is very fast and effective to catch bugs, see page 4 of paper.
Sthread: In-Vivo Model Checking of Multithreaded Programs / https://hal.inria.fr/hal-02449080/document
The point is to use SimGrid software under Linux (pthreads and windows... not a nice match) is not so obvious stated on paper.
SimGrid / https://simgrid.org/
Other paper related to first one is a presentation, see slide 47 and next ones / http://www.ccs.neu.edu/home/gene/dmtcp-msr-19.pdf

@cristian-ilies-vasile
Copy link

On V's concurrency front we start working on a Poof of Concept based on Weave project.

@shelby3
Copy link

shelby3 commented Sep 16, 2020

@cristian-ilies-vasile finding and fixing needle-in-a-haystack thread synchronization heisenbugs is not equivalent to a paradigm which can never create those bugs. The latter would be provably 100% free from these bugs at compile-time. I urge you please do not repeat the design mistake of encouraging the use of thread synchronization primitives. I urge you to only allow access to these in unsafe{} code. Safe code should be provably safe at compile-time. As I explained herein and in the companion thread #1868 there are shared data paradigms which are provably safe at compile-time such as immutable shared data. The Vlang compiler could enforce it by for example only allowing threads to share references to immutable shared data.

Just as you would not allow pointer arithmetic in safe code, you should not allow in “safe code” a paradigm famous for creating insidious heisenbugs because it is well known to not be safe at all and to always proliferate insidious bugs. I believe Eric Raymond (re-)coined or first used (in 2014) the term “defect attractor” to describe intentional means to create defects in software.

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits. It’s impossible to thoroughly reason about thread synchronization because it is a dynamic synchronization problem and Turing-complete programs can’t be proven to halt (the Halting problem). Thus these unseeable heisenbugs are not equivalent in genre to other kinds of bugs which can be in many cases reasoned about and prevented (or at least discovered) by human brains and eyeballs or some formal methods, without even running the program.

Additionally as you presumably know that thread synchronization defeats massively multicore (which has arrived already for servers) due to Amdahl’s law (especially in conjunction with amplified contention due to increasing numbers of contending threads). Really you want to embrace immutability, it is the only viable direction going to the future. And Golang does not have any capability for expressing immutability at this time, so this can be a significant advantage for Vlang.

Another reason to favor “future-proof” immutability is what I wrote in the companion thread:

I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy.

I have contemplated other benefits to immutability.

P.S. I have not read about Weave yet.

EDIT: Eric Raymond wrote which supports my warning:

While the ability to manage lots of concurrency fairly elegantly does pull Go in the direction of what goes on on a modern multi-core processor, CSP is a quite abstract and simplified interface to the reality of buzzing threads and lots of test-and-set instructions that has to be going on underneath. Naked, that reality is unmanageable by mere human brains – at least that’s what the defect statistics from the conventional mutex-and-mailbox approach to doing so say to me.

@elimisteve
Copy link
Member

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits.

@shelby3 How, specifically?

@krolaw
Copy link
Contributor

krolaw commented Oct 15, 2020

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits.

@shelby3 How, specifically?

Consider that a variable maintaining a value for security is being left in an indeterminate state from a data race. Targeted parallel calls cause races, causing variable to go undetermined, causing access to be granted when it shouldn't be. Attack may need to be sustained depending on likelihood of success. Memory is often reused in strange ways, the variable may not even be part of the race directly. Undefined behaviour, is undefined including leaving the vault unlocked.

@cristian-ilies-vasile
Copy link

The first 250 simple tasks successfully ran on the R U S H proof of concept (inspired by Andreas Prell's PhD thesis) .

fn x_sum(a int, b int, output_ch chan string)  {   // original function
	x:= a + b 
	output_ch <- ''
	output_ch <- strconv.v_sprintf("x_sum= %04d", x)
	output_ch <- ''
}

@dumblob
Copy link
Contributor

dumblob commented Nov 2, 2020

@cristian-ilies-vasile could you put all the sources to one gist to make it easier to comment on and discuss it (with the nice side effect of not polluting this thread)? Thanks 😉.

@dumblob
Copy link
Contributor

dumblob commented Mar 1, 2021

A recent note on the above-discussed topic of the difference between IO-bound computation and CPU-bound workloads by the author of Weave: https://nim-lang.org/blog/2021/02/26/multithreading-flavors.html

@cristian-ilies-vasile
Copy link

@dumblob
Thank you for the link.
I had parked the weave implementation due to some personal reasons.
However in 2021 the bottleneck is the memory bandwidth, not storage/disk/SSD see these fresh posts below. So still think that a Weave adjusted for IO could be a better approach to concurrency or no concurrency at all at the language level, and use new platforms like Microsoft dapr (dapr.io) or lunatic (https://lunatic.solutions/)

o Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation

The new wave of thinking: One thread-per-core architecture
o Seastar is the first framework to bring together a set of extreme architectural innovations / http://seastar.io/
o The Impact of Thread-Per-Core Architecture on Application Tail Latency / https://helda.helsinki.fi//bitstream/handle/10138/313642/tpc_ancs19.pdf?sequence=1
o Glommio - a Thread-per-Core Crate for Rust & Linux / https://www.datadoghq.com/blog/engineering/introducing-glommio/
o What is io_uring? / https://unixism.net/loti/what_is_io_uring.html

@dumblob
Copy link
Contributor

dumblob commented Mar 1, 2021

or no concurrency at all at the language level, and use new platforms like Microsoft dapr (dapr.io) or lunatic (https://lunatic.solutions/)

These platforms (and many other) need to be programmed though in some language which needs to do all the heavy lifting 😉.

The new wave of thinking: One thread-per-core architecture

Hm, how is this different from what we've discussed so far (i.e. one os-thread per physical core and then millions of fibers/tasks per such os-thread using work stealing with some hacks providing preemptivness to avoid starvation and some level of fairness)?

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Mar 1, 2021

It is different a little bit.

  • there are no user level threads/fibers -> the abstraction is called task, which at low level is just a function, a structure, few pointers and one or two channels.
  • there is no preemption, all tasks are kept in stacks, one per logical CPU.
  • the wave design is a good match for scientific / CPU bound applications.
  • the scheduler is smart and is able to move one task from busy CPUs to the free ones.

That means once a task is put on a CPU, you cannot suspend that, just wait to finish. For a lot of tasks the load on each resource is fairy distributed due to work stealing algo.

@dumblob
Copy link
Contributor

dumblob commented Mar 1, 2021

That means once a task is put on a CPU, you cannot suspend that, just wait to finish. For a lot of tasks the load on each resource is fairy distributed due to work stealing algo.

If it's really like this, then this has a serious issue with starvation - imagine writing few tasks each with an infinite loop (accidentally). It's also "not fair" and thus no real-time guarantees (required by games, audio/video calls, etc.) can be made.

Weave has some counter measures (has some "good enough" fairness and if it's not enough, it provides a shared input queue; regarding starvation I think Weave doesn't do anything, but there is an easy solution with longjmp to provide real-time guarantees on top), so it's not an issue.

But if you're talking about something else than Weave, then this starvation and unfairness is a show stopper for a default concurrency backend in V.

@elimisteve
Copy link
Member

@dumblob Is it clear how much overhead Weave requires in order to do more sophisticated scheduling?

@dumblob
Copy link
Contributor

dumblob commented Mar 2, 2021

Is it clear how much overhead Weave requires in order to do more sophisticated scheduling?

Yes - just see the benchmarks it Weave repo. It's actually basically fully amortized (not kidding!). So the only question is always: which Weave parameters should I choose for this particular set of tasks in order to trade (tail) latency for throughput and vice versa. Weave defaults are though pretty much usable even for games, so this parameter choosing shall happen only in extreme scenarios (e.g. someone writes her own kernel or audio/video signal processing server like JACK, etc.).

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Mar 2, 2021

@dumblob
My RUSH project is 80% like the original design and implementation done by Andreas Prell, on his PhD thesis, so is a cousin of Weave,

I put here some non audited figures, just to have a perspective. Weave uses the word worker and I use the word robot for the same piece of code.
The figures are taken from different runs on different machines, do not try to correlate them.

A - load on each robot

robot num_tasks_exec robot_run_function_duration (micro sec) delta tasks % delta timing %
1 39,024 687,698 -2.44% -0.61%
2 42,060 750,044 5.15% 8.40%
3 37,339 635,525 -6.65% -8.15%
4 41,577 694,485 3.94% 0.37%
total 160000 2,767,752    
avg 40000 691,938

B - tasks distribution spread by robot's state
If you do not account the last 3 entries, because the robot was sleeping, the overall overhead is less than 5%.

task duration (micro secs)
robot_fsm_crunch_tasks_duration 302944
robot_fsm_move_tasks_on_deque_duration 8768
robot_fsm_1st_read_reqs_duration 946
send_notif_to_chief_robot_long_duration 635
send_req_to_peer_robot_duration 98
robot_fsm_1st_reqs_and_tasks_deq0_duration 64
robot_fsm_1st_fwd_reqs_duration 11
   
   
robot_fsm_ask_for_task_lr_duration 744
robot_fsm_wait_and_sleep_duration_reqs 14293
robot_fsm_wait_and_sleep_duration_tasks 16668

image

@dumblob
Copy link
Contributor

dumblob commented May 7, 2021

On the way to an ultra low overhead equal-priority hard real time preemptive scheduler

Motivated (and surprised) by https://twitter.com/v_language/status/1388154640768851969 I've decided to finally research and write down my ideas I have in my head for about a year.

TL;DR It turns out V could have zero overhead full preemption on Linux and Windows Subsystem for Linux and bare metal (and maybe also some BSD) platforms.

As described in #1868 (comment) V could (should!) leverage combining cooperative scheduling of "go routines" among a dynamic pool (1:1 to number of currently powered on CPU cores) of system-level threads ("os threads") with os-triggered preemption (bare metal has counter/clock interrupt). This os-triggered preemption ("interrupt" - e.g. a signal) is though costly if not necessary (with cooperative scheduling majority is not necessary). It's the costlier the more real time responses are needed which is typical for low-latency scenarios like audio, video, games, etc. - all that are coincidentally domains which also utterly seek high-performance languages to leverage the HW up to the last bit and watt. And that's the domain where V shines or wants to shine.

Now, when WSL supports seamless Linux interaction (incl. games with high fps) through WSLg Microsoft announced support for eBPF directly in Windows (https://github.com/Microsoft/ebpf-for-windows ), it becomes viable to leverage eBPF in the Linux/Windows/BSD kernel (I'd recommend first reading some eBPF programs BCC - e.g. https://github.com/iovisor/bcc/blob/master/examples/tracing/sync_timing.py ).

The idea is, that user space can write to kernel space (thanks to eBPF) without any interrupt nor any context switch nor any locking nor any other restriction. Just write at the pointer address basically. eBPF app (to whoms kernel space memory the user space app can arbitrarily write) can be then run each time a kernel timer(s) fires (doesn't matter who has set up this timer - whether some kernel driver or some other user space app or whatever - as we just care about being woken up "frequently enough" - in case of NO_HZ there might be the need to set some in-kernel periodic timer according to the user space app needs though).

If there'll be a free running counter in eBPF app kernel space incremented only by a given user space app thread and the eBPF app will maintain a "cached" copy of this counter from last time the eBPF app ran together with monotonic timestamp of the cached value, then by comparing this timestamp it can decide whether it's already too long the user space app thread didn't increment the counter and can organize an interrupt of the given user space app thread. The user space app thread increments the counter each time the cooperative scheduling on that particular CPU (virtual) core happens. Note, that no slow atomic operations (reads, increments, etc.) are needed here as we don't care about the edge cases as in the worst case it'll just fire the interrupt negligibly more frequently.

Of course, for thousands of CPU cores there might arise the need to somehow increase efficiency of the eBPF app - maybe by grouping the timer interrupts (though hrtimers in Linux should support tens and hundreds of thousands of timers without performance degradation), but that's just an implementation detail.

Why should this be possible? Because eBPF nowadays supports:

  1. maps (which is the term designating the shared memory address space between kernel space and user space)
  2. attaching e.g. to perf:perf_hrtimer/timer:hrtimer_expire_entry (for periodic running of eBFP program; for multiple timers per PID struct perf_event_attr could be used to tell them apart)
  3. sending a signal
  4. running in non-root user space (eBPF was originally designed to be run only as root in user space) - this seems to be nowadays the default
  5. running on FreeBSD (though probably with some restrictions potentially hindering this usage)
  6. running on WSL (thus on Windows though indirectly - but that doesn't matter anyway to those wanting to squeeze the maximum performance out their HW) Windows natively

Implementation note: on *nix systems sending an interrupt signal to the given thread seems to be a bit more work than expected (first the signal is being sent to the process and the signal handler will most then most of the time need to "distribute" the signal to the appropriate thread: signal handlers are per-process but signal masks are per thread).

Note, this is a generic solution and I'm certain readers from other projects (incl. Go lang, D lang, Nim, Python, ... and other languages and frameworks) might jump on this and try to implement it as low hanging fruit (hey, an attribution would be nice guys 😉).

@mintsuki @UweKrueger @cristian-ilies-vasile thoughts?

Additional notes

  1. The described scheduler is the simplest naive form and thus doesn't support e.g. priority groups which might (or migth not - depending on many other factors) be somewhat important instead of "blindly using it as os-level scheduler in in-V-written operating systems (i.e. big bare metal apps)".
  2. This brings the burden of eBPF compiler, but no burden on distribution as eBPF apps shall be platform-agnostic if not using unstable trace points (trace points are basically hooks into various parts of Linux/BSD/Windows kernel).
  3. There'll be some work needed for the scheduler to instruct the eBPF app to accommodate in running time after e.g. some CPU cores initially unavailable appeared (either not powered on or simple physically added or virtually added by the underlying hypervisor). This shouldn't be needed for CPU core removal/power_down (incl. readdition of the same CPU core) as there won't be much overhead with checking these "superfluous" stale counters in the eBPF app.

@dumblob
Copy link
Contributor

dumblob commented May 12, 2021

Interesting - just 3 days after I published the above sketch, Microsoft announced support for eBPF directly in Windows - see https://github.com/Microsoft/ebpf-for-windows . And if they'll support some timer hooks (and I'm 100% sure they will), then I think there is no excuse to not writing a scheduler leveraging that sooner or later 😉.

@vlang vlang locked and limited conversation to collaborators Sep 22, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Feature Request This issue is made to request a feature.
Projects
None yet
Development

No branches or pull requests