Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: Task groups in Go: per-session resource usage tracking #60589

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

knz
Copy link
Contributor

@knz knz commented Feb 15, 2021

@knz knz requested a review from a team as a code owner February 15, 2021 17:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20210215-task-group-rfc branch 4 times, most recently from e07d169 to 3cf68cf Compare February 15, 2021 17:24
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @bdarnell, @knz, @nvanbenschoten, @RaduBerinde, and @tbg)


docs/RFCS/20210215_per_session_cpu_usage.md, line 155 at r1 (raw file):

  who / what is responsible for resource usage.

  The reason why we prefer an ID-to-state mapping at the application

We could provide a key-value storage facility on the task group similar to what exists on context.Context.


docs/RFCS/20210215_per_session_cpu_usage.md, line 183 at r1 (raw file):

Context: The Go scheduler is preemptive and runs goroutines in segments of time
called "ticks". The specific time length of a tick is variable but is

The phrasing "a tick is variable" confused me. The unit of a tick seems to depend on the architecture, but is not variable within the lifetime of a process. I was interpreting "a tick is variable" to indicate that the unit of a tick was variable within the lifetime of a process.

Hmm, or maybe ticks are variable length within a process because execute is called at arbitrary times. This all makes me quite anxious about using ticks as the unit to reporting CPU usage.


docs/RFCS/20210215_per_session_cpu_usage.md, line 187 at r1 (raw file):

second, TBD how much precisely). For CPU-heavy loads every tick tends
to use its fully allowed length before control is switched over to a
different goroutine, to ensure fair scheduling.

What happens when control switches due to a blocking operation such as a lock acquisition?


docs/RFCS/20210215_per_session_cpu_usage.md, line 202 at r1 (raw file):

The RFC proposes to extend the internal task group struct with a new
`schedticks` field, incremented upon every scheduler tick for the

I think this metric should expose seconds or nanoseconds. Otherwise the value feels difficult to use. I thought initially that tickspersecond() could be used to do a conversion, but I don't think the ticks in tickspersecond() are the same as scheduler ticks. Confusing!

The GC records stats about how much time is spent in assist vs dedicated-mark vs idle-mark by recording entry and exit to these states with calls to nanotime(). See runtime/mgc.go.


docs/RFCS/20210215_per_session_cpu_usage.md, line 221 at r1 (raw file):

// GetInternalTaskGroupSchedTicks retrieves the number of scheduler ticks for
// all goroutines in the given task group.
func GetInternalTaskGroupSchedTicks(taskGroup TaskGroup) uint64 {

Have you seen https://tip.golang.org/pkg/runtime/metrics/ which is new in go1.16? I point this out because the Go runtime itself is moving towards having metrics accessible by string keys. The schedticks seems like a metric associated with a task group, but otherwise not significantly different than this new runtime/metrics package.


docs/RFCS/20210215_per_session_cpu_usage.md, line 377 at r1 (raw file):

instrument every single entry point in Go where "CPU time" or "memory
usage" can be incurred. This has been attempted in the past and was
largely unsuccessful.

Could we use a sampling-based approach for determining CPU usage? The CPU profiler is arriving at some estimate of CPU usage, but we can't easily map that back to goroutines. Seems like that distance could be closed with extensions or enhancements to the Go runtime.

@tbg
Copy link
Member

tbg commented Feb 16, 2021

There is another approach that should probably listed as an alternative, which is to "make do" with what Go gives us out of the box: we use profiler labels and you use the cpu profile endpoint to get a breakdown of CPU usage mapped to queries. A prototype of that is here:

#60508

The main motivation in this RFC,

Today there is an issue when CPU usage is high, it is very difficult to track down which SQL session or query is responsible for that load.

could be satisfactorily handled with the profiling approach alone. But beyond that, if we're looking towards something like active session history, it gets more sketchy and we're possibly looking at periodically running a CPU profile, or "background tracing" which in itself requires changes to the Go runtime (suggested in golang/go#41554 (comment)). So I'm not suggesting that the idea presented in this RFC does not have merit.

a task group to "pages" (blocks) of system memory that are used
to carve heap allocations from.

When a heap allocation is performed, the RAM usage counter on the task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider trying to introduce an indirection here by tying allocations to "allocators" (like in other languates), not directly to task groups. And so a task group, or individual goroutines, could specify what they current allocator is. So for example userspace could write allocWidget(allocator) *widget constructors that allocate with a particular allocator.

@tbg
Copy link
Member

tbg commented Feb 18, 2021

I just learned that tailscale also maintains a fork of the runtime: https://twitter.com/bradfitz/status/1360293997038637058
Now they obviously have some intense Go wizardry on their team and it's not surprising that they could pull it off. I replied on the tweet to see if Brad has anything he'd like to share.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @bdarnell, @knz, @petermattis, @RaduBerinde, and @tbg)


docs/RFCS/20210215_per_session_cpu_usage.md, line 50 at r1 (raw file):

  that can be inspected separately (e.g. via `crdb_internal.jobs`? TBD)

A preliminary version of the proposed Go runtime changes is available

I really like that you did this prototyping and linked to it from the RFC. It makes everything else that follows significantly easier to understand. We should continue to do this.


docs/RFCS/20210215_per_session_cpu_usage.md, line 187 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What happens when control switches due to a blocking operation such as a lock acquisition?

I'm also left wondering this. Does this result in more frequent ticks or work that is not accounted for? Either would pose challenges for meaningful reporting. The first would make ticks incomparable in terms of being a measurement of CPU utilization. The second would lead to inaccuracy, especially with the course-grained 100ms time quanta (TBD from above) - though I have to imagine it's much smaller than this. EDIT: it looks like it's 10ms: const forcePreemptNS = 10 * 1000 * 1000 // 10ms.


docs/RFCS/20210215_per_session_cpu_usage.md, line 305 at r1 (raw file):

To achieve this we change the heap allocator structures to attach
a task group to "pages" (blocks) of system memory that are used
to carve heap allocations from.

Have you taken a look into what kinds of changes will be needed in the runtime to support this, as you did with CPU utilization above? Without looking myself, I fear that we may not have access to the taskGroupCtx structures in all of the places we need them to perform the proper accounting. We certainly will have access to the task group structure when performing a heap allocation, but what about when freeing it? This would imply associating an owner goroutine/task group to each heap allocation and keeping a record of that somewhere, wouldn't it? Is there such a notion of heap memory ownership in the runtime today?

If not, new bookkeeping here sounds expensive, which may be why you're discussing some loss of precision for small objects down below. Is that how I should be understanding that suggestion?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviews all!

Here's a summary of my changes since the last time you read the RFC:

  • included the pprof label approach as an alternative, with discussion of cons
  • detailed the manual instrumentation alternative
  • spelled out that CPU ticks were merely PoC, and proposed how to use actual time instead
  • researched and detailed how to get precise RAM usage from Go's current heap alloc
  • added mention of Andrei's idea for custom allocators.

Answers to Tobias' top level comments above:

There is another approach that should probably listed as an alternative

Done

could be satisfactorily handled with the profiling approach alone

This is incorrect. As now stated in the text of the RFC, the pprof based approach is unsuitable because:

  • The profiling causes frequent interruptions of the goroutines, which
    impairs performance. It also trashes CPU caches and generally slows
    down processing for memory-heavy workloads.

  • It only measures CPU usage during the goroutine interrupts. To get a
    more precise measurement requires increasing the profiling
    frequency, which impairs performance more.

  • The CPU usage output is formatted for consumption by human eyes. It
    is difficult and expensive to re-format so that it can be inspected
    using an API.

I just learned that tailscale also maintains a fork of the runtime

Added a note of this in the RFC.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, @bdarnell, @nvanbenschoten, @petermattis, @RaduBerinde, and @tbg)


docs/RFCS/20210215_per_session_cpu_usage.md, line 155 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We could provide a key-value storage facility on the task group similar to what exists on context.Context.

Added a note about that. Thanks.


docs/RFCS/20210215_per_session_cpu_usage.md, line 183 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The phrasing "a tick is variable" confused me. The unit of a tick seems to depend on the architecture, but is not variable within the lifetime of a process. I was interpreting "a tick is variable" to indicate that the unit of a tick was variable within the lifetime of a process.

Hmm, or maybe ticks are variable length within a process because execute is called at arbitrary times. This all makes me quite anxious about using ticks as the unit to reporting CPU usage.

Reworded to clarify. And yes, I agree ticks do not report CPU usage accordingly. Clarified in the section title that this merely a proof of concept. Added a new section with actual CPU time below.


docs/RFCS/20210215_per_session_cpu_usage.md, line 187 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm also left wondering this. Does this result in more frequent ticks or work that is not accounted for? Either would pose challenges for meaningful reporting. The first would make ticks incomparable in terms of being a measurement of CPU utilization. The second would lead to inaccuracy, especially with the course-grained 100ms time quanta (TBD from above) - though I have to imagine it's much smaller than this. EDIT: it looks like it's 10ms: const forcePreemptNS = 10 * 1000 * 1000 // 10ms.

Yes agreed, CPU yields cause ticks to be shorter. Added a clarification in the text as well as another section for actual CPU utilization.


docs/RFCS/20210215_per_session_cpu_usage.md, line 202 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this metric should expose seconds or nanoseconds. Otherwise the value feels difficult to use. I thought initially that tickspersecond() could be used to do a conversion, but I don't think the ticks in tickspersecond() are the same as scheduler ticks. Confusing!

The GC records stats about how much time is spent in assist vs dedicated-mark vs idle-mark by recording entry and exit to these states with calls to nanotime(). See runtime/mgc.go.

Done.


docs/RFCS/20210215_per_session_cpu_usage.md, line 221 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Have you seen https://tip.golang.org/pkg/runtime/metrics/ which is new in go1.16? I point this out because the Go runtime itself is moving towards having metrics accessible by string keys. The schedticks seems like a metric associated with a task group, but otherwise not significantly different than this new runtime/metrics package.

Very good point. Added a reference to metrics here with a note that the API needs to be refined further.


docs/RFCS/20210215_per_session_cpu_usage.md, line 305 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have you taken a look into what kinds of changes will be needed in the runtime to support this, as you did with CPU utilization above? Without looking myself, I fear that we may not have access to the taskGroupCtx structures in all of the places we need them to perform the proper accounting. We certainly will have access to the task group structure when performing a heap allocation, but what about when freeing it? This would imply associating an owner goroutine/task group to each heap allocation and keeping a record of that somewhere, wouldn't it? Is there such a notion of heap memory ownership in the runtime today?

If not, new bookkeeping here sounds expensive, which may be why you're discussing some loss of precision for small objects down below. Is that how I should be understanding that suggestion?

I did some more research and the situation is actually simpler than I had feared. Now explained in the text. PTAL.


docs/RFCS/20210215_per_session_cpu_usage.md, line 307 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider trying to introduce an indirection here by tying allocations to "allocators" (like in other languates), not directly to task groups. And so a task group, or individual goroutines, could specify what they current allocator is. So for example userspace could write allocWidget(allocator) *widget constructors that allocate with a particular allocator.

Good point! Added a section to explain your idea, although this would be a different enhancement that solves another problem than the one that motivated this RFC.


docs/RFCS/20210215_per_session_cpu_usage.md, line 377 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Could we use a sampling-based approach for determining CPU usage? The CPU profiler is arriving at some estimate of CPU usage, but we can't easily map that back to goroutines. Seems like that distance could be closed with extensions or enhancements to the Go runtime.

CPU usage in profiles can be tracked back to task groups via the profiler labels. Added a section to explain, as per Tobias' suggestion.

@knz knz force-pushed the 20210215-task-group-rfc branch 3 times, most recently from 19decc4 to 9d246e4 Compare February 20, 2021 15:31
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, @bdarnell, @knz, @nvanbenschoten, @petermattis, @RaduBerinde, and @tbg)


docs/RFCS/20210215_per_session_cpu_usage.md, line 419 at r2 (raw file):

allocations were made.

#### Tweaking the Go heap allocator for precise measurement

Goroutine allocations are performed both for temporary state needed during execution, and for persistent state stored in caches. For example, the SQL execution goroutines will allocate memory to populate the replica cache (or whatever that structure is called in DistSender nowadays). I believe there are also metadata caches present in SQL-land itself. Is there an intention to distinguish between this cache memory allocation and temporary state needed during execution?


docs/RFCS/20210215_per_session_cpu_usage.md, line 434 at r2 (raw file):

  - we change the mapping of P to `mcache`.  Instead of one
    `mcache` per P, we use one `mcache` per P per `taskGroupCtx` --
    i.e. we implement a separate small heap per task group.

How heavyweight are mcache structures? This will significantly increase the number of them in a process which is worrisome.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. As now stated in the text of the RFC, the pprof based approach is unsuitable because:

I left inline comments but as a meta comment, in the discussions of alternatives I perceive you as biased. For example, off-hand dismissing a profiling-based approach as "too expensive" doesn't seem constructive. Someone could similarly dismiss your proposal as "too invasive". I agree that there will be judgment calls (for example, what is the price of a subtle bug introduced by switching to custom heap arenas that we ship to customers?) but the profiler overhead is very simple to measure. I'm not even suggesting you do this, just please try to root both against and for the alternatives when you consider them and reflect that in the verbiage.
By the way, if it seems like I'm rooting only for the profiling approach, that is incorrect. I think that if we're forking the runtime it has to be for very well-substantiated reasons, and the possible alternatives factor heavily into that.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, @bdarnell, @knz, @nvanbenschoten, @petermattis, and @RaduBerinde)


docs/RFCS/20210215_per_session_cpu_usage.md, line 526 at r2 (raw file):

manage a custom fork of Go.

For reference, Tailscale was founded in 2019 and has a smaller

You may want to reference the details from the thread though:

  1. they are upstreaming everything, so they wouldn't make any of the changes proposed here. Most of their fork is "this will be in the next Go version but I wanted it now", so
  2. their fork is correspondingly small

It is unlikely that what we are building here will ever be upstreamed. Our fork will possibly be much larger than theirs.

Tailscale has a smaller engineering team, but they have more Go core expertise.


docs/RFCS/20210215_per_session_cpu_usage.md, line 656 at r2 (raw file):

2. to start a CPU usage report, the program requests `StartCPUProfile`
   on every server process simultaneously. An output file / buffer

No, this is not necessary. What you want (and this is the same with your approach) is that when a SQL gateway delegates work to another node, that it submits metadata that allows identifying the source of the work. For this approach, this means that labels have to be propagated across RPC boundaries at all times (OR we profile all nodes at the same time, but this is brittle).


docs/RFCS/20210215_per_session_cpu_usage.md, line 678 at r2 (raw file):

- The profiling causes frequent interruptions of the goroutines, which
  impairs performance. It also trashes CPU caches and generally slows
  down processing for memory-heavy workloads.

Please give nuance to this argument. It is fair to be concerned about the overhead of the approach, but it is unfair to dismiss it off-hand as "too ineffective". The sampling frequency is configurable and a certain loss of fidelity might be acceptable.

Please also mention an upside of this approach: it is the mainstream Go way. Profiler labels will eventually be supported by the heap profiles as well. There is no need to fork, and tools will become better without requiring resources on our end.


docs/RFCS/20210215_per_session_cpu_usage.md, line 680 at r2 (raw file):

  down processing for memory-heavy workloads.

- It is only able to *observe* CPU usage. It cannot stop tasks from

It could be the basis of a system that can. The alternative you are suggesting can also not stop tasks from consuming more CPU. I don't think you're suggesting that the runtime changes will block allocations or "fail" goroutines (or are you?)


docs/RFCS/20210215_per_session_cpu_usage.md, line 687 at r2 (raw file):

  frequency, which impairs performance more.

- The CPU usage output is formatted for consumption by human eyes. It

This is not correct. Profiles are protocol buffers and you can write any analysis on top of them that you would like. Most of it can actually be cribbed from pprof itself, if you can't outright use it as a library.

@cucaroach
Copy link
Contributor

This has probably been discussed but what about forking a separate OS process for each session? Not necessarily a real fork (can't imagine the go runtime could be made to handle that) but a subprocess that handled just SQL execution. Like a detached distsql processor that was just handed a full query execution spec and routed kv requests to its parent/gateway process and/or made direct distsql connections to other nodes. Then managing the memory/CPU usage of the SQL session would get really simple and get rid of the complexity of thinking about in-process/same goroutine kvclient/kvserver situations (if that even happens). It could be an optin thing too where only heavy aggregation queries got booted to their own subprocess. One could even imagine implementing these little SQL workers in multiple languages (rust, js, inet,java) for extra efficiency or to open the door for UDFs (maybe even stored procedures) in developers language of choice (sorry runaway brain!). This is essentially what browsers do with plugins; although security was the main driver but the security implications might appeal to some customers/use cases. We could also conceivably take advantage of OS features like nice, cgroups et al. Not a small lift but I imagine a Go based prototype could be slapped together fairly quickly. One could also imagine something like this enabling interesting CC free tier implementation options where each tenant gets its own process or something.

@knz
Copy link
Contributor Author

knz commented Apr 15, 2021

That's an intriguing suggestion. Would be curious to hear thoughts from @RaduBerinde and @jordanlewis about that idea.

@nvanbenschoten
Copy link
Member

+1, it is an intriguing suggestion.

One could also imagine something like this enabling interesting CC free tier implementation options where each tenant gets its own process or something.

In CC free tier, each tenant does get its own process. In fact, each tenant SQL process is in its own k8s pod, with various resource limits applied to it.

@petermattis
Copy link
Collaborator

The process-per-session approach would simplify the resource tracking problem, but then we'd have the problem of deciding what memory limits to place on each of these processes. There are also various other practical problems. Would the sub-process talk to the parent process for caches, or would we replicate the caches? I'm thinking of caches like the replica cache used by distsender to locate where replicas are in the cluster.

Those concerns aside, it is certainly worth thinking about this more.

@nvanbenschoten
Copy link
Member

There are also various other practical problems.

Another being the increased CPU and memory cost per session which necessitates aggressive connection pooling in systems like Postgres.

@jordanlewis
Copy link
Member

Agreed this is an interesting idea. It could be challenging to get the IPC right. I'd imagine the child process would send all of its requests back to the parent process to avoid any challenges with cache replication. I'd hope the child processes would be stateless.

@petermattis
Copy link
Collaborator

Agreed this is an interesting idea. It could be challenging to get the IPC right. I'd imagine the child process would send all of its requests back to the parent process to avoid any challenges with cache replication. I'd hope the child processes would be stateless.

Sending all RPCs through the parent process would be quite costly. Reading @cucaroach's original proposal again, I see he was imagining doing this at the DistSQL processor level. I think that could eliminate some of the cache concerns I have.

@jordanlewis
Copy link
Member

But the TableReaders need to figure out where to send their KV requests. That's the part I don't quite see how we'd accomplish without something trickier.

@cucaroach
Copy link
Contributor

we'd have the problem of deciding what memory limits to place on each of these processes

Sure, the OOM story gets a lot better though since the whole node doesn't go down when inevitable OOMs occur. Its been awhile since I've looked into how the oomkiller works but we'd need to be able to influence the scoring to make the sql workers sacrificial and the make the parent process less so. But you can't beat the simplicity of calling getrusage to access resource usage.

I'm not sure what caches are important probably can get away with IPC back to the parent for most things. Wonder if anyone has taken gRPC to the limit with a shared memory implementation, cutting out the serialization/deserialization might be necessary to make it practical.

Unfortunately I don't really have enough of the architecture in my brain yet to think clearly about how feasible this but it seems like a lot of pieces are already in place with distsql and the kvclient/kvserver split.

@sumeerbhola
Copy link
Collaborator

but then we'd have the problem of deciding what memory limits to place on each of these processes.

Sure, the OOM story gets a lot better though since the whole node doesn't go down when inevitable OOMs occur.

Will there be hard memory limits for each process? I am trying to understand if this will result in fragmentation, in that ample unused memory in one process will be unavailable for use in another process.

@cucaroach
Copy link
Contributor

Will there be hard memory limits for each process? I am trying to understand if this will result in fragmentation, in that ample unused memory in one process will be unavailable for use in another process.

Isn't that what happens today with free tier CC (pods have hard limits that are a fixed fraction of available space)? Tempting to avoid hard limits and let the workers rip on the theory that a) oomkiller scoring can be made to preserve parent process and b) spill to disk limits keep most workflows in line. RAM is a terrible thing to waste.

@rafiss
Copy link
Collaborator

rafiss commented Apr 16, 2021

Another being the increased CPU and memory cost per session which necessitates aggressive connection pooling in systems like Postgres.

Here's another experiment on that topic. IMO if we go this route, this sort of connection pooling should be built in to CockroachDB. In particular, pgbouncer only helps with this problem if it's in "transaction pooling" mode, but that breaks prepared statements / extended protocol since pgbouncer is not smart enough to keep track of that state. And also in addition to issues like that, I think it's too much overhead to set up and configure this separate service.

@knz
Copy link
Contributor Author

knz commented Apr 17, 2021

Ok so to synthetize the discussion so far:

  • Tommy proposes to explore the idea of separate processes.

  • We already do this for multi-tenant, with one process per SQL tenant.
    One known drawback is that a process crash kills all that tenant's connections to the server.

  • We can further refine by creating more processes. There are two approaches:

    • Create one sub-process per session. That is more or less what pg does.

      One already-identified drawback is that we must be very aggressive with connection pooling as Rafi pointed out, to avoid keeping/starting processes for idle sessions.

    • Create sub-processes for the distSQL execution. That would be a new idea.

      One identified drawback is that we would need cross-process IPCs to flow data back and forth, as well as table descriptors and other metadata used for exec processors. This may have a performance overhead.

      Another identified drawback is that we have many queries that operate on the gateway node directly. These can also cause OOM problems, and would not be protected by distsql-specific forks.

Meanwhile, regardless of the approach considered, we have a discussion about memory budgets. When using multiple processes, do we set limits and how?

  • fixed limits will cause fragmentation.
  • fixed limits raise the question: what is a good limit? We do not have a model for that.
  • we could also operate without limits. In that case we rely on the system OOM killer to only select "offending" processes, and we will need to be careful to "protect" the gateway/master processes.

Finally, something not discussed so far but must be mentioned for any process-based solutions: we still also need to think about CPU accounting and budgeting:

  • separate processes give us CPU accounting "for free" thanks to OS metrics.
  • CPU budgeting would then come in either of three variants:
    • no specific budgets: the OS scheduler is responsible for fairness. This might be OK but we'd need a lot of benchmarking to know for sure.
    • fixed CPU limits: like above, we'd need to find out what is a good limit.
    • relative CPU limits: we can group processes on Linux, and assign scheduling priorities to groups. Within a group processes are scheduled fairly. This would enable us to give priority to certain SQL sessions, e.g. admin or internal sessions.

Did I miss anything?

I can mention this alternative in the RFC, although at this point it's becoming pretty clear that we need two separate RFCs: one for CPU accounting/budgets, and one for memory accounting/budget.

@cucaroach
Copy link
Contributor

In addition to connection pooling another technique we could leverage would be to not just do one process per session, ie allow multiple client connections per process. For instance would it make sense to allow one process per user (role?) or database (ignoring the fact that multiple databases can be queried in one session). In addition to efficiency gains by having multiple connections served by one process grouping things this way seems like it could match how operators might want to control/monitor resource usage.

@sumeerbhola
Copy link
Collaborator

We can further refine by creating more processes. There are two approaches:

There are other variants on this idea, like the one mentioned 2 weeks ago on an internal thread in the context of admission control https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1617571577065900?thread_ts=1617224758.005400&cid=C01SRKWGHG8
"I wonder how much in-memory state is shared above (which includes distsql and parts of kv) and below that layer. If it is limited/none, we could consider slicing a cockroachdb node in regular cluster deployments into 2 processes along that boundary. These processes would be in different cgroups with static memory limits, but with cpu shares weighted heavily towards the lower layer."

no specific budgets: the OS scheduler is responsible for fairness. This might be OK but we'd need a lot of benchmarking to know for sure.

Different cpu shares in the scheduler tends to work well enough from what I've observed in the past, and has the advantage of being work conserving compared to a fixed limit.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@knz knz force-pushed the 20210215-task-group-rfc branch from bc5bde2 to 55be05b Compare May 17, 2021 15:50
@knz
Copy link
Contributor Author

knz commented May 17, 2021

I was able to add precise tracking for "large allocations" (above 32KB) see new explanatory section here

If we look at cockroachdb with this patch in place, we see the large_bytes column in crdb_internal.node_sessions periodically increase and decrease, in proportion with the activities of the SQL session (as expected).

@tbg tbg removed request for tbg and asubiotto June 21, 2021 11:30
@knz knz force-pushed the 20210215-task-group-rfc branch from 55be05b to 943ee5a Compare June 28, 2021 15:32
@knz
Copy link
Contributor Author

knz commented Jun 28, 2021

Following up on the above:

  • I have split up the discussion between CPU and RAM resource tracking into two separate texts, because the trade-offs, analysis and study of related work are completely different
  • however, both RFCs refer to the same Go runtime extensions as a candidate approach, and so I extracted the common sub-part of the RFCs into a 3rd mini-RFC that describes just the Go runtime extensions.

@knz
Copy link
Contributor Author

knz commented Jun 28, 2021

(the links to the 3 docs are now available in the PR description at the top)

@data-matt
Copy link

Hi @knz, this PR is of particular interest to me, I believe that resource controls are still important.

I see the PR is open, is this going to move forward in the future?

@knz
Copy link
Contributor Author

knz commented Nov 9, 2021

we want to, yes

@irfansharif
Copy link
Contributor

irfansharif commented Jun 3, 2022

Connecting some dots: see #82356 which proposes similar mechanisms (an instrumented runtime) to get fine-grained CPU attribution, a subset of what this RFC helped define. We're hoping to do something here for 22.2.

@nvanbenschoten nvanbenschoten removed their request for review September 27, 2022 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.