Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU manager proposal. #654

Merged
merged 15 commits into from
Aug 1, 2017

Conversation

ConnorDoyle
Copy link
Contributor

Notes for reviewers

First proposal submitted to the community repo, please advise if something's not right with the format or procedure, etc.

cc @flyingcougar @sjenning @derekwaynecarr

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2017
@ConnorDoyle
Copy link
Contributor Author

Attn @kubernetes/sig-node-proposals

@k8s-ci-robot
Copy link
Contributor

@ConnorDoyle: These labels do not exist in this repository: sig/node.

In response to this:

Attn @kubernetes/sig-node-proposals

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@derekwaynecarr
Copy link
Member

fyi @jeremyeder

Guaranteed pod with 1 or more cores of cpu, the system will try to make
sure that the pod gets its cpu quota primarily from reserved core(s),
resulting in fewer context switches and higher cache affinity".
1. Support the case where in a given pod, one container is latency-critical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is the case where a pod has a container that uses integral cores and another container that uses fractional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how the static policy as written satisfies this requirement. Does that need clarification here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably ok, i have too much history in the discussion that as phrased here, it signals something deeper.

@flyingcougar
Copy link

flyingcougar commented May 23, 2017

CRI update needed by CPU Manager - kubernetes/kubernetes#46105

@chino
Copy link

chino commented May 23, 2017

Will it be possible to reserve a core for a pod but not actually assign a "container" to it? We actually dedicate cores to very specific application "threads". We currently do this by setting aside cores at boot time using isolcpus, disabling interrupts on them, and then using taskset from within the container at runtime. This way a single thread is never context switched.

Not disturbing the reactor thread is critical for our low latency / low jitter workloads.

Instead of using taskset it might make more sense to perhaps communicate with the manager at runtime from the container?

@flyingcougar
Copy link

@chino Making CPU Manager aware of isolcpus and using only those as exclusive makes sense as a part of static policy or new policy.
In any case IMHO removing the need for user to manually manage cpus and prefix container entrypoint (or pod manifest) with taskset is an advantage.
Which cores on which nodes are in isolcpus, which are not used at particular point in time - can be troublesome if you have a cluster of size i.e. more than 10 nodes.

@ConnorDoyle
Copy link
Contributor Author

@chino there has some debate about similar use cases so far. This proposal represents a simplified compromise between the most sophisticated scenarios and some level of support that will benefit many users. We at Intel have worked with users with exactly this requirement, and the strategy you described is pretty common among the low latency crowd.

Process scheduler load balancing is off for those CPUs, so applications need awareness to call sched_setaffinity() (or wrap with taskset/numactl) in order to use more than one. You could argue that the additional complexity breaks fungibility, so one way to look at isolcpus could be as a separate resource. In that case, one of the policies could be expanded to discover, advertise and make node-level allocation decisions for an isolated-cpu resource. If the use case isn't a natural fit with the upstream CPU Manager policies, then it may motivate a plugin interface so more specialized policies can be added by users. Before we retreat to that alternative, we should try to see how it could be satisfied in this proposal.

@chino
Copy link

chino commented May 23, 2017

Yes, I would like to replace the isolcpus method by working with the manager to achieve the same result. I was only describing what we currently do as an example.

The main point is that we want to lock a specific app "thread" to a core and not have anything else but that thread on the core.

We don't know what the thread-id is until the process starts so we have a wrapper script that handles that in the container.

@ConnorDoyle
Copy link
Contributor Author

@chino by coincidence, a similar point just came up in the resource management workgroup weekly while discussing constraining cache partitions for garbage collector threads in a JVM.

How about exploring the other direction of communication first? That is, Kubelet/CPU Manager could communicate down to the container the intersection of its CPU mask and isolcpus? That way the subprocess or thread could continue to set its own affinity without needing to call back into the CRI or Kubelet. The container would gain the advantage of allowing the CPU Manager to track the assigned cpusets, therefore guaranteeing that it won't accidentally collide with another user of an isolated CPU.

My gut reaction is that implementing a bidirectional API for Kubelet/CRI to collaborate with the user application for these settings would get pretty complicated, but maybe we just haven't seen the right abstractions for it yet.

@derekwaynecarr
Copy link
Member

@ConnorDoyle @chino -- the cpu assigned by the kubelet to a container needs to be discoverable via downward API (should be in this spec, if not there yet).

@chino
Copy link

chino commented May 23, 2017

@ConnorDoyle

Yes, exactly. I figured the simplest solution which wouldn't introduce too much complexity would be to reserve a core BUT not assign it. Meaning the Manager will pull a core out from scheduling (hopefully disabling interrupts can become an option too) and then our container will use taskset it self to pin the specific thread-id which is only known at runtime.

We generally have 1 core on each socket set to handle interrupts and os/background threads. Sometimes we need to leave a few extra free to handle the interrupts on them fast enough for disk and what not. Obviously in a true multi-tenant cluster it would be nice to leave cores for os/k8 and then also group up cores into "shared pools" between certain pods and as noted "exclusive cores" for specific containers/threads.

It could be interesting if there was someway to specify how to find the right thread to pin but it seems complicated and in our case that could be java (hence need to use jstack) or could be a native application where you need to use another method to identify the right thread.

Another issue we currently have is we'd like to share ipc namespace and mmap files across sets of pods in the same tenant. Right now the only way to do that is to share it to the entire host which isn't very multi-tenant friendly/secure.

@chino
Copy link

chino commented May 23, 2017

@ConnorDoyle I like that idea, as long as we can specify easily shared vs dedicated. We previously used a very simple set of cgroups pre-docker but container runtimes create a whole set of different cgroup hierarchies and sometimes the container might not have write access. If something like cgroup namespaces or something formal can make this process easy to manage inside the private region than that's great. Again to be clear, I think multiple pods will want to share the same cores at times for background threads because there isn't enough cores to go around and controlling interrupt settings will be important.

for “fast” virtual network functions (want to approach line rate on
modern server NICs.)

_Solution requirements:_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a requirement that a container must be able to discover the set of cpus it has been assigned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree that this should be a requirement. I see the value, where people are wanting to do thread affinity inside the pod, but it comes with baggage.

It pins us in when we go to add the dynamic cpuset manager later.

It also creates problem in the kubelet restart case. If our contract is "I'll put you on an exclusive core", we have to freedom to "reschedule" on kubelet restart rather than writing a nasty cgroup parser to reconstitute the cpuset manager state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reschedule as in possibly shift a container to a new core after it's already been running? That would be rather scary for many workloads\users if that happened.

The cpuset is just a bit flag which is easy to parse?

Could the assigned cores be saved to etcd or a local cache so it's recoverable perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chino the CPU mask would only be changed after process start if the dynamic CPU Manager policy is enabled. The static policy guarantees never to do that.

@derekwaynecarr and @sjenning: Inside the container, workloads that care about their CPU mask can easily get it by reading /proc/self/status (see below). Is that sufficient?

connor:~/ $ grep -i cpus /proc/self/status
Cpus_allowed:   77
Cpus_allowed_list:      0-2,4-6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a dynamic policy, we should send a send a signal. For static, I agree it's not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfilling from off-list discussion: we talked today about providing a signal in the following way. We could project (a subset of) the CPU manager state into a volume visible to selected containers. User workloads could subscribe to update events in a normal Linux manner (e.g. inotify.) We also decided, more or less, that this won't be necessary until the dynamic policy is implemented. We can document that user containers in the shared pool must avoid setting their own CPU affinity, since it will be overwritten without notice any time the membership of the shared pool changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catching up on doc, so this may be covered elsewhere, but we should say something about signalling in a future iteration of this proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it to the dynamic policy section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@chino
Copy link

chino commented May 23, 2017

@derekparker yes, that would be great

The main thing that worries me is we literally turn off interrupts and even high performance tick timers. We cannot have the top level container pid touching the core at all. Hence why I asked if we could dedicate but not assign anything to the core so that we can taskset our selves and know that it's completely void of anything running on it.

When you work with low latency apps in the nano/micros time scale then any interrupt will cause measurable blips and jitter in the otherwise steady message rates. Especially in distributed applications this can cause race conditions and outliers.

@vishh
Copy link
Contributor

vishh commented Jun 2, 2017

APologies for this delay. I'll review this proposal next week.

@fabiand
Copy link

fabiand commented Jun 7, 2017

I do not think that it's required in a first step, but: Some applications need a way to specify how threads are assigned to cores, to optimize performance. qemu/libvirt is one such example. Thus it might be valuable if at some point an workload can express how it expects the threads to be pinned to cores.

| ------------------------------------------ | ------------------------------ |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 0.5 | Container **A** is assigned to the shared cpuset. |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)<br /><br /> The shared cpuset is shrunk to make room for the exclusively allocated CPUs. |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 1.0<br />&emsp;A:<br />&emsp;&emsp;cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here and next line Pod column mentions 2 A containers. easy to spot on rendered page, but not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, will fix.

CPUs topologically. For example, allocating two CPUs on a system that has
hyper-threading turned on yields both sibling threads on the same
physical core. Likewise, allocating two CPUs on a non-hyper-threaded
system yields two cores on the same socket.
Copy link

@teferi teferi Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if there are enough CPUs on a node, but they're far away from each other (I'm requesting 2 CPUs, but the ones left are on different cores). For some workloads admitting such Pod may be acceptable, for some not. Would it be a good idea to make this behaviour configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, our plan is to have the first policy (static) provide the "best" allocation when assigning dedicated cores. So, at higher G pod density it would be possible to observe the topology fragmentation you described.

If it were configurable in the policy, do you have ideas about how you would like that to look (name for the knob, values, defaults etc.)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it — the more I realize how complicated things actually are =/
Generally my perception is that by default CPUs of a single container inside a Pod should be as close to each other as possible, while the CPUs of a different containers inside a Pod should be as far as possible from each other. The only case I can imagine when I would want to pack as much dedicated containers on a single socket as possible is when I would want to launch multiple small (1-CPU) containers and then launch a single giant one, that asks for half the CPUs of a core.

For a container if we want to allow to control how close the CPUs are, I imagine it might be a double parameter in the pod spec. Smth like:

corePolicy:
  placement: dense|separate
  strict: true|false   

with dense and false as defaults. separate would mean to attempt to spread CPUs as much as possible and dense would mean to pack as close as possible. With strict: false meaning that workload can tolerate if the requirements are not met. And strict: true on the other side should mean that the container has to be rescheduled.

And then there is a whole new level with HT enabled. I can imagine smth like:

coreThreadPolicy:
  placement: dense|separate|separate_restrict
  strict: true|false

with dense and false as defaults. dense here would mean allow giving a container multiple sibling CPUs. separate would mean attempt to spread CPUs across non-siblings. spearate_restict would mean spread CPUs across non-siblings and reserve the sibling, so that it is not assigned to anyone.
I'm afraid that I might be over-thinking the whole problem =)

As a small disclaimer I'd like to mention that I come from OpenStack, so my perception migh be tainted with how OS does it =) Here is a link to a doc that describes how OS approaches similar problems.
(cc @ashish-billore)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teferi I like the idea of corePolicy.
As for coreThreadPolicy - isn't it a little bit too much ? i.e. not all environments are homogeneous, some nodes may not have HT enabled (probably not much). Besides separate_restrict can be accomplished by requesting i.e. 2 cores. If node has HT and

corePolicy:
  placement: dense
  strict: true

then it would result in 2 CPUs on 1 physical core.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyingcougar separate_restrict is more about not allowing others to use core's sibling. It probably would be very difficult to implement, since k8s would still think that we have those cores available for scheduling. So anyway it was probably a weird idea.

corePolicy is about packing the CPUs as tightly as possible on cores in the same socket, while coreThreadPolicy is about packing CPUs as tightly as possible within one core. So imo both of those have their uses. I think I need to draw a couple of diagrams to illustrate these ideas...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @fabiand I guess you should participate in ^^ part of the discussion

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it — the more I realize how complicated things actually are =/

Yes :)

I'm not an expert on this domain, but …
It's pretty clear that we - kubevirt - need the 'strict dense placement' option as a first step.
I actually wonder if additional differentiations are needed: Either you care, and you require an optimal placement, or you don't and get best effort.

Some workloads, like kubevirt (again), might want to manage parts of numa nodes themselfes, in a more fine granular fashion. In those cases it might make sense, to be able to tell the kubelet, that some nodes, sockets, core, should be reserved/exclusive to a specific container. A process can then take those reserved parts and manage them as needed (i.e. in our case, libvirt would take care of managing those nodes).
Long story short, we might want a mechanism to lend nodes to a process.

Another thing is that in future we need - and thus should already think of - to combine the placement with ResourceClasses #782, because in reality you want a process close to the device it uses, to avoid numa-node remote memory access.
Closely related to this is actually also the relationship to huge-pages.

I do not think that we need to solve all of this now, but we should understand the picture, to not create something which will already contradict with what we see coming up.

@berrange @mpolednik thoughts?

Copy link
Contributor Author

@ConnorDoyle ConnorDoyle Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand there's exactly that topic on the resource management workgroup agenda for this morning. Besides the device binding logic and the hugetlb controller settings, CNI plugins could also take topology into account on a multi-socket system with multiple network interfaces. I hope we can achieve agreement that a solution is needed to unify affinity somehow in the v1.9 or v1.10 releases.

On the topic of the CPU manager itself, policies that operators want can get really complicated. Hopefully we can get a good default scoped and implemented first and then extend with more advanced policies. In the resource management face-to-face where this component was conceived, we had a pretty long discussion about implementing a more flexible policy driven by explicit pool configuration. Maybe we could think of ways to represent the behaviors needed for the advanced cases in the config domain. In my mind, a rich configurable policy may buy enough flexibility so that we don't need another plugin system for external policies. At the same time, for v1.8 at least, changes to the pod spec API are probably out of scope.

@fabiand
Copy link

fabiand commented Jul 5, 2017

Let me just raise our use-case: When spawning VMs it is important to map the virtual topology to the real one, to achieve optimal performance.
I wonder, if this proposal should cover a mechanism which allows a container to tell the manager how to place the processes/threads on the real hardware.

/cc @berrange


1. _A container that was assigned exclusive cores terminates._
1. Kuberuntime unregisters the container with the CPU manager.
1. CPU manager unregisters the contaner with the static policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: container

1. Asynchronously, the CPU manager's reconcile loop updates the
cpuset for all containers running in the shared pool.

1. _The shared pool becomes empty._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we previously discussed a node condition for cpu pressure. is the taint the replacement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pre-emption, will the taint be understood?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, lost track of the progress to convert node conditions to taints. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


#### Policy 3: "dynamic" cpuset control

_TODO: Describe the policy._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summarize what was discussed at f2f, and just state that this is planned for prototyping post kube 1.8.

* Checkpointing assignments
* The CPU Manager must be able to pick up where it left off in case the
Kubelet restarts for any reason.
* Read effective CPU assinments at runtime for alerting. This could be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignments

CRI. Runc/libcontainer allows container cgroup settings to be updated
after creation, but neither the Kubelet docker shim nor the CRI
implement a similar interface.
1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed independent of cpu policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update required in this text? (what are the other things that need it, VPA?)


## Implementation roadmap

### Phase 1: No-op policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make clear kube 1.8 targets phase 1 and 2.

@derekwaynecarr
Copy link
Member

@ConnorDoyle @sjenning -- we need to capture what happens when all cpus are used, and make clear that we need to evict a BE pod if already resident on the node.

@ConnorDoyle
Copy link
Contributor Author

@derekwaynecarr updated

@ConnorDoyle
Copy link
Contributor Author

@derekwaynecarr and @vishh (approvers per the rm working group planning spreadsheet),

Aside from the node condition details that @sjenning is currently working to add, are there any other blockers to accept this proposal?

log forwarding, metrics collection and the like.)
1. Do not cap CPU quota for guaranteed containers that are granted
exclusive cores, since that would be antithetical to (1) above.
1. Take physical processor topology into account in the CPU affinity policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: incorrect numbering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is github markdown. When viewed as markdown, not source, the numbering is correct.

document][node-allocatable] for details. The CPU manager will claim
`ceiling(node.status.allocatable.cpu)` as the number of CPUs available to
assign to pods, starting from the highest-numbered physical core and
descending topologically. It is recommended to configure an integer value for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConnorDoyle Can you please give an example here illustrating with sample topology, kube-reserved, system-reserved and node.status.allocatable.cpu?

Operator documentation will be updated to explain how to configure the
system to use the low-numbered physical cores for kube-reserved and
system-reserved slices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConnorDoyle Will it make sense to add minimal details in this document as well for example "user-story"?

that prevents BestEffort and Burstable QoS class pods from
running on the node.

1. _The shared pool becomes nonempty._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: numbering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before re: github markdown

1. _The shared pool becomes empty._
1. The CPU manager adds a taint with effect NoSchedule, NoExecute
that prevents BestEffort and Burstable QoS class pods from
running on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is reading the state and updating the container resources via CRI synchronous with "start the container"? If yes, how kuberuntime gets to know when to invoke "start"?

@derekwaynecarr
Copy link
Member

@ConnorDoyle -- i will take one final pass by eod tomorrow. right now, this generally lgtm.

@derekwaynecarr
Copy link
Member

@ConnorDoyle @sjenning -- please make sure to keep updated with implementation progress, but I think this proposal is fine to proceed implementing. Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 1, 2017
@sjenning
Copy link
Contributor

sjenning commented Aug 1, 2017

@derekwaynecarr looks like merging is manual in this repo. Should I just git the Big Green Button?

@resouer
Copy link
Contributor

resouer commented Aug 1, 2017

@sjenning Yes, feel free to manually merge this :)

@cliffburdick
Copy link

@sjenning @derekwaynecarr @vishh I believe your PR covers this, but the EPYC processor from AMD has some new (somewhat odd) differences between the Xeons when it comes to core affinity and NUMA. You can see in the diagram here that there are multiple latency paths that are possible when you cross NUMA boundaries, depending on which core originated the request:

http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/2

@sjenning
Copy link
Contributor

@cliffburdick it is true that LLC latency is not uniform among cores on the same socket for the new AMD CCX based processors. If the a CCX must access a remote CCXs LLC, there is additional latency. It is basically a topology level between cores and sockets.

Note this is only an issue if you have two threads sharing data. If the working sets are independent, this is not an issue.

Right now, we can't make the distinction as cadvisor does not make the distinction. But this is a last 5% optimization in my mind. The existing cpuset allocator should work well, even without explicit knowledge of the CCXs.

@jeremyeder
Copy link
Contributor

jeremyeder commented Aug 31, 2017

@cliffburdick there's some historical context here, too. We agreed to defer a bunch of what this feature was originally spec'd out to be, out to a later release. That includes some of the more advanced topology detection and affinity (incl. PCI locality). It is a roadmap item for the working group. From RH perspective and I would say others too, it is a very important item to tackle to make sure this cpumanager delivers on the intended use-cases.

The compromise we made in order to even get this far was that the complexity of topology discovery and pinning would be hidden behind QoS tiers. So users request a QoS tier and a core count in whole integers (rather than millicores) to get the performance they want.

I would love it if you have more concrete requirements that you write them up, send them to our mailing list and hopefully participate directly in the next resource management working group meeting.

Please review this: kubernetes/kubernetes#49964

You can find all of the mailing list, meeting and agenda details here:
https://github.com/kubernetes/community/tree/master/wg-resource-management#contact-us

@cliffburdick
Copy link

cliffburdick commented Aug 31, 2017

Hi @jeremyeder, I'd be glad to list a couple here, and some of it is based on your work. We're running DPDK in Docker and have IO cards that should be affined to the same NUMA zone/CPU. For DPDK, isolcpus is the first obvious thing to turn on for all the packet processing and worker threads; we do not want any Linux housekeeping or other applications floating there. I've seen some recent threads on other message boards related to using isolcpus with containers, and it appears that it either works, or there are workaround to make it work. Taking it a step further, Intel recommends disabling the RCU threads on any isolcpus cores. I don't believe this changes anything, but it's another thing enabled on the kernel boot line. Lastly I wanted to experiment with the tickless kernel since many applications would see latency benefits by a single thread per core. I don't know how well this will work in a container since I haven't tried it, but I believe if two threads try to run on the same core, the kernel will revert to the normal scheduler for that core. I think it should work based on looking at the processes started by Docker, though.

In my case, I almost never will be crossing the NUMA boundary. All of the work will come in a NIC attached to a single processor, and all of the work will be done on that processor or accelerator cards attached to that processor before going back out the same NIC.

Of course, DPDK has other implications as well, which I'm currently working through.

I will send to the mailing list as well.

@ConnorDoyle
Copy link
Contributor Author

@cliffburdick this is a fantastic discussion, but the comments in a closed PR is the wrong place for it. I'm afraid all this good info will be lost. Could we please move to the ML?

@cliffburdick
Copy link

@ConnorDoyle sure. Thanks.

@ConnorDoyle ConnorDoyle deleted the cpu-manager-proposal branch September 8, 2017 17:19
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.