HugePages proposal #837

derekwaynecarr · 2017-07-24T02:14:41Z

A pod may request a number of huge pages. The scheduler is able to place the pod on a node that can satisfy that request. The kubelet advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages via hugetlbfs or shmget.

Planned as Alpha for Kubernetes 1.8 release.

See feature: kubernetes/enhancements#275

@kubernetes/sig-scheduling-feature-requests
@kubernetes/sig-scheduling-misc
@kubernetes/sig-node-proposals
@kubernetes/api-approvers
@kubernetes/api-reviewers

davidopp · 2017-07-24T04:40:24Z

contributors/design-proposals/hugepages.md

+[Allocatale] = [Node Capacity] - 
+ [Kube-Reserved] - 
+ [System-Reserved] - 
+ [Pre-Allocated-HugePages * HugePageSize] 


I think there's a minus missing at the end of this line

derekwaynecarr · 2017-07-24T13:11:33Z

I may choose to use explicit huge page sizes after more reflection. Want to experiment with a prototype some today. Difference between 2MB and 1Gi pages is significant enough that I would want to quota them separately. The empty dir topic would then become a separate volume type with pageSize option.

jeremyeder · 2017-07-24T13:48:49Z

Few questions:

eviction-hard only looks at the non-hugepages allocatable remaining, right? IOW if there are plenty of free hugepages but the non-hugepages amount is getting low...eviction would be triggered despite free hugepages. Assume that's the case, but just checking.
The kubelet will set the node label automatically according to what it discovers?
Will any hugepage info show up in cadvisor?

derekwaynecarr · 2017-07-24T14:33:52Z

@jeremyeder

eviction-hard - if there are plenty of free hugepages, but normal memory is low, eviction is triggered.
node label - yes, the kubelet woudl set it, but the more i think on this, the more i think we should just have the huge page size explicit in the resource name
cadvisor will report hugepages allocated per size; i will add a section for monitoring, i suspect we would want to see hugepage usage per container/pod cgroup.

jeremyeder · 2017-07-24T17:40:32Z

node label - explicit name...makes sense since there are so few possibilities and even if the list grows it will still be manageable.
cadvisor -- in addition to per-container/cgroup, perhaps it can also report the hugepages allocatable/capacity for each node.

sjenning

mostly nits, nothing to hold up the proposal over. lgtm.

sjenning · 2017-07-27T20:17:34Z

contributors/design-proposals/hugepages.md

+virtual-to-physical page mappings.  If the virtual address passed in a hardware
+instruction can be found in the TLB, the mapping can be determined quickly.  If
+not, a TLB miss occurs, and the system falls back to slower, software based
+memory management.  This results in performance issues.  Since the hardware is


nit: s/memory management/address translation/

also nit: s/the hardware is fixed/the size of the TLB is fixed/

sjenning · 2017-07-27T20:19:09Z

contributors/design-proposals/hugepages.md

+instruction can be found in the TLB, the mapping can be determined quickly.  If
+not, a TLB miss occurs, and the system falls back to slower, software based
+memory management.  This results in performance issues.  Since the hardware is
+fixed, the only way to alleviate the performance impact of a TLB miss is to


s/alleviate the performance impact/reduce the chance/

sjenning · 2017-07-27T20:20:08Z

contributors/design-proposals/hugepages.md

+A huge page is a memory page that is larger than 4KB.  On x86_64 architectures,
+there are two huge page sizes: 2MB and 1GB.  Sizes vary on other architectures,
+but the idea is the same.  In order to use huge pages, application must write
+code that is aware of them.  Transparent huge pages (THP) attempt to automate


nit: s/attempt/attempts/

sjenning · 2017-07-27T20:26:48Z

contributors/design-proposals/hugepages.md

+but the idea is the same.  In order to use huge pages, application must write
+code that is aware of them.  Transparent huge pages (THP) attempt to automate
+the management of huge pages without application knowledge, but they have
+limitations.  In particular, they are limited to 2MB page sizes.  


Also, dynamic splitting of THPs into standard pages during certain conditions can result in undesirable application stalls.

sjenning · 2017-07-27T20:29:45Z

contributors/design-proposals/hugepages.md

+A system may support multiple huge page sizes.  It is assumed that most
+nodes will be configured to primarily use the default huge page size as
+returned via `grep Hugepagesize /proc/meminfo`.  This defaults to 2MB on
+most Linux systems unless overriden by `default_hugepagesz=1g` in kernel


nit s/overriden/overridden/

sjenning · 2017-07-27T20:30:51Z

contributors/design-proposals/hugepages.md

+most Linux systems unless overriden by `default_hugepagesz=1g` in kernel
+boot parameters.  This design does not limit the ability to use multiple
+huge page sizes at the same time, but we expect this to not be the normal
+setup.  For each supported huge page size, the node will advertise a


nit: not sure if we need this sentence (starting with "This design..") i found it confusing i.e. you can, but maybe you shouldn't because... reasons?

I have been told that mixed use of 2MB and 1GB pages is possible on a node.

Don't think I've seen it. Mostly because the use-cases for 1GB pages make co-locating other apps that use 2MB pages on the same hardware architecturally undesirable (from a deployment standpoint).

For the record, yes mixing is possible for the three way to explicitly use huge pages.
https://gist.github.com/sjenning/b6bed5bf029c9fd6f078f76b37f0a73f#consuming-hugepages

hugetlbfs - pagesize mount option
mmap() - MAP_HUGE_2MB, MAP_HUGE_1GB
shmget() - SHM_HUGE_2MB, SHM_HUGE_1GB

removed to avoid confusion.

sjenning · 2017-07-27T20:33:52Z

contributors/design-proposals/hugepages.md

+## Pod Specification
+
+A pod must make a request to consume pre-allocated huge pages using the resource
+`hugepages.<hugepagesize>` whose quantity is a non-negative number of pages. The


s/non-negative/positive integer/

sjenning · 2017-07-27T20:36:30Z

contributors/design-proposals/hugepages.md

+request and limit for `hugepages.<hugepagesize>` must match until such time the
+node provides more reliability in handling overcommit (potentially via
+eviction).  An application that consumes `hugepages.<hugepagesize>` is not in
+the `BestEffort` QoS class.


not sure if we need this sentence. If we do need it, i'd go with

An application that consumes hugepages.<hugepagesize> is, by definition, not in the BestEffort QoS class as it specifies a request for one or more resources.

+1 for rewording

Idea at time is not all resources impact QoS (ie OIR) and placement in cgroup hierarchy, but this would. Will reason if it was confusing.

sjenning · 2017-07-27T20:38:12Z

contributors/design-proposals/hugepages.md

+`pageSize` option that must be provided by users.  The `pageSize` option aligns
+with the `pagesize` argument in the above mount.  It is used to specify the huge
+page size and pool that is used for accounting.  To write into this volume, the
+pod must make a request for huge pages that match this `pagesize`, otherwise its


s/pod/container/ ?

volumes are per-pod

i guess i was thinking requests where per container. i can read it both ways now.

sjenning · 2017-07-27T20:40:36Z

contributors/design-proposals/hugepages.md

+To use this feature, the `--cgroups-per-qos` must be enabled.  In addition, the
+`hugetlb` cgroup must be mounted.
+
+The QoS level cgroups are left unbounded across all huge page pool sizes.


do we want to bound at the kubepods (node) level in case system services are using hugepages?

ConnorDoyle · 2017-07-27T21:15:09Z

cc @PiotrProkop

ConnorDoyle · 2017-07-27T21:19:25Z

contributors/design-proposals/hugepages.md

+virtual-to-physical page mappings.  If the virtual address passed in a hardware
+instruction can be found in the TLB, the mapping can be determined quickly.  If
+not, a TLB miss occurs, and the system falls back to slower, software based
+memory management.  This results in performance issues.  Since the hardware is


also nit: s/the hardware is fixed/the size of the TLB is fixed/

ConnorDoyle · 2017-07-27T21:25:29Z

contributors/design-proposals/hugepages.md

+setup.  For each supported huge page size, the node will advertise a
+resource of the form `hugepages.<hugepagesize>`.  For example, if a node
+supports 2MB huge pages, a resource `hugepages.2MB` will be shown in node
+capacity and allocatable values.


Does allocatable == capacity in all cases for hugepages resources, or are node-allocatable enhancements also in scope? The text describes how allocatable memory is affected by hugepages, but not how the operator could set aside pre-allocated hugepages that are not available for user pods.

Good point. I will add text. I see no reason why a reservation would be a problem at first thought.

ConnorDoyle · 2017-07-27T21:26:08Z

contributors/design-proposals/hugepages.md

+and reduce the amount of `memory` it reports using the following formula:
+
+```
+[Allocatale] = [Node Capacity] - 


s/Allocatale/Allocatable

ConnorDoyle · 2017-07-27T21:30:00Z

contributors/design-proposals/hugepages.md

+request and limit for `hugepages.<hugepagesize>` must match until such time the
+node provides more reliability in handling overcommit (potentially via
+eviction).  An application that consumes `hugepages.<hugepagesize>` is not in
+the `BestEffort` QoS class.


+1 for rewording

ConnorDoyle · 2017-07-27T21:30:56Z

contributors/design-proposals/hugepages.md

+`pageSize` option that must be provided by users.  The `pageSize` option aligns
+with the `pagesize` argument in the above mount.  It is used to specify the huge
+page size and pool that is used for accounting.  To write into this volume, the
+pod must make a request for huge pages that match this `pagesize`, otherwise its


volumes are per-pod

ConnorDoyle · 2017-07-27T21:39:08Z

Do proposals usually link to the feature (here)?

ConnorDoyle · 2017-07-27T21:50:26Z

contributors/design-proposals/hugepages.md

+
+### NUMA
+
+NUMA is complicated.  To support NUMA, the node must support cpu pinning,


Agree with waiting to solve NUMA. Could we add a plan for how this will work as a pre-requisite for beta? I have some concerns related to non-fungibility of pages from separate NUMA nodes.

how will the plan differ from normal memory?

¯_(ツ)_/¯

IIUIC then the smallest granularity level of the scheduler so far is the node level - Do I recall this correctly?
For NUMA however, the granularity level would be intra-node locality - thus the locality of NUMA nodes and memory.

@fabiand - we anticipate the scheduler to only maintain visibility at the node level. we anticipate the kubelet to take responsibility for intra-node locality, and that will be a factor of QoS.

ConnorDoyle

Aside from small comments above, LGTM.

By the way, huge +1 for this proposal and feature.

tanshanshan · 2017-07-28T02:42:56Z

contributors/design-proposals/hugepages.md

+volumes whose usage is ultimately constrained by the pod cgroup sandbox memory
+settings.  The `min_size` option is omitted as its not necessary. It's possible
+we could have extended `emptyDir` to support an additional option for
+`medium=Memory` but extend it to support a `pageSize` paramter, but this


s/paramter/parameter

derekwaynecarr · 2017-07-29T06:58:23Z

Updated based on feedback.

vishh · 2017-08-01T17:58:20Z

contributors/design-proposals/hugepages.md

+and reduce the amount of `memory` it reports using the following formula:
+
+```
+[Allocatable] = [Node Capacity] - 


~~Why is this necessary? Doesn't node capacity already get reduced whenever static huge pages are created? I thought /proc/meminfo reduces total memory available in the presence of static huge pages.~~

Edit: Can Capacity be MemTotal - (HugePages_Total * Hugepagesize)?

it does not. MemTotal is fixed in /proc/meminfo even after allocating hugepages.

@vishh - i think adjusting [Node Capacity] as described in your edited comment introduces far more confusion when you layer in other monitoring tools. The machines memory capacity is [Node Capacity], the operator has reserved a portion of that capacity for use in huge pages.

vishh · 2017-08-01T20:22:39Z

contributors/design-proposals/hugepages.md

+status:
+  capacity:
+    memory: 10Gi
+    hugepages.2MB: 512


Can we avoid exposing size in the API for now? In the future, we do plan on supporting attributes for resources. Size can be an attribute then. Until attributes are available, would node labels suffice?

What would be the semantics of the request if size were not in the name? It seems to me there are two ways to do that:

Use bytes as the resource quantity unit instead of pages. This may require requests to be quantized to a multiple of available page size. What if the page sizes are heterogeneous across the set of nodes?

Continue to use pages as the resource quantity unit, and you just get the node's default page size whatever that may be (if you care, influence the scheduler with node label affinity.)

(2) above seems better than (1) to me. However neither option significantly reduces changes that pod authors would have to make once resource attributes are available. Caveat, this is without having reviewed a concrete proposal for resource attributes.

(2) doesn't satisfy the quota concerns. It makes quota on pages useless. This is why I settled on the name proposed.

A better compute resource API can handle quota as well. Can the quota requirement be deferred?

thockin

Vish asked me to take a look. Nothing terribly upsetting here.

thockin · 2017-08-01T20:20:45Z

contributors/design-proposals/hugepages.md

+
+There are a variety of huge page sizes supported across different hardware
+architectures.  It is preferred to have a resource per size in order to better
+support quota.  For example, 1 huge page with size 2MB is an order of magnitude


3 orders of magnitude, actually

thockin · 2017-08-01T20:23:01Z

contributors/design-proposals/hugepages.md

+`hugepages.<hugepagesize>` whose quantity is a positive number of pages. The
+request and limit for `hugepages.<hugepagesize>` must match until such time the
+node provides more reliability in handling overcommit (potentially via
+eviction).  An application that consumes `hugepages.<hugepagesize>` is not in


WRT class, are you saying that using hugepages gets me an automatic priority boost? Why?

I think this is more: by definition, an application that consumes hugepages in not in the BestEffort QoS class because it must have a resource request for hugepages, making it at least Burstable (Guaranteed if all resources are specified and requests == limits)

basically what seth said.

thockin · 2017-08-01T20:25:53Z

contributors/design-proposals/hugepages.md

+	min_size=<value>,nr_inodes=<value> none /mnt/huge
+```
+
+The proposal recommends a new volume type `hugePages`.  It will support a single


better as a new volume or as a twist on emptyDir, medium=HugePages, hugePagesConfig: { pageSize: 2Mi } ?

+1 for a twist of EmptyDir. We already deal with local storage that is split between EmptyDir (volumes abstraction), container logs and overlay filesystems.

From a user perspective, probably no material difference. If EmptyDir is easier for the project to manage then 👍

I am fine with emptyDir. The rationale at time was code complexity, but it's probably simpler API to keep with emptyDir

+1 for EmptyDir as the only memory backed volume in k8s.

thockin · 2017-08-01T20:26:34Z

contributors/design-proposals/hugepages.md

+```
+
+The proposal recommends a new volume type `hugePages`.  It will support a single
+`pageSize` option that must be provided by users.  The `pageSize` option aligns


does it matter that resource.Quantity is spelled 2Mi but the resource name above is 2MB ?

@thockin - at time of writing, i had mapped pageSize to align with the OCI runtime-spec pageSize field for huge page limits, so it didn't need to be a literal resource.Quantity.

see: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits

it seems more common to use decimal rather than binary units when describing the pageSize, but i think it should be fine if we want to support binary units only.

looking at this closer in runc, supported hugepage sizes are determined here:
https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/utils.go#L408
https://github.com/opencontainers/runc/blob/master/vendor/github.com/docker/go-units/size.go#L73

it appears to ultimately be parsing the representation returned from the sys/kernel/mm/hugepages/hugepages-{size}kB in kibibytes, so binary notation in kubernetes representation should be fine...

actually, this is useful reference:
https://en.wikipedia.org/wiki/Page_%28computer_memory%29#Page_size_trade-off

the actual page sizes are binary notation. even reading 2MByte on intel chipset was 2^21 and 4MByte was 2^22 as expected. binary format is actually preferred. (see pg 109 https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf for sample reference.)

thockin · 2017-08-01T20:28:23Z

contributors/design-proposals/hugepages.md

+settings.  The `min_size` option is omitted as its not necessary. It's possible
+we could have extended `emptyDir` to support an additional option for
+`medium=Memory` but extend it to support a `pageSize` parameter, but this
+proposal prefers to create an additional volume type instead.


Care to explain why? I am open to discussion, but I want to know the rationale here.

@saad-ali

If no objection, I am happy to extend emptyDir as described in earlier comment. The life cycle requirements were same as memory backed emptyDir

derekwaynecarr · 2017-08-09T08:30:17Z

@thockin @saad-ali @vishh @dchen1107 -- updated to respond to feedback.

zhouhaibing089 · 2017-08-11T01:39:03Z

contributors/design-proposals/hugepages.md

+
+The proposal introduces huge pages as an Alpha feature.
+
+It must be enabled via the `--feature-gates=HugePages=true` flag on pertient


vishh · 2017-08-11T20:12:41Z

contributors/design-proposals/hugepages.md

+  - name: hugepage
+    emptyDir:
+      medium: HugePages
+      pageSize: "2Mi"


If we treat each hugepage size as a separate resource, that's one thing. But parsing the resource name seems hacky.
Can we instead have emptydir volumes uses whatever hugepage resource was specified as part of the container specification and also require all containers to specify the same type of hugepage resource?
For the latter it may help having a different resource name like hugepages.kubernetes.io/2Mi.
Or the other option is to have medium be the hugepage resource type itself (hugepages.2Mi). memory medium has a corresponding memory compute resource.

Yes - It's probably good to tie the EmptyDir to the request hugepages. Otherwise there is also the option of abusing the mechanism, by requesting one size, but then using a different pageSize.

Quite unlikely, but still possible is that different hugepage sizes are requested, thus the volume needs a reference to the hugepage size. Just saying that a volume can not infer the page size automatically if there is more than one given. This in turn would mean that the volume should reference the resource (thus hugepages.2Miin the example above)

@fabiand - for the initial alpha version of this feature, I am going to update the proposal to restrict the pod from consuming more than one huge page pool size. The volume will then infer the pagesize to use based on the requests.

vishh · 2017-08-11T20:13:31Z

contributors/design-proposals/hugepages.md

+
+The `ResourceQuota` resource will be extended to support accounting for
+`hugepages.<hugepagesize>` similar to `cpu` and `memory`.  The `LimitRange`
+resource will be extended to define min and max constraints for `hugepages`


Burst ratio wouldn't apply to integer resources like hugepages right?

vishh · 2017-08-11T20:14:50Z

contributors/design-proposals/hugepages.md

+  `vm.huge_tlb_shm_group` sysctl
+- sysctl `kernel.shmmax` must allow the size of the shared memory segment
+- The user's memlock ulimits must allow the size of the shared memory segment
+- `vm.huge_tlb_shm_group` is not namespaced.


How are these issues going to be simplified for users?

i want to evaluate the associated burden based on usage of the feature rather than pre-design.

fabiand · 2017-08-13T11:33:23Z

contributors/design-proposals/hugepages.md

+- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.)
+- Java applications can back the heap with huge pages using the
+  `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
+- packet processing systems (DPDK)


Virtual Machines (qemu/kvm/libvirt) can leverage huge pages

fabiand · 2017-08-13T11:35:40Z

contributors/design-proposals/hugepages.md

+`grep Hugepagesize /proc/meminfo`.  This defaults to 2MB on most Linux systems
+unless overriden by `default_hugepagesz=1g` in kernel boot parameters.  For each
+supported huge page size, the node will advertise a resource of the form
+`hugepages.<hugepagesize>`.


Just to be clear - Resources as in #782 ? Might be worth a link then.

@fabiand - this is unrelated to resource class.

Right - so will it then be advertised as part of the node info?

fabiand · 2017-08-13T11:37:56Z

contributors/design-proposals/hugepages.md

+Supported huge page sizes are determined by parsing the
+`sys/kernel/mm/hugepages/hugepages-{size}kB` directory on the host.  Kubernetes
+will expose the `hugepages.<hugepagesize>` resource using binary notation form.
+For example, if a node supports `hugepages-2048kB`, a resource `hugepages.2Mi`


Previously you said the pages will be advertised by using hugepages.<hugepagesize> but you did not define the unit of the size - in this line however you implicitly convert kB to Mi - which is okay, but it's probably worth noting the default unit, to be able to specify the right resources whne creating pods.

IOW The proposal should define the uni to use, otherwise it's unclear if you have to request 2048kB or 2Mi units.

i will add text describing the size.

fabiand · 2017-08-13T11:41:38Z

contributors/design-proposals/hugepages.md

+  - name: hugepage
+    emptyDir:
+      medium: HugePages
+      pageSize: "2Mi"


Yes - It's probably good to tie the EmptyDir to the request hugepages. Otherwise there is also the option of abusing the mechanism, by requesting one size, but then using a different pageSize.

Quite unlikely, but still possible is that different hugepage sizes are requested, thus the volume needs a reference to the hugepage size. Just saying that a volume can not infer the page size automatically if there is more than one given. This in turn would mean that the volume should reference the resource (thus hugepages.2Miin the example above)

fabiand · 2017-08-13T11:45:31Z

contributors/design-proposals/hugepages.md

+
+cAdvisor will need to be modified to return the number of pre-allocated huge
+pages per page size on the node.  It will be used to determine capacity and
+calculate allocatable values on the node.


This reporting is just informative, and not for scheduling, right?

As I'd expect that the resource class mechanism is used to advertise available hugepages.

the reporting is for scheduling. the resource class mechanism is later.

derekwaynecarr · 2017-08-14T20:00:25Z

@vishh @thockin @dchen1107 @saad-ali - updated based on last round of feedback.

notable changes:

a pod is limited to a single huge page size pool for the initial iteration of this feature. this limits the exposure of changes to emptyDir to only require an alternate medium than an API schema change. feedback at this point is that a pod consuming multiple page size pools is rare.
a pod requests huge pages like memory rather than integral values. for example, hugepages-2Mi: 1Gi rather than hugepages-2Mi: 512. This aligns more naturally to how users think about the memory, and aligns with how i would anticipate monitoring tools to report usage back via hugetlb cgroup (i.e. in actual bytes used instead of pages).

dchen1107 · 2017-08-16T00:09:29Z

contributors/design-proposals/hugepages.md

+and reduce the amount of `memory` it reports using the following formula:
+
+```
+[Allocatable] = [Node Capacity] - 


Based on formula and the example you listed below, looks like you change the semantics of NodeCapacity and NodeAllocatable which both are declared as ResourceList. Taking your example given below, it is confused to the user that the node has total memory 10Gi = 9Gi (memory allocatable) + 1Gi (hugepage-2Mi allocatable) or 11Gi = 10Gi (memory capacity) + 1Gi (hugepage-2Mi)

Isn't this the first instance of a cross-resource data dependency in node status? @derekwaynecarr could you expand on how the monitoring tools are affected by this decision?

this was my thought process at the time:

most users associate capacity[memory] with /proc/meminfo MemTotal (i.e. total physical RAM)

pre-allocating huge pages does not change MemTotal, it just reduces MemAvailable

pre-allocating huge pages is an implicit reservation of memory for use in a different mode against memory capacity

i suspect users would be confused if the capacity value for memory as reported back by the kubelet differed from what is reported in cAdvisor (or other monitoring tools) and could bubble up through Prometheus, etc. i really feel people will be confused as most monitoring tools will read MemTotal.

i don't think users should expect to add numbers with two different units when evaluating capacity.

if we wanted the operator to explicitly account for pre-allocated hugepages in either system-reserved or kube-reserved, i suspect that causes more trouble when we actually enforce-node-allocatable for system and kube cgroups as that memory is not actually available (its been reserved for hugepage usage only).

Ok, I agreed with you on this.

Unfortunately not all kernel has MemAvailable in /proc/meminfo.

dchen1107 · 2017-08-16T00:33:54Z

contributors/design-proposals/hugepages.md

+
+The request and limit for `hugepages-<hugepagesize>` must match.  Similar to
+memory, an application that requests `hugepages-<hugepagesize>` resource is at
+minimum in the `Burstable` QoS class.


I thought we don't want to overcommit hugepage, at least for 1.8 release. If that is true, shouldn't we only support Guaranteed QoS class for hugepage? I understand that we changed QoS crossing all resources, and all containers, which makes any pod requiring hugepage is at least at Burstable class, but I want to make sure that is what you are referring here.

we do not want to overcommit hugepages. the request must always match the limit for hugepages. i should be able have the request != limit for cpu and memory resource requirements on any pod that consumes hugepages. i was trying to assert that hugepage requests impact qos (similar to cpu and memory).

dchen1107 · 2017-08-16T00:37:15Z

contributors/design-proposals/hugepages.md

+
+If a pod consumes huge pages via `shmget`, it must run with a supplemental group
+that matches `/proc/sys/vm/hugetlb_shm_group` on the node.  Configuration of
+this group is outside the scope of this specification.


Agreed that how to configure such supplemental group is outside the scope, but can you have an example on how a user pod to request huge pages via shmget?

i will add an example of a pod that runs a jvm backed by huge pages.

dchen1107 · 2017-08-16T00:39:07Z

contributors/design-proposals/hugepages.md

+fail validation.  We believe it is rare for applications to attempt to use
+multiple huge page sizes. This restriction may be lifted in the future with
+community presented use cases.  Introducing the feature with this restriction
+limits the exposure of API changes needed when consuming huge pages via volumes.


Can we make the node specification with multiple huge page sizes invalid at the initial node capacity discovery level?

i guess that is fine. the kubelet could fail startup if it detects multiple huge page sizes have been pre-allocated.

dchen1107 · 2017-08-16T17:00:30Z

/lgtm

We can address several comments in separate prs. Thanks!

k8s-github-robot · 2017-08-16T17:04:18Z

Automatic merge from submit-queue

@dashpole

Automatic merge from submit-queue (batch tested with PRs 50893, 50913, 50963, 50629, 50640) bump(github.com/google/cadvisor): 27e1acbb4ef0fe1889208b21f8f4a6d0863e02f6 Bump cadvisor dep to include google/cadvisor@556c7b1. This is required for supporting huge pages for 1.8 kubernetes/community#837 kubernetes/enhancements#275 @dashpole @derekwaynecarr @eparis @jeremyeder

…ture Automatic merge from submit-queue HugePages feature **What this PR does / why we need it**: Implements HugePages support per kubernetes/community#837 Feature track issue: kubernetes/enhancements#275 **Special notes for your reviewer**: A follow-on PR is opened to add the EmptyDir support. **Release note**: ```release-note Alpha support for pre-allocated hugepages ```

Automatic merge from submit-queue HugePages proposal A pod may request a number of huge pages. The `scheduler` is able to place the pod on a node that can satisfy that request. The `kubelet` advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages via `hugetlbfs` or `shmget`. Planned as Alpha for Kubernetes 1.8 release. See feature: kubernetes/enhancements#275 @kubernetes/sig-scheduling-feature-requests @kubernetes/sig-scheduling-misc @kubernetes/sig-node-proposals @kubernetes/api-approvers @kubernetes/api-reviewers

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 24, 2017

derekwaynecarr mentioned this pull request Jul 24, 2017

hugepage proposal #181

Closed

davidopp reviewed Jul 24, 2017

View reviewed changes

derekwaynecarr force-pushed the hugepages branch 4 times, most recently from da0c7e3 to 9c41fae Compare July 24, 2017 19:25

derekwaynecarr mentioned this pull request Jul 24, 2017

Add support for pre-allocated hugepages kubernetes/enhancements#275

Closed

derekwaynecarr assigned vishh Jul 26, 2017

sjenning approved these changes Jul 27, 2017

View reviewed changes

ConnorDoyle reviewed Jul 27, 2017

View reviewed changes

ConnorDoyle approved these changes Jul 27, 2017

View reviewed changes

tanshanshan reviewed Jul 28, 2017

View reviewed changes

derekwaynecarr force-pushed the hugepages branch from 9c41fae to d78d868 Compare July 29, 2017 06:57

ConnorDoyle mentioned this pull request Aug 1, 2017

Hardware topology awareness at node level (including NUMA) kubernetes/kubernetes#49964

Closed

vishh reviewed Aug 1, 2017

View reviewed changes

thockin reviewed Aug 1, 2017

View reviewed changes

vishh assigned dchen1107 and unassigned vishh Aug 1, 2017

xiangpengzhao mentioned this pull request Aug 3, 2017

Hugetlbfs support based on empty dir volume plugin kubernetes/kubernetes#50072

Merged

derekwaynecarr force-pushed the hugepages branch from d78d868 to cd62813 Compare August 9, 2017 08:29

zhouhaibing089 reviewed Aug 11, 2017

View reviewed changes

vishh reviewed Aug 11, 2017

View reviewed changes

fabiand reviewed Aug 13, 2017

View reviewed changes

sjenning mentioned this pull request Aug 14, 2017

bump(github.com/google/cadvisor): 27e1acbb4ef0fe1889208b21f8f4a6d0863e02f6 kubernetes/kubernetes#50629

Merged

derekwaynecarr force-pushed the hugepages branch from cd62813 to 07659e2 Compare August 14, 2017 19:56

HugePages proposal

d8673b1

derekwaynecarr force-pushed the hugepages branch from 07659e2 to d8673b1 Compare August 14, 2017 20:49

kubernetes deleted a comment from vishh Aug 15, 2017

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017

dchen1107 reviewed Aug 16, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2017

k8s-github-robot merged commit f5520e2 into kubernetes:master Aug 16, 2017

ConnorDoyle mentioned this pull request Aug 17, 2017

Reject unknown container resource names in the default *kubernetes.io/ namespace. kubernetes/kubernetes#50838

Closed

derekwaynecarr mentioned this pull request Aug 17, 2017

HugePages feature kubernetes/kubernetes#50859

Merged

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

add FangtingYan to members list (kubernetes#837)

a8227ce


		### NUMA

		NUMA is complicated. To support NUMA, the node must support cpu pinning,


		The proposal introduces huge pages as an Alpha feature.

		It must be enabled via the `--feature-gates=HugePages=true` flag on pertient

HugePages proposal #837

HugePages proposal #837

Conversation

derekwaynecarr commented Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

derekwaynecarr commented Jul 24, 2017

jeremyeder commented Jul 24, 2017 • edited Loading

derekwaynecarr commented Jul 24, 2017

jeremyeder commented Jul 24, 2017

sjenning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjenning Jul 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConnorDoyle commented Jul 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConnorDoyle commented Jul 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConnorDoyle Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConnorDoyle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Jul 29, 2017

vishh Aug 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Aug 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Jul 24, 2017 •

edited

Loading

jeremyeder commented Jul 24, 2017 •

edited

Loading

sjenning Jul 28, 2017 •

edited

Loading

ConnorDoyle Jul 27, 2017 •

edited

Loading

vishh Aug 1, 2017 •

edited

Loading