Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HugePages proposal #837

Merged
merged 1 commit into from
Aug 16, 2017
Merged

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Jul 24, 2017

A pod may request a number of huge pages. The scheduler is able to place the pod on a node that can satisfy that request. The kubelet advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages via hugetlbfs or shmget.

Planned as Alpha for Kubernetes 1.8 release.

See feature: kubernetes/enhancements#275

@kubernetes/sig-scheduling-feature-requests
@kubernetes/sig-scheduling-misc
@kubernetes/sig-node-proposals
@kubernetes/api-approvers
@kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 24, 2017
@derekwaynecarr derekwaynecarr mentioned this pull request Jul 24, 2017
[Allocatale] = [Node Capacity] -
[Kube-Reserved] -
[System-Reserved] -
[Pre-Allocated-HugePages * HugePageSize]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a minus missing at the end of this line

@derekwaynecarr
Copy link
Member Author

I may choose to use explicit huge page sizes after more reflection. Want to experiment with a prototype some today. Difference between 2MB and 1Gi pages is significant enough that I would want to quota them separately. The empty dir topic would then become a separate volume type with pageSize option.

@jeremyeder
Copy link
Contributor

jeremyeder commented Jul 24, 2017

Few questions:

  • eviction-hard only looks at the non-hugepages allocatable remaining, right? IOW if there are plenty of free hugepages but the non-hugepages amount is getting low...eviction would be triggered despite free hugepages. Assume that's the case, but just checking.
  • The kubelet will set the node label automatically according to what it discovers?
  • Will any hugepage info show up in cadvisor?

@derekwaynecarr
Copy link
Member Author

@jeremyeder

  • eviction-hard - if there are plenty of free hugepages, but normal memory is low, eviction is triggered.
  • node label - yes, the kubelet woudl set it, but the more i think on this, the more i think we should just have the huge page size explicit in the resource name
  • cadvisor will report hugepages allocated per size; i will add a section for monitoring, i suspect we would want to see hugepage usage per container/pod cgroup.

@jeremyeder
Copy link
Contributor

  • node label - explicit name...makes sense since there are so few possibilities and even if the list grows it will still be manageable.
  • cadvisor -- in addition to per-container/cgroup, perhaps it can also report the hugepages allocatable/capacity for each node.

Copy link
Contributor

@sjenning sjenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits, nothing to hold up the proposal over. lgtm.

virtual-to-physical page mappings. If the virtual address passed in a hardware
instruction can be found in the TLB, the mapping can be determined quickly. If
not, a TLB miss occurs, and the system falls back to slower, software based
memory management. This results in performance issues. Since the hardware is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/memory management/address translation/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit: s/the hardware is fixed/the size of the TLB is fixed/

instruction can be found in the TLB, the mapping can be determined quickly. If
not, a TLB miss occurs, and the system falls back to slower, software based
memory management. This results in performance issues. Since the hardware is
fixed, the only way to alleviate the performance impact of a TLB miss is to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/alleviate the performance impact/reduce the chance/

A huge page is a memory page that is larger than 4KB. On x86_64 architectures,
there are two huge page sizes: 2MB and 1GB. Sizes vary on other architectures,
but the idea is the same. In order to use huge pages, application must write
code that is aware of them. Transparent huge pages (THP) attempt to automate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/attempt/attempts/

but the idea is the same. In order to use huge pages, application must write
code that is aware of them. Transparent huge pages (THP) attempt to automate
the management of huge pages without application knowledge, but they have
limitations. In particular, they are limited to 2MB page sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, dynamic splitting of THPs into standard pages during certain conditions can result in undesirable application stalls.

A system may support multiple huge page sizes. It is assumed that most
nodes will be configured to primarily use the default huge page size as
returned via `grep Hugepagesize /proc/meminfo`. This defaults to 2MB on
most Linux systems unless overriden by `default_hugepagesz=1g` in kernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit s/overriden/overridden/

most Linux systems unless overriden by `default_hugepagesz=1g` in kernel
boot parameters. This design does not limit the ability to use multiple
huge page sizes at the same time, but we expect this to not be the normal
setup. For each supported huge page size, the node will advertise a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if we need this sentence (starting with "This design..") i found it confusing i.e. you can, but maybe you shouldn't because... reasons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been told that mixed use of 2MB and 1GB pages is possible on a node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I've seen it. Mostly because the use-cases for 1GB pages make co-locating other apps that use 2MB pages on the same hardware architecturally undesirable (from a deployment standpoint).

Copy link
Contributor

@sjenning sjenning Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, yes mixing is possible for the three way to explicitly use huge pages.
https://gist.github.com/sjenning/b6bed5bf029c9fd6f078f76b37f0a73f#consuming-hugepages

hugetlbfs - pagesize mount option
mmap() - MAP_HUGE_2MB, MAP_HUGE_1GB
shmget() - SHM_HUGE_2MB, SHM_HUGE_1GB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed to avoid confusion.

## Pod Specification

A pod must make a request to consume pre-allocated huge pages using the resource
`hugepages.<hugepagesize>` whose quantity is a non-negative number of pages. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/non-negative/positive integer/

request and limit for `hugepages.<hugepagesize>` must match until such time the
node provides more reliability in handling overcommit (potentially via
eviction). An application that consumes `hugepages.<hugepagesize>` is not in
the `BestEffort` QoS class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need this sentence. If we do need it, i'd go with

An application that consumes hugepages.<hugepagesize> is, by definition, not in the BestEffort QoS class as it specifies a request for one or more resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for rewording

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea at time is not all resources impact QoS (ie OIR) and placement in cgroup hierarchy, but this would. Will reason if it was confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

`pageSize` option that must be provided by users. The `pageSize` option aligns
with the `pagesize` argument in the above mount. It is used to specify the huge
page size and pool that is used for accounting. To write into this volume, the
pod must make a request for huge pages that match this `pagesize`, otherwise its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pod/container/ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volumes are per-pod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i was thinking requests where per container. i can read it both ways now.

To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the
`hugetlb` cgroup must be mounted.

The QoS level cgroups are left unbounded across all huge page pool sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to bound at the kubepods (node) level in case system services are using hugepages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@ConnorDoyle
Copy link
Contributor

cc @PiotrProkop

virtual-to-physical page mappings. If the virtual address passed in a hardware
instruction can be found in the TLB, the mapping can be determined quickly. If
not, a TLB miss occurs, and the system falls back to slower, software based
memory management. This results in performance issues. Since the hardware is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit: s/the hardware is fixed/the size of the TLB is fixed/

setup. For each supported huge page size, the node will advertise a
resource of the form `hugepages.<hugepagesize>`. For example, if a node
supports 2MB huge pages, a resource `hugepages.2MB` will be shown in node
capacity and allocatable values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does allocatable == capacity in all cases for hugepages resources, or are node-allocatable enhancements also in scope? The text describes how allocatable memory is affected by hugepages, but not how the operator could set aside pre-allocated hugepages that are not available for user pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add text. I see no reason why a reservation would be a problem at first thought.

and reduce the amount of `memory` it reports using the following formula:

```
[Allocatale] = [Node Capacity] -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Allocatale/Allocatable

request and limit for `hugepages.<hugepagesize>` must match until such time the
node provides more reliability in handling overcommit (potentially via
eviction). An application that consumes `hugepages.<hugepagesize>` is not in
the `BestEffort` QoS class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for rewording

`pageSize` option that must be provided by users. The `pageSize` option aligns
with the `pagesize` argument in the above mount. It is used to specify the huge
page size and pool that is used for accounting. To write into this volume, the
pod must make a request for huge pages that match this `pagesize`, otherwise its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volumes are per-pod

@ConnorDoyle
Copy link
Contributor

Do proposals usually link to the feature (here)?


### NUMA

NUMA is complicated. To support NUMA, the node must support cpu pinning,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with waiting to solve NUMA. Could we add a plan for how this will work as a pre-requisite for beta? I have some concerns related to non-fungibility of pages from separate NUMA nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will the plan differ from normal memory?

Copy link
Contributor

@ConnorDoyle ConnorDoyle Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯_(ツ)_/¯

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUIC then the smallest granularity level of the scheduler so far is the node level - Do I recall this correctly?
For NUMA however, the granularity level would be intra-node locality - thus the locality of NUMA nodes and memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand - we anticipate the scheduler to only maintain visibility at the node level. we anticipate the kubelet to take responsibility for intra-node locality, and that will be a factor of QoS.

Copy link
Contributor

@ConnorDoyle ConnorDoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from small comments above, LGTM.

By the way, huge +1 for this proposal and feature.

volumes whose usage is ultimately constrained by the pod cgroup sandbox memory
settings. The `min_size` option is omitted as its not necessary. It's possible
we could have extended `emptyDir` to support an additional option for
`medium=Memory` but extend it to support a `pageSize` paramter, but this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/paramter/parameter

@derekwaynecarr
Copy link
Member Author

Updated based on feedback.

and reduce the amount of `memory` it reports using the following formula:

```
[Allocatable] = [Node Capacity] -
Copy link
Contributor

@vishh vishh Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Doesn't node capacity already get reduced whenever static huge pages are created? I thought /proc/meminfo reduces total memory available in the presence of static huge pages.

Edit: Can Capacity be MemTotal - (HugePages_Total * Hugepagesize)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not. MemTotal is fixed in /proc/meminfo even after allocating hugepages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh - i think adjusting [Node Capacity] as described in your edited comment introduces far more confusion when you layer in other monitoring tools. The machines memory capacity is [Node Capacity], the operator has reserved a portion of that capacity for use in huge pages.

status:
capacity:
memory: 10Gi
hugepages.2MB: 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid exposing size in the API for now? In the future, we do plan on supporting attributes for resources. Size can be an attribute then. Until attributes are available, would node labels suffice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the semantics of the request if size were not in the name? It seems to me there are two ways to do that:

  1. Use bytes as the resource quantity unit instead of pages. This may require requests to be quantized to a multiple of available page size. What if the page sizes are heterogeneous across the set of nodes?
  2. Continue to use pages as the resource quantity unit, and you just get the node's default page size whatever that may be (if you care, influence the scheduler with node label affinity.)

(2) above seems better than (1) to me. However neither option significantly reduces changes that pod authors would have to make once resource attributes are available. Caveat, this is without having reviewed a concrete proposal for resource attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) doesn't satisfy the quota concerns. It makes quota on pages useless. This is why I settled on the name proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better compute resource API can handle quota as well. Can the quota requirement be deferred?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vish asked me to take a look. Nothing terribly upsetting here.


There are a variety of huge page sizes supported across different hardware
architectures. It is preferred to have a resource per size in order to better
support quota. For example, 1 huge page with size 2MB is an order of magnitude
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 orders of magnitude, actually

`hugepages.<hugepagesize>` whose quantity is a positive number of pages. The
request and limit for `hugepages.<hugepagesize>` must match until such time the
node provides more reliability in handling overcommit (potentially via
eviction). An application that consumes `hugepages.<hugepagesize>` is not in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT class, are you saying that using hugepages gets me an automatic priority boost? Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more: by definition, an application that consumes hugepages in not in the BestEffort QoS class because it must have a resource request for hugepages, making it at least Burstable (Guaranteed if all resources are specified and requests == limits)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically what seth said.

min_size=<value>,nr_inodes=<value> none /mnt/huge
```

The proposal recommends a new volume type `hugePages`. It will support a single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better as a new volume or as a twist on emptyDir, medium=HugePages, hugePagesConfig: { pageSize: 2Mi } ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a twist of EmptyDir. We already deal with local storage that is split between EmptyDir (volumes abstraction), container logs and overlay filesystems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user perspective, probably no material difference. If EmptyDir is easier for the project to manage then 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with emptyDir. The rationale at time was code complexity, but it's probably simpler API to keep with emptyDir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for EmptyDir as the only memory backed volume in k8s.

```

The proposal recommends a new volume type `hugePages`. It will support a single
`pageSize` option that must be provided by users. The `pageSize` option aligns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter that resource.Quantity is spelled 2Mi but the resource name above is 2MB ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin - at time of writing, i had mapped pageSize to align with the OCI runtime-spec pageSize field for huge page limits, so it didn't need to be a literal resource.Quantity.

see: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits

it seems more common to use decimal rather than binary units when describing the pageSize, but i think it should be fine if we want to support binary units only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this closer in runc, supported hugepage sizes are determined here:
https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/utils.go#L408
https://github.com/opencontainers/runc/blob/master/vendor/github.com/docker/go-units/size.go#L73

it appears to ultimately be parsing the representation returned from the sys/kernel/mm/hugepages/hugepages-{size}kB in kibibytes, so binary notation in kubernetes representation should be fine...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this is useful reference:
https://en.wikipedia.org/wiki/Page_%28computer_memory%29#Page_size_trade-off

the actual page sizes are binary notation. even reading 2MByte on intel chipset was 2^21 and 4MByte was 2^22 as expected. binary format is actually preferred. (see pg 109 https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf for sample reference.)

settings. The `min_size` option is omitted as its not necessary. It's possible
we could have extended `emptyDir` to support an additional option for
`medium=Memory` but extend it to support a `pageSize` parameter, but this
proposal prefers to create an additional volume type instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Care to explain why? I am open to discussion, but I want to know the rationale here.

@saad-ali

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no objection, I am happy to extend emptyDir as described in earlier comment. The life cycle requirements were same as memory backed emptyDir

@derekwaynecarr
Copy link
Member Author

@thockin @saad-ali @vishh @dchen1107 -- updated to respond to feedback.


The proposal introduces huge pages as an Alpha feature.

It must be enabled via the `--feature-gates=HugePages=true` flag on pertient

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pertinent?

- name: hugepage
emptyDir:
medium: HugePages
pageSize: "2Mi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we treat each hugepage size as a separate resource, that's one thing. But parsing the resource name seems hacky.
Can we instead have emptydir volumes uses whatever hugepage resource was specified as part of the container specification and also require all containers to specify the same type of hugepage resource?
For the latter it may help having a different resource name like hugepages.kubernetes.io/2Mi.
Or the other option is to have medium be the hugepage resource type itself (hugepages.2Mi). memory medium has a corresponding memory compute resource.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - It's probably good to tie the EmptyDir to the request hugepages. Otherwise there is also the option of abusing the mechanism, by requesting one size, but then using a different pageSize.

Quite unlikely, but still possible is that different hugepage sizes are requested, thus the volume needs a reference to the hugepage size. Just saying that a volume can not infer the page size automatically if there is more than one given. This in turn would mean that the volume should reference the resource (thus hugepages.2Miin the example above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand - for the initial alpha version of this feature, I am going to update the proposal to restrict the pod from consuming more than one huge page pool size. The volume will then infer the pagesize to use based on the requests.


The `ResourceQuota` resource will be extended to support accounting for
`hugepages.<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange`
resource will be extended to define min and max constraints for `hugepages`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Burst ratio wouldn't apply to integer resources like hugepages right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.

`vm.huge_tlb_shm_group` sysctl
- sysctl `kernel.shmmax` must allow the size of the shared memory segment
- The user's memlock ulimits must allow the size of the shared memory segment
- `vm.huge_tlb_shm_group` is not namespaced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these issues going to be simplified for users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to evaluate the associated burden based on usage of the feature rather than pre-design.

- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.)
- Java applications can back the heap with huge pages using the
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
- packet processing systems (DPDK)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Virtual Machines (qemu/kvm/libvirt) can leverage huge pages

`grep Hugepagesize /proc/meminfo`. This defaults to 2MB on most Linux systems
unless overriden by `default_hugepagesz=1g` in kernel boot parameters. For each
supported huge page size, the node will advertise a resource of the form
`hugepages.<hugepagesize>`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear - Resources as in #782 ? Might be worth a link then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand - this is unrelated to resource class.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - so will it then be advertised as part of the node info?

Supported huge page sizes are determined by parsing the
`sys/kernel/mm/hugepages/hugepages-{size}kB` directory on the host. Kubernetes
will expose the `hugepages.<hugepagesize>` resource using binary notation form.
For example, if a node supports `hugepages-2048kB`, a resource `hugepages.2Mi`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously you said the pages will be advertised by using hugepages.<hugepagesize> but you did not define the unit of the size - in this line however you implicitly convert kB to Mi - which is okay, but it's probably worth noting the default unit, to be able to specify the right resources whne creating pods.

IOW The proposal should define the uni to use, otherwise it's unclear if you have to request 2048kB or 2Mi units.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will add text describing the size.

- name: hugepage
emptyDir:
medium: HugePages
pageSize: "2Mi"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - It's probably good to tie the EmptyDir to the request hugepages. Otherwise there is also the option of abusing the mechanism, by requesting one size, but then using a different pageSize.

Quite unlikely, but still possible is that different hugepage sizes are requested, thus the volume needs a reference to the hugepage size. Just saying that a volume can not infer the page size automatically if there is more than one given. This in turn would mean that the volume should reference the resource (thus hugepages.2Miin the example above)


cAdvisor will need to be modified to return the number of pre-allocated huge
pages per page size on the node. It will be used to determine capacity and
calculate allocatable values on the node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reporting is just informative, and not for scheduling, right?

As I'd expect that the resource class mechanism is used to advertise available hugepages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reporting is for scheduling. the resource class mechanism is later.

@derekwaynecarr
Copy link
Member Author

@vishh @thockin @dchen1107 @saad-ali - updated based on last round of feedback.

notable changes:

  1. a pod is limited to a single huge page size pool for the initial iteration of this feature. this limits the exposure of changes to emptyDir to only require an alternate medium than an API schema change. feedback at this point is that a pod consuming multiple page size pools is rare.
  2. a pod requests huge pages like memory rather than integral values. for example, hugepages-2Mi: 1Gi rather than hugepages-2Mi: 512. This aligns more naturally to how users think about the memory, and aligns with how i would anticipate monitoring tools to report usage back via hugetlb cgroup (i.e. in actual bytes used instead of pages).

@kubernetes kubernetes deleted a comment from vishh Aug 15, 2017
@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017
and reduce the amount of `memory` it reports using the following formula:

```
[Allocatable] = [Node Capacity] -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on formula and the example you listed below, looks like you change the semantics of NodeCapacity and NodeAllocatable which both are declared as ResourceList. Taking your example given below, it is confused to the user that the node has total memory 10Gi = 9Gi (memory allocatable) + 1Gi (hugepage-2Mi allocatable) or 11Gi = 10Gi (memory capacity) + 1Gi (hugepage-2Mi)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the first instance of a cross-resource data dependency in node status? @derekwaynecarr could you expand on how the monitoring tools are affected by this decision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was my thought process at the time:

  1. most users associate capacity[memory] with /proc/meminfo MemTotal (i.e. total physical RAM)
  2. pre-allocating huge pages does not change MemTotal, it just reduces MemAvailable
  3. pre-allocating huge pages is an implicit reservation of memory for use in a different mode against memory capacity
  4. i suspect users would be confused if the capacity value for memory as reported back by the kubelet differed from what is reported in cAdvisor (or other monitoring tools) and could bubble up through Prometheus, etc. i really feel people will be confused as most monitoring tools will read MemTotal.

i don't think users should expect to add numbers with two different units when evaluating capacity.

if we wanted the operator to explicitly account for pre-allocated hugepages in either system-reserved or kube-reserved, i suspect that causes more trouble when we actually enforce-node-allocatable for system and kube cgroups as that memory is not actually available (its been reserved for hugepage usage only).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agreed with you on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not all kernel has MemAvailable in /proc/meminfo.


The request and limit for `hugepages-<hugepagesize>` must match. Similar to
memory, an application that requests `hugepages-<hugepagesize>` resource is at
minimum in the `Burstable` QoS class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we don't want to overcommit hugepage, at least for 1.8 release. If that is true, shouldn't we only support Guaranteed QoS class for hugepage? I understand that we changed QoS crossing all resources, and all containers, which makes any pod requiring hugepage is at least at Burstable class, but I want to make sure that is what you are referring here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want to overcommit hugepages. the request must always match the limit for hugepages. i should be able have the request != limit for cpu and memory resource requirements on any pod that consumes hugepages. i was trying to assert that hugepage requests impact qos (similar to cpu and memory).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm


If a pod consumes huge pages via `shmget`, it must run with a supplemental group
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of
this group is outside the scope of this specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that how to configure such supplemental group is outside the scope, but can you have an example on how a user pod to request huge pages via shmget?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will add an example of a pod that runs a jvm backed by huge pages.

fail validation. We believe it is rare for applications to attempt to use
multiple huge page sizes. This restriction may be lifted in the future with
community presented use cases. Introducing the feature with this restriction
limits the exposure of API changes needed when consuming huge pages via volumes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the node specification with multiple huge page sizes invalid at the initial node capacity discovery level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess that is fine. the kubelet could fail startup if it detects multiple huge page sizes have been pre-allocated.

@dchen1107
Copy link
Member

/lgtm

We can address several comments in separate prs. Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit f5520e2 into kubernetes:master Aug 16, 2017
luxas pushed a commit to luxas/kubernetes that referenced this pull request Aug 22, 2017
Automatic merge from submit-queue (batch tested with PRs 50893, 50913, 50963, 50629, 50640)

bump(github.com/google/cadvisor): 27e1acbb4ef0fe1889208b21f8f4a6d0863e02f6

Bump cadvisor dep to include google/cadvisor@556c7b1.

This is required for supporting huge pages for 1.8

kubernetes/community#837
kubernetes/enhancements#275

@dashpole @derekwaynecarr @eparis @jeremyeder
nicksardo pushed a commit to nicksardo/kubernetes that referenced this pull request Sep 5, 2017
…ture

Automatic merge from submit-queue

HugePages feature

**What this PR does / why we need it**:
Implements HugePages support per kubernetes/community#837

Feature track issue: kubernetes/enhancements#275

**Special notes for your reviewer**:
A follow-on PR is opened to add the EmptyDir support.

**Release note**:
```release-note
Alpha support for pre-allocated hugepages
```
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Automatic merge from submit-queue

HugePages proposal

A pod may request a number of huge pages.  The `scheduler` is able to place the pod on a node that can satisfy that request.  The `kubelet` advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages via `hugetlbfs` or `shmget`.  

Planned as Alpha for Kubernetes 1.8 release.

See feature: kubernetes/enhancements#275

@kubernetes/sig-scheduling-feature-requests
@kubernetes/sig-scheduling-misc
@kubernetes/sig-node-proposals
@kubernetes/api-approvers
@kubernetes/api-reviewers
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.