Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local storage (scratch space) allocatable support #46456

Merged
merged 2 commits into from
Jun 3, 2017

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented May 25, 2017

This PR adds the support for allocatable local storage (scratch space).
This feature is only for root file system which is shared by kubernetes
componenets, users' containers and/or images. User could use
--kube-reserved flag to reserve the storage for kube system components.
If the allocatable storage for user's pods is used up, some pods will be
evicted to free the storage resource.

This feature is part of local storage capacity isolation and described in the proposal kubernetes/community#306

Release note:

This feature exposes local storage capacity for the primary partitions, and supports & enforces storage reservation in Node Allocatable 

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 25, 2017
@jingxu97 jingxu97 requested review from vishh and dashpole May 25, 2017 19:42
@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-label-needed labels May 25, 2017
@jingxu97 jingxu97 requested review from dashpole and derekwaynecarr and removed request for dashpole May 25, 2017 19:42
Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. I'll look into what changes to machineInfo it would take to be able to determine rootfs capacity.

func (cm *containerManagerImpl) GetNodeAllocatableReservation() v1.ResourceList {
evictionReservation := hardEvictionReservation(cm.HardEvictionThresholds, cm.capacity)
if !hasStorageInfoInCapacity(cm.capacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont need a helper function here, as it is just one line:
if _, ok := cm.capacity[v1.ResourceStorage]; !ok {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -38,6 +38,8 @@ const (
SignalImageFsInodesFree Signal = "imagefs.inodesFree"
// SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes.
SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available"
// SignalAllocatableStorageScratchAvailable is amount of storage available for pod allocation
SignalAllocatableStorageScratchAvailable Signal = "allocatableStorageScratch.available"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we have begun calling rootfs "Scratch" space, it would probably be easier to read the code if it was named similarly to the Node-Level signal. Something like SignalAllocatableNodeFsAvailable.

// build the ranking functions (if not yet known)
// TODO: have a function in cadvisor that lets us know if global housekeeping has completed
if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 {
// this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass.
hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs()
if err != nil {
if errImage != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move the error check next to the call? Why only check for errors inside the if statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the existing code, I think the logic is if resourceToRankFunc and resourceToNodeReclaimFuncs are already defined, it does not matter whether errImage returns nil or not.
I modified the code here a little more. Instead of keeping calling HasDedicatedImageFs in every sync call, I put it in a map. PTAL

@@ -391,10 +391,13 @@ func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceNam
glog.Errorf("eviction manager: unable to find value associated with signal %v", signal)
continue
}
value.available.Add(*reclaimed)
if reclaimed != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. reclaimed cannot be nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


// evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals
if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 {
glog.V(3).Infof("eviction manager: reclaim goal reached for resource %v: ", resourceToReclaim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already log something in synchronize on line 311 if we are able to reclaim enough resources. This is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -83,6 +86,7 @@ func init() {
signalToResource[evictionapi.SignalImageFsInodesFree] = resourceImageFsInodes
signalToResource[evictionapi.SignalNodeFsAvailable] = resourceNodeFs
signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceNodeFsInodes
signalToResource[evictionapi.SignalAllocatableStorageScratchAvailable] = resourceNodeFs
resourceToSignal = map[v1.ResourceName]evictionapi.Signal{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you still planning to remove resourceToSignal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do it in a separate PR, basically changing from 1-1 mapping from resource to signal, we use map 1-many with a list of signals

}

// Pass podTestSpecsP as references so that it could be set up in the first BeforeEach clause
func runLocalStorageEvictionTest(f *framework.Framework, conditionType v1.NodeConditionType, testCondition string, podTestSpecsP *[]podTestSpec, evictionTestTimeout time.Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reuse the version in the inode eviction test suite? If not, can we refactor them to avoid duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few things, only by passing the reference works for me. I plan to refactor this code in a separate PR, is that ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that can happen after code freeze, if neccessary.

@@ -100,3 +100,15 @@ func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapi.MachineInfo, error) {
}
return kl.machineInfo, nil
}

// GetCachedRootFsInfo assumes that the rootfs info can't change without a reboot
func (kl *Kubelet) GetCachedRootFsInfo() (cadvisorapiv2.FsInfo, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking right now is that it would make more sense to change cAdvisor to include the information we need in MachineInfo. IIRC, the only piece we are missing is the mountpoint for the disks, which would let us determine which filesystem is the RootFs. I am happy to look into this in the next couple days to see if it is feasible. That would greatly simplify this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I will check that too, but we might also need to add this after code freeze.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is fine.

@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 31, 2017
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested a few changes.

//Account for storage requested by emptydir volumes
for _, vol := range pod.Spec.Volumes {
if vol.EmptyDir != nil {
result.StorageScratch += vol.EmptyDir.SizeLimit.Value()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to exclude empty dir that is backed by memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Name: "emptyDirVolumeName",
VolumeSource: v1.VolumeSource{
EmptyDir: &v1.EmptyDirVolumeSource{
SizeLimit: *resource.NewQuantity(scratch, resource.BinarySI),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set medium to not memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

reasons []algorithm.PredicateFailureReason
}{
{
pod: newResourcePod(schedulercache.Resource{MilliCPU: 1, Memory: 1, StorageOverlay: 1}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test for memory backed empty dir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test for memory medium

//Account for storage requested by emptydir volumes
for _, vol := range pod.Spec.Volumes {
if vol.EmptyDir != nil {
res.StorageScratch += vol.EmptyDir.SizeLimit.Value()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good if we have a utility functoin for this as this loop is repeated in multiple places.

also need to ignore medium memory empty dir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. this loop only appeared in predicates.go and node_info.go, could we consider add a utility if more places need it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine tests also using this utility function.

@jingxu97 jingxu97 force-pushed the May/allocatable branch 5 times, most recently from f0e5ded to 0235873 Compare May 31, 2017 21:51
@jingxu97
Copy link
Contributor Author

jingxu97 commented May 31, 2017

addressed the comments. @derekwaynecarr @dashpole PTAL

@dashpole
Copy link
Contributor

eviction manager changes lgtm

@@ -0,0 +1,322 @@
/*
Copyright 2016 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an odd file name. I think go recommends using _ instead of camelcase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the file name

if hasDedicatedImageFs, err := kl.HasDedicatedImageFs(); err != nil && hasDedicatedImageFs {
imagesfs, err := kl.ImagesFsInfo()
if err != nil {
node.Status.Capacity[v1.ResourceStorageOverlay] = resource.MustParse("0Gi")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we fail kubelet in this case? Why would kubelet lie that imagefs doesn't exist when it fails to collect stats about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the logic in GetCachedMachineInfo. Should we use different behavior here?

@k8s-github-robot k8s-github-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jun 1, 2017
@jingxu97 jingxu97 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2017
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2017
@jingxu97 jingxu97 force-pushed the May/allocatable branch 2 times, most recently from 491253e to 19c5b3c Compare June 1, 2017 19:26
@jingxu97 jingxu97 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2017
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2017
@jingxu97 jingxu97 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2017
@davidopp
Copy link
Member

davidopp commented Jun 1, 2017

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidopp, jingxu97, thockin, vishh

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@fejta
Copy link
Contributor

fejta commented Jun 1, 2017

@k8s-bot pull-kubernetes-federation-e2e-gce test this
ref kubernetes/test-infra#2931

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2017
@jingxu97
Copy link
Contributor Author

jingxu97 commented Jun 1, 2017

@k8s-bot pull-kubernetes-unit test this

This PR adds the support for allocatable local storage (scratch space).
This feature is only for root file system which is shared by kubernetes
componenets, users' containers and/or images. User could use
--kube-reserved flag to reserve the storage for kube system components.
If the allocatable storage for user's pods is used up, some pods will be
evicted to free the storage resource.
This PR adds the check for local storage request when admitting pods. If
the local storage request exceeds the available resource, pod will be
rejected.
@jingxu97
Copy link
Contributor Author

jingxu97 commented Jun 1, 2017

@k8s-bot pull-kubernetes-unit test this

@jingxu97
Copy link
Contributor Author

jingxu97 commented Jun 1, 2017

@k8s-bot pull-kubernetes-e2e-kops-aws test this

@jingxu97 jingxu97 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants