[Proposal] Improve Local Storage Management #306

vishh · 2017-01-30T19:43:22Z

A note to reviewers: A detailed design proposal will be posted once the overall design in this proposal has been accepted by the community. So kindly hold on to specific design questions or suggestions that are not relevant to the overall high level design for local storage.

cc @kubernetes/sig-storage-proposals @kubernetes/sig-node-proposals @kubernetes/sig-apps-proposals @kubernetes/sig-scheduling-proposals

cc @msau42

For kubernetes/enhancements#121

TODO:

Handle scheduling of runtime primary partition capacity
Propose design choices for mitigating or avoiding Disk IO issues for system daemons, container logs and writable layer. EmptyDir IO isolation is out of scope because local PV can be used instead.

Signed-off-by: Vishnu Kannan <vishnuk@google.com>

kow3ns · 2017-01-30T23:04:25Z

contributors/design-proposals/local-storage-overview.md

+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
+8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
+9. Once Alice decides to delete the database, the PVCs are expected to get deleted by the StatefulSet. PVs will then get recycled and deleted, and the addon adds it back to the cluster.


Will this be global for all PVCs for StatefulSet going forward? Also, we will be depending on reasonable collection timeouts to ensure that users have time to collect data from Volumes after deletion (assuming they have a need to do so)?

By default, the PVC will need to be deleted by the user to retain similar behavior as today. We are looking into an "inline" PVC feature that can automatically delete the PVCs when the StatefulSet gets destroyed. I'll update this to clarify that.

Regarding the retention policy, the PV can be changed to use the "Retain" policy if users need to collect data after deletion.

kow3ns · 2017-01-30T23:05:28Z

contributors/design-proposals/local-storage-overview.md

+    local-pv-2 10Gi       Bound     log-local-pvc-3 node-3
+    ```
+
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.


So we are depending on priority and preemption to be implemented prior to this?

It seems like what's described in this section also relies on some variant of #7562 / #30044 being implemented, as today there is no notion of a local PV (beyond the experimental HostPath volume type which doesn't do what's needed here).

Yes, this proposal is also covering this new local PV type.

@kow3ns regarding priority and preemption, it is not a strict requirement for this feature, but will make the workflow smoother. There are also plans to implement this soon.

F21 · 2017-01-30T23:50:28Z

contributors/design-proposals/local-storage-overview.md

+* Provide flexibility for users/vendors to utilize various types of storage devices
+* Define a standard partitioning scheme for storage drives for all Kubernetes nodes
+* Provide storage usage isolation for shared partitions
+* Support random access storage devices only


Can you elaborate on what support for "random access storage devices only" means? Does this mean using RAM as storage?

Good question. I took this to mean DASD (i.e. this will not work for Tape drives or other sequential access storage media). Is this not the case?

Yes, it means not supporting tape.

davidopp · 2017-01-30T23:48:37Z

contributors/design-proposals/local-storage-overview.md

+* Support random access storage devices only
+
+# Non Goals
+* Provide isolation for all partitions. Isolation will not be of concern for most partitions since they are not expected to be shared.


This could be written more concisely as "Provide usage isolation for non-shared partitions" which would also make it more parallel with the Goal "Provide storage usage isolation for shared partitions"

davidopp · 2017-01-30T23:52:01Z

contributors/design-proposals/local-storage-overview.md

+* Pods do not know how much local storage is available to them.
+* Pods cannot request “guaranteed” local storage.
+* Local storage is a “best-effort” resource 
+* Pods can get evicted due to other pods filling up the local storage during which time no new pods will be admitted, until sufficient storage has been reclaimed


s/during which/after which/

davidopp · 2017-01-30T23:54:56Z

contributors/design-proposals/local-storage-overview.md

+Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:
+
+### Root
+ This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition.


davidopp · 2017-01-31T00:06:09Z

contributors/design-proposals/local-storage-overview.md

+3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
+4. Alice’s pod is not provided any IO guarantees
+5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
+6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.


s/foo/fooc/

foo is correct because it's referring to the pod, and not the container.

Quota feature assumes an appropriate supporting file system is being used. A large part of the distributed storage systems require raw (no file system) storage. How would that be managed? Would a raw partition be crated by a logical manager?

We don't plan to support raw partitions as a primary partition. Secondary partitions can have block level support though.

davidopp · 2017-01-31T00:09:40Z

contributors/design-proposals/local-storage-overview.md

+      emptyDir:   
+    ```
+
+2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage.


Add a link to the LimitRange user guide that explains this.
(BTW did you mean "burst" rather than "bust"?)

davidopp · 2017-01-31T00:11:44Z

contributors/design-proposals/local-storage-overview.md

+     capacity: 2Gi
+  ```
+
+6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. 


s/intent/intend/
BTW I'm not clear about the connection between prohibiting a minimum storage requirement in LimitRange and overcommit. Won't the scheduler prohibit overcommit regardless of what storage requirement you give for the EmptyDir (regardless of whether it's set manually or via LimitRange)?

i assume this means we will just not allow a request a limit for volumes.

davidopp · 2017-01-31T00:53:46Z

contributors/design-proposals/local-storage-overview.md

+    local-pv-1 100Gi    RWO         Delete        Available         node-3
+    local-pv-2 10Gi     RWO         Delete        Available         node-3
+    ```
+3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.


There is currently no notion of tainting PVs, only nodes. Can you say more about what semantics you are expecting for tainting a PV?

We would like to evict pods that are using tainted PVs, unbind the PVC, and reschedule the pod so that it can bind to a different PV. I think everything after the eviction could be handled by a separate controller.

davidopp · 2017-01-31T01:05:57Z

contributors/design-proposals/local-storage-overview.md

+    local-pv-2 10Gi       Bound     log-local-pvc-3 node-3
+    ```
+
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.


It seems like what's described in this section also relies on some variant of #7562 / #30044 being implemented, as today there is no notion of a local PV (beyond the experimental HostPath volume type which doesn't do what's needed here).

davidopp · 2017-01-31T01:21:41Z

contributors/design-proposals/local-storage-overview.md

+### Alice manages a Database which needs access to “durable” and fast scratch space
+
+1. Cluster administrator provisions machines with local SSDs and brings up the cluster
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.


So there is always a 1:1 correspondence between PV and partition on a secondary device?

Yes, this will let the cluster administrator decide how they want to provision local storage. They can have one partition per disk for IOPS isolation, or if sharing is ok, then create multiple partitions on a device.

I'm guessing this is based on the technology of the underlying filesystem. If not, then I think this depends a lot on some type of logical volume manager. If not only two things can happen: 1. a secondary partition is the entire disk, 2: A lot of disk fragmentation. I think more information on how number 1 is done may shed more light on this model

Yes, it is up to the administrator to partition and create the filesystem first. And how that is done will depend on the partitioning tools (parted, mdadm, lvm, etc) available and which filesystems the administrator decides to use. From Kubernetes point of view, we will not handle creating partitions or filesystems.

it is up to the administrator to partition and create the filesystem first

That's very inconvenient for an admin. Also, when such PV gets Released, who/how removes the data there and puts it back to Available? We'd like to deprecate recycler as soon as possible.

IMO, some sort of simple dynamic provisioning would be very helpful and it's super simple with LVM. It should be pluggable though to work on all other platforms.

The current thought for the PV deletion workflow is to set the PV to Released phase, delete all the files (similar to how EmptyDir cleans up), delete the PV object, and then the addon daemonset will detect that the partition and then create the PV for it again.

So from an admin's point of view, the partitioning and fs setup is just a one time step whenever new disks are added. And for the use case that we are targeting, which requires whole disks for IOPs guarantees, the setup is simple: one partition across the whole disk, and create the filesystem on that partition.

As for LVM, I agree it is a simpler user model, but we cannot get IOPs guarantees from it, which is what customers we've talked to want the most. I don't think this design will prevent supporting an LVM-based model in the future though. I can imagine there can be a "storageType = lvm" option as part of the PV spec, and a dynamic provisioner can be written to use that to carve out LVs from a single VG. The scheduling changes that we have to make to support local PVs can still apply to a lvm-based volume. We're just not prioritizing it right now based on user requirements.

i agree with @jsafrane that we should have some default out of the box local disk PV provisioner and for default cases, we dont have to do a addon or some such thing. 90% use cases might be just simple use of local disks

Based on feedback we have gotten from customers and workloads team, it's the opposite. Most of the use cases require dedicated disks. We have not seen many requests for dynamic provisioning of shared disks. If you see some real use cases where an app wants to use persistent local storage (and all its semantics), but doesn't need performance guarantees, then I would be interested in hearing about them as well.

I do want to make sure that nothing in this proposal would prevent LVM and dynamic provisioning from being supported in the future. And that it will be able to take advantage of the scheduling and failure handling features we will be adding.

In terms of admin work, my hope is that the default addon will require a similar amount of admin intervention as the LVM model (configure the disk once in the beginning, the system takes care of the rest).

kow3ns · 2017-01-31T01:31:52Z

contributors/design-proposals/local-storage-overview.md

+    ```
+
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.


This should occur only when the PV backing the PVC is permanently unavailable. If a controller creates a new PVC and relaunches the Pod with that PVC, it will never be able to reuse the data on the old PV anyway. To simplify this for controller developers, when some policy is applied to indicate that K8s should "give up" on recovering a PV, can we just delete the PV, and set the status of the PVC to pending? This would reduce the complexity of the interaction with DaemonSet, StatefulSet, and any other controllers and local persistent storage.

This situation could also occur if the node has failed or can no longer fulfill other requested resources, for example, if other pods got scheduled and took up the cpu or memory needed.

The main concern with deleting the PV and keeping the PVC, is that it may not follow the retention policy. The user may want to recover data from the PV, but won't have the pod->PVC->PV binding anymore. As another alternative, we could remove the PVC->PV binding, and if the PV policy is retain, also add an annotation with the old pod, PVC information so the user can figure out which PV had their data.

I like the idea of keeping the PVC and just removing the PVC->PV binding. If we expect the StatefulSet controller to modify the Pod to use a new PVC, that essentially means only the StatefulSet controller can perform the task of unblocking its unschedulable Pods. That in turn means that every controller needs to separately implement this behavior. For example, what if I have "stateless" Deployment Pods that want this behavior for their large caches on local PV?

If unblocking can be done without modifying the Pods to use a different PVC, then it leaves the door open to write a generic "local PV unbinding" controller that implements this behavior once for everyone who requests it via some annotation or field.

The generic PVC unbinding controller can monitor for this error condition, unbind the PVC, clean up the PV according to the reclaim policy, and then evict and reschedule the pods to force them to obtain a new PV.

kow3ns · 2017-01-31T01:43:04Z

contributors/design-proposals/local-storage-overview.md

+
+3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
+4. Alice’s pod is not provided any IO guarantees
+5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi


How does this interact with kubectl logs? Right now we are aggregating and rolling stdout and stderr? Are you proposing that we use local storage instead of, or in addition to, the current K8s logging infra?

It should have no impact to kubectl logs. It's only changing the log rotation mechanism to be on a per container basis instead of on a node basis.

kow3ns · 2017-01-31T03:18:27Z

contributors/design-proposals/local-storage-overview.md

+
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
+8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.


If you are targeting DBs and DFSs, and if a "taint" is really pertaining to a problem with the underlying storage media, I don't think anything in your target set will tolerate a taint. @davidopp shouldn't this be expressed by the controller in terms of declarative tolerations against node taints in any case. That is, don't I have to explicitly declare the taint as tolerated?

Instead of having every controller/operator watch for the appearance of taints on a node and delete Pods should we consider the following approach?

DBs and DFS should include a health check that causes the Pod to fail when the contained application can't write to storage media (Most, if not all, storage applications will fail on errors returned from fysnc/sync).

When the application monitoring the storage device decides that the mounted PVs are unrecoverable, it should delete the PVs and mark the Bound PVCs as pending. The policy deciding when to do this can be applied here. Note that, this is no scarier than having the controller make the decision to delete the PVC. In either case, once the Pod is disassociated from its local volume and launched with another, it can never be safely re-associated with the prior volume. Both cases also need a good story around snapshops and backup. I think that, as the device monitoring application is a node local agent, it can make a better decision about when to "give up" trying bind a Pod to a local mount.

As the volumes are deleted, we need not be concerned with the PVCs being fulfilled by this node unless it has volumes mounted on another, functional device.

When controllers/operators recreate the Pods, their existing PVCs must be Bound to volumes provided by another node.

If we take an approach that is closer to this we don't have to duplicate the watch logic in every controller/operator.

If I understand correctly, are you suggesting to leave it up to the application to handle local storage failures, since each application may have its own unique requirements and policies?

Sorry if I was not clear. I am saying the opposite. The "application monitoring the storage device" referred to above is based on the design statement that " Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints."
Rather than have every controller/operator attempt to heuristically guess when it should delete a PVC, might it not be better to have the "Node Problem Detector", kublet, or another local agent make a decision that the volume is no longer usable due to device failure, and to set the associated PVC back to pending. Perhaps using your suggestion above to retrain the volume for data recovery purposes. I can't think of a distributed storage application that will want to re-balance or re-replicate its data due to a temporary network partition or intermittent node failure. The only time, IMO, that we'd want a controller/operator to move a Pod with local PVs to a new node is if the storage device failed, or if the MTTR is so high that it might as well have. In the former case it might be best if a node local agent made the decision that the storage device is failed. In the latter case, we should at least consider having a global controller with a policy that can unbind local PVs from PVCs, rather than having every controller/operator have to implement its own policy.

I don't think the controller (e.g. StatefulSet) should be responsible for deleting Pods. I think @kow3ns is also saying that if I'm reading him correctly.

My understanding is that regular Node taints are noticed and enforced by kubelet, which may evict the Pod if it doesn't tolerate the taint. Wouldn't it make sense for kubelet to also evict the Pod if it does not tolerate a taint on one of its local PVs?

If recreated with the same PVC, the Pod would remain unschedulable due to the taint on the PV. At this point, the problem is reduced to being the same as (7) above. In this way, both (7) and (8) can be handled without necessarily requiring any changes to StatefulSet or other controllers (if a generic controller can be implemented as suggested above).

Currently taints are only at the node level, but I think it could worth looking into expanding, as it already has a flexible interface for specifying per pod the tolerations and forgiveness for each taint. This workflow could also work for the case when the node fails or becomes unavailable. @davidopp

Then, when the pod gets evicted due to the taint, it reduces the problem to (7), as mentioned above.

It is also possible to implement this without taints, and instead add an error state to the PV, and have a controller monitor for the error state and evict pods that way. But using taints may be nice as a future enhancement to unify the API.

kow3ns · 2017-01-31T03:40:16Z

contributors/design-proposals/local-storage-overview.md

+     capacity: 2Gi
+  ```
+
+6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. 


intent=intend

derekwaynecarr · 2017-01-31T19:08:06Z

contributors/design-proposals/local-storage-overview.md

+         Storage-overlay: 200Mi
+         type: Container
+       - default:
+         storage: 1Gi


to clarify, each empty dir backed volume will pick up this default capacity? so if a user had multiple empty dirs for some reason, each would get 1Gi?

Yes, it's the limit per emptydir. You bring up a good point though, that the "type: Pod" implies that it's for the whole pod. We can change it to "type: EmptyDir"

derekwaynecarr · 2017-01-31T19:11:25Z

contributors/design-proposals/local-storage-overview.md

+     volumes:
+      name: myEmptyDir
+      emptyDir:
+       capacity: 1Gi


do you worry users will get confused with this field as only being meaningful when the medium is disk and not memory?

It can be used for memory-backed emptydir too.

enisoc · 2017-01-31T22:02:21Z

contributors/design-proposals/local-storage-overview.md

+     name: foo
+    spec:
+     containers:
+      name: fooc


I think there's a missing - here.

enisoc · 2017-01-31T22:02:40Z

contributors/design-proposals/local-storage-overview.md

+        storage-logs: 500Mi
+        storage-overlay: 1Gi
+     volumes:
+      name: myEmptyDir


Missing -.

enisoc · 2017-01-31T22:04:47Z

contributors/design-proposals/local-storage-overview.md

+      name: fooc
+      resources:
+       limits:
+        storage-logs: 500Mi


Is there a reason for doing it this way rather than:

limits: storage: logs: 500Mi overlay: 1Gi

?

It's a limitation in the LimitRange design. It doesn't support nesting of limits.

enisoc · 2017-01-31T22:11:36Z

contributors/design-proposals/local-storage-overview.md

+          labels:
+            storage.kubernetes.io/medium: ssd
+        spec:
+          volume-type: local


Wouldn't the convention be volumeType?

enisoc · 2017-01-31T22:27:27Z

contributors/design-proposals/local-storage-overview.md

+    ```
+
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.


I like the idea of keeping the PVC and just removing the PVC->PV binding. If we expect the StatefulSet controller to modify the Pod to use a new PVC, that essentially means only the StatefulSet controller can perform the task of unblocking its unschedulable Pods. That in turn means that every controller needs to separately implement this behavior. For example, what if I have "stateless" Deployment Pods that want this behavior for their large caches on local PV?

If unblocking can be done without modifying the Pods to use a different PVC, then it leaves the door open to write a generic "local PV unbinding" controller that implements this behavior once for everyone who requests it via some annotation or field.

enisoc · 2017-01-31T22:36:55Z

contributors/design-proposals/local-storage-overview.md

+
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
+8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.


I don't think the controller (e.g. StatefulSet) should be responsible for deleting Pods. I think @kow3ns is also saying that if I'm reading him correctly.

My understanding is that regular Node taints are noticed and enforced by kubelet, which may evict the Pod if it doesn't tolerate the taint. Wouldn't it make sense for kubelet to also evict the Pod if it does not tolerate a taint on one of its local PVs?

If recreated with the same PVC, the Pod would remain unschedulable due to the taint on the PV. At this point, the problem is reduced to being the same as (7) above. In this way, both (7) and (8) can be handled without necessarily requiring any changes to StatefulSet or other controllers (if a generic controller can be implemented as suggested above).

enisoc · 2017-01-31T22:39:18Z

contributors/design-proposals/local-storage-overview.md

+        storage.kubernetes.io/medium: ssd
+    spec:
+      volume-type: local
+      storage-type: block


Should this be storageType? Also, having both volumeType and storageType seems confusing. Not sure what else these could be called though.

Would storageLevel be better?

jsafrane · 2017-02-01T14:44:54Z

contributors/design-proposals/local-storage-overview.md

+
+# Design Overview
+
+A node’s local storage can be broken into primary and secondary partitions.  


Were other options considered? Especially LVM would allow us to add / remove devices on hosts and RAID 0/1/5 per volume with very little overhead. With partitions, you must do all this manually.

In the case of persistent local storage, most of the use cases we have heard about prioritize performance and being able to use dedicated disks.

In addition, LVM is only available on Linux, so it could be difficult to use as a generic solution.

I'm assuming a Primary and Secondary partitions are logical objects which can be implemented a multiple of ways. Do you mind elaborating on possible implementations?

Is this the kind of information you were looking for?

Using an entire disk (this is the primary use case for persistent local storage)

Adding multiple disks into a RAID volume

Using LVM to carve out multiple logical partitions (if you don't need IOPs guarantees)

lpabon · 2017-02-01T22:31:41Z

contributors/design-proposals/local-storage-overview.md

+## Persistent Local Storage
+Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors:
+
+* Performance: On cloud providers, local SSDs give better performance than remote disks.


Will performance include a QoS IOPS requirement for distributed storage systems?

The PVs will have to be created by the admin/addon that utilizes the entire disk to guarantee IOPs for performance use cases.

Not sure I understand correctly, but why do PVs have to be full-disk? Why not a properly aligned partition?

We're not requiring that the PV has to use the whole disk (the volume is created on a partition), but if you need IOPS guarantees, then it should be a dedicated disk. Especially for rotational disks, the IO will still end up being on a shared path at the device layer.

SSDs may offer high enough IOPS that you can share them.

As @msau42 mentioned, the API that kubernetes would consume is a logical partition. It can map to any storage duration (RAID, JBOD, etc.). We recommend not sharing spinning disks unless either the storage configuration or IOPS requirements permits sharing them.

lpabon · 2017-02-01T22:36:42Z

contributors/design-proposals/local-storage-overview.md

+This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.
+
+## Secondary Partitions
+All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.


When they are exposed as PVs, are they created and available as a pool? Do you mind elaborating a little more on Secondary Partitions? Who creates them, how are they managed, what sizes, etc.

Yes, the workflow mentions this a little bit. I will also add it here.

There will be an addon daemonset that can discover all the secondary partitions on the local node and create PV objects for them. The capacity of the PV will be the size of the entire partition. We can provide a default daemonset that will look for the partitions under a known directory, and it is also possible to write your own addons for your own environment.

So the PVs will be statically created and not dynamically provisioned, but the addon can be long running and create new PVs as disks are added to the node.

lpabon · 2017-02-01T22:38:45Z

contributors/design-proposals/local-storage-overview.md

+
+# Design Overview
+
+A node’s local storage can be broken into primary and secondary partitions.  


I'm assuming a Primary and Secondary partitions are logical objects which can be implemented a multiple of ways. Do you mind elaborating on possible implementations?

lpabon · 2017-02-01T22:42:12Z

contributors/design-proposals/local-storage-overview.md

+       capacity: 20Gi
+    ```
+
+3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.


How are the guarantees provided by the system? The FS?, logical volume?

For primary partitions, the node's local storage capacity will be exposed so that the scheduler can take into account a pod's storage limits and what nodes can satisfy that limit.

Then kubelet will monitor the storage usage of the emptydir volumes and containers so that they stay within their limits. If quota is supported, then it will use that.

lpabon · 2017-02-01T22:44:23Z

contributors/design-proposals/local-storage-overview.md

+3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
+4. Alice’s pod is not provided any IO guarantees
+5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
+6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.


Quota feature assumes an appropriate supporting file system is being used. A large part of the distributed storage systems require raw (no file system) storage. How would that be managed? Would a raw partition be crated by a logical manager?

lpabon · 2017-02-01T22:48:43Z

contributors/design-proposals/local-storage-overview.md

+4. Alice’s pod is not provided any IO guarantees
+5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
+6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.
+7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet.


Assuming the FS supports such a feature.

Yes, I will mention otherwise kubelet can only enforce soft limits.

lpabon · 2017-02-02T01:28:43Z

contributors/design-proposals/local-storage-overview.md

+6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.
+7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet.
+8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
+9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes. 


I know number 8 showed there will be a Health monitor, but how would it detect that the primary partition is unhealthy on number 9? What does it mean to be unhealthy?

Now that I think about it more, health monitoring is dependent on the environment and configuration, so an external monitor may be needed for both primary and secondary.

It can monitor at various layers depending on how the partitions are configured:
disk layer: look at SMART data
raid layer: look for complete raid failure (non-recoverable)

How to do disk health monitoring if the Node is a VM and disk is a virtual disk? The smartctl or raid tools may not return correct data.

That's a good point. Because the partition configuration is very dependent on the environment, I think we cannot do any monitoring ourselves. Instead, we can define a method for external monitors to report errors, and also define how kubernetes will react to those errors.

Does our proposal/design require this health monitor. lets say in the default configuration, when there is no external health monitor, what is the behavior ?

The health monitor is not required. In that case, it will behave the same way that it does today, which is undefined.

lpabon · 2017-02-02T01:30:48Z

contributors/design-proposals/local-storage-overview.md

+      emptyDir:   
+    ```
+
+2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage.


This part is a little bit confusing "His cluster administrator being aware....". Does that mean that this solution would require the administrator to take action or things may be incorrectly allocated?

In order to solve today's local storage isolation problem, pods should specify limits for their local storage usage. In the absence of that, the administrator has the option to specify defaults for the namespace. If neither of those two occur, then you just have the same issue today.

lpabon · 2017-02-02T01:34:32Z

contributors/design-proposals/local-storage-overview.md

+    ```
+
+4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
+5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes.


The concern I have here is that it requires a lot of interaction with an administrator and the user. If I am "Bob", I'm just going to keep asking for more storage (1, then 2, then .. ). That would move the Pod from node to node satisfying the storage size request. I'm guessing... How different is this from the current model?

Yes there is a little bit of a trial and error going on here for Bob. But as an application developer, you will have to do this in order to size your apps appropriately. One goal that we're trying to achieve here is provide pods better isolation from other pods running on that node through storage isolation.

lpabon · 2017-02-02T01:40:44Z

contributors/design-proposals/local-storage-overview.md

+### Alice manages a Database which needs access to “durable” and fast scratch space
+
+1. Cluster administrator provisions machines with local SSDs and brings up the cluster
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.


I'm guessing this is based on the technology of the underlying filesystem. If not, then I think this depends a lot on some type of logical volume manager. If not only two things can happen: 1. a secondary partition is the entire disk, 2: A lot of disk fragmentation. I think more information on how number 1 is done may shed more light on this model

lpabon · 2017-02-02T01:47:17Z

contributors/design-proposals/local-storage-overview.md

+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.
+
+    ```yaml
+    kind: PersistentVolume


Just to clear my confusion, these all are created by hand?

They could be created by hand, or if you put the partitions in a known directory, then the addon daemonset can discover the partitions and automatically create the PVs.

lpabon · 2017-02-02T02:20:58Z

@msau42 FYI, I appreciate the quick turnaround! 👏

rootfs · 2017-02-02T17:12:31Z

@msau42 here is a related use case for local storage handling from @fabiand
kubernetes/kubernetes#38606

stp-ip · 2017-02-03T15:19:22Z

contributors/design-proposals/local-storage-overview.md

+4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it.
+5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example).
+6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes. 
+7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs create by the Operator on those nodes.


stp-ip · 2017-02-03T16:32:09Z

contributors/design-proposals/local-storage-overview.md

+* Local Persistent Volume bindings happening in the scheduler vs in PV controller
+    * Should the PV controller fold into the scheduler
+* Supporting dedicated partitions for logs and volumes in Kubelet in addition to runtime overlay filesystem
+    * This complicates kubelet.Not sure what value it adds to end users.


Logs should usually not accumulate as they should be collected to a central location.
--> no need for separation

Overlay FS data can be used, but for heavy use or increased storage needs we do recommend and provide emptyDirs and the new local PVs.
--> no need for separation

As emptyDirs might be used for caches and heavy IO it might makes sense to let this be separated from the planned root PV.

Complicating the Kubelet for logs and overlay doesn't seem to make sense. We should definitely think about the usage pattern of emptyDirs after local PVs are available.
Would we recommend local PV usage for heavy IO caches instead of emptyDirs? If yes, then we might leave emptyDirs inside the root PV and let the user know that for anything serious he might need to migrate away from emptyDirs.

Definitely needs to be clearly documented what use-cases each one solves.

Yes, since we don't plan to provide IOPS isolation for emptydir, then local PV should be used instead for those use cases. One question we have is are there use cases that need ephemeral IOPS guarantees that cannot be adapted to use local PVs? Do we need to look into an "inline" local PV feature where the PV gets created and destroyed with the pod?

Depends on the automation of creating and using local PVs.
EmptyDirs work great without having to involve the cluster admin.
Local PVs most likely need cluster admin intervention. Maybe not always, but it's not 100% automated.

The path I see as reasonable would be:

Leave emptyDir as "best-effort" scratch pad.

Recommend local PVs for guaranteed IOPS.

First iteration having to use manual cluster admin action

Iterate on automating local PVs to bring them closer to emptyDir and PDs aka provide local PVs via dynamic provisioning

This would lead to no huge complexity additions in the kubelet as root, emptyDir, log and overlay FS are kept on the primary partition in the first iteration.

As additional note:
Persistent Volume as name seems confusing especially, when we recommend it as IOPS guaranteed scratch pad. (Maybe: Local Disk?)

Good plan! LocalDisk as the actual volume plugin name sounds good.

For clarification. We would have:

PersistentDisk (networked "unfailable" disk)

emptyDir (shared temporary volume without guarantees)

LocalDisk (local volume with guarantees, which might have some persistence)

hostPath (local volume for testing)

all the provider specific stuff, flexVolume, gitRepo and k8s API backed volumes.

We are planning to recommend using LocalDisk only through the PV/PVC interfaces for the following reasons:

In failure scenarios, like the node failing, you may want to give up on the local disk and find a new one to use. You can do that by unbinding the PVC from the PV, instead of having to change the volume in the pod spec

If you use local disk directly, it would be very similar to HostPath volumes, and have all its problems, where you have to specify the path, and understand the storage layout of the node, and understand that that particular volume can satisfy the pod's capacity needs. The PV interface hide those details.

The PV interface gives a way to pool all the local volumes across the entire cluster and easily query for them, and find ones that will fit a pod's requirements.

Thanks for the clarification. Always great to have that documented.

So recommendation:
PD + LD using PVC
emptyDir + hostPath used directly

Small addition:
The notion it projects using PV/PVCs about LD being persistent could create some confusion.

I will update this doc to clarify that thanks! I agree the PV name could be misleading since the local disk can only offer semi-persistence, and has different semantics then normal PVs. I can add a section about the different semantics. Also, because of the different behavior, and its very targeted use cases, I want to make sure in the API layer, the user explicitly selects this option, and that they cannot use a local PV "accidentally".

thockin · 2017-05-05T03:19:09Z

/lgtm
/approve

vishh · 2017-05-07T18:12:22Z

This proposal has gotten amazing feedback. However it has reached a point where it's too large to continue discussing further. I'm merging this PR based on @thockin's approval. As we implement the features in this proposal, some aspects of this proposal can (and possibly will) change. Expect this proposal to evolve over the next year as local storage features get added to kubernetes.
If you have any further feedback for this proposal, kindly open up a separate patch updating the proposal directly.

If someone feels strongly about any pending comments, I'm happy to revert the merge and continue discussing if necessary.

Automatic merge from submit-queue LocalStorage api **What this PR does / why we need it**: API changes to support persistent local volumes, as described [here](kubernetes/community#306) **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # Part of kubernetes#43640 **Special notes for your reviewer**: There were a few items I was concerned about. Will add review comments in those places. **Release note**: NONE Note will be added in subsequent PR with the volume plugin changes

This PR adds the new APIs to support storage capacity isolation as described in the proposal kubernetes/community#306 1. Add SizeLimit for emptyDir volume 2. Add scratch and overlay storage type used by container level or node level

Automatic merge from submit-queue Add Local Storage Capacity Isolation API This PR adds the new APIs to support storage capacity isolation as described in the proposal [kubernetes/community#306 1. Add SizeLimit for emptyDir volume 2. Add scratch and overlay storage type used by container level or node level **Release note**: ```release-note Alpha feature: Local volume Storage Capacity Isolation allows users to set storage limit to isolate EmptyDir volumes, container storage overlay, and also supports allocatable storage for shared root file system. ```

Automatic merge from submit-queue Add local storage (scratch space) allocatable support This PR adds the support for allocatable local storage (scratch space). This feature is only for root file system which is shared by kubernetes componenets, users' containers and/or images. User could use --kube-reserved flag to reserve the storage for kube system components. If the allocatable storage for user's pods is used up, some pods will be evicted to free the storage resource. This feature is part of local storage capacity isolation and described in the proposal kubernetes/community#306 **Release note**: ```release-note This feature exposes local storage capacity for the primary partitions, and supports & enforces storage reservation in Node Allocatable ```

Automatic merge from submit-queue Add EmptyDir volume capacity isolation This PR adds the support for isolating the emptyDir volume use. If user sets a size limit for emptyDir volume, kubelet's eviction manager monitors its usage and evict the pod if the usage exceeds the limit. This feature is part of local storage capacity isolation and described in the proposal kubernetes/community#306 **Release note**: ```release-note Alpha feature: allows users to set storage limit to isolate EmptyDir volumes. It enforces the limit by evicting pods that exceed their storage limits ```

sky-big · 2017-06-27T09:26:24Z

contributors/design-proposals/local-storage-overview.md

+# Open Questions & Discussion points 
+* Single vs split “limit” for storage across writable layer and logs
+* Local Persistent Volume bindings happening in the scheduler vs in PV controller
+    * Should the PV controller fold into the scheduler


at present, local PVs have above question，PV, PVC bounded before schedulering，when bounded completed，scheduler select the node with PV node affinity，but now the node CPU, Mem not enough and so on，so the pod all the time schedule failed，so above question have plan to solve?

Yes, that is the limitation in the first phase. We hope to solve it in the next release, but no concrete ideas yet, we're just prototyping at this point. At a high level, the PVC binding needs to be delayed until a pod is scheduled, so that it can take into account all the other scheduling requirements of the pod.

It's cool if solve local volume PV PVC delay bound,now my project team worry about the question,so not dare use local volume plugin because pod schedule fail all the time easily.

Yes, it will not work well right now in general purpose situations. But if you use the critical pod feature, or you run your workload that needs local storage first, then it that may work better. Still, the PVs may not get spread very well because the PV controller doesn't know that all these PVCs are replicas in the same workload. You may be able to work around the issue by labeling the PVs per workload.

OK,Thanks. I pay close attention to V1.8(Scheduler predicate for prebound local PVCs#43640) .

Automatic merge from submit-queue Add Local Storage Capacity Isolation API This PR adds the new APIs to support storage capacity isolation as described in the proposal [kubernetes/community#306 1. Add SizeLimit for emptyDir volume 2. Add scratch and overlay storage type used by container level or node level **Release note**: ```release-note Alpha feature: Local volume Storage Capacity Isolation allows users to set storage limit to isolate EmptyDir volumes, container storage overlay, and also supports allocatable storage for shared root file system. ```

Initial incomplete release notes draft for 1.7

[Proposal] Improve Local Storage Management

Adding a proposal for managing local storage

71f2016

Signed-off-by: Vishnu Kannan <vishnuk@google.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2017

vishh mentioned this pull request Jan 30, 2017

Durable (non-shared) local storage management kubernetes/enhancements#121

Closed

msau42 force-pushed the storage-proposal branch from 2728af6 to 4899d05 Compare January 30, 2017 22:08

Update pv workflow example

c034e02

msau42 force-pushed the storage-proposal branch from 4899d05 to c034e02 Compare January 30, 2017 22:23

kow3ns reviewed Jan 30, 2017

View reviewed changes

F21 reviewed Jan 30, 2017

View reviewed changes

davidopp reviewed Jan 31, 2017

View reviewed changes

kow3ns reviewed Jan 31, 2017

View reviewed changes

jheiss mentioned this pull request Jan 31, 2017

Local storage handling in Nodes kubernetes/kubernetes#30799

Closed

derekwaynecarr self-requested a review January 31, 2017 18:53

derekwaynecarr reviewed Jan 31, 2017

View reviewed changes

enisoc reviewed Jan 31, 2017

View reviewed changes

jsafrane reviewed Feb 1, 2017

View reviewed changes

lpabon reviewed Feb 1, 2017

View reviewed changes

Update local storage doc with first round comments

2f01ad5

lpabon reviewed Feb 2, 2017

View reviewed changes

kow3ns mentioned this pull request Feb 2, 2017

PetSet with multiple PVC kubernetes/kubernetes#35695

Closed

rkouj mentioned this pull request Feb 2, 2017

Monitoring and managing the quota size of containers and local volumes kubernetes/kubernetes#13479

Closed

stp-ip reviewed Feb 3, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2017

therc mentioned this pull request May 5, 2017

Initial proposal for node-local services kubernetes/kubernetes#28637

Closed

vishh merged commit 41e99f0 into kubernetes:master May 7, 2017

aglitke mentioned this pull request May 26, 2017

Proposal for growing persistent volumes #657

Merged

This was referenced Jun 1, 2017

Add local storage (scratch space) allocatable support kubernetes/kubernetes#46456

Merged

Add EmptyDir volume and container overlay capacity isolation kubernetes/kubernetes#45686

Merged

sky-big reviewed Jun 27, 2017

View reviewed changes

This was referenced Jul 26, 2017

Local Ephemeral Storage Capacity Isolation kubernetes/enhancements#361

Closed

Supports Storage as Allocatable Resource kubernetes/enhancements#362

Open

Expose Storage Metrics kubernetes/enhancements#363

Closed

kyessenov mentioned this pull request Aug 4, 2017

SizeLimit violates k8s API convention kubernetes/kubernetes#50121

Closed

shyamjvs pushed a commit to shyamjvs/community that referenced this pull request Sep 22, 2017

Merge pull request kubernetes#306 from spiffxp/release-notes-1.7

19ca220

Initial incomplete release notes draft for 1.7

AmitKumarDas mentioned this pull request Jan 13, 2018

Digging Design, Storage with Kubernetes AmitKumarDas/Decisions#27

Open

jacobstr mentioned this pull request Feb 15, 2018

Conditional PV backup opt-in. vmware-tanzu/velero#317

Closed

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021

Merge pull request kubernetes#306 from vishh/storage-proposal

7457b48

[Proposal] Improve Local Storage Management

msau42 mentioned this pull request Jun 23, 2022

KEP-361: promote Local Ephemeral Storage to GA kubernetes/enhancements#3422

Merged


		# Design Overview

		A node’s local storage can be broken into primary and secondary partitions.

[Proposal] Improve Local Storage Management #306

[Proposal] Improve Local Storage Management #306

Conversation

vishh commented Jan 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kow3ns Jan 30, 2017 • edited Loading

Choose a reason for hiding this comment

davidopp Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kow3ns Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kow3ns Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishh commented Jan 30, 2017 •

edited

Loading

kow3ns Jan 30, 2017 •

edited

Loading

davidopp Jan 31, 2017 •

edited

Loading

davidopp Jan 31, 2017 •

edited

Loading

kow3ns Jan 31, 2017 •

edited

Loading

kow3ns Jan 31, 2017 •

edited

Loading