Remedy Azure disk ConflictingUserInput errors #1

stoyanr · 2020-07-01T07:39:53Z

Issue

Detect Azure disks which can't be attached because they are still attached to another vm on the infrastructure.

Proposed solution

Controller need to watch for pod events
Parse the message of the event
2.1. If Field reason equals FailedAttachVolume and Field Code equals ConflictingUserInput
2.2 Then read Field message and extract the disk uri
Get VM object which has the disk attached
Update the VM disk by removing the disk from its ARM spec

Event example

Event(
v1.ObjectReference{
	Kind:"Pod", 
	Namespace:"test-namespace", 
	Name:"test-pod-with-volume-0", 
	UID:"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", 
	APIVersion:"v1", 
	ResourceVersion:"999999999", 
	FieldPath:""
}): type: 'Warning' 
reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx" : Attach volume "shoot-garden-az-test-dynamic-pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx" to instance "shoot-garden-az-test-worker-1111111111-new" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. 
Status=<nil> 
Code="ConflictingUserInput" 
Message="Disk '/subscriptions/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/resourceGroups/shoot-garden-az-test/providers/Microsoft.Compute/disks/shoot-garden-az-test-dynamic-pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx' cannot be attached as the disk is already owned by VM '/subscriptions/7532faf4-1027-48a1-91b0-40cb0d2d1496/resourceGroups/shoot-garden-az-test/providers/Microsoft.Compute/virtualMachines/shoot-garden-az-test-worker-2222222222-old'."

Tasks

Reliably reproduce or simulate the issue, if needed via a special script / binary

The text was updated successfully, but these errors were encountered:

stoyanr · 2020-07-01T07:55:43Z

@stoyanr, Jun 15

I tried to reproduce the issue and found that it's already fixed when using the in-tree volume attacher:

Create a PVC, this also creates a PV using a "dynamic" volume. Remove the finalizers manually and delete the PVC and the PV. This deletes the PVC and the PV, but leaves the volume. Alternatively, you could simply create the volume manually in Azure.
Create another PV using an explicit azureDisk configuration pointing to the dynamic volume. Then create a PVC matching this PV and a pod using the PVC with affinity to a specific node. See the attached example pod-test.yaml.txt
Delete the pod, and manually deattach the volume from the node in Azure and attach it to a different node in the same availability zone. Make sure that no pods on that node are actually using the volume.
Create the pod again. Starting the pod should fail with the following error:

AttachVolume.Attach failed for volume "test" : disk(/subscriptions/00d2caa5-cd29-46f7-845a-
2f8ee0360ef5/resourceGroups/shoot--remedy-ck3b6eulyy/providers/Microsoft.Compute/disks/
shoot--remedy--ck3b6eulyy-dynamic-pvc-79d503bb-f787-4259-8da7-0c44b6f61256)
 already attached to node(/subscriptions/00d2caa5-cd29-46f7-845a-2f8ee0360ef5/
resourceGroups/shoot--remedy--ck3b6eulyy/providers/Microsoft.Compute/virtualMachines/
test), could not be attached to node(shoot--remedy--ck3b6eulyy-worker-qr1lw-z3-
69589b7988-2g2hw)

After about 2 minutes, the disk should automatically be deattached from the other node and attached to the correct one, and the pod should start. Manually deattaching the pod from the other node would also work, but is not necessary.

The fix came with kubernetes/kubernetes#81266 and should be available in all releases 1.16 onwards, and also in patches for older releases. It came as a fix for kubernetes/kubernetes#81079.

It seems however that the fix is still not part of the Azure CSI driver, so if we build an Azure cluster using the CSI driver (disabled by default, @d062553 do you know how to enable?), the original issue should still be reproducible. The relevant code can be found in https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/pkg/azuredisk/controllerserver.go, and there is also this issue: kubernetes-sigs/azuredisk-csi-driver#288 and it can't be implemented because of kubernetes/kubernetes#80488.

In any case, I would suggest not to proactively try to solve this issue in the remedy controller, but instead push for a resolution in the CSI driver.

@stoyanr, Jun 16

The issue is definitely reproducible when the Azure CSI driver is enabled. The error we are getting is different from the description in the ticket, due to the change introduced with kubernetes/kubernetes#81266:

AttachVolume.Attach failed for volume "pv-shoot--dev--i500152-az-00142a77-ba60-489f-
87df-275e4c66f8fd" : rpc error: code = Unknown desc = Attach volume "/subscriptions/
00d2caa5-cd29-46f7-845a-2f8ee0360ef5/resourceGroups/shoot--dev--i500152-
az/providers/Microsoft.Compute/disks/pv-shoot--dev--i500152-az-00142a77-ba60-489f-87df-
275e4c66f8fd" to instance "shoot--dev--i500152-az-worker-test-z1-7f596dbcf4-bbmtv" failed
 with disk(/subscriptions/00d2caa5-cd29-46f7-845a-2f8ee0360ef5/resourceGroups/shoot--
dev--i500152-az/providers/Microsoft.Compute/disks/pv-shoot--dev--i500152-az-00142a77-
ba60-489f-87df-275e4c66f8fd) already attached to node(/subscriptions/00d2caa5-cd29-
46f7-845a-2f8ee0360ef5/resourceGroups/shoot--dev--i500152-az/providers/
Microsoft.Compute/virtualMachines/shoot--dev--i500152-az-worker-test-z1-7f596dbcf4-
wndwk), could not be attached to node(shoot--dev--i500152-az-worker-test-z1-7f596dbcf4-
bbmtv)

However, with the CSI driver the automatic deattachment of the disk from the other node doesn't happen. Instead, the pod remains in ContianerCreating state with the above error indefinitely, or until the disk is manually deattached from the other node. The ultimate root cause for this is the issue described in kubernetes/kubernetes#80488, so it's a Kubernetes issue, not an Azure issue.

@stoyanr, Jun 16

The same issue is also reproducible on GCP when the CSI driver is enabled (the default with 1.18). The error is the following:

AttachVolume.Attach failed for volume "pv--72cd19bd-6ddd-422d-9844-f0b18b47f788" : 
rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op:
 operation operation-1592316439014-5a8340fb7ae51-9c621c05-ac20ceae failed 
(RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/sap-se-gcp-scp-
k8s-dev/zones/europe-west1-c/disks/pv--72cd19bd-6ddd-422d-9844-f0b18b47f788' 
is already being used by 'projects/sap-se-gcp-scp-k8s-dev/zones/europe-west1c/instances/
shoot--dev--i500152-gcp-worker-test-z1-8697b946cb-872tn'

@stoyanr, Jun 19

Short summary from meeting on 19.06:

We will not implement a remedy for this issue yet, as we aren't currently hit by it.
We will postpone making the CSI driver the default on Azure to at least 1.20. The other platforms are assumed to be fine until we actually observe similar issues there (not the case so far).
We will try to engage with the community by further commenting / asking in Dangling volume mechanism for CSI does not exist kubernetes/kubernetes#80488 to get an understanding how could we potentially contribute an upstream fix, and then perhaps do it.

stoyanr · 2020-07-28T12:30:24Z

As described above, we will not implement a remedy for this issue. Closing this epic.

stoyanr added kind/epic Large multi-story topic platform/azure Microsoft Azure platform/infrastructure priority/normal labels Jul 1, 2020

gardener deleted a comment from gardener-robot Jul 1, 2020

stoyanr closed this as completed Jul 28, 2020

stoyanr mentioned this issue Aug 3, 2020

Postpone switch to CSI drivers to 1.20 gardener/gardener-extension-provider-azure#142

Merged

gardener-robot added the priority/3 Priority (lower number equals higher priority) label Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remedy Azure disk ConflictingUserInput errors #1

Remedy Azure disk ConflictingUserInput errors #1

stoyanr commented Jul 1, 2020

stoyanr commented Jul 1, 2020 •

edited

Loading

stoyanr commented Jul 28, 2020

Remedy Azure disk ConflictingUserInput errors #1

Remedy Azure disk ConflictingUserInput errors #1

Comments

stoyanr commented Jul 1, 2020

stoyanr commented Jul 1, 2020 • edited Loading

stoyanr commented Jul 28, 2020

stoyanr commented Jul 1, 2020 •

edited

Loading