Store Consul cluster/snapshot state to fix failed cluster behavior #326

jjti · 2022-06-13T00:27:15Z

🛠️ Description

Background: we have an issue where resources are leaked in our E2E tests. This is because we purge all failed clusters from state with d.SetId("").

Changes:

add cluster_state and snapshot_state to Consul clusters and snapshots, respectively
remove DELETED Consul clusters from state (not FAILED)
set Consul client config resource properties after setting the rest of the resource properties. The GetConsulClusterByID call can succeed before the GetConsulClientConfigFiles API call fails -- we should still set the resource properties we have from the GetConsulClusterByID call

UX change:

FAILED clusters/snapshots are stored in state so they're deleted by terraform on the next terraform apply or terraform delete. This differs from the behavior right now where users have to find and delete the resources in UI manually because Terraform forgets about them

Example of a failed E2E test here. We mark the cluster as deleted and remove it from state:

2022-06-11T16:11:36Z   # module.hcp_e2e_consul_service.hcp_consul_cluster.default has been deleted
2022-06-11T16:11:36Z   - resource "hcp_consul_cluster" "default" {
2022-06-11T16:11:36Z       - cluster_id         = "cloud-e2e-a541b8d-112230" -> null
2022-06-11T16:11:36Z       - connect_enabled    = true -> null

That's the opposite of what we want to do: delete the failed cluster on the next run. And our E2E org becomes filled with Consul clusters and HVNs: https://admin.hcp.to/organizations/orgs/11eb1545-60c8-4b02-9121-0242ac110008/resources

From docs:

When you create something in Terraform but delete it manually, Terraform should gracefully handle it. If the API returns an error when the resource doesn't exist, the read function should check to see if the resource is available first. If the resource isn't available, the function should set the ID to an empty string so Terraform "destroys" the resource in state. The following code snippet is an example of how this can be implemented; you do not need to add this to your configuration for this tutorial.

As far as I can tell, the behavior in this PR will match the desired behavior of the original PR:

Set the ID of resources that need to poll for an operation before polling the operation. This prevents the case where a resource is created in the DB but the async resource operation fails and Terraform has no state entry so it thinks the resource does not exist (even though it does and is in a failed state), so the user can't run terraform destroy to clean up the failed resource.

closes: #335

🚢 Release Note

Release note for CHANGELOG:

* resource/consul: Store cluster+snapshot state [GH-326]

🏗️ Acceptance tests

Are there any feature flags that are required to use this functionality?
Have you added an acceptance test for the functionality being added?
Have you run the acceptance tests on this branch?

Output from acceptance testing:

$ → make testacc TESTARGS='-run=TestAccConsulCluster'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./internal/... -v -run=TestAccConsulCluster -timeout 210m
?       github.com/hashicorp/terraform-provider-hcp/internal/clients    [no test files]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/consul     (cached) [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/input      (cached) [no tests to run]
=== RUN   TestAccConsulCluster
    resource_consul_cluster_test.go:62: Step 6/6 error: Error running apply: exit status 1
        
        Error: unable to create Consul cluster (test-consul-cluster): [POST /consul/2021-02-04/organizations/{cluster.location.organization_id}/projects/{cluster.location.project_id}/clusters][403] Create default  &{Code:7 Details:[] Error:billing authorization check failed Message:billing authorization check failed}
...

→ make testacc TESTARGS='-run=TestAccConsulSnapshot'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./internal/... -v -run=TestAccConsulSnapshot -timeout 210m
?       github.com/hashicorp/terraform-provider-hcp/internal/clients    [no test files]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/consul     0.140s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/input      0.131s [no tests to run]
=== RUN   TestAccConsulSnapshot
--- PASS: TestAccConsulSnapshot (909.93s)
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/provider   910.251s

Everything passes except the update where I hit a payments/no-credit-card err

E2E Testing

I created a cluster, terminated the dataplane deployment, and confirmed that cluster_state was stored after tf refresh:

      "type": "hcp_consul_cluster",
      "instances": [{
          "status": "tainted",
...
            "cluster_id": "consul-quickstart-1655065377418",
            "cluster_state": "FAILED",

On tf apply it was replaced:

-/+ resource "hcp_consul_cluster" "main" {
      ~ auto_hvn_to_hvn_peering       = false -> (known after apply)
      ~ cloud_provider                = "aws" -> (known after apply)
      ~ cluster_state                 = "FAILED" -> (known after apply)

- there has been a longstanding issue where resources are leaked in our e2e tests. This is because we purge all failed clusters from state with d.SetId(""). This doesn't make sense for failed clusters/snapshots. Instead, we should store their state... in state

jjti · 2022-06-13T00:44:00Z

this handling of state behavior applies to many other resources as well. Example of another issue but for HVNs:
#222

We should store all these resource's state so they're deleted on the next terraform apply/delete: https://github.com/hashicorp/terraform-provider-hcp/search?q=failed+to+provision. I believe we want to change all of these resources to store resources state. Example from HVN:

terraform-provider-hcp/internal/provider/resource_hvn.go

Lines 214 to 217 in f678fea

    
           // HVN found, update resource data 
        
           if err := setHvnResourceData(d, hvn); err != nil { 
        
           	return diag.FromErr(err) 
        
           }

- and undo property sorting - it makes the PR bigger

- otherwise folks need to upload a credit card to help pr fixes

- reduced the size, need to reduce the scale

mkeeler · 2022-06-13T09:32:38Z

internal/provider/resource_consul_cluster.go

@@ -382,12 +387,20 @@ func resourceConsulClusterCreate(ctx context.Context, d *schema.ResourceData, me
 		return diag.Errorf("unable to retrieve Consul cluster (%s): %v", payload.Cluster.ID, err)
 	}

+	if err := setConsulClusterResourceData(d, cluster); err != nil {


When the create fails, where would the tf have failed previously? Was it in attempting to get the client config files?

I had assumed the WaitForOperation call back on line 379 still prior to this call.

you're right and yes, this was failing up above in the Wait. Where I saw this was an issue was in resourceConsulClusterRead (the GetConsulClusterByID call succeeds while GetConsulClientConfigFiles fails). I then applied the same pattern everywhere we make these two API calls (storing results of first)

If it fails during the Wait then will we still clean it up?

Yes, we'll still clean it up on the next terraform delete or terraform apply. A PR from last year addressed this by storing the ID of the resource before polling: #59

Current behavior:

tf create fail during poll: bail, return diag.Error, we store nothing but ID

tf plan, read: GET with ID, see it's FAILED, purge from state

tf apply: create a new cluster. Don't delete prior cluster (it was purged). Probably (unless user knows to go manually delete the old resources) run into a duplicate cluster/hvn ID issue

Behavior after this change:

tf create fail during poll: bail, return diag.Error, we store nothing but ID

tf plan, read: GET with ID, set state

tf apply, old cluster is tainted, delete, create a new cluster

Thanks for summing up the behavior change here 📝 💡

One thing I'm considering is how useful this computed output state might be in other circumstances, since it seems more of an internal detail. Would a user ever need to use the output of state or react to it? I think that was the main reason for keeping it out of the resource schema until now.

We could also resolve this issue without state by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state. We could put this operation in a retry function.

Which operation takes more time on the Consul service side? Deleting a failed cluster or re-attempting the creation of a failed cluster? Or any other advantages?

I realized and messed up and replied out of thread: #326 (comment)

internal/provider/resource_consul_cluster_test.go

docs/resources/consul_snapshot.md

docs/resources/consul_cluster.md

bcmdarroch

Thanks for this fix! I think this is a solid approach, but I wanted to discuss an alternative that doesn't involve adding a state field.

bcmdarroch · 2022-06-14T15:27:08Z

internal/provider/resource_consul_cluster.go

 				Type:        schema.TypeString,
 				Computed:    true,
 			},
-			"project_id": {
-				Description: "The ID of the project this HCP Consul cluster is located in.",
+			"cluster_state": {


To confirm: the only key change here, besides the alphabetization, is adding cluster_state as a computed output.

that's the main change. But there are two other behavior changes. Highlights from description:

add state

remove DELETED Consul clusters from state (not FAILED)

set Consul client config resource properties after setting the rest of the cluster resource properties (ie set state after each API call, rather than at the end of calling all of them)

bcmdarroch · 2022-06-14T16:20:55Z

internal/provider/resource_consul_cluster.go

@@ -382,12 +387,20 @@ func resourceConsulClusterCreate(ctx context.Context, d *schema.ResourceData, me
 		return diag.Errorf("unable to retrieve Consul cluster (%s): %v", payload.Cluster.ID, err)
 	}

+	if err := setConsulClusterResourceData(d, cluster); err != nil {


Thanks for summing up the behavior change here 📝 💡

One thing I'm considering is how useful this computed output state might be in other circumstances, since it seems more of an internal detail. Would a user ever need to use the output of state or react to it? I think that was the main reason for keeping it out of the resource schema until now.

We could also resolve this issue without state by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state. We could put this operation in a retry function.

Which operation takes more time on the Consul service side? Deleting a failed cluster or re-attempting the creation of a failed cluster? Or any other advantages?

jjti · 2022-06-14T17:29:03Z

not sure why I can't reply inline, but wrt the comment here: #326 (comment)

I think there's a few reasons to include state:

it matches the experience in the UI and the API. In the UI, we already show the user when a cluster/hvn/etc is creating, running, updating, deleted, failed

it matches user expectations. Eg state is also broadcasted/communicated for EC2 instances within their resource: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance#instance_state. In the related PR from a year ago Trevor brought up this point, and I'd tend to agree: Set resource id before polling operation and re-create failed deployments #59 (comment). I don't really see it as an internal detail, since the user will want to know when it's ready/running/deleted/failed etc in state as well

We could also resolve this issue without state by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state.

We also talked about this on our side (just deleting the failed resource after the wait). Again I think that's diff from the UI case, and that we should try to match the UI-case where it stays failed -- and then we delete if for them on their next tf apply/delete. I would be kind of surprised for example if I tried to make an EKS cluster, it fails, then is deleted (even w/ a warning). I think I'd want to be the one to issue the delete (even if via tf apply)

internal/provider/resource_consul_cluster_test.go

Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>

bcmdarroch

Thanks for the response! I think your point about consistency with the UI and user expectations make sense. Definitely worth rolling out this pattern to the rest of our resources.

) * Fix storing of failed consul cluster/snapshot state - there has been a longstanding issue where resources are leaked in our e2e tests. This is because we purge all failed clusters from state with d.SetId(""). This doesn't make sense for failed clusters/snapshots. Instead, we should store their state... in state * Add acc testing - and undo property sorting - it makes the PR bigger * Make docs * Change acc testing tier to development - otherwise folks need to upload a credit card to help pr fixes * Fix scale attr - reduced the size, need to reduce the scale * Fix size if acc testing of consul cluster * Update Consul cluster data source to also store state * Fix brace in cluster def * go generate docs * Use state vs cluster/snapshot_state * Update internal/provider/resource_consul_cluster_test.go Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com> Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>

jjti changed the title ~~Fix storing of Failed Consul cluster/snapshot state~~ Store Consul cluster/snapshot state, fix FAILED cluster behavior Jun 13, 2022

jjtimmons added 2 commits June 12, 2022 19:47

Add acc testing

6394510

- and undo property sorting - it makes the PR bigger

Make docs

f3b64eb

jjti changed the title ~~Store Consul cluster/snapshot state, fix FAILED cluster behavior~~ Store Consul cluster/snapshot state and fix FAILED cluster behavior Jun 13, 2022

jjti changed the title ~~Store Consul cluster/snapshot state and fix FAILED cluster behavior~~ Store Consul cluster/snapshot state, fix FAILED cluster behavior Jun 13, 2022

jjti requested a review from bcmdarroch June 13, 2022 00:51

jjti marked this pull request as ready for review June 13, 2022 00:51

jjti requested a review from a team as a code owner June 13, 2022 00:51

jjti requested review from a team and mkeeler June 13, 2022 00:51

jjti changed the title ~~Store Consul cluster/snapshot state, fix FAILED cluster behavior~~ Store Consul cluster/snapshot state to fix failed cluster behavior Jun 13, 2022

jjtimmons added 2 commits June 12, 2022 20:11

Change acc testing tier to development

3aec06c

- otherwise folks need to upload a credit card to help pr fixes

Fix scale attr

789ed6b

- reduced the size, need to reduce the scale

mkeeler reviewed Jun 13, 2022

View reviewed changes

jjtimmons added 4 commits June 13, 2022 09:02

Fix size if acc testing of consul cluster

5ce199c

Update Consul cluster data source to also store state

e7001a9

Fix brace in cluster def

4a68e2c

go generate docs

f6787dc

jjti requested a review from mkeeler June 13, 2022 15:59

mkeeler reviewed Jun 13, 2022

View reviewed changes

docs/resources/consul_snapshot.md Outdated Show resolved Hide resolved

docs/resources/consul_cluster.md Outdated Show resolved Hide resolved

mkeeler approved these changes Jun 13, 2022

View reviewed changes

Use state vs cluster/snapshot_state

6398b04

bcmdarroch reviewed Jun 14, 2022

View reviewed changes

internal/provider/resource_consul_cluster_test.go Outdated Show resolved Hide resolved

Update internal/provider/resource_consul_cluster_test.go

c26468a

Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>

bcmdarroch approved these changes Jun 15, 2022

View reviewed changes

Merge branch 'main' into jjtimmons/fix-consul-cluster-failed-state

0b6dcd1

This was referenced Jun 18, 2022

Keep failed resources, add state output value #331

Merged

HCP Consul Cluster failed to launch #335

Closed

jjti merged commit 1c970d4 into main Jun 28, 2022

jjti deleted the jjtimmons/fix-consul-cluster-failed-state branch June 28, 2022 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store Consul cluster/snapshot state to fix failed cluster behavior #326

Store Consul cluster/snapshot state to fix failed cluster behavior #326

jjti commented Jun 13, 2022 •

edited

Loading

jjti commented Jun 13, 2022 •

edited

Loading

mkeeler Jun 13, 2022

jjti Jun 13, 2022

mkeeler Jun 13, 2022

jjti Jun 13, 2022 •

edited

Loading

bcmdarroch Jun 14, 2022 •

edited

Loading

jjti Jun 14, 2022

bcmdarroch left a comment

bcmdarroch Jun 14, 2022

jjti Jun 14, 2022

bcmdarroch Jun 14, 2022 •

edited

Loading

jjti commented Jun 14, 2022 •

edited

Loading

bcmdarroch left a comment

Store Consul cluster/snapshot state to fix failed cluster behavior #326

Store Consul cluster/snapshot state to fix failed cluster behavior #326

Conversation

jjti commented Jun 13, 2022 • edited Loading

🛠️ Description

🚢 Release Note

🏗️ Acceptance tests

E2E Testing

jjti commented Jun 13, 2022 • edited Loading

mkeeler Jun 13, 2022

Choose a reason for hiding this comment

jjti Jun 13, 2022

Choose a reason for hiding this comment

mkeeler Jun 13, 2022

Choose a reason for hiding this comment

jjti Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

bcmdarroch Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

jjti Jun 14, 2022

Choose a reason for hiding this comment

bcmdarroch left a comment

Choose a reason for hiding this comment

bcmdarroch Jun 14, 2022

Choose a reason for hiding this comment

jjti Jun 14, 2022

Choose a reason for hiding this comment

bcmdarroch Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

jjti commented Jun 14, 2022 • edited Loading

bcmdarroch left a comment

Choose a reason for hiding this comment

jjti commented Jun 13, 2022 •

edited

Loading

jjti commented Jun 13, 2022 •

edited

Loading

jjti Jun 13, 2022 •

edited

Loading

bcmdarroch Jun 14, 2022 •

edited

Loading

bcmdarroch Jun 14, 2022 •

edited

Loading

jjti commented Jun 14, 2022 •

edited

Loading