Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Consul cluster/snapshot state to fix failed cluster behavior #326

Merged
merged 12 commits into from
Jun 28, 2022

Conversation

jjti
Copy link
Contributor

@jjti jjti commented Jun 13, 2022

🛠️ Description

Background: we have an issue where resources are leaked in our E2E tests. This is because we purge all failed clusters from state with d.SetId("").

Changes:

  • add cluster_state and snapshot_state to Consul clusters and snapshots, respectively
  • remove DELETED Consul clusters from state (not FAILED)
  • set Consul client config resource properties after setting the rest of the resource properties. The GetConsulClusterByID call can succeed before the GetConsulClientConfigFiles API call fails -- we should still set the resource properties we have from the GetConsulClusterByID call

UX change:

  • FAILED clusters/snapshots are stored in state so they're deleted by terraform on the next terraform apply or terraform delete. This differs from the behavior right now where users have to find and delete the resources in UI manually because Terraform forgets about them

Example of a failed E2E test here. We mark the cluster as deleted and remove it from state:

2022-06-11T16:11:36Z   # module.hcp_e2e_consul_service.hcp_consul_cluster.default has been deleted
2022-06-11T16:11:36Z   - resource "hcp_consul_cluster" "default" {
2022-06-11T16:11:36Z       - cluster_id         = "cloud-e2e-a541b8d-112230" -> null
2022-06-11T16:11:36Z       - connect_enabled    = true -> null

That's the opposite of what we want to do: delete the failed cluster on the next run. And our E2E org becomes filled with Consul clusters and HVNs: https://admin.hcp.to/organizations/orgs/11eb1545-60c8-4b02-9121-0242ac110008/resources

From docs:

When you create something in Terraform but delete it manually, Terraform should gracefully handle it. If the API returns an error when the resource doesn't exist, the read function should check to see if the resource is available first. If the resource isn't available, the function should set the ID to an empty string so Terraform "destroys" the resource in state. The following code snippet is an example of how this can be implemented; you do not need to add this to your configuration for this tutorial.

As far as I can tell, the behavior in this PR will match the desired behavior of the original PR:

Set the ID of resources that need to poll for an operation before polling the operation. This prevents the case where a resource is created in the DB but the async resource operation fails and Terraform has no state entry so it thinks the resource does not exist (even though it does and is in a failed state), so the user can't run terraform destroy to clean up the failed resource.

closes: #335

🚢 Release Note

Release note for CHANGELOG:

* resource/consul: Store cluster+snapshot state [GH-326]

🏗️ Acceptance tests

  • Are there any feature flags that are required to use this functionality?
  • Have you added an acceptance test for the functionality being added?
  • Have you run the acceptance tests on this branch?

Output from acceptance testing:

$ → make testacc TESTARGS='-run=TestAccConsulCluster'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./internal/... -v -run=TestAccConsulCluster -timeout 210m
?       github.com/hashicorp/terraform-provider-hcp/internal/clients    [no test files]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/consul     (cached) [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/input      (cached) [no tests to run]
=== RUN   TestAccConsulCluster
    resource_consul_cluster_test.go:62: Step 6/6 error: Error running apply: exit status 1
        
        Error: unable to create Consul cluster (test-consul-cluster): [POST /consul/2021-02-04/organizations/{cluster.location.organization_id}/projects/{cluster.location.project_id}/clusters][403] Create default  &{Code:7 Details:[] Error:billing authorization check failed Message:billing authorization check failed}
...

→ make testacc TESTARGS='-run=TestAccConsulSnapshot'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./internal/... -v -run=TestAccConsulSnapshot -timeout 210m
?       github.com/hashicorp/terraform-provider-hcp/internal/clients    [no test files]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/consul     0.140s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/input      0.131s [no tests to run]
=== RUN   TestAccConsulSnapshot
--- PASS: TestAccConsulSnapshot (909.93s)
PASS
ok      github.com/hashicorp/terraform-provider-hcp/internal/provider   910.251s

Everything passes except the update where I hit a payments/no-credit-card err

E2E Testing

I created a cluster, terminated the dataplane deployment, and confirmed that cluster_state was stored after tf refresh:

      "type": "hcp_consul_cluster",
      "instances": [{
          "status": "tainted",
...
            "cluster_id": "consul-quickstart-1655065377418",
            "cluster_state": "FAILED",

On tf apply it was replaced:

-/+ resource "hcp_consul_cluster" "main" {
      ~ auto_hvn_to_hvn_peering       = false -> (known after apply)
      ~ cloud_provider                = "aws" -> (known after apply)
      ~ cluster_state                 = "FAILED" -> (known after apply)

- there has been a longstanding issue where resources are leaked in our e2e tests. This is because we purge all failed clusters from state with d.SetId(""). This doesn't make sense for failed clusters/snapshots. Instead, we should store their state... in state
@jjti jjti changed the title Fix storing of Failed Consul cluster/snapshot state Store Consul cluster/snapshot state, fix FAILED cluster behavior Jun 13, 2022
@jjti
Copy link
Contributor Author

jjti commented Jun 13, 2022

this handling of state behavior applies to many other resources as well. Example of another issue but for HVNs:
#222

We should store all these resource's state so they're deleted on the next terraform apply/delete: https://github.com/hashicorp/terraform-provider-hcp/search?q=failed+to+provision. I believe we want to change all of these resources to store resources state. Example from HVN:

// HVN found, update resource data
if err := setHvnResourceData(d, hvn); err != nil {
return diag.FromErr(err)
}

jjtimmons added 2 commits June 12, 2022 19:47
- and undo property sorting - it makes the PR bigger
@jjti jjti changed the title Store Consul cluster/snapshot state, fix FAILED cluster behavior Store Consul cluster/snapshot state and fix FAILED cluster behavior Jun 13, 2022
@jjti jjti changed the title Store Consul cluster/snapshot state and fix FAILED cluster behavior Store Consul cluster/snapshot state, fix FAILED cluster behavior Jun 13, 2022
@jjti jjti requested a review from bcmdarroch June 13, 2022 00:51
@jjti jjti marked this pull request as ready for review June 13, 2022 00:51
@jjti jjti requested a review from a team as a code owner June 13, 2022 00:51
@jjti jjti requested review from a team and mkeeler June 13, 2022 00:51
@jjti jjti changed the title Store Consul cluster/snapshot state, fix FAILED cluster behavior Store Consul cluster/snapshot state to fix failed cluster behavior Jun 13, 2022
jjtimmons added 2 commits June 12, 2022 20:11
- otherwise folks need to upload a credit card to help pr fixes
- reduced the size, need to reduce the scale
@@ -382,12 +387,20 @@ func resourceConsulClusterCreate(ctx context.Context, d *schema.ResourceData, me
return diag.Errorf("unable to retrieve Consul cluster (%s): %v", payload.Cluster.ID, err)
}

if err := setConsulClusterResourceData(d, cluster); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the create fails, where would the tf have failed previously? Was it in attempting to get the client config files?

I had assumed the WaitForOperation call back on line 379 still prior to this call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right and yes, this was failing up above in the Wait. Where I saw this was an issue was in resourceConsulClusterRead (the GetConsulClusterByID call succeeds while GetConsulClientConfigFiles fails). I then applied the same pattern everywhere we make these two API calls (storing results of first)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fails during the Wait then will we still clean it up?

Copy link
Contributor Author

@jjti jjti Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'll still clean it up on the next terraform delete or terraform apply. A PR from last year addressed this by storing the ID of the resource before polling: #59

Current behavior:

  • tf create fail during poll: bail, return diag.Error, we store nothing but ID
  • tf plan, read: GET with ID, see it's FAILED, purge from state
  • tf apply: create a new cluster. Don't delete prior cluster (it was purged). Probably (unless user knows to go manually delete the old resources) run into a duplicate cluster/hvn ID issue

Behavior after this change:

  • tf create fail during poll: bail, return diag.Error, we store nothing but ID
  • tf plan, read: GET with ID, set state
  • tf apply, old cluster is tainted, delete, create a new cluster

Copy link
Contributor

@bcmdarroch bcmdarroch Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for summing up the behavior change here 📝 💡

One thing I'm considering is how useful this computed output state might be in other circumstances, since it seems more of an internal detail. Would a user ever need to use the output of state or react to it? I think that was the main reason for keeping it out of the resource schema until now.

We could also resolve this issue without state by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state. We could put this operation in a retry function.

Which operation takes more time on the Consul service side? Deleting a failed cluster or re-attempting the creation of a failed cluster? Or any other advantages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized and messed up and replied out of thread: #326 (comment)

internal/provider/resource_consul_cluster_test.go Outdated Show resolved Hide resolved
@jjti jjti requested a review from mkeeler June 13, 2022 15:59
docs/resources/consul_snapshot.md Outdated Show resolved Hide resolved
docs/resources/consul_cluster.md Outdated Show resolved Hide resolved
Copy link
Contributor

@bcmdarroch bcmdarroch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix! I think this is a solid approach, but I wanted to discuss an alternative that doesn't involve adding a state field.

Type: schema.TypeString,
Computed: true,
},
"project_id": {
Description: "The ID of the project this HCP Consul cluster is located in.",
"cluster_state": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm: the only key change here, besides the alphabetization, is adding cluster_state as a computed output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the main change. But there are two other behavior changes. Highlights from description:

  • add state
  • remove DELETED Consul clusters from state (not FAILED)
  • set Consul client config resource properties after setting the rest of the cluster resource properties (ie set state after each API call, rather than at the end of calling all of them)

@@ -382,12 +387,20 @@ func resourceConsulClusterCreate(ctx context.Context, d *schema.ResourceData, me
return diag.Errorf("unable to retrieve Consul cluster (%s): %v", payload.Cluster.ID, err)
}

if err := setConsulClusterResourceData(d, cluster); err != nil {
Copy link
Contributor

@bcmdarroch bcmdarroch Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for summing up the behavior change here 📝 💡

One thing I'm considering is how useful this computed output state might be in other circumstances, since it seems more of an internal detail. Would a user ever need to use the output of state or react to it? I think that was the main reason for keeping it out of the resource schema until now.

We could also resolve this issue without state by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state. We could put this operation in a retry function.

Which operation takes more time on the Consul service side? Deleting a failed cluster or re-attempting the creation of a failed cluster? Or any other advantages?

@jjti
Copy link
Contributor Author

jjti commented Jun 14, 2022

not sure why I can't reply inline, but wrt the comment here: #326 (comment)

I think there's a few reasons to include state:

  • it matches the experience in the UI and the API. In the UI, we already show the user when a cluster/hvn/etc is creating, running, updating, deleted, failed

Screen Shot 2022-06-14 at 12 18 11 PM

Screen Shot 2022-06-14 at 12 51 04 PM

Screen Shot 2022-06-14 at 12 50 04 PM

We could also resolve this issue without state by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state.

We also talked about this on our side (just deleting the failed resource after the wait). Again I think that's diff from the UI case, and that we should try to match the UI-case where it stays failed -- and then we delete if for them on their next tf apply/delete. I would be kind of surprised for example if I tried to make an EKS cluster, it fails, then is deleted (even w/ a warning). I think I'd want to be the one to issue the delete (even if via tf apply)

Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>
Copy link
Contributor

@bcmdarroch bcmdarroch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response! I think your point about consistency with the UI and user expectations make sense. Definitely worth rolling out this pattern to the rest of our resources.

@jjti jjti merged commit 1c970d4 into main Jun 28, 2022
@jjti jjti deleted the jjtimmons/fix-consul-cluster-failed-state branch June 28, 2022 17:42
aidan-mundy pushed a commit that referenced this pull request Sep 8, 2023
)

* Fix storing of failed consul cluster/snapshot state

- there has been a longstanding issue where resources are leaked in our e2e tests. This is because we purge all failed clusters from state with d.SetId(""). This doesn't make sense for failed clusters/snapshots. Instead, we should store their state... in state

* Add acc testing

- and undo property sorting - it makes the PR bigger

* Make docs

* Change acc testing tier to development

- otherwise folks need to upload a credit card to help pr fixes

* Fix scale attr

- reduced the size, need to reduce the scale

* Fix size if acc testing of consul cluster

* Update Consul cluster data source to also store state

* Fix brace in cluster def

* go generate docs

* Use state vs cluster/snapshot_state

* Update internal/provider/resource_consul_cluster_test.go

Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>

Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HCP Consul Cluster failed to launch
3 participants