-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store Consul cluster/snapshot state to fix failed cluster behavior #326
Conversation
- there has been a longstanding issue where resources are leaked in our e2e tests. This is because we purge all failed clusters from state with d.SetId(""). This doesn't make sense for failed clusters/snapshots. Instead, we should store their state... in state
this handling of state behavior applies to many other resources as well. Example of another issue but for HVNs: We should store all these resource's state so they're deleted on the next terraform-provider-hcp/internal/provider/resource_hvn.go Lines 214 to 217 in f678fea
|
- and undo property sorting - it makes the PR bigger
- otherwise folks need to upload a credit card to help pr fixes
- reduced the size, need to reduce the scale
@@ -382,12 +387,20 @@ func resourceConsulClusterCreate(ctx context.Context, d *schema.ResourceData, me | |||
return diag.Errorf("unable to retrieve Consul cluster (%s): %v", payload.Cluster.ID, err) | |||
} | |||
|
|||
if err := setConsulClusterResourceData(d, cluster); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the create fails, where would the tf have failed previously? Was it in attempting to get the client config files?
I had assumed the WaitForOperation call back on line 379 still prior to this call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right and yes, this was failing up above in the Wait. Where I saw this was an issue was in resourceConsulClusterRead
(the GetConsulClusterByID
call succeeds while GetConsulClientConfigFiles
fails). I then applied the same pattern everywhere we make these two API calls (storing results of first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it fails during the Wait then will we still clean it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'll still clean it up on the next terraform delete
or terraform apply
. A PR from last year addressed this by storing the ID of the resource before polling: #59
Current behavior:
tf create
fail during poll: bail, return diag.Error, we store nothing but IDtf plan
, read: GET with ID, see it's FAILED, purge from statetf apply
: create a new cluster. Don't delete prior cluster (it was purged). Probably (unless user knows to go manually delete the old resources) run into a duplicate cluster/hvn ID issue
Behavior after this change:
tf create
fail during poll: bail, return diag.Error, we store nothing but IDtf plan
, read: GET with ID, set statetf apply
, old cluster is tainted, delete, create a new cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for summing up the behavior change here 📝 💡
One thing I'm considering is how useful this computed output state
might be in other circumstances, since it seems more of an internal detail. Would a user ever need to use the output of state
or react to it? I think that was the main reason for keeping it out of the resource schema until now.
We could also resolve this issue without state
by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state. We could put this operation in a retry function.
Which operation takes more time on the Consul service side? Deleting a failed cluster or re-attempting the creation of a failed cluster? Or any other advantages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized and messed up and replied out of thread: #326 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix! I think this is a solid approach, but I wanted to discuss an alternative that doesn't involve adding a state
field.
Type: schema.TypeString, | ||
Computed: true, | ||
}, | ||
"project_id": { | ||
Description: "The ID of the project this HCP Consul cluster is located in.", | ||
"cluster_state": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: the only key change here, besides the alphabetization, is adding cluster_state
as a computed output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the main change. But there are two other behavior changes. Highlights from description:
- add
state
- remove DELETED Consul clusters from state (not FAILED)
- set Consul client config resource properties after setting the rest of the cluster resource properties (ie set state after each API call, rather than at the end of calling all of them)
@@ -382,12 +387,20 @@ func resourceConsulClusterCreate(ctx context.Context, d *schema.ResourceData, me | |||
return diag.Errorf("unable to retrieve Consul cluster (%s): %v", payload.Cluster.ID, err) | |||
} | |||
|
|||
if err := setConsulClusterResourceData(d, cluster); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for summing up the behavior change here 📝 💡
One thing I'm considering is how useful this computed output state
might be in other circumstances, since it seems more of an internal detail. Would a user ever need to use the output of state
or react to it? I think that was the main reason for keeping it out of the resource schema until now.
We could also resolve this issue without state
by instead deleting any cluster that returns in a failed state, in addition to setting the ID to nil in the TF state. We could put this operation in a retry function.
Which operation takes more time on the Consul service side? Deleting a failed cluster or re-attempting the creation of a failed cluster? Or any other advantages?
not sure why I can't reply inline, but wrt the comment here: #326 (comment) I think there's a few reasons to include state:
We also talked about this on our side (just deleting the failed resource after the wait). Again I think that's diff from the UI case, and that we should try to match the UI-case where it stays failed -- and then we delete if for them on their next |
Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response! I think your point about consistency with the UI and user expectations make sense. Definitely worth rolling out this pattern to the rest of our resources.
) * Fix storing of failed consul cluster/snapshot state - there has been a longstanding issue where resources are leaked in our e2e tests. This is because we purge all failed clusters from state with d.SetId(""). This doesn't make sense for failed clusters/snapshots. Instead, we should store their state... in state * Add acc testing - and undo property sorting - it makes the PR bigger * Make docs * Change acc testing tier to development - otherwise folks need to upload a credit card to help pr fixes * Fix scale attr - reduced the size, need to reduce the scale * Fix size if acc testing of consul cluster * Update Consul cluster data source to also store state * Fix brace in cluster def * go generate docs * Use state vs cluster/snapshot_state * Update internal/provider/resource_consul_cluster_test.go Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com> Co-authored-by: Brenna Hewer-Darroch <21015366+bcmdarroch@users.noreply.github.com>
🛠️ Description
Background: we have an issue where resources are leaked in our E2E tests. This is because we purge all failed clusters from state with
d.SetId("")
.Changes:
cluster_state
andsnapshot_state
to Consul clusters and snapshots, respectivelyDELETED
Consul clusters from state (notFAILED
)GetConsulClusterByID
call can succeed before theGetConsulClientConfigFiles
API call fails -- we should still set the resource properties we have from theGetConsulClusterByID
callUX change:
FAILED
clusters/snapshots are stored in state so they're deleted by terraform on the nextterraform apply
orterraform delete
. This differs from the behavior right now where users have to find and delete the resources in UI manually because Terraform forgets about themExample of a failed E2E test here. We mark the cluster as deleted and remove it from state:
That's the opposite of what we want to do: delete the failed cluster on the next run. And our E2E org becomes filled with Consul clusters and HVNs: https://admin.hcp.to/organizations/orgs/11eb1545-60c8-4b02-9121-0242ac110008/resources
From docs:
As far as I can tell, the behavior in this PR will match the desired behavior of the original PR:
closes: #335
🚢 Release Note
Release note for CHANGELOG:
🏗️ Acceptance tests
Output from acceptance testing:
Everything passes except the update where I hit a payments/no-credit-card err
E2E Testing
I created a cluster, terminated the dataplane deployment, and confirmed that
cluster_state
was stored aftertf refresh
:On
tf apply
it was replaced: