Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Updating Webhook with removed headers doesn't work as expected #71995

Closed
YulNaumenko opened this issue Jul 16, 2020 · 18 comments · Fixed by #73688
Closed

[Alerting] Updating Webhook with removed headers doesn't work as expected #71995

YulNaumenko opened this issue Jul 16, 2020 · 18 comments · Fixed by #73688
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@YulNaumenko
Copy link
Contributor

Currently, deleting existing header in the Webhook connector doesn't work. At the same time you can update Webhook headers with the new header.

Steps to reproduce:

  1. Create new Webhook connector in the Alerting and Actions page
  2. Add a few headers and Save
  3. Open created Webhook connector for Edit
  4. Open Browser dev tool -> Network
  5. Remove one of the existing headers and click Save
  6. Find the PUT request which was sent for the current connector.
  7. Check the response - it contains correct data for updated Webhook connector - header was removed.
  8. Refresh the page and open the same Webhook connector.
  9. Observe that removed header appears again.

Expected behavior:
Deleting Webhook headers should works.

Seems like some there is some caching issue on the server side for savedObjectClient.find

@YulNaumenko YulNaumenko added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 16, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor

gmmorris commented Jul 29, 2020

I've investigated this issue and it seems fixing this is trickier than you'd expect.

The reason this happens is that Saved Objects are updated using partial updates and so, removing a field requires explicitly setting the field with a null.

This is straight forward enough with a field you want to remove, but in this case - it's a sub field. Specifically, we have a dictionary on the config to hold the various headers and setting a specific header to null would result in a dictionary with null values but real keys - which would get sent with the webhook.
We could address this by filtering them out in the webhook - but this seems like a broken implementation and is seriously unmaintainable. We're also very likely to encounter this problem with future Connectors and Actions.... in fact, we're probably going to have the exact same problem with Alerts in the future.

Looking into how we're supposed to do a full update, @joshdover suggested doing the following:

client.create(type, newAttributes, { id, overwrite: true })

This would essentially create a new object, but the overwrite:true will tell it to replace the existing document.
We could replace our updates with creates in both Actions and Alerts, to address this problem... except this leads us to a new problem.
The EncryptedSavedObjects have a built in limitation: if you try to create a document with an id (which we need in order to overwrite the existing document) then you get this error: Error: Predefined IDs are not allowed for saved objects with encrypted attributes.

It turns out that for security reason we don't allow this:

// Saved objects with encrypted attributes should have IDs that are hard to guess especially
// since IDs are part of the AAD used during encryption, that's why we control them within this
// wrapper and don't allow consumers to specify their own IDs directly.
if (options.id) {
throw new Error(
'Predefined IDs are not allowed for saved objects with encrypted attributes.'
);
}


My suggestion would be to do the following:

  1. Add a allowExplicitIDs field to the EncryptedSavedObjectTypeRegistration which will allow us to override this limitation specifically for alert and action Saved Object types.

/**
* Describes the registration entry for the saved object type that contain attributes that need to
* be encrypted.
*/
export interface EncryptedSavedObjectTypeRegistration {
readonly type: string;
readonly attributesToEncrypt: ReadonlySet<string | AttributeToEncrypt>;
readonly attributesToExcludeFromAAD?: ReadonlySet<string>;
}
/**
* Uniquely identifies saved object.

  1. Use client.create(type, newAttributes, { id, overwrite: true }) in place of update for our updates.

But obviously we'd need @elastic/kibana-security and @elastic/kibana-platform to weigh in before we make such a change. :)

@legrego
Copy link
Member

legrego commented Jul 29, 2020

I wonder if we could instead address this by solving #50256. I need to give this some more thought (so I'm sorry for the half-baked idea), but could we instead update the ESO client's update function to perform the client.create(type, newAttributes, { id, overwrite: true }) on behalf of consumers? This would only impact saved objects that register themselves as encrypted, so all other types will still get the current partial-update behavior.

Since partial-updates are problematic for encrypted saved objects, I think it makes sense to prevent that from happening in the first place. We could try to be smart about this and only use client.create when fields that are part of the AAD are modified, but it might not be worth the effort and added complexity.

@gmmorris
Copy link
Contributor

Yeah sure, I can take that approach, I assumed you wouldn't want that, but from our perspective that's even better (I think 🤔 )

If that approach works for you then I'll revisit my PR

@legrego
Copy link
Member

legrego commented Jul 29, 2020

I assumed you wouldn't want that, but from our perspective that's even better

Ah that makes me think I'm overlooking something nebulous 😄. Were you thinking I wouldn't want this approach because we'd be forcing full updates for encrypted types even if it wasn't strictly necessary?

(also cc @azasypkin, I'd love your input here as well)

@gmmorris
Copy link
Contributor

Were you thinking I wouldn't want this approach because we'd be forcing full updates for encrypted types even if it wasn't strictly necessary?

Exactly 👍
That and the fact that it would mean that developers who don't know the internals of ESO might be surprised by ESO update differing from SO client's update.

@azasypkin
Copy link
Member

azasypkin commented Jul 30, 2020

Specifically, we have a dictionary on the config to hold the various headers and setting a specific header to null would result in a dictionary with null values but real keys - which would get sent with the webhook.

Sorry, I'm not sure I completely understand what is going here. Would you mind giving a bit more details, maybe with real example? What exactly we pass to update, what we get back, what gets stored/encrypted and where things break etc.?

I'm not opposed to the idea of using create, but I'm concerned that create and update operations may diverge more over time (new operation specific parameters, behaviors etc.) that may result into other subtle bugs in the future as SO APIs evolve.

@gmmorris
Copy link
Contributor

gmmorris commented Jul 30, 2020

@azasypkin basically, this is the case:

Say you have a SO with these attributes:

{
    config: {
        headers: {
            asd: 123
        }
    }
}

If you use update with these attributes with the intention of deleting the “asd” header and adding the "zxc" header:

{
    config: {
        headers: {
            zxc: 456
        }
    }
}

Your SO ends up as:

{
    config: {
        headers: {
            asd: 13,
            zxc: 456
        }
    }
}

If you want to unset asd then you need to update the SO with:

{
    config: {
        headers: {
            asd: null,
            zxc: 456
        }
    }
}

But that doesn't remove the asd field, it just sets it to null.
Also, we can't control the objects in Solutions that add Alert and Action types of their own, so this isn't a maintainable solution long term.

Platform team express that this is by design for the SavedObjects and that if we want to wipe the inner fields then we need to overwrite the whole doc.

@gmmorris
Copy link
Contributor

I'm not opposed to the idea of using create, but I'm concerned that create and update operations may diverge more over time (new operation specific parameters, behaviours etc.) that may result into other subtle bugs in the future as SO APIs evolve.

I agree completely, but it doesn't sound like we have another option here, unless @elastic/kibana-platform have another suggestion.

@gmmorris
Copy link
Contributor

I made a PR that changes EncryptedSavedObjects so that it uses create with overwrite instead of update.
This fixes most of the issue but it unearthed another problem - using this approach causes us to lose optimistic concurrency control as the SavedObjectsClient doesn't take a version in it's create api.

We'll talk to @elastic/kibana-platform to see about adding this to create

@pmuellr
Copy link
Member

pmuellr commented Aug 3, 2020

I'm quite happy to see an overwriting update capability for SO's to solve our endemic problem of having to nullify our action config / secrets.

But also wondering if changing the headers from a Record<string,string> to Array<{key: string, value: string}> would work in this case as well. It's a migration change, so nasty in that regard, and am more curious than anything. What does the update do for arrays? Does it try to do the "partial update" thing, somehow, and so would still have issues deleting them, or does it consider the array to be an atomic value itself and allow it to be replaced without merging previous contents?

@legrego
Copy link
Member

legrego commented Aug 3, 2020

Does it try to do the "partial update" thing, somehow, and so would still have issues deleting them, or does it consider the array to be an atomic value itself and allow it to be replaced without merging previous contents?

It's my understanding (based on how we implemented deleteFromNamespaces) that the array is an atomic value itself, so replacing ['foo', 'bar'] with ['baz'] would result in just ['baz']

@gmmorris
Copy link
Contributor

gmmorris commented Aug 5, 2020

That's an interesting thought @pmuellr but as we're not less confident in our ability to migrate ESOs (and have reverted our usage of it until 7.10) I don't think that'll actually give us an incremental step in the right direction and as we also want to address #50256 it might be best to address them both in the same manner.

@gmmorris
Copy link
Contributor

gmmorris commented Aug 11, 2020

I wanted to summarize where we're at on this, as it turned out to be far more complicated than originally thought and now that we've decided that it is too large a change to fit into 7.9, I think we can slow down on the implementation and make sure we're going in the right direction.

In order to address this issue, we need to ensure that whenever we update an Alert's SavedObejct, we replace the whole document instead of updating the existing one. This is to avoid Elasticsearch's default behaviour of merging documents, making it impossible to delete keys in objects (not without explicitly nulling them out, which is unmaintainable in the long run).

We could do this by changing every update call in the AlertsClient to be a create with overwrite:true in the EncryptedSavedObjects client, which means changing about a half dozen call sites in the client.
When discussing this option, @legrego pointed me to the #50256 issue, which we feel is a preferable approach, as it means semantically out code will easier to follow, as it'll be clearer to new developers coming int o the code that it is semantically an update of an existing Alert, rather than creating a new one. Changing the way ESO updated documents has the added value of addressing the issue across the board for all users of ESO, which helps us as well, as we are likely to continue using ESO in the future and it means we're not likely to accidentally reintroduce this failure case in the future.
The key complication here is in bulkUpdate as it now needs to treat ESOs and SOs differently - passing the objects to different underlying APIs (bulkCreate for ESOs and bulkUpdate for SO). This is far from ideal as it means we need to partition the objects by their type, call different underlying APIs and then merge them back in the correct order. Other than the Alerting specific code, this is probably the most substantial change.

One problem this change has revealed, is that the underlying SavedObjects client does not support version on the create api, meaning we lose our optimistic concurrency control. I spoke to the Platform team, and they're happy with us adding this support into the SavedObjectsClient, which has proven to be a relatively small change (797781f).

Changing how we update documents revealed a bigger issue - which is what has made this change set far larger.
It seems we heavily relied on the partial updates of documents in the Alerts & Actions code. We use partial document updates throughout out code base (a fact that is hidden away by the typing of the SavedObjectsClient which forces a Partial<T> type on update no matter what we do [A fact I'm trying to address as part of this PR as well within our clients]). This reliance isn't just in omitting attributes of the SO (which is hard to identify in some places), but we also omit the references in most updates, which means our updates now drop the references in random places.
Addressing this requires auditing all the places we use AlertsClient.update and EncryptedSavedObjects.update to ensure we're passing in all attributes, references, version etc. - which is the lions share of the code changes in this PR.

Now that we've decided to hold off on merging this work until 7.9.1 / 7.10 (it's too close to release to fix this in 7.9) we could ask Platform/Security to address the changes in their own code and we would focus on the changes in Alerting, but I'm not sure the resulting context switching will actually speed things up. My feelings are that it would be more a question of ownership than capacity, but I can't speak for Platform / Security on this.

@pmuellr
Copy link
Member

pmuellr commented Aug 13, 2020

I've started to think of these two types of update as update and merge, with merge being the current partial update behavior. I was actually wondering yesterday about something and thinking how doing a partial update would be useful, for a particular case.

I think they are both useful. Ideally, the current SO/ESO update method should probably really be named merge, and update could then be the name of the overwrite method, but that's obviously a bit confusing. That, and I think in general ES "updates" are considered partial updates only, so mixing the meanings of update doesn't seem good.

Could we leave SO/ESO update as is, and then add this new overwriting capability as overwrite (or probably something better)?

As a separate issue, I wonder if there would be some TS pixie dust we could sprinkle over ESOs to help prevent AAD issues, for cases of partial updates. Assuming we get an "overwrite" capability and still have a partial update "update" capability for ESO's, I think I'd be very leery of using a partial update with an ESO without some typing help, deferring to just always using an "overwrite" instead.

@gmmorris
Copy link
Contributor

Sadly, this has unearthed a deeper hole and to side step digging down the rabbit hole, I've spoke to @legrego and @azasypkin and we've decided Security will take ownership of changing the ESO client instead of us as part of this PR.
It seems the usage of ESO is wider than we realised and the impact of this change is broader - and should be handled by them.

Instead, we'll make changes in Alerting to use create with overwrite:true in our code.
It's harder to maintain, but I'll try to keep that centralized in our code in a maintainable way (perhaps a wrapper in our code that we use instead of using SO directly).

As a separate issue, I wonder if there would be some TS pixie dust we could sprinkle over ESOs to help prevent AAD issues, for cases of partial updates. Assuming we get an "overwrite" capability and still have a partial update "update" capability for ESO's, I think I'd be very leery of using a partial update with an ESO without some typing help, deferring to just always using an "overwrite" instead.

Sadly, that's not doable without changing the SavedObjectClient itself as ESO is hidden by the SO client with which we interact directly.

@gmmorris
Copy link
Contributor

I've started to think of these two types of update as update and merge, with merge being the current partial update behavior. I was actually wondering yesterday about something and thinking how doing a partial update would be useful, for a particular case.

I think they are both useful. Ideally, the current SO/ESO update method should probably really be named merge, and update could then be the name of the overwrite method, but that's obviously a bit confusing. That, and I think in general ES "updates" are considered partial updates only, so mixing the meanings of update doesn't seem good.

Could we leave SO/ESO update as is, and then add this new overwriting capability as overwrite (or probably something better)?

That would definitely be out of scope for this issue :)
I'd leave that to @elastic/kibana-platform and @elastic/kibana-security to hash out on their end, but I definitely feel update by default being a merge operation is only intuitive if you're deep into the stack and understand the internals of Elasticsearch. Considering many of the people working on Kibana and the solutions don't come with a rich ES background, this is definitely something we should consider mitigating somehow. 🤔

@gmmorris
Copy link
Contributor

gmmorris commented Sep 7, 2020

After syncing with @pmuellr regarding this work and the work he's been doing around the Alert Status issue, we've realised that the approach taken my initial PR is going to clash with his.
Our approach here was to optimistically remove the ability to perform partial updates all together, but it turns out this won't work - there are cases where we can't rely on the ability to fetch an entire object to fully update it- such as when you need to update the Alert to an error status.

After reviewing Patrick's work (over here #75553) I've decided to hold off on this issue until his PR is merged. This does mean that much of the work done in my initial PR (#73688) will no longer be mergable, but it will allow us to build on top of his work to address this by providing two methods for updating an Alert - a partial one and a full one (which uses create & overwrite to address the core issue where we can't update deep objects).

I'll mark this as blocked for the time being

@gmmorris gmmorris removed the blocked label Sep 16, 2020
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants