Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to throttle alert instances until action group changes #50077

Closed
mikecote opened this issue Nov 7, 2019 · 29 comments · Fixed by #82969
Closed

Ability to throttle alert instances until action group changes #50077

mikecote opened this issue Nov 7, 2019 · 29 comments · Fixed by #82969
Assignees
Labels
enhancement New value added to drive a business result Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Nov 7, 2019

This feature request would be to stop being reminded of an alert after it fires. Currently users can disable throttling (throttle: null) which causes alerts to fire actions every time or they can throttle alerts for a certain period of time (ex: throttle: 5m).

When throttling for a time window, it is considered as "remind me every X minutes" style feature. This request would be to "stop reminding me about this until the action group changes".

For example an alert that constantly fires default action group would stop firing actions until the group changes to something other than default. This would be the same as only firing actions when the action group changes. This would be useful for alerts that execute actions that are non-idempotent.

This feature could instead be configured at the alert type level instead? This would mean when defining an alert type, you could configure the throttling behaviour.

cc @mdefazio to work on mock ups for this.

Some early thought; adding a checkbox next to "Notify every" in the UI. When unchecked, alert instances are throttled until the state changes.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@mikecote
Copy link
Contributor Author

mikecote commented Jun 8, 2020

Instead of using "throttle" as a terminology, this could be a hook for firing actions only when entering an action group. This could also be used by maps for "entering" and "leaving" a geofenced area instead of notifying (w/ a throttle) when someone is "within" or "outside" a geofenced area.

@mikecote mikecote added the enhancement New value added to drive a business result label Aug 19, 2020
@thomasneirynck
Copy link
Contributor

mentioned here (#81631) as well, but this would be very useful for a planned "containment"-option of the geo-threshold alert (#80749).

cc @kmartastic @aaronjcaldwell

@ymao1
Copy link
Contributor

ymao1 commented Nov 6, 2020

@mdefazio Are there mockups for this?

@ymao1 ymao1 self-assigned this Nov 6, 2020
@ymao1
Copy link
Contributor

ymao1 commented Nov 6, 2020

This feature could instead be configured at the alert type level instead? This would mean when defining an alert type, you could configure the throttling behaviour.

Some early thought; adding a checkbox next to "Notify every" in the UI. When unchecked, alert instances are throttled until the state changes.

If I'm reading this correctly, there seems to be two different suggestions:

  1. Configuring at the alert type level, so each alert type would determine whether to notify on state change
  2. Adding a checkbox next to "Notify every" implies to me this is a configuration across all alerts (since this configuration exists outside of the alert type specific settings).

I think #2 makes the most sense to me, but I wanted to make sure I've read this correctly.

@mikecote
Copy link
Contributor Author

mikecote commented Nov 6, 2020

I'm thinking option 2 as well. At least we can add option 1 on top of the solution in the future if we prefer both. I would be curious of what @arisonl thinks as well.

@pmuellr
Copy link
Member

pmuellr commented Nov 6, 2020

For some reason, I thought we were just going to change to always do this. Throttle would "break" once the action group changes. If you want "throttle even if alert group changes", I think we were leaving that to the not-yet-implemented "snooze" setting (a timed mute).

Probably worth putting a doc together on all the things we're thinking about here, which would include current behavior, things like "snooze", and scheduled mutes. I'm a little worried about the combinatorial explosion of all these potential throttle/mute settings ...

@ymao1
Copy link
Contributor

ymao1 commented Nov 6, 2020

Good idea @pmuellr

This is how I understand the current behavior (please correct me if I'm wrong).

  • if the alert instance is muted, no actions will fire
  • if the alert instance is not muted:
    • if throttle=null, action(s) will fire each time an alert instance meets the alert criteria
    • if throttle is defined, action(s) will fire when the alert instance meets the alert criteria and the throttle time has elapsed since the last fired action

With #82274, we can now assign actions to different action groups, but that doesn't change any of the above behavior.

With #82412, we will be able to specify conditions per action group, but it's TBD whether changes to the above behavior will be included with that issue.

  • if we assume no changes, then if we transition from one action group to another for an alert type but the transition happens during the throttle period, then we wouldn't get a notification of that state change and it's possible that the state could change back before the throttle period expires. this is probably not ideal.
  • if the action execution logic changes, what would that look like? possibly a notification of the state change and resetting the throttle period? Would that then make this issue obsolete?

@mikecote
Copy link
Contributor Author

mikecote commented Nov 6, 2020

With #82274, we can now assign actions to different action groups, but that doesn't change any of the above behavior.

I believe changing action groups will reset the throttle window and fire actions as soon as the action group changes.

@ymao1
Copy link
Contributor

ymao1 commented Nov 6, 2020

Gotcha. So if we add a switch or checkbox with this issue, the behavior would be:

  • if the alert instance is muted, no actions will fire
  • if the alert instance is not muted:
    • if notify once flag is true, action(s) will fire when the action group changes only
    • if notify once flag is false:
      • if throttle=null, action(s) will fire each time an alert instance meets the alert criteria
      • if throttle is defined, action(s) will fire when the alert instance meets the alert criteria, action group has changed and the throttle time has elapsed since the last fired action

@mikecote
Copy link
Contributor Author

mikecote commented Nov 6, 2020

That's correct based on how I recall / envisioned it.

@pmuellr

Probably worth putting a doc together on all the things we're thinking about here, which would include current behavior, things like "snooze", and scheduled mutes. I'm a little worried about the combinatorial explosion of all these potential throttle/mute settings ...

I wonder if we can use this meta issue: #67597. You are right, there will be lots of combinations in the future.

@mdefazio
Copy link
Contributor

mdefazio commented Nov 6, 2020

I do not have mocks for this yet. But the behavior you describe seems to make sense to me.

@arisonl
Copy link
Contributor

arisonl commented Nov 12, 2020

@ymao1 The following two comments capture the user requirements on when to trigger (assuming unmuted) and the perceived state changes from a user perspective:

Your latest comment (#50077 (comment)) is consistent with these requirements, assuming we have the required set of transitions (action groups) in place.

In addition:

if throttle is defined, action(s) will fire when the alert instance meets the alert criteria, action group has changed and the throttle time has elapsed since the last fired action

If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling? Why is the "action group has changed" condition required here? Shouldn't state-change-based throttling and schedule-based throttling be mutually exclusive? Isn't the combination making the mental model very complicated? What value is the combination adding? @pmuellr @mikecote @mdefazio

@ymao1
Copy link
Contributor

ymao1 commented Nov 12, 2020

If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling? Why is the "action group has changed" condition required here? Shouldn't state-change-based throttling and schedule-based throttling be mutually exclusive? Isn't the combination making the mental model very complicated? What value is the combination adding?

@arisonl You are correct, if alert-on-state-change is false, it should default to the existing schedule-based throttling. I think I muddled my boolean logic there. It should read

if throttle is defined, action(s) will fire when (the alert instance meets the alert criteria AND the throttle time has elapsed since the last fired action) OR (the action group has changed)

@mikecote
Copy link
Contributor Author

mikecote commented Nov 12, 2020

I'm wondering if the terminology should be alert-only-on-state-change otherwise it can seem like an opt-in thing to also get notifications when the state changes which already happens with throttled alert instances.

I also wanted to clarify, "state" === "action group" so when mentioning state change, it means changing action groups. I'm ok if the technical terms change to "action group" to avoid further confusion.

@ymao1
Copy link
Contributor

ymao1 commented Nov 12, 2020

I also wanted to clarify, "state" === "action group" so when mentioning state change, it means changing action groups. I'm ok if the technical terms change to "action group" to avoid further confusion.

@mikecote Thank you. I think that clears up a point of possible confusion that @pmuellr brought up: what is the state that change that we're notifying on. Answer action group

@pmuellr
Copy link
Member

pmuellr commented Nov 12, 2020

@arisonl: If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling?

Just a note that my thoughts on the schedule-based (ie, cron) silencing of alerts is that it's all muting, not throttling. Feels like schedule-based throttling is going to become a bit of a mind-bender to customers, probably confusing if they do or don't see notifications they weren't expecting, then have to reason backwards as to why. Muting is more of a boolean, seems like a less complex thing to reason about. Select an "calendar entry" of some kind, mute it, now you know - you will NOT get notifications during that period.

It's a simplification anyway :-)

@ymao1 ymao1 changed the title Ability to throttle alert instances until state changes Ability to throttle alert instances until action group changes Nov 12, 2020
@ymao1
Copy link
Contributor

ymao1 commented Nov 12, 2020

A diagram to illustrate notification behavior with the different throttling conditions:

Screen Shot 2020-11-12 at 1 23 39 PM

@mikecote
Copy link
Contributor Author

Just wanted to drop some mock ups of how I saw the UI for this work:

Maybe by default is we no longer "notify every" and we could display the form like this:
Screen Shot 2020-11-23 at 4 42 24 PM

If ever the user wishes to be notified every x interval, they check "notify every" and enter a value:
Screen Shot 2020-11-23 at 4 41 45 PM

@mikecote
Copy link
Contributor Author

mikecote commented Nov 23, 2020

cc @arisonl it would be worth figuring out the default behaviour. Do users want to be "notified every" by default or just when the action group changes? I can see where the throttle is opt-in but there's also that weird use case of not throttling and always notify..

@pmuellr
Copy link
Member

pmuellr commented Nov 23, 2020

So the unchecked case in the mockup ^^^ is to throttle only on action group changes, I guess? And there's no way to get the old behavior of throttle: null where the actions run for every alert execution? That's what "that weird use case of not throttling and always notify.." is? That feels to me like a pretty important case, since it's the default case for every alert today.

It seems like this is going to have to end up being a three-state selector, somehow:

  1. no throttle at all (throttle: null, notifyOnlyOnActionGroupChange: false)
  2. no throttle on interval (throttle: interval, notifyOnlyOnActionGroupChange: false)
  3. throttle till action group changes (throttle: ignored, notifyOnlyOnActionGroupChange: true)

notifyOnlyOnActionGroupChange is from PR #82969

Interesting that if we do it like that, you could "toggle" notifyOnlyOnActionGroupChange without having to destroy the existing throttle value. Seems like this would be useful, if a customer wants to temporarily switch from case 2 to case 3, and then back to case 2 later.

Alternatively, throttle could become a more complex value - I guess it's an "interval/duration string" now? So we could special case a string like onlyOnActionGroupChange. But then the previous throttle value WOULD be destroyed. Feels kinda messy as well, "magic string values".

@mikecote
Copy link
Contributor Author

So the unchecked case in the mockup ^^^ is to throttle only on action group changes, I guess?

Correct, if it was labeled "Re-notify every", I think it would be more clear. Unchecked = don't re-notify.

And there's no way to get the old behavior of throttle: null where the actions run for every alert execution?

That's what "that weird use case of not throttling and always notify.." is?

Yeah, it inherits the previous problem where it's not clear null means notify all the time (not sure if others share the same). In theory null is the same as having the same value as the interval, maybe more clear to change it to that or something?

That feels to me like a pretty important case, since it's the default case for every alert today.

This is where I wished we had telemetry or maybe we could compare default behaviour of other alerting systems to see if they re-notify by default or just the first time. It's mostly to make sure we have the proper default between these 3 options and the user understand / expects the default behaviour.

Alternatively, throttle could become a more complex value - I guess it's an "interval/duration string" now? So we could special case a string like onlyOnActionGroupChange. But then the previous throttle value WOULD be destroyed. Feels kinda messy as well, "magic string values".

Another option could be to present 3 radio buttons and come up with clear labels of the 3 options the user has for how to want to be notified.

@ymao1
Copy link
Contributor

ymao1 commented Nov 24, 2020

I like the idea of explicitly presenting the 3 options somehow (with some additional descriptive words) instead of relying on the user to figure out what combination of checked/unchecked state and throttle input gets the behavior they want.

The 3 radio buttons with clear labels would work. Or maybe a dropdown with 3 values:

  • Notify only when action group changes
  • Notify every time the alert is active
  • Notify when the alert is active at the re-notification interval.

and when the third option is selected, the throttle inputs show up.

@mdefazio
Copy link
Contributor

I have started some designs for this, but I will try these new approaches as well. Knowing the right defaults is definitely an important piece of this. And how often users will want to modify it (Is the check interval more important and throttling can remain at our defaults most of the time? In which case, is interval exposed and throttling is behind an additional click?)

@pmuellr
Copy link
Member

pmuellr commented Nov 24, 2020

I'm happy changing the default, probably to "notify on action group changes". Both no throttle and throttle on a different interval always seemed kinda clunky to me, probably not what you want most of the time.

@mdefazio
Copy link
Contributor

mdefazio commented Dec 7, 2020

I've updated the mockups after some additional help from @ymao1 :
The custom interval view I think still needs work–—it looks like it's possible to set it so it conflicts with the alert schedule.
I also think the copy would greatly benefit from @gchaps . Curious to get the team's thoughts on it at this state though.

I think we should also move this to the bottom of the create flyout since this now takes up quite a bit of space. (Hopefully the typical defaults would be good enough for most and they won't have to mess with this).

image

@pmuellr
Copy link
Member

pmuellr commented Dec 7, 2020

Those views look good to me.

The custom interval view I think still needs work–—it looks like it's possible to set it so it conflicts with the alert schedule.
I also think the copy would greatly benefit from @gchaps . Curious to get the team's thoughts on it at this state though.

That's not new :-). It's always been possible to set illogical throttle values. It's not clear how much validation we should be doing on these, I sort of feel like we should have some visualization or better "explainer" for all the time-related params/config things. That would be the alert interval, throttle, the mythical "snooze", and any alert-specific values like index threshold's window parameter. Like showing some kind of time-line with the various intervals on tic lines with relative time values on the axis. Kinda thing. I was thinking at one point of trying to overlay some of that on the "preview" graphs, but I think they'll get too busy, especially for non-trivial alerts.

Perhaps what we should do in lieu of "validation" is present a warning/hint for known illogical values, like if the throttle is less than the alert interval.

In any case, worthy of another issue, we don't need to "solve" this problem here.

I think we should also move this to the bottom of the create flyout since this now takes up quite a bit of space. (Hopefully the typical defaults would be good enough for most and they won't have to mess with this).

I'm leery of separating the alert interval from the throttle - so if the suggestion is we move both, then +1. Otherwise, if interval is at the top, and threshold is at the bottom, doesn't feel right.

@mdefazio
Copy link
Contributor

mdefazio commented Dec 8, 2020

I'm leery of separating the alert interval from the throttle - so if the suggestion is we move both, then +1. Otherwise, if interval is at the top, and threshold is at the bottom, doesn't feel right.

Yes, apologies for the confusion. My thought was that the whole block shown in the screenshots would move to the bottom

@pmuellr
Copy link
Member

pmuellr commented Dec 8, 2020

My thought was that the whole block shown in the screenshots would move to the bottom

Perfect - I think we discussed that in a previous design meeting, and seemed good to me.

FrankHassanabad added a commit that referenced this issue Aug 26, 2021
…t and side car notifications (Part 1) (#109722)

## Summary

Removes the "side car" actions object and side car notification (Part 1). Part 1 makes it so that newly created rules and editing existing rules will update them to using the new side car notifications. Part 2 in a follow up PR will be the migrations to move the existing data. 

The saved object side we are removing usages of is:
```
siem-detection-engine-rule-actions
```

The alerting side car notification system we are removing is:
```
siem.notifications
```

* Removes the notification files and types
* Adds transform to and from alerting concepts of `notityWhen` and our `throttle`
* Adds unit tests for utilities and pure functions created 
* Updates unit tests to have more needed jest mock
* Adds business rules and logic for the different states of `notifyWhen`, and `throttle` on each of the REST routes to determine when we should `muteAll` vs. not muting using secondary API call from client alerting
* Adds e2e tests for the throttle conditions and how they are to interact with the kibana-alerting `throttle` and `notifyWhen`

A behavioral change under the hood is that we now support the state changes of `muteAll` from the UI/UX of [stack management](https://www.elastic.co/guide/en/kibana/master/create-and-manage-rules.html#controlling-rules). Whenever the `security_solution` ["Perform no actions"](https://www.elastic.co/guide/en/security/current/rules-api-create.html
) is selected we do a `muteAll`. However, we do not change the state if all individual actions are muted within the rule. Instead we only maintain the state of `muteAll`:

<img width="2299" alt="ui_state_change" src="https://user-images.githubusercontent.com/1151048/130823045-48a9f34b-db23-44e3-b9ed-cbbb57edc3d6.png">

<img width="1163" alt="no_actions_state_change" src="https://user-images.githubusercontent.com/1151048/130823056-3f8953fa-9433-4973-a2d3-6e11263b9619.png">

Ref:
* Issue and PR where notifyWhen was added to kibna-alerting
  * #82969
  * #50077  

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Aug 26, 2021
…t and side car notifications (Part 1) (elastic#109722)

## Summary

Removes the "side car" actions object and side car notification (Part 1). Part 1 makes it so that newly created rules and editing existing rules will update them to using the new side car notifications. Part 2 in a follow up PR will be the migrations to move the existing data. 

The saved object side we are removing usages of is:
```
siem-detection-engine-rule-actions
```

The alerting side car notification system we are removing is:
```
siem.notifications
```

* Removes the notification files and types
* Adds transform to and from alerting concepts of `notityWhen` and our `throttle`
* Adds unit tests for utilities and pure functions created 
* Updates unit tests to have more needed jest mock
* Adds business rules and logic for the different states of `notifyWhen`, and `throttle` on each of the REST routes to determine when we should `muteAll` vs. not muting using secondary API call from client alerting
* Adds e2e tests for the throttle conditions and how they are to interact with the kibana-alerting `throttle` and `notifyWhen`

A behavioral change under the hood is that we now support the state changes of `muteAll` from the UI/UX of [stack management](https://www.elastic.co/guide/en/kibana/master/create-and-manage-rules.html#controlling-rules). Whenever the `security_solution` ["Perform no actions"](https://www.elastic.co/guide/en/security/current/rules-api-create.html
) is selected we do a `muteAll`. However, we do not change the state if all individual actions are muted within the rule. Instead we only maintain the state of `muteAll`:

<img width="2299" alt="ui_state_change" src="https://user-images.githubusercontent.com/1151048/130823045-48a9f34b-db23-44e3-b9ed-cbbb57edc3d6.png">

<img width="1163" alt="no_actions_state_change" src="https://user-images.githubusercontent.com/1151048/130823056-3f8953fa-9433-4973-a2d3-6e11263b9619.png">

Ref:
* Issue and PR where notifyWhen was added to kibna-alerting
  * elastic#82969
  * elastic#50077  

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
kibanamachine added a commit that referenced this issue Aug 26, 2021
…t and side car notifications (Part 1) (#109722) (#110305)

## Summary

Removes the "side car" actions object and side car notification (Part 1). Part 1 makes it so that newly created rules and editing existing rules will update them to using the new side car notifications. Part 2 in a follow up PR will be the migrations to move the existing data. 

The saved object side we are removing usages of is:
```
siem-detection-engine-rule-actions
```

The alerting side car notification system we are removing is:
```
siem.notifications
```

* Removes the notification files and types
* Adds transform to and from alerting concepts of `notityWhen` and our `throttle`
* Adds unit tests for utilities and pure functions created 
* Updates unit tests to have more needed jest mock
* Adds business rules and logic for the different states of `notifyWhen`, and `throttle` on each of the REST routes to determine when we should `muteAll` vs. not muting using secondary API call from client alerting
* Adds e2e tests for the throttle conditions and how they are to interact with the kibana-alerting `throttle` and `notifyWhen`

A behavioral change under the hood is that we now support the state changes of `muteAll` from the UI/UX of [stack management](https://www.elastic.co/guide/en/kibana/master/create-and-manage-rules.html#controlling-rules). Whenever the `security_solution` ["Perform no actions"](https://www.elastic.co/guide/en/security/current/rules-api-create.html
) is selected we do a `muteAll`. However, we do not change the state if all individual actions are muted within the rule. Instead we only maintain the state of `muteAll`:

<img width="2299" alt="ui_state_change" src="https://user-images.githubusercontent.com/1151048/130823045-48a9f34b-db23-44e3-b9ed-cbbb57edc3d6.png">

<img width="1163" alt="no_actions_state_change" src="https://user-images.githubusercontent.com/1151048/130823056-3f8953fa-9433-4973-a2d3-6e11263b9619.png">

Ref:
* Issue and PR where notifyWhen was added to kibna-alerting
  * #82969
  * #50077  

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Co-authored-by: Frank Hassanabad <frank.hassanabad@elastic.co>
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants