Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuleTypes can't provide an AlertContext on recovery #87048

Closed
3 tasks done
gmmorris opened this issue Dec 30, 2020 · 25 comments
Closed
3 tasks done

RuleTypes can't provide an AlertContext on recovery #87048

gmmorris opened this issue Dec 30, 2020 · 25 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Dec 30, 2020

This is a follow up to #49405 (comment) , identified as a problem as part of the work on #86761 and resulting in a bug identified by @arisonl.

The Problem

When an AlertType wishes to activate an AlertInstance it schedules actions under an actionGroup and can optionally specify an AlertInstanceContext which is provided to the actions in that actionGroup.
The Alerting Framework automatically schedules actions in the recovered actionGroup when an active AlertInstance is omitted on the next execution.

As it is the omission of an AlertInstance that causes the scheduling of instances in the recovered actionGroup, we have no access to a context in these actions.
This is by design as we're trying to encourage detection of certain active states as part of a search, rather than querying of all data and then manually identifying this state in-memory. For that reason we do not want AlertType implementors to explicitly schedule actions in the recovered actionGroup, and to that effect we've made it impossible to do so without bypassing the type system in this PR.

In the meantime, as surfaced by this bug, we have found that the InventoryMetricThreshold have been scheduling actions under the recovered actionGroup. This causes the actions for recovery to fire twice: Once explicitly as called by the AlertType, and a second time when the AlertInstance is omitted.

Addressing the 7.11 bug

I don't know if the bug identified in 7.11 is a blocker or not (that is up to @elastic/logs-metrics-ui [cc @Zacqary ]).

If this is a blocker, then I doubt we can find a solution at Alerting Framework level that is safe to include in 7.11 post FF.
As we do intend on supporting this use case in the near future (hence this issue), I'd recommend that Infra remove the explicit scheduling of recovered in their AlertType. Firing actions on recovery will still be supported - it just won't be able to include a context in it.
If this approach is unacceptable to @elastic/logs-metrics-ui, then we'll have to discuss options in @elastic/kibana-alerting-services (keeping in mind this a behaviour we explicitly chose not to support, but hadn't sufficiently prevented at the time).

Proposed Next Step

We want to add support AlertInstanceContext on recovery, but we need to reconcile the fact that there is no way to know what the context is when an instance recovers automatically.

Balancing these two states in a sustainable manner will require a solution that can support:

  1. The ability for an AlertType to recover an instance with an AlertInstanceContext in the manner that would address the need in InventoryMetricThreshold
  2. The ability for an AlertType to automatically recover an instance when it is omitted (as it works today)
  3. A safe manner for users to specify parameters for an action which can handle both cases (with or without a context) as we cannot rely on AlertType implementors to always provide the context (there are likely always edge cases where an instance auto recovers, even on AlertTypes that intend on explicitly recovering instances).
  4. Discourages a "query all instances and asses each one" approach. For Alerting to scale we need to leverage query based detection, which is why we want to discourage explicit recovery.

The next steps we've decided on are thus:

  • Research the feasibility of a static context approach as proposed by @mikecote here. The basic idea is that when an alert is first created, the rule would provide some static context that doesn't change throughout the lifecycle of the alert.
  • Research the feasibility of an explicit recovery API which would allow rules to tell the framework that an alert should recover, while still providing an up-to-date context as detected by the rule.
  • Propose follow up technical work as revealed by the two research items above
@pmuellr
Copy link
Member

pmuellr commented Jan 4, 2021

Couple of thoughts.

  • one of the requests seem to be to be able to provide better default action parameters (eg, email/slack message bodies), which has an issue of it's own, here: [Alerting][Discuss] allow alerts to create default action parameters #66095 (comment) - I'd like to see all of our alerts, with all of their common fields and "states" (active, recovered, etc), have useful default messages. It's possible that implementing that would be good enough the use case of o11y triggering their own recovered state

  • one thing we can't currently get for a "recovered" instance, is the current value(s) that caused it to be no longer active, since some alerts indicate "recovered" instance via their lack of data (from the ES query run during the alert execution). This is a tough one, as it would require either alert executors to get ALL of the data from all of their instances (not just those that exceeded a threshold), or to have some additional ES call that could be made to fetch that current data, for recovered instances.

  • I actually kind of like recovered instances having NO context data, because if we DO give them context data, it's going to be different than the context data provided by active instances. Two different shapes, or perhaps same shape, but some properties are undefined (eg, current value checked against threshold - defined in an "active" context, undefined in a "recovered" context. It would be nice to make these context variables ... less "variable".

  • one possibility for allowing alert types to "customize" recovery, would be with some kind of explicit flag that would indicate the default recovery processing should not be done, and that the alert type will provide the recovery info itself, probably supplying it's own context variables. So the "complexity" of dealing with this kind of alert type is pretty explicit in the code, and for users it just be additional complexity for those alert types.

@pmuellr
Copy link
Member

pmuellr commented Jan 5, 2021

I wonder if we need to be more general - maybe every action group could have different context variables?

@christophercutajar
Copy link

One of our use-cases why the state in the recovered action would be very helpful for us is due to the following:

We're monitoring about 64 different monitors with one Uptime monitor status alert. During our maintenance, we expected that a number of services would temporary go down. Having just a message saying recovered isn't very helpful. A sample is the one below:

image

@tamland
Copy link

tamland commented May 5, 2021

We use {{alertInstanceId}} as a workaround in recovered message. Not the prettiest but at least gives some information about which monitor is back up.

@christophercutajar
Copy link

We use {{alertInstanceId}} as a workaround in recovered message. Not the prettiest but at least gives some information about which monitor is back up.

@tamland this is a good hack! Thank you for the suggestion

@shahzad31
Copy link
Contributor

@gmmorris can we please prioritise this, without this recovery state message is pretty much useless

we are constantly receiving feedback on discuss forum from user and also from internal users

cc @paulb-elastic

@pmuellr
Copy link
Member

pmuellr commented May 26, 2021

After seeing another user request for this, specifically mentioning state context variables, I'm wondering if we DO still have the previous context.state values around - seems like we might. We maintain that state until we know we can delete it, which is when the alert recovers. Not sure if I had mentioned this before, but I think even just making context.state values around would be a quick but incomplete win.

@shahzad31
Copy link
Contributor

I think we should at least solve this confusion from UI point of view, until we can resolve the issue, we shouldn't display list of state variables for resolved action in the alert flyout.

@mikecote
Copy link
Contributor

@pmuellr We do have access to them when firing a recovered action but it's the state/context from the last execution. So some of the information captured (ex: current CPU usage) could be false.

I think we should at least solve this confusion from UI point of view, until we can resolve the issue, we shouldn't display list of state variables for resolved action in the alert flyout.

This should be implemented already 🤔

@shahzad31
Copy link
Contributor

@mikecote it will work for most of uptime alerts use cases though, since mostly it's meta data user is interested in, monitorId, url etc

@shahzad31
Copy link
Contributor

taking a deeper look at list

    monitorUrl: monitorInfo.url?.full,
    monitorId: monitorInfo.monitor?.id,
    monitorName: monitorInfo.monitor?.name ?? monitorInfo.monitor?.id,
    monitorType: monitorInfo.monitor?.type,
    latestErrorMessage: monitorInfo.error?.message,
    observerLocation: monitorInfo.observer?.geo?.name ?? UNNAMED_LOCATION,
    observerHostname: monitorInfo.agent?.name,

for monitor status alert instance, only error message is which can change. Rest will remain same.

@mikecote
Copy link
Contributor

mikecote commented May 26, 2021

Random thought: I'm wondering if the solution to the problem should provide a split for this. Have access to some metadata when the alert is created, and that metadata is assumed static for the life of the alert. This would separate from the state/context, which would remain unaccessible for recovered actions.

This may overlap with parts of the alerts as data story where we could provide the alert data object for all actions. 🤔 hmm

@gmmorris
Copy link
Contributor Author

gmmorris commented Jun 1, 2021

@gmmorris can we please prioritise this, without this recovery state message is pretty much useless

we are constantly receiving feedback on discuss forum from user and also from internal users

cc @paulb-elastic

Sorry for the delay @shahzad31 , as you're likely aware our team's entire capacity is currently diverted to the RAC initiative and unified alerting. As a consequence we've had to delay every other priority.
That said, we're not gatekeeping the framework, and If your team has capacity, we might be able to support and guide you through implementing this yourself (assuming the @elastic/kibana-alerting-services team agrees with the proposed implementation).

Let @mikecote and myself know if that's an option from your perspective.

@gmmorris
Copy link
Contributor Author

This issue came up in Slack today and I provided some historical context which, I figure, is worth copy-pasting it in here as well.


To add some historical context around this:

The actions invoked on the recovery of and alert don’t have access in the context to the values of the violation that recovered

As Recovery is caused by the omission of an alert by a rule, there is no context available to us which we can pass into actions here (as the context is provided by the rule type implementer).
An idea had been proposed to pass in the previous context, but that raised concerns around old (and incorrect) data being included in an action on recovery and we decided that in the interest of “the principle of least surprise” we would not release that change.

To get around this we’ve proposed a change where Rules could explicitly cause an alert to recover by providing the context.
This raised another problem though, which was that sometime the context would be available (when recovered manually) and sometimes it would be missing (when recovered automatically).

The product concern around this possible randomness caused us to hold off on the proposed solution and in the meantime other priorities put this on ice.

@mikecote recently proposed another idea which could address this: Rules can provide a static context that remains true for the lifetime of the alert. We could pass this static data into every action, including the Recovery action.

This would mean that some context fields are still unavailable on recovery, but at least you would have some details that are not specific to an evaluation and they can be used at any point in the alert’s lifecycle.

@gmmorris gmmorris added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework labels Jul 1, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021
@pmuellr
Copy link
Member

pmuellr commented Jul 19, 2021

Rules can provide a static context that remains true for the lifetime of the alert. We could pass this static data into every action, including the Recovery action.

This would mean that some context fields are still unavailable on recovery, but at least you would have some details that are not specific to an evaluation and they can be used at any point in the alert’s lifecycle.

How "static" would this be? Taking the index threshold rule type as an example, I'd think that there are cases where you'd like see the last value of the calculated value, in the recovered message. Or maybe it would make more sense to see the first one, or the maximal one, or ... In any case, this isn't really "static" data in my thinking, but more "persistent" data.

And presumably this data will be in "alerts as data" documents, so perhaps there is so way of making use of that data for this specific case. Perhaps this is even a case where alerting shouldn't be dealing with this at all, and it should be more a "RAC" capability / responsibility.

@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Sep 24, 2021
@gmmorris
Copy link
Contributor Author

gmmorris commented Oct 1, 2021

I have updated the description above with the next steps we've discussed synchronously: to explore the two complementary options discussed and propose concrete follow up issues.

@mholttech
Copy link

I'm also running into an issue with this. When creating an Uptime alert with integration into ServiceNow (SNOW) we would like to set the Message Key going into the SNOW Event Management Platform to {{state.url}} so that all alerts for a specific URL would roll up to a single alert. Using {{alert.id}} is not a good solution because the AlertID also contains the name of the checkpoint that performed the check. This would result in an alert per URL per checkpoint which creates confusion and alert fatigue.

We also see this in other areas such as Metric Thresholds where we have similar requirements to key off a value in Context.

@gmmorris
Copy link
Contributor Author

gmmorris commented Oct 4, 2021

I'm also running into an issue with this. When creating an Uptime alert with integration into ServiceNow (SNOW) we would like to set the Message Key going into the SNOW Event Management Platform to {{state.url}} so that all alerts for a specific URL would roll up to a single alert. Using {{alert.id}} is not a good solution because the AlertID also contains the name of the checkpoint that performed the check. This would result in an alert per URL per checkpoint which creates confusion and alert fatigue.

We also see this in other areas such as Metric Thresholds where we have similar requirements to key off a value in Context.

Thanks for the feedback @mholttech we'll explore ways in which this can be remediated (cc @elastic/uptime @elastic/logs-metrics-ui )

@sgelastic sgelastic added the bug Fixes for quality problems that affect the customer experience label Nov 5, 2021
@ymao1 ymao1 self-assigned this Nov 9, 2021
@pmuellr
Copy link
Member

pmuellr commented Nov 11, 2021

To get around this we’ve proposed a change where Rules could explicitly cause an alert to recover by providing the context.
This raised another problem though, which was that sometime the context would be available (when recovered manually) and sometimes it would be missing (when recovered automatically).

This still sounds like the best idea to me, but the second comment above is an interesting twist. Sorta feels like a rule could opt-in to providing it's own recover processing, in which case alerting would NOT provide the automatic recover processing.

I think the scariest thing about this is thinking about all the edge cases. For instance, with our current rule disable processing, we "recover" rules since we drop the state, and have no hope in recovering them later since we did in fact lose the state. Seems like this is going to force us to come up with a different actionGroup or such, for these - alert-abandoned-because-rule-disabled or something? It'll have to be something different than recover if we'd be expecting all the recover contexts to be available, since they wouldn't be in this case.

Hmmm ... or is this a case for subgroups? I can't remember how those work ... but something like:

scheduleActionsWithSubGroup('recover', 'rule-disabled', {})

@ymao1
Copy link
Contributor

ymao1 commented Nov 11, 2021

Research the feasibility of an explicit recovery API which would allow rules to tell the framework that an alert should recover, while still providing an up-to-date context as detected by the rule.

POC for explicit recovery API

With this POC, we are attempting to maintain the availability of consistent context variables for a rule type across all of its action groups. We do this by allowing rule executors the ability to retrieve the list of recovered alert IDs during an execution and allowing rule executors the ability to explicitly specify context variables for the recovery action. With the idea of providing a consistent user experience, we want the recovery variables to be typed the same as existing action variables.

Just providing these service functions to the rule executor is insufficient, as rule executors will need to update to use these service functions so we have two options for providing defaults context for recovered alerts:

  • Default to empty. This is not great for the user since they will get meaningless recovery messages. OTOH, that is what is currently happening, so we wouldn't be making the user experience worse while rule executors do the work to specify recovery context.
  • Maintain context across executions the same way we currently maintain state. During rule execution, we would have access to the previous context of a recovered alert, so if an executor does not specify recovery context, the framework would fill in the gaps with the last seen values. (This is similar to what the lifecycle executor in RAC does, which is update the alert document with status:recovered while keeping the information populated during the previous execution of the rule when the alert was active). This is most likely expensive as it could possibly balloon the size of the task manager document that is keeping the stringified version of the context and state, especially in the case of high cardinality rules with a lot of alerts, which we know exist.

While the POC focuses on context variables, we could have a quick win by focusing on the state variables first, which we already maintain between rule executions. state is currently not propagated to recovered alert actions but it is available. It would be a small fix to make the state available to recovery actions (this would be the state of the alert during the previous execution) and if we wanted to go further, we could allow rules to explicitly set recovered alert state. There are rule types that make heavy use of state (Uptime monitor status is one of them and the subject of many of these SDHs)

@ymao1
Copy link
Contributor

ymao1 commented Nov 15, 2021

Research the feasibility of a static context approach as proposed by @mikecote here. The basic idea is that when an alert is first created, the rule would provide some static context that doesn't change throughout the lifecycle of the alert.

POC for static context approach

With this POC, we are identifying a third set of action variables (staticContext) for each alert. These are identified when an alert is first created and live for the lifetime of the alert. They are serialized via the alerting task manager document so they will potentially increase the size of the task manager document if a rule type identifies many static context variables. When an alert recovers, it has access to these static context variables for usage in actions.

After working through the POC, it seems to me that having state, context and static context variables are kind of redundant. If static context is being persisted in task manager the same way that state is, is it that different than putting these variables inside state other than organizational? And having a third prefix in the action variable dropdown might cause even more confusion.

@ymao1
Copy link
Contributor

ymao1 commented Nov 15, 2021

I think we should at least solve this confusion from UI point of view, until we can resolve the issue, we shouldn't display list of state variables for resolved action in the alert flyout.

This should be implemented already 🤔

Verified that this is actually not implemented. We are hiding context variables on recovery but state variables are all listed in the action variable dropdown.

@ymao1
Copy link
Contributor

ymao1 commented Nov 15, 2021

After discussion with @mikecote, I will write up RFCs for the following to get feedback from rule authors:

  • Passing through state variables to the recovery context. The UI already exposes these variables for the recovery action group. Should we follow through and pass them to the recovery context even though they are from the previous execution and may be outdated?
  • Explicit recovery API.

@gmmorris
Copy link
Contributor Author

While the RFC is in review, the engineering work is still yet to be done.
Moved back to In Progress, as this has been flagged as confusing.

@gmmorris gmmorris changed the title AlertTypes can't provide an AlertInstanceContext on recovery RuleTypes can't provide an AlertContext on recovery Jan 6, 2022
@ymao1
Copy link
Contributor

ymao1 commented Jan 13, 2022

An RFC has been submitted and approved for this. We've created two followup issues to handle implementation:

Closing this research issue as done. Please track the implementation issues for current status

@ymao1 ymao1 closed this as completed Jan 13, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

10 participants