[Alerts] Performance benchmarks #40264

pmuellr · 2019-07-03T13:59:28Z

Alerting system load is affected by a number of factors:

number of alerts
frequency of alert checks
the load the alert check places on elasticsearch ( number of queries and query duration )
concurrent alert checks (spiky vs level load)
number of actions fired per alert

These can create a variety of load patterns. Under the hood, both alert checks and actions are handled with Task Manager which is backed by Elasticsearch, each of which will have throughput limits. As the system evolves we need a way to reproduce different types and sizes of load and observe the performance characteristics in different environments.

The objective of this issue is to build out such a tool, there are command line utilities already like @pmuellr repositories for kbn-actions and kbn-alerts as well as alerting samples that we could built upon to make this easy to setup, run, teardown.

I think ideally we'd have some way to control the variables above, and have a generic alert type that could take one or more elasticsearch queries (in SQL or ES DSL) to control the load of the alert check.

Steps

To-Do:

Increase the max worker limit for cloud users (to something like 50, currently 20)

To-Do kbn-alert-load:

Get execution failures in the report Add bar chart for failures per minute pmuellr/kbn-alert-load#4
Add support for ingestion (configurable ingestion rate) [done with Logstash]
Get task manager stats in the report
~~Support automatic deployment sizing conversions~~ (Automatic deployment sizing conversion in kbn-alert-load tool #88388)
~~Move the tool into Kibana~~ (Move kbn-alert-load tool into Kibana alerting #88389)

Performance study:

Alerts benchmarking
Alerts vs actions benchmarking
Alerts vs ingestion benchmarking

Original description

Ran a stress test yesterday with an alert that always triggers and action. Created 1000 of them, interval 1s, action .server_log.

Never crashed or anything, but ES was steady at > 100% the entire time. Kibana steady at < 10%. No noticeable memory growth. Ran for ~12 hours.

Need to look into the ES perf ...

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-03T14:00:04Z

Pinging @elastic/kibana-stack-services

pmuellr · 2019-07-03T23:32:27Z

see PR #40291

pmuellr · 2020-10-07T15:25:01Z

Since this issue was last updated, Kibana is now doing some perf/load testing themselves. We should probably build on what they've done.

For more info, see issue #73189 (comment)

pmuellr · 2020-10-21T19:10:53Z

Some additional thoughts.

We should aim to be able to run some manually launched but otherwise automated set of test on cloud that:

either spin up a new cloud instance, or point to an existing one
change task manager poll interval / max workers (when they become configurable)
change # of Kibana instances and ES instances, and the RAM associated with them
change the number of alerts, and how many instances are generated from them

There are a ton of knobs and dials, but given the combinatorial explosion, we should start small :-)

I've lately been measuring the "throughput" of the alerting / actions tasks running, by looking at the actions:execute and alerting:execute event documents - counts via date histogram. This is a rough number telling us how many alerts/actions are running per time unit. It seems to provide a pretty reasonable number, based on experiments of adding/reducing Kibanas on cloud.

We should also figure out some stats to gauge the general "health" of ES and Kibana. Probably CPU and memory usage would be a decent start, and adding some more ES stats later would be good.

In the end, would be nice to have a report showing data comparing some how these combinations change some of these metrics.

I've been using the index threshold alert, and feeding the index it's querying against with data live, to control whether actions will be running or not. Seems like a decent alert to test with. I've been using the server log action, which might actually have about the same latency as a "real" action (since most are HTTP calls to other cloud services), but perhaps working in a webhook call to some "interesting" and not spammy system would be more realistic.

mikecote · 2021-01-14T19:25:04Z

I'm closing this issue now that we have the kbn-alert-load tool built to measure performance benchmarks.

There are two follow up issues created that will be prioritized separately:

Automatic deployment sizing conversion in kbn-alert-load tool Automatic deployment sizing conversion in kbn-alert-load tool #88388
Move kbn-alert-load tool into Kibana alerting Move kbn-alert-load tool into Kibana alerting #88389

pmuellr added the Team:Stack Services label Jul 3, 2019

pmuellr added the Feature:Alerting label Jul 3, 2019

peterschretlen changed the title ~~[Alerts] ES perf issue stress testing with 1000's of firing alerts~~ [Alerts] Performance benchmarks Nov 19, 2019

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

liza-mae added the performance label Mar 5, 2020

mikecote assigned pmuellr and mikecote Oct 26, 2020

This was referenced Oct 26, 2020

Alerting GA #74788

Closed

Modify default Task Manager configuration for better throughput? #78851

Closed

mikecote mentioned this issue Dec 4, 2020

[Alerting] Index threshold: Actions not fired as expected #84335

Closed

peterschretlen mentioned this issue Dec 22, 2020

Kibana performance - tools, benchmarking, CI, optimizations #86833

Closed

9 tasks

gmmorris self-assigned this Jan 4, 2021

gmmorris mentioned this issue Jan 6, 2021

[Task Manager] adds more granular polling results to monitoring stats #87494

Merged

8 tasks

mikecote closed this as completed Jan 14, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerts] Performance benchmarks #40264

[Alerts] Performance benchmarks #40264

pmuellr commented Jul 3, 2019 •

edited by mikecote

Loading

elasticmachine commented Jul 3, 2019

pmuellr commented Jul 3, 2019 •

edited

Loading

pmuellr commented Oct 7, 2020

pmuellr commented Oct 21, 2020

mikecote commented Jan 14, 2021

[Alerts] Performance benchmarks #40264

[Alerts] Performance benchmarks #40264

Comments

pmuellr commented Jul 3, 2019 • edited by mikecote Loading

Steps

elasticmachine commented Jul 3, 2019

pmuellr commented Jul 3, 2019 • edited Loading

pmuellr commented Oct 7, 2020

pmuellr commented Oct 21, 2020

mikecote commented Jan 14, 2021

pmuellr commented Jul 3, 2019 •

edited by mikecote

Loading

pmuellr commented Jul 3, 2019 •

edited

Loading