Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerts] Performance benchmarks #40264

Closed
9 tasks done
pmuellr opened this issue Jul 3, 2019 · 5 comments
Closed
9 tasks done

[Alerts] Performance benchmarks #40264

pmuellr opened this issue Jul 3, 2019 · 5 comments
Assignees
Labels
Feature:Alerting performance Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jul 3, 2019

Alerting system load is affected by a number of factors:

  • number of alerts
  • frequency of alert checks
  • the load the alert check places on elasticsearch ( number of queries and query duration )
  • concurrent alert checks (spiky vs level load)
  • number of actions fired per alert

These can create a variety of load patterns. Under the hood, both alert checks and actions are handled with Task Manager which is backed by Elasticsearch, each of which will have throughput limits. As the system evolves we need a way to reproduce different types and sizes of load and observe the performance characteristics in different environments.

The objective of this issue is to build out such a tool, there are command line utilities already like @pmuellr repositories for kbn-actions and kbn-alerts as well as alerting samples that we could built upon to make this easy to setup, run, teardown.

I think ideally we'd have some way to control the variables above, and have a generic alert type that could take one or more elasticsearch queries (in SQL or ES DSL) to control the load of the alert check.

Steps

To-Do:

  • Increase the max worker limit for cloud users (to something like 50, currently 20)

To-Do kbn-alert-load:

Performance study:

  • Alerts benchmarking
  • Alerts vs actions benchmarking
  • Alerts vs ingestion benchmarking
Original description

Ran a stress test yesterday with an alert that always triggers and action. Created 1000 of them, interval 1s, action .server_log.

Never crashed or anything, but ES was steady at > 100% the entire time. Kibana steady at < 10%. No noticeable memory growth. Ran for ~12 hours.

Need to look into the ES perf ...

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@pmuellr
Copy link
Member Author

pmuellr commented Jul 3, 2019

see PR #40291

@peterschretlen peterschretlen changed the title [Alerts] ES perf issue stress testing with 1000's of firing alerts [Alerts] Performance benchmarks Nov 19, 2019
@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@pmuellr
Copy link
Member Author

pmuellr commented Oct 7, 2020

Since this issue was last updated, Kibana is now doing some perf/load testing themselves. We should probably build on what they've done.

For more info, see issue #73189 (comment)

@pmuellr
Copy link
Member Author

pmuellr commented Oct 21, 2020

Some additional thoughts.

We should aim to be able to run some manually launched but otherwise automated set of test on cloud that:

  • either spin up a new cloud instance, or point to an existing one
  • change task manager poll interval / max workers (when they become configurable)
  • change # of Kibana instances and ES instances, and the RAM associated with them
  • change the number of alerts, and how many instances are generated from them

There are a ton of knobs and dials, but given the combinatorial explosion, we should start small :-)

I've lately been measuring the "throughput" of the alerting / actions tasks running, by looking at the actions:execute and alerting:execute event documents - counts via date histogram. This is a rough number telling us how many alerts/actions are running per time unit. It seems to provide a pretty reasonable number, based on experiments of adding/reducing Kibanas on cloud.

We should also figure out some stats to gauge the general "health" of ES and Kibana. Probably CPU and memory usage would be a decent start, and adding some more ES stats later would be good.

In the end, would be nice to have a report showing data comparing some how these combinations change some of these metrics.

I've been using the index threshold alert, and feeding the index it's querying against with data live, to control whether actions will be running or not. Seems like a decent alert to test with. I've been using the server log action, which might actually have about the same latency as a "real" action (since most are HTTP calls to other cloud services), but perhaps working in a webhook call to some "interesting" and not spammy system would be more realistic.

@mikecote
Copy link
Contributor

I'm closing this issue now that we have the kbn-alert-load tool built to measure performance benchmarks.

There are two follow up issues created that will be prioritized separately:

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting performance Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants