Measure APM agent impact on the platform performance #78792

mshustov · 2020-09-29T14:57:51Z

We are working on enabling APM agent on the prod build #70497. Before making this happen we want to understand what performance overhead it adds to the Kibana server. We might be able to re-use the setup introduced in #73189 to measure the average response time & number of requests Kibana can handle with and without APM agent enabled.

elasticmachine · 2020-09-29T14:57:53Z

Pinging @elastic/kibana-platform (Team:Platform)

mshustov · 2020-10-08T08:21:24Z

Setup

API performance testing is based on setup https://github.com/dmlemeshko/kibana-load-testing I adjusted number of requests not to overwhelm APM server.

  setUp(
    scn.inject(
      constantConcurrentUsers(15) during (2 minute),
      rampConcurrentUsers(15) to (20) during (2 minute)
    ).protocols(httpProtocol)
  ).maxDuration(15 minutes)

Testes are run against 7.10.0-SNAPSHOT

Results

APM agent seems to add a significant overhead (see 95%).

Without APM agent:

download result in html 7.10.0-without-apm.zip

With APM agent:

download result in html 7.10.0-with-apm.zip

mshustov · 2020-10-08T08:39:25Z

Tested Kibana image doesn't contain changes introduced in #78697
So I added breakdownMetrics: false, to dist APM config manually. It slightly improves the situation:

mshustov · 2020-10-08T08:49:39Z

https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html provides some details on how to squeeze out a bit more perf improvements.
The most simple way is to reduce the sample ratio https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html#performance-sampling
It seems we can use 0.2-0.3 as a default value and adjust it via config file.
Numbers:

sample ratio 0.1
sample ratio 0.2
sample ratio 0.3
sample ratio 0.5

Other config values don't seem to affect CPU as much as sampleRation does, so I decided not to use them. @vigneshshanmugam do you have anything to add?

vigneshshanmugam · 2020-10-08T11:23:00Z

As you have already figured transactionSampleRate is the go to metric we recommend for both Node.js and RUM agent to tune it for performance as it drops transactions based on this metric.

Perf tuning RUM agent - https://www.elastic.co/guide/en/apm/agent/rum-js/current/performance-tuning.html

breakdownMetrics Disabling it certainly helps a lot in the RUM agent for custom transactions vs page-load transactions. I don't know how the above test schedules the load and which browsers it runs so cant say for sure if its going to have a huge impact. But my recommendation would be to keep it to false if it helps.
centralConfig - Disable this one as it introduces one additional request to APM server. Defaults to true in Node.js and false in RUM.
metricsInterval - Can you try increasing this interval/disabling metrics reporting and check if its helping? This controls the Metrics capturing in Node agent. https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html#metrics-interval

I cant seem to find any other config that would help.

mshustov · 2020-10-08T11:48:16Z

I don't know how the above test schedules the load and which browsers it runs so cant say for sure if its going to have a huge impact. But my recommendation would be to keep it to false if it helps.

It tests the server side API perf only.

centralConfig - Disable this one as it introduces one additional request to APM server. Defaults to true in Node.js and false in RUM.

already disabled in my tests

metricsInterval - Can you try increasing this interval/disabling metrics reporting and check if its helping? This controls the Metrics capturing in Node agent.

Test with metricsInterval: '120s' and transactionSampleRate: 0.3 slightly improves the situation (in comparison to transactionSampleRate: 0.3):

pgayvallet · 2020-10-21T11:35:58Z

So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant.

mshustov · 2020-10-21T14:58:05Z

So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant

The best configuration with sample ratio: 0.1, breakdownMetrics: false; centralConfig: false; metricsInterval: '120s'. 50th percentile is doubled from 118 to 225, 95th percentile is almost doubled from 574 to 950.
It mostly affects query functionality when requesting /api/saved_objects/* & /api/metrics/vis/data endpoints. Test case query timeseries data is almost tripled.

with APM enabled:

mshustov · 2020-11-06T09:47:37Z

@TinaHeiligers you asked how to perform testing:

how to run Kibana with APM agent locally:

clone https://github.com/elastic/apm-integration-testing
cd apm-integration-testing
Run ES & APM servers with ./scripts/compose.py start master --no-kibana
cd ../kibana
you might need to change elasticsearch credentials (I used admin/changeme)
make sure APM agent is active and points to the local APM server - set in kibana.yml:

elastic.apm.active: true
elastic.apm.serverUrl: 'http://127.0.0.1:8200'
# elastic.apm.secretToken: ... <-- might be required in prod/cloud
# optional metrics to adjust performance 
# see https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html
elastic.apm.centralConfig: false
elastic.apm.breakdownMetrics: false
elastic.apm.transactionSampleRate: 0.1
elastic.apm.metricsInterval: '120s'

run Kibana: ELASTIC_APM_ACTIVE=true yarn start
you can see transactions in the APM app
stop Kibana
stop ES & APM servers: cd apm-integration-testing; ./scripts/compose.py stop

how to run load testing against Kibana:

clone https://github.com/elastic/kibana-load-testing
follow readme https://github.com/elastic/kibana-load-testing#kibana-load-testing

how to test Kibana on Cloud

spin up Kibana v7.10 server on Cloud
adjust https://github.com/elastic/kibana-load-testing config to point to Cloud instance
run kibana-load-testing against v7.10 on Cloud to get numbers without APM agent enabled.
spin up APM server on Cloud
adjust kibana.yml file to enable APM agent (ask Cloud team for assistance - elastic.apm.* settings aren't listed in allow list) and point to APM server in Cloud
make sure APM agent works and Kibana communicates with APM server (see APM app in Kibana)
perform load testing and compare numbers with the previous results
feel free to adjust APM settings to compare how config values affect the results

TinaHeiligers · 2020-11-10T23:41:29Z

@restrry I've followed your instructions above and with a little tweaking, was able to run the load tests against a local Kibana instance with and without APM running (through Docker).

My setup thus far:

Kibana local on master
ES and APM server through Docker
Load testing against the default DemoJourney with version=8.0.0

I left the DemoJourney simulation as is regarding requests:

  setUp(
    scn
      .inject(
        constantConcurrentUsers(20) during (3 minute), // 1
        rampConcurrentUsers(20) to (50) during (3 minute) // 2
      )
      .protocols(httpProtocol)
  ).maxDuration(15 minutes)

In the screen shots below, I've highlighted the same queries in both cases, for ease of comparison.
Without APM:

Full Results:
local_Kibana_without_APM.zip

With APM, using the Kibana apm settings suggested in the instructions:

Full Results:
local_Kibana_with_APM.zip

Summary:
We are indeed seeing an impact of APM on Kibana performance, with an increase in the 95th percentile response times.
I'll redo everything from v7.10-SNAPSHOT, after which I'll move on to Cloud unless I hear otherwise 😉 .

mshustov · 2020-11-11T06:29:59Z

Looks good overall. The only outlier is the query dashboard list case that in the 95th percentile is faster with APM agent enabled.

I'll redo everything from v7.10-SNAPSHOT, after which I'll move on to Cloud unless I hear otherwise 😉 .

🚀

TinaHeiligers · 2020-11-12T00:07:21Z

Progress was slow today, I really struggled to get Kibana 7.10 running and resorted to running Kibana off the distributable.

Load tests without APM:
Elasticsearch: snapshot v7.10
Kibana: 7.10 (distributable)

Note: Nothing really useful from this setup as roughly half of the queries threw errors.

Full results:
demojourney-20201111231041086.zip

Load tests with APM:
Elasticsearch and APM run from Docker (v7.10)
Kibana: 7.10 (distributable) with apm configured

Full results:
demojourney-20201111234910265.zip

Summary:
There's a huge discrepancy in the results from the queries that were successful. I don't trust these results and am rather moving on to Cloud testing. Hopefully that will be more reliable 😉

mshustov · 2020-11-12T07:49:12Z

Note: Nothing really useful from this setup as roughly half of the queries threw errors.

@dmlemeshko I experienced a similar problem when only the login scenario succeeded. What could be a reason for this?

@TinaHeiligers What Cloud settings did you use? There are recommended ones in https://github.com/elastic/kibana-load-testing

elasticsearch {
    deployment_template = "gcp-io-optimized"
    memory = 8192
}
kibana {
    memory = 1024
}

dmlemeshko · 2020-11-12T11:30:45Z

I fixed a login issue for 7.10 when running load testing with new deployment + canvas end-points needed to be updated.
Here is my test run:

export API_KEY=<Key generated on Staging Cloud>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

demojourney-20201112111618663.zip

7.10.0.conf deploy config has the same memory values @restrry posted above

Another run with rampConcurrentUsers changed to 20..150

demojourney-20201112113251442.zip

TinaHeiligers · 2020-11-12T14:44:33Z

@restrry

What Cloud settings did you use?

I haven't tested on Cloud yet, I'll do that today with the recommended settings.

TinaHeiligers · 2020-11-12T16:53:23Z

@dmlemeshko That's for fixing that issue! I reran the load test on a local Kibana 7.10 distributable and not getting the errors seen previously.

Test setup for both runs:

setUp(
    scn
      .inject(
        constantConcurrentUsers(20) during (3 minute), // 1
        rampConcurrentUsers(20) to (50) during (3 minute) // 2
      )
      .protocols(httpProtocol)
  ).maxDuration(15 minutes)

Load tests without APM:
Elasticsearch: snapshot v7.10
Kibana: 7.10 (distributable)

Full result
demojourney-20201112153808121.zip

Load tests with APM:
Elasticsearch and APM run from Docker (v7.10)
Kibana: 7.10 (distributable) with apm configured

Full result
demojourney-20201112161648302.zip

Summary:
With the exception of the request to discover and discover query 2, all the response times increase when APM is enabled.
OF the response times already starting at over 500ms, the increase in response time ranged between 12 and 40%, taking login response time to over 1000ms with "query gauge data" approaching the 1000ms mark.

TinaHeiligers · 2020-11-12T18:57:05Z

On Cloud staging, using an existing deployment without APM:

Load test results without APM

Full Result
demojourney-20201114173519337.zip

~~I'm reaching out to the Cloud folks to add the apm* config to the Cloud deployment and will post the results when I have them.~~

On cloud staging, using an existing deployment with APM:

Test run:

mvn install
export env=config/cloud-tina-7.10.0.conf // contains the details of the cloud staging env
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Load test results with APM

Full results
demojourney-20201117164239422.zip

On Cloud staging, creating a deployment as part of the test run:

export API_KEY=<Key generated on Staging Cloud>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Script-created deployment
deploy-config:

version = 7.10.0

elasticsearch {
    deployment_template = "gcp-io-optimized"
    memory = 8192
}

kibana {
    memory = 1024
}

Load test results

Full Result
demojourney-20201112213550120.zip

On Cloud staging, creating a deployment as part of the test run: Not done

TinaHeiligers · 2020-11-17T17:06:54Z

@restrry I've added the results from the Kibana load testing on the cloud (staging) test run where APM is enabled in Kibana.
The results are similar to what we've seen on local instances of Kibana with and without APM: An overall increase in the 95th percentile response times by ~16%.
For both test runs, the number of concurrent users was set to 20 during 3 min and the number of users was ramped up from 20 to 50 during a 3 minute interval.

Please let me know if I should repeat the tests with fewer/more concurrent users and/or change any of the APM settings.
I will document the steps to take to add configurations not exposed be default on Cloud. Please let me know where the best place is to add these (I don't think making it public in this issue is appropriate 😉 )

cc @joshdover

dmlemeshko · 2020-11-17T18:39:17Z

@TinaHeiligers @restrry
If you want to have more "clean" test results, I suggest spinning up VM in the same region where you create stack deployment

I can help with it, but if you are familiar how to add VM the follow up steps are:

// e.g. I run tests and create VM in Franfurt (europe-west3-a)
// zip project and upload to VM
zip -r KibanaLoadTesting.zip .
gcloud compute scp  ~/github/KibanaLoadTesting.zip root@<vm-name>:/home/<user-name>/test --zone=europe-west3-a
// start docker image with JDK/maven in other terminal
sudo docker run -it -v "$(pwd)"/test:/local/git --name java-maven --rm jamesdbloom/docker-java8-maven
// run tests with the same command you did locally
// download test results
sudo tar -czvf my_results.tar.gz /home/<user-name>/test/KibanaLoadTesting/target/gatling/demojourney-<report-folder>
gcloud compute scp  root@<vm-name>:/home/<user-name>/test/KibanaLoadTesting/target/gatling/my_results.tar.gz </local-machine-path-to-save-at> --zone=europe-west3-a

TinaHeiligers · 2020-11-17T19:04:33Z

@dmlemeshko I'm not familiar with adding VM and would greatly appreciate your help! I'm happy to watch you go through the process on Zoom. In the mean time, I'll work through the guide.

mshustov · 2020-11-18T09:18:25Z

Why we have such a significant difference between On cloud staging, using an existing deployment with APM and On Cloud staging, creating a deployment as part of the test run?
I think it makes sense to spin up a new deployment for both Kibana & kibana-load-testing as @dmlemeshko suggested #78792 (comment)
I scheduled a call to discuss the testing strategy.

dmlemeshko · 2020-11-18T17:42:51Z

Here are the steps how to spin up Google Cloud VM and run tests on it:

Login to https://console.cloud.google.com/ with corp account
Create CPU-optimized VM (4CPUs, 16 GB memory is enough) with any Container Optimized OS as boot disk, e.g load-testing-vm
Note: use US-Cenral1 region, same as for stack deployment
Zip https://github.com/elastic/kibana-load-testing and copy to VM

Connect to VM, create test folder

gcloud beta compute ssh --zone "us-central1-a" "load-testing-vm" --project "elastic-kibana-184716"
mkdir test
chmod 777 test

In other terminal upload archive to VM

sudo gcloud compute scp KibanaLoadTesting.tar.gz <user>@load-testing-vm:/home/<user>/test  --zone "us-central1-a" --project "elastic-kibana-184716"

In first terminal (VM) unzip project and start docker container with mapping local/container path, so later you can exit container and keep results on VM

cd test
tar -xzf KibanaLoadTesting.tar.gz
sudo docker run -it -v "$(pwd)":/local/git --name java-maven --rm jamesdbloom/docker-java8-maven

Now you are in container and should be able to see test folder, that contains unzipped project. Run tests as locally

export API_KEY=<Your API Key>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

When tests are one, type exit. Check target/gatling for your tests results. Zip and download to local machine:

sudo tar -czvf results.tar.gz demojourney-20201118160915491/

From local machine run

sudo gcloud compute scp  <user>@load-testing-vm:/home/<user>/test/target/gatling/results.tar.gz . --zone=us-central1-a

Results should be available in the current path

joshdover · 2020-11-18T18:27:38Z

I think it'd also be worth understanding the difference between 7.11 w/ APM vs 7.10 and 7.9 w/o APM. Due to the many performance tweaks that were made to support Fleet, there may not be large regression in 7.11 w/ APM enabled. If the difference is smaller, enabling this in 7.11 clusters may be an easier pill to swallow.

Next, I'd also like to experiment with tweaking some other settings to see if we get any performance improvements:

elastic.apm.asyncHooks: false
elastic.apm.disableInstrumentations
- Modules to try disabling: bluebird, graphql
- Full list of instrumented modules here
- We could try disabling hapi, elasticsearch, and http, but I suspect those are the most useful ones. If disabling any of these improve the numbers, we may need to ask the APM agent team help us optimize those.

If none of these result in improved performance, we may need to work directly with the APM team to look at some flamegraphs / profiles and see where most of the time is being spent in the APM agent code.

sorenlouv · 2021-02-10T23:28:14Z

AFAIK (not authoritative), the feature loss is an empty "Stack Trace" section in the "Span details" page. E.g. this:

That sounds right. Additionally we display stack traces for errors. Not sure if they are disabled too or if that's a different setting 🤔

trentm · 2021-02-10T23:36:22Z

Additionally we display stack traces for errors. Not sure if they are disabled too

That is a separate feature. The Node.js APM agent always captures a stack trace for a captured Error instance. (Pedantic aside: If there is a call to apm.captureError(<a *string*, not an Error instance>) then a stack trace may be captured depending on the (somewhat confusing) captureErrorLogStackTraces config var.)

mshustov · 2021-02-11T18:24:14Z

AFAIK (not authoritative), the feature loss is an empty "Stack Trace" section in the "Span details" page. E.g. this:

So we can identify slow operations, but we can't tell why they are so slow? It might be acceptable as long as we can re-configure APM settings and run an instance with captureSpanStackTraces enabled to debug the performance.

dgieselaar · 2021-02-11T19:01:43Z

@joshdover

Anything else others would like to see?

I'd be interested in A) a higher sample rate and B) enabling breakdown metrics (now that #90955 has been merged). I don't expect breakdown metrics to be useful for Kibana as ~all async operations are Elasticsearch, but just curious about the difference now.

dgieselaar · 2021-02-11T19:12:52Z

@restrry might be useful for you folks as well: #90403

joshdover · 2021-02-25T12:02:39Z

Sorry for the delay here folks, I had written a lengthy reply but it appears it didn't get posted. So let's try this again 😄

Great news! I'm curious about what "fairly stable" means though, in terms of the distribution of the results. I suspect we'd need to document the expected scatter somewhere so that we can possibly run fewer tests.

So I'm not the most statistically knowledgeable, but the way I was able to reproducible results was by running the DemoJourney scenario 30 times and then taking the percentiles of the entire distribution of all request timings from all tests. This resulted in about 42k total requests (across about 20 different endpoints) for each configuration. I came up with the 30 number pretty arbitrarily and I suspect we could lower this number in order to speed up the time it takes to get answers back.

I also wasn't sure if those p50 values were a concern.

Taking another look at the tests I ran before, here is the same table, including the 50p numbers:

APM config	50p (+/- baseline)	75p (+/- baseline)	95p (+/- baseline)
no apm - baseline	4209ms	11118ms	34186ms
no apm	4780ms	10448ms	32108ms
apm-default	7294ms (+73%)	20385ms (+83%)	41166ms (+20%)
apm-no-metrics	7790ms (+85%)	19206ms (+72%)	52288ms (+35%)
apm-no-span-stacktrace	5295ms (+26%)	11223ms (+1%)	34203ms (even)
apm-disable-instrumentation	7362ms (+75%)	18369ms (+65%)	39883ms (+17%)
apm-no-async	7294ms (+73%)	14916ms (+34%)	38963ms (+14%)

We do see an increase in 50p of 26% even with captureSpanStackTraces: false. While it's much reduced compared to having this option on, it's probably still worth investigating the root cause here since this will have an impact on the 'typical' case.

So we can identify slow operations, but we can't tell why they are so slow? It might be acceptable as long as we can re-configure APM settings and run an instance with captureSpanStackTraces enabled to debug the performance.

Yep, we'll need to coordinate with the Cloud folks on how much access we can get in order to flip that switch on to grab some samples when needed. Ideally this would be self-service for Kibana developers (or at least for a handful of teams). If we are able to find a way to optimize this in the agent, then maybe we'd be able to do away with this overhead. @trentm is it possible to offload any of the CPU cycles here to another Node worker thread?

I'd be interested in A) a higher sample rate and B) enabling breakdown metrics (now that #90955 has been merged). I don't expect breakdown metrics to be useful for Kibana as ~all async operations are Elasticsearch, but just curious about the difference now.

Yep, I don't think that PR was included in the snapshot I ran these tests under. I'll run some more this week to see if it makes an impact. Breakdown metrics would be helpful in some endpoints, but I'm not sure how a higher sample rate would be helpful to us for our use case?

dgieselaar · 2021-02-25T12:12:19Z

Yep, I don't think that PR was included in the snapshot I ran these tests under. I'll run some more this week to see if it makes an impact. Breakdown metrics would be helpful in some endpoints, but I'm not sure how a higher sample rate would be helpful to us for our use case?

Mostly just interested in the performance impact of increasing or decreasing the sample rate, compared to the baseline.

trentm · 2021-02-25T22:40:51Z

is it possible to offload any of the CPU cycles here to another Node worker thread?

No, the APM agent doesn't currently support using worker threads for any of its work. Worth considering, but not something that would available anytime soon.

trentm · 2021-04-29T20:22:17Z

Taking another look at the tests I ran before, here is the same table, including the 50p numbers:
...
APM config 50p (+/- baseline) 75p (+/- baseline) 95p (+/- baseline)
no apm - baseline 4209ms 11118ms 34186ms

@joshdover Can I get a quick sanity check, please? When I was running DemoJourney against a local Kibana on my laptop I am getting values in the rough range of min=10ms to max=1300ms for the "Global Information" values in the gatling summary, e.g.:

Doing a DemoJourney run against a newly deployed 7.12.0-SNAPSHOT I see values in the rough range of min=100ms, 50p=2500ms, max=24000ms, e.g.:

Your values are quite a bit higher. I want to make sure we are quoting the same thing.

Are you quoting the average of your ~30 runs of the "Global Information" stats values from the gatling run summaries?
Were you running mvn gatling:test ... from a local computer? or from a VM in Google Cloud to try to be closer to your deployment?

watson · 2021-11-09T20:12:31Z

FYI: I'm in the process of removing bluebird #118097 (not sure if that's what's causing these issues, but just in case, I thought I'd let you know).

lizozom · 2021-11-21T12:09:52Z

I see that the last benchmarking results on this issue are from Nov 2020.
Do we plan to re-evaluate performance after all the changes that were made?
Do we have plans to automate this process?

trentm · 2021-11-22T17:38:34Z

Do we plan to re-evaluate performance after all the changes that were made?
Do we have plans to automate this process?

@dmlemeshko Does something from https://github.com/elastic/kibana-load-testing provide any automatic data here? For example, with the recent #112973 merge, Kibana master has the Node.js APM agent on by default (in its reduced-functionality contextPropagationOnly mode). Are there any regular runs of kibana-load-testing scenarios against Kibana master that we could look at to see if there was a change in load/performance?

mshustov · 2021-11-22T17:45:52Z

Do we plan to re-evaluate performance after all the changes that were made?

It might be useful to conduct testing against nodejs v16. AFAIK it contains changes to async_hooks that should have reduced the APM agent overhead. With these numbers at hand, we can close the issue.

Are there any regular runs of kibana-load-testing scenarios against Kibana master that we could look at to see if there was a change in load/performance?

Yes, you can find it here

lizozom · 2021-11-23T14:33:16Z

Should we maybe track the automation on an issue (or update this one)?
Are there multiple configs we want to benchmark or should we just benchmark no apm vs. default config?

I think this would be helpful to reduce performance implication concerns anyone may have when enabling APM on their cluster.

mshustov · 2021-11-24T14:34:34Z

Should we maybe track the automation on an issue (or update this one)?

I don't know the APM agent testing infrastructure well enough, but I'd be surprised if there is no such performance testing sandbox. cc @trentm and @vigneshshanmugam know better.

Let's just keep this effort out of the scope of the current task. Kibana is not the best place for such kind of testing due to its high level of internal complexity.

trentm · 2021-11-24T16:38:16Z

I don't know the APM agent testing infrastructure well enough, but I'd be surprised if there is no such performance testing sandbox.

The Node.js APM agent does have regular benchmark runs with the data shown here: https://observability-benchmarks.elastic.dev/goto/ec051bde1fc50f0239710a3b5c08867a
However, that are micro benchmarks that don't provide a useful measure of the overall agent impact on an app.

There was some (timeboxed) work done on closer-to-real-world performance analysis of the Node.js APM agent earlier in elastic/apm-agent-nodejs#2028 That work did not include a regular testing framework.

I was somewhat hoping that Kibana usage of the APM agents and https://github.com/elastic/kibana-load-testing might provide a path to getting a feel for APM impact on a large real-world app. However, I might be misunderstanding the goals of kibana-load-testing.git, so my hope is unfair.

vigneshshanmugam · 2021-11-24T16:51:56Z

RUM agent benchmarks are also on the same cluster, you can check the RUM dashboard - https://observability-benchmarks.elastic.dev/goto/27dac144459a24fc7e49a461cd81fca9

We have both Micro and Macro benchmarks for the hot paths of the code. However, the macro benchmark does not cover a general application, instead simulates a blank and heavy page. You can read more details about the RUM benchmarking in this document

lizozom · 2021-11-30T13:22:46Z

@trentm @vigneshshanmugam @mshustov
So do we think that the benchmarking APM has internally + a fresh one-of benchmark from Kibana are enough?

mshustov · 2021-11-30T14:55:55Z

I was somewhat hoping that Kibana usage of the APM agents and https://github.com/elastic/kibana-load-testing might provide a path to getting a feel for APM impact on a large real-world app

I can see the benefits of using Kibana as a real-world scenario for performance testing. But I can see a few problems:

https://github.com/elastic/kibana-load-testing doesn't run the whole APM pipeline. So the test will be limited to the APM nodejs agent in contextPropagationOnly mode.
https://github.com/elastic/kibana-load-testing cannot be used to test the APM RUM agent performance since it doesn't execute any client-side code.
in all the load tests conducted for DemoJourney scenario of https://github.com/elastic/kibana-load-testing, I saw fluctuation of results. Maybe we should introduce a small scenario dedicated to performance testing? It will reduce the impact of changes to the Kibana codebase.

Maybe APM performance testing should belong to https://github.com/elastic/apm-integration-testing?

a fresh one-of benchmark from Kibana are enough

IMO it's the quickest solution for now.

dmlemeshko · 2022-02-02T16:05:15Z

Hi everyone. Since we started to use bare metal machine for scalability testing, I decided to double check the impact of APM on Kibana server. Results are available in elastic/kibana-load-testing/issues/221
Happy to do some tweaks and do more test runs in order to improve latency.

lizozom · 2022-07-20T14:09:12Z

Addressed with #129585
Closing as duplicate

mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result labels Sep 29, 2020

joshdover assigned mshustov Sep 30, 2020

trentm mentioned this issue Apr 1, 2021

tracking bug for agent performance impact on Kibana elastic/apm-agent-nodejs#2028

Closed

joshdover mentioned this issue May 5, 2021

Upgrade elastic.apm.* configuration to "beta" #99328

Closed

7 tasks

joshdover removed their assignment Aug 10, 2021

lizozom added the performance label Nov 4, 2021

joshdover mentioned this issue Nov 4, 2021

Default APM Configurations for monitoring of kibana clusters #117492

Closed

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021

lizozom closed this as completed Jul 20, 2022

Measure APM agent impact on the platform performance #78792

Measure APM agent impact on the platform performance #78792

Comments

mshustov commented Sep 29, 2020

elasticmachine commented Sep 29, 2020

mshustov commented Oct 8, 2020 • edited Loading

Setup

Results

Without APM agent:

With APM agent:

mshustov commented Oct 8, 2020 • edited Loading

mshustov commented Oct 8, 2020 • edited Loading

vigneshshanmugam commented Oct 8, 2020

mshustov commented Oct 8, 2020

pgayvallet commented Oct 21, 2020

mshustov commented Oct 21, 2020 • edited Loading

mshustov commented Nov 6, 2020 • edited Loading

TinaHeiligers commented Nov 10, 2020 • edited Loading

mshustov commented Nov 11, 2020 • edited Loading

TinaHeiligers commented Nov 12, 2020

mshustov commented Nov 12, 2020 • edited Loading

dmlemeshko commented Nov 12, 2020 • edited Loading

TinaHeiligers commented Nov 12, 2020

TinaHeiligers commented Nov 12, 2020

TinaHeiligers commented Nov 12, 2020 • edited Loading

On Cloud staging, using an existing deployment without APM:

On cloud staging, using an existing deployment with APM:

On Cloud staging, creating a deployment as part of the test run:

On Cloud staging, creating a deployment as part of the test run: Not done

TinaHeiligers commented Nov 17, 2020 • edited Loading

dmlemeshko commented Nov 17, 2020 • edited Loading

TinaHeiligers commented Nov 17, 2020 • edited Loading

mshustov commented Nov 18, 2020

dmlemeshko commented Nov 18, 2020 • edited Loading

joshdover commented Nov 18, 2020 • edited Loading

sorenlouv commented Feb 10, 2021

trentm commented Feb 10, 2021

mshustov commented Feb 11, 2021

dgieselaar commented Feb 11, 2021 • edited Loading

dgieselaar commented Feb 11, 2021

joshdover commented Feb 25, 2021

dgieselaar commented Feb 25, 2021

trentm commented Feb 25, 2021

trentm commented Apr 29, 2021

watson commented Nov 9, 2021

lizozom commented Nov 21, 2021

trentm commented Nov 22, 2021

mshustov commented Nov 22, 2021

lizozom commented Nov 23, 2021

mshustov commented Nov 24, 2021

trentm commented Nov 24, 2021

vigneshshanmugam commented Nov 24, 2021 • edited Loading

lizozom commented Nov 30, 2021

mshustov commented Nov 30, 2021 • edited Loading

dmlemeshko commented Feb 2, 2022

lizozom commented Jul 20, 2022

mshustov commented Oct 8, 2020 •

edited

Loading

mshustov commented Oct 8, 2020 •

edited

Loading

mshustov commented Oct 8, 2020 •

edited

Loading

mshustov commented Oct 21, 2020 •

edited

Loading

mshustov commented Nov 6, 2020 •

edited

Loading

TinaHeiligers commented Nov 10, 2020 •

edited

Loading

mshustov commented Nov 11, 2020 •

edited

Loading

mshustov commented Nov 12, 2020 •

edited

Loading

dmlemeshko commented Nov 12, 2020 •

edited

Loading

TinaHeiligers commented Nov 12, 2020 •

edited

Loading

TinaHeiligers commented Nov 17, 2020 •

edited

Loading

dmlemeshko commented Nov 17, 2020 •

edited

Loading

TinaHeiligers commented Nov 17, 2020 •

edited

Loading

dmlemeshko commented Nov 18, 2020 •

edited

Loading

joshdover commented Nov 18, 2020 •

edited

Loading

dgieselaar commented Feb 11, 2021 •

edited

Loading

vigneshshanmugam commented Nov 24, 2021 •

edited

Loading

mshustov commented Nov 30, 2021 •

edited

Loading