Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure APM agent impact on the platform performance #78792

Closed
mshustov opened this issue Sep 29, 2020 · 60 comments
Closed

Measure APM agent impact on the platform performance #78792

mshustov opened this issue Sep 29, 2020 · 60 comments
Labels
enhancement New value added to drive a business result impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@mshustov
Copy link
Contributor

We are working on enabling APM agent on the prod build #70497. Before making this happen we want to understand what performance overhead it adds to the Kibana server. We might be able to re-use the setup introduced in #73189 to measure the average response time & number of requests Kibana can handle with and without APM agent enabled.

@mshustov mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result labels Sep 29, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@mshustov
Copy link
Contributor Author

mshustov commented Oct 8, 2020

Setup

API performance testing is based on setup https://github.com/dmlemeshko/kibana-load-testing I adjusted number of requests not to overwhelm APM server.

  setUp(
    scn.inject(
      constantConcurrentUsers(15) during (2 minute),
      rampConcurrentUsers(15) to (20) during (2 minute)
    ).protocols(httpProtocol)
  ).maxDuration(15 minutes)

Testes are run against 7.10.0-SNAPSHOT

Results

APM agent seems to add a significant overhead (see 95%).

Without APM agent:

2020-10-08_11-11-25
download result in html 7.10.0-without-apm.zip

With APM agent:

2020-10-08_11-12-07
download result in html 7.10.0-with-apm.zip

@mshustov
Copy link
Contributor Author

mshustov commented Oct 8, 2020

Tested Kibana image doesn't contain changes introduced in #78697
So I added breakdownMetrics: false, to dist APM config manually. It slightly improves the situation:
2020-10-08_11-33-55

@mshustov
Copy link
Contributor Author

mshustov commented Oct 8, 2020

https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html provides some details on how to squeeze out a bit more perf improvements.
The most simple way is to reduce the sample ratio https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html#performance-sampling
It seems we can use 0.2-0.3 as a default value and adjust it via config file.
Numbers:

  • sample ratio 0.1
    2020-10-08_12-38-47
  • sample ratio 0.2
    2020-10-08_12-09-58
  • sample ratio 0.3
    2020-10-08_12-16-32
  • sample ratio 0.5
    2020-10-08_12-37-31

Other config values don't seem to affect CPU as much as sampleRation does, so I decided not to use them. @vigneshshanmugam do you have anything to add?

@vigneshshanmugam
Copy link
Member

As you have already figured transactionSampleRate is the go to metric we recommend for both Node.js and RUM agent to tune it for performance as it drops transactions based on this metric.

Perf tuning RUM agent - https://www.elastic.co/guide/en/apm/agent/rum-js/current/performance-tuning.html

  • breakdownMetrics Disabling it certainly helps a lot in the RUM agent for custom transactions vs page-load transactions. I don't know how the above test schedules the load and which browsers it runs so cant say for sure if its going to have a huge impact. But my recommendation would be to keep it to false if it helps.

  • centralConfig - Disable this one as it introduces one additional request to APM server. Defaults to true in Node.js and false in RUM.

  • metricsInterval - Can you try increasing this interval/disabling metrics reporting and check if its helping? This controls the Metrics capturing in Node agent. https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html#metrics-interval

I cant seem to find any other config that would help.

@mshustov
Copy link
Contributor Author

mshustov commented Oct 8, 2020

I don't know how the above test schedules the load and which browsers it runs so cant say for sure if its going to have a huge impact. But my recommendation would be to keep it to false if it helps.

It tests the server side API perf only.

centralConfig - Disable this one as it introduces one additional request to APM server. Defaults to true in Node.js and false in RUM.

already disabled in my tests

metricsInterval - Can you try increasing this interval/disabling metrics reporting and check if its helping? This controls the Metrics capturing in Node agent.

Test with metricsInterval: '120s' and transactionSampleRate: 0.3 slightly improves the situation (in comparison to transactionSampleRate: 0.3):
2020-10-08_14-46-02

@pgayvallet
Copy link
Contributor

So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant.

@mshustov
Copy link
Contributor Author

mshustov commented Oct 21, 2020

So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant

The best configuration with sample ratio: 0.1, breakdownMetrics: false; centralConfig: false; metricsInterval: '120s'. 50th percentile is doubled from 118 to 225, 95th percentile is almost doubled from 574 to 950.
It mostly affects query functionality when requesting /api/saved_objects/* & /api/metrics/vis/data endpoints. Test case query timeseries data is almost tripled.

with APM enabled:

@mshustov
Copy link
Contributor Author

mshustov commented Nov 6, 2020

@TinaHeiligers you asked how to perform testing:

how to run Kibana with APM agent locally:

  • clone https://github.com/elastic/apm-integration-testing
  • cd apm-integration-testing
  • Run ES & APM servers with ./scripts/compose.py start master --no-kibana
  • cd ../kibana
  • you might need to change elasticsearch credentials (I used admin/changeme)
  • make sure APM agent is active and points to the local APM server - set in kibana.yml:
elastic.apm.active: true
elastic.apm.serverUrl: 'http://127.0.0.1:8200'
# elastic.apm.secretToken: ... <-- might be required in prod/cloud
# optional metrics to adjust performance 
# see https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html
elastic.apm.centralConfig: false
elastic.apm.breakdownMetrics: false
elastic.apm.transactionSampleRate: 0.1
elastic.apm.metricsInterval: '120s'
  • run Kibana: ELASTIC_APM_ACTIVE=true yarn start
  • you can see transactions in the APM app
  • stop Kibana
  • stop ES & APM servers: cd apm-integration-testing; ./scripts/compose.py stop

how to run load testing against Kibana:

how to test Kibana on Cloud

  • spin up Kibana v7.10 server on Cloud
  • adjust https://github.com/elastic/kibana-load-testing config to point to Cloud instance
  • run kibana-load-testing against v7.10 on Cloud to get numbers without APM agent enabled.
  • spin up APM server on Cloud
  • adjust kibana.yml file to enable APM agent (ask Cloud team for assistance - elastic.apm.* settings aren't listed in allow list) and point to APM server in Cloud
  • make sure APM agent works and Kibana communicates with APM server (see APM app in Kibana)
  • perform load testing and compare numbers with the previous results
  • feel free to adjust APM settings to compare how config values affect the results

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 10, 2020

@restrry I've followed your instructions above and with a little tweaking, was able to run the load tests against a local Kibana instance with and without APM running (through Docker).

My setup thus far:

  • Kibana local on master
  • ES and APM server through Docker
  • Load testing against the default DemoJourney with version=8.0.0

I left the DemoJourney simulation as is regarding requests:

  setUp(
    scn
      .inject(
        constantConcurrentUsers(20) during (3 minute), // 1
        rampConcurrentUsers(20) to (50) during (3 minute) // 2
      )
      .protocols(httpProtocol)
  ).maxDuration(15 minutes)

In the screen shots below, I've highlighted the same queries in both cases, for ease of comparison.
Without APM:
local_without_APM

Full Results:
local_Kibana_without_APM.zip

With APM, using the Kibana apm settings suggested in the instructions:
local_with_APM

Full Results:
local_Kibana_with_APM.zip

Summary:
We are indeed seeing an impact of APM on Kibana performance, with an increase in the 95th percentile response times.
I'll redo everything from v7.10-SNAPSHOT, after which I'll move on to Cloud unless I hear otherwise 😉 .

@mshustov
Copy link
Contributor Author

mshustov commented Nov 11, 2020

Looks good overall. The only outlier is the query dashboard list case that in the 95th percentile is faster with APM agent enabled.

I'll redo everything from v7.10-SNAPSHOT, after which I'll move on to Cloud unless I hear otherwise 😉 .

🚀

@TinaHeiligers
Copy link
Contributor

Progress was slow today, I really struggled to get Kibana 7.10 running and resorted to running Kibana off the distributable.

Load tests without APM:
Elasticsearch: snapshot v7.10
Kibana: 7.10 (distributable)

kibana-7_10-localDistributable-no-APM
Note: Nothing really useful from this setup as roughly half of the queries threw errors.

Full results:
demojourney-20201111231041086.zip

Load tests with APM:
Elasticsearch and APM run from Docker (v7.10)
Kibana: 7.10 (distributable) with apm configured

Kibana-7_10-localDistributable-with-APM

Full results:
demojourney-20201111234910265.zip

Summary:
There's a huge discrepancy in the results from the queries that were successful. I don't trust these results and am rather moving on to Cloud testing. Hopefully that will be more reliable 😉

@mshustov
Copy link
Contributor Author

mshustov commented Nov 12, 2020

Note: Nothing really useful from this setup as roughly half of the queries threw errors.

@dmlemeshko I experienced a similar problem when only the login scenario succeeded. What could be a reason for this?

@TinaHeiligers What Cloud settings did you use? There are recommended ones in https://github.com/elastic/kibana-load-testing

elasticsearch {
    deployment_template = "gcp-io-optimized"
    memory = 8192
}
kibana {
    memory = 1024
}

@dmlemeshko
Copy link
Member

dmlemeshko commented Nov 12, 2020

I fixed a login issue for 7.10 when running load testing with new deployment + canvas end-points needed to be updated.
Here is my test run:

export API_KEY=<Key generated on Staging Cloud>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Gatling Stats - Global Information 2020-11-12 12-26-54

demojourney-20201112111618663.zip

7.10.0.conf deploy config has the same memory values @restrry posted above

Another run with rampConcurrentUsers changed to 20..150
Gatling Stats - Global Information 2020-11-12 12-52-25

demojourney-20201112113251442.zip

@TinaHeiligers
Copy link
Contributor

@restrry

What Cloud settings did you use?

I haven't tested on Cloud yet, I'll do that today with the recommended settings.

@TinaHeiligers
Copy link
Contributor

@dmlemeshko That's for fixing that issue! I reran the load test on a local Kibana 7.10 distributable and not getting the errors seen previously.

Test setup for both runs:

setUp(
    scn
      .inject(
        constantConcurrentUsers(20) during (3 minute), // 1
        rampConcurrentUsers(20) to (50) during (3 minute) // 2
      )
      .protocols(httpProtocol)
  ).maxDuration(15 minutes)

Load tests without APM:
Elasticsearch: snapshot v7.10
Kibana: 7.10 (distributable)
Kibana-7_10-distributable-no-APM-test-2

Full result
demojourney-20201112153808121.zip

Load tests with APM:
Elasticsearch and APM run from Docker (v7.10)
Kibana: 7.10 (distributable) with apm configured
Kibana-7_10-distributable-with-APM-test2

Full result
demojourney-20201112161648302.zip

Summary:
With the exception of the request to discover and discover query 2, all the response times increase when APM is enabled.
OF the response times already starting at over 500ms, the increase in response time ranged between 12 and 40%, taking login response time to over 1000ms with "query gauge data" approaching the 1000ms mark.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 12, 2020

On Cloud staging, using an existing deployment without APM:

Load test results without APM
Cloud_Staging_no_APM

Full Result
demojourney-20201114173519337.zip

I'm reaching out to the Cloud folks to add the apm* config to the Cloud deployment and will post the results when I have them.

On cloud staging, using an existing deployment with APM:

Test run:

mvn install
export env=config/cloud-tina-7.10.0.conf // contains the details of the cloud staging env
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Load test results with APM
cloud_staging_with_APM

Full results
demojourney-20201117164239422.zip

On Cloud staging, creating a deployment as part of the test run:

export API_KEY=<Key generated on Staging Cloud>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Script-created deployment
deploy-config:

version = 7.10.0

elasticsearch {
    deployment_template = "gcp-io-optimized"
    memory = 8192
}

kibana {
    memory = 1024
}

Load test results
Cloud_Kibana7-10-auto-deploymentcreation

Full Result
demojourney-20201112213550120.zip

On Cloud staging, creating a deployment as part of the test run: Not done

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 17, 2020

@restrry I've added the results from the Kibana load testing on the cloud (staging) test run where APM is enabled in Kibana.
The results are similar to what we've seen on local instances of Kibana with and without APM: An overall increase in the 95th percentile response times by ~16%.
For both test runs, the number of concurrent users was set to 20 during 3 min and the number of users was ramped up from 20 to 50 during a 3 minute interval.

Please let me know if I should repeat the tests with fewer/more concurrent users and/or change any of the APM settings.
I will document the steps to take to add configurations not exposed be default on Cloud. Please let me know where the best place is to add these (I don't think making it public in this issue is appropriate 😉 )

cc @joshdover

@dmlemeshko
Copy link
Member

dmlemeshko commented Nov 17, 2020

@TinaHeiligers @restrry
If you want to have more "clean" test results, I suggest spinning up VM in the same region where you create stack deployment

I can help with it, but if you are familiar how to add VM the follow up steps are:

// e.g. I run tests and create VM in Franfurt (europe-west3-a)
// zip project and upload to VM
zip -r KibanaLoadTesting.zip .
gcloud compute scp  ~/github/KibanaLoadTesting.zip root@<vm-name>:/home/<user-name>/test --zone=europe-west3-a
// start docker image with JDK/maven in other terminal
sudo docker run -it -v "$(pwd)"/test:/local/git --name java-maven --rm jamesdbloom/docker-java8-maven
// run tests with the same command you did locally
// download test results
sudo tar -czvf my_results.tar.gz /home/<user-name>/test/KibanaLoadTesting/target/gatling/demojourney-<report-folder>
gcloud compute scp  root@<vm-name>:/home/<user-name>/test/KibanaLoadTesting/target/gatling/my_results.tar.gz </local-machine-path-to-save-at> --zone=europe-west3-a

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 17, 2020

@dmlemeshko I'm not familiar with adding VM and would greatly appreciate your help! I'm happy to watch you go through the process on Zoom. In the mean time, I'll work through the guide.

@mshustov
Copy link
Contributor Author

Why we have such a significant difference between On cloud staging, using an existing deployment with APM and On Cloud staging, creating a deployment as part of the test run?
I think it makes sense to spin up a new deployment for both Kibana & kibana-load-testing as @dmlemeshko suggested #78792 (comment)
I scheduled a call to discuss the testing strategy.

@dmlemeshko
Copy link
Member

dmlemeshko commented Nov 18, 2020

Here are the steps how to spin up Google Cloud VM and run tests on it:

Login to https://console.cloud.google.com/ with corp account
Create CPU-optimized VM (4CPUs, 16 GB memory is enough) with any Container Optimized OS as boot disk, e.g load-testing-vm
Note: use US-Cenral1 region, same as for stack deployment
Zip https://github.com/elastic/kibana-load-testing and copy to VM

Connect to VM, create test folder

gcloud beta compute ssh --zone "us-central1-a" "load-testing-vm" --project "elastic-kibana-184716"
mkdir test
chmod 777 test 

In other terminal upload archive to VM

sudo gcloud compute scp KibanaLoadTesting.tar.gz <user>@load-testing-vm:/home/<user>/test  --zone "us-central1-a" --project "elastic-kibana-184716"

In first terminal (VM) unzip project and start docker container with mapping local/container path, so later you can exit container and keep results on VM

cd test
tar -xzf KibanaLoadTesting.tar.gz
sudo docker run -it -v "$(pwd)":/local/git --name java-maven --rm jamesdbloom/docker-java8-maven

Now you are in container and should be able to see test folder, that contains unzipped project. Run tests as locally

export API_KEY=<Your API Key>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

When tests are one, type exit. Check target/gatling for your tests results. Zip and download to local machine:

sudo tar -czvf results.tar.gz demojourney-20201118160915491/

From local machine run

sudo gcloud compute scp  <user>@load-testing-vm:/home/<user>/test/target/gatling/results.tar.gz . --zone=us-central1-a

Results should be available in the current path

@joshdover
Copy link
Contributor

joshdover commented Nov 18, 2020

I think it'd also be worth understanding the difference between 7.11 w/ APM vs 7.10 and 7.9 w/o APM. Due to the many performance tweaks that were made to support Fleet, there may not be large regression in 7.11 w/ APM enabled. If the difference is smaller, enabling this in 7.11 clusters may be an easier pill to swallow.

Next, I'd also like to experiment with tweaking some other settings to see if we get any performance improvements:

  • elastic.apm.asyncHooks: false
  • elastic.apm.disableInstrumentations
    • Modules to try disabling: bluebird, graphql
    • Full list of instrumented modules here
    • We could try disabling hapi, elasticsearch, and http, but I suspect those are the most useful ones. If disabling any of these improve the numbers, we may need to ask the APM agent team help us optimize those.

If none of these result in improved performance, we may need to work directly with the APM team to look at some flamegraphs / profiles and see where most of the time is being spent in the APM agent code.

@sorenlouv
Copy link
Member

AFAIK (not authoritative), the feature loss is an empty "Stack Trace" section in the "Span details" page. E.g. this:

That sounds right. Additionally we display stack traces for errors. Not sure if they are disabled too or if that's a different setting 🤔

@trentm
Copy link
Member

trentm commented Feb 10, 2021

Additionally we display stack traces for errors. Not sure if they are disabled too

That is a separate feature. The Node.js APM agent always captures a stack trace for a captured Error instance. (Pedantic aside: If there is a call to apm.captureError(<a *string*, not an Error instance>) then a stack trace may be captured depending on the (somewhat confusing) captureErrorLogStackTraces config var.)

@mshustov
Copy link
Contributor Author

AFAIK (not authoritative), the feature loss is an empty "Stack Trace" section in the "Span details" page. E.g. this:

So we can identify slow operations, but we can't tell why they are so slow? It might be acceptable as long as we can re-configure APM settings and run an instance with captureSpanStackTraces enabled to debug the performance.

@dgieselaar
Copy link
Member

dgieselaar commented Feb 11, 2021

@joshdover

Anything else others would like to see?

I'd be interested in A) a higher sample rate and B) enabling breakdown metrics (now that #90955 has been merged). I don't expect breakdown metrics to be useful for Kibana as ~all async operations are Elasticsearch, but just curious about the difference now.

@dgieselaar
Copy link
Member

@restrry might be useful for you folks as well: #90403

@joshdover
Copy link
Contributor

Sorry for the delay here folks, I had written a lengthy reply but it appears it didn't get posted. So let's try this again 😄

Great news! I'm curious about what "fairly stable" means though, in terms of the distribution of the results. I suspect we'd need to document the expected scatter somewhere so that we can possibly run fewer tests.

So I'm not the most statistically knowledgeable, but the way I was able to reproducible results was by running the DemoJourney scenario 30 times and then taking the percentiles of the entire distribution of all request timings from all tests. This resulted in about 42k total requests (across about 20 different endpoints) for each configuration. I came up with the 30 number pretty arbitrarily and I suspect we could lower this number in order to speed up the time it takes to get answers back.

I also wasn't sure if those p50 values were a concern.

Taking another look at the tests I ran before, here is the same table, including the 50p numbers:

APM config 50p (+/- baseline) 75p (+/- baseline) 95p (+/- baseline)
no apm - baseline 4209ms 11118ms 34186ms
no apm 4780ms 10448ms 32108ms
apm-default 7294ms (+73%) 20385ms (+83%) 41166ms (+20%)
apm-no-metrics 7790ms (+85%) 19206ms (+72%) 52288ms (+35%)
apm-no-span-stacktrace 5295ms (+26%) 11223ms (+1%) 34203ms (even)
apm-disable-instrumentation 7362ms (+75%) 18369ms (+65%) 39883ms (+17%)
apm-no-async 7294ms (+73%) 14916ms (+34%) 38963ms (+14%)

We do see an increase in 50p of 26% even with captureSpanStackTraces: false. While it's much reduced compared to having this option on, it's probably still worth investigating the root cause here since this will have an impact on the 'typical' case.

So we can identify slow operations, but we can't tell why they are so slow? It might be acceptable as long as we can re-configure APM settings and run an instance with captureSpanStackTraces enabled to debug the performance.

Yep, we'll need to coordinate with the Cloud folks on how much access we can get in order to flip that switch on to grab some samples when needed. Ideally this would be self-service for Kibana developers (or at least for a handful of teams). If we are able to find a way to optimize this in the agent, then maybe we'd be able to do away with this overhead. @trentm is it possible to offload any of the CPU cycles here to another Node worker thread?

I'd be interested in A) a higher sample rate and B) enabling breakdown metrics (now that #90955 has been merged). I don't expect breakdown metrics to be useful for Kibana as ~all async operations are Elasticsearch, but just curious about the difference now.

Yep, I don't think that PR was included in the snapshot I ran these tests under. I'll run some more this week to see if it makes an impact. Breakdown metrics would be helpful in some endpoints, but I'm not sure how a higher sample rate would be helpful to us for our use case?

@dgieselaar
Copy link
Member

Yep, I don't think that PR was included in the snapshot I ran these tests under. I'll run some more this week to see if it makes an impact. Breakdown metrics would be helpful in some endpoints, but I'm not sure how a higher sample rate would be helpful to us for our use case?

Mostly just interested in the performance impact of increasing or decreasing the sample rate, compared to the baseline.

@trentm
Copy link
Member

trentm commented Feb 25, 2021

is it possible to offload any of the CPU cycles here to another Node worker thread?

No, the APM agent doesn't currently support using worker threads for any of its work. Worth considering, but not something that would available anytime soon.

@trentm
Copy link
Member

trentm commented Apr 29, 2021

Taking another look at the tests I ran before, here is the same table, including the 50p numbers:
...
APM config 50p (+/- baseline) 75p (+/- baseline) 95p (+/- baseline)
no apm - baseline 4209ms 11118ms 34186ms

@joshdover Can I get a quick sanity check, please? When I was running DemoJourney against a local Kibana on my laptop I am getting values in the rough range of min=10ms to max=1300ms for the "Global Information" values in the gatling summary, e.g.:

Screen Shot 2021-04-29 at 1 14 52 PM

Doing a DemoJourney run against a newly deployed 7.12.0-SNAPSHOT I see values in the rough range of min=100ms, 50p=2500ms, max=24000ms, e.g.:

Screen Shot 2021-04-29 at 1 19 34 PM

Your values are quite a bit higher. I want to make sure we are quoting the same thing.

  1. Are you quoting the average of your ~30 runs of the "Global Information" stats values from the gatling run summaries?
  2. Were you running mvn gatling:test ... from a local computer? or from a VM in Google Cloud to try to be closer to your deployment?

@joshdover joshdover removed their assignment Aug 10, 2021
@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021
@watson
Copy link
Contributor

watson commented Nov 9, 2021

FYI: I'm in the process of removing bluebird #118097 (not sure if that's what's causing these issues, but just in case, I thought I'd let you know).

@lizozom
Copy link
Contributor

lizozom commented Nov 21, 2021

I see that the last benchmarking results on this issue are from Nov 2020.
Do we plan to re-evaluate performance after all the changes that were made?
Do we have plans to automate this process?

@trentm
Copy link
Member

trentm commented Nov 22, 2021

Do we plan to re-evaluate performance after all the changes that were made?
Do we have plans to automate this process?

@dmlemeshko Does something from https://github.com/elastic/kibana-load-testing provide any automatic data here? For example, with the recent #112973 merge, Kibana master has the Node.js APM agent on by default (in its reduced-functionality contextPropagationOnly mode). Are there any regular runs of kibana-load-testing scenarios against Kibana master that we could look at to see if there was a change in load/performance?

@mshustov
Copy link
Contributor Author

Do we plan to re-evaluate performance after all the changes that were made?

It might be useful to conduct testing against nodejs v16. AFAIK it contains changes to async_hooks that should have reduced the APM agent overhead. With these numbers at hand, we can close the issue.

Are there any regular runs of kibana-load-testing scenarios against Kibana master that we could look at to see if there was a change in load/performance?

Yes, you can find it here

@lizozom
Copy link
Contributor

lizozom commented Nov 23, 2021

Should we maybe track the automation on an issue (or update this one)?
Are there multiple configs we want to benchmark or should we just benchmark no apm vs. default config?

I think this would be helpful to reduce performance implication concerns anyone may have when enabling APM on their cluster.

@mshustov
Copy link
Contributor Author

Should we maybe track the automation on an issue (or update this one)?

I don't know the APM agent testing infrastructure well enough, but I'd be surprised if there is no such performance testing sandbox. cc @trentm and @vigneshshanmugam know better.

Let's just keep this effort out of the scope of the current task. Kibana is not the best place for such kind of testing due to its high level of internal complexity.

@trentm
Copy link
Member

trentm commented Nov 24, 2021

I don't know the APM agent testing infrastructure well enough, but I'd be surprised if there is no such performance testing sandbox.

The Node.js APM agent does have regular benchmark runs with the data shown here: https://observability-benchmarks.elastic.dev/goto/ec051bde1fc50f0239710a3b5c08867a
However, that are micro benchmarks that don't provide a useful measure of the overall agent impact on an app.

There was some (timeboxed) work done on closer-to-real-world performance analysis of the Node.js APM agent earlier in elastic/apm-agent-nodejs#2028 That work did not include a regular testing framework.

I was somewhat hoping that Kibana usage of the APM agents and https://github.com/elastic/kibana-load-testing might provide a path to getting a feel for APM impact on a large real-world app. However, I might be misunderstanding the goals of kibana-load-testing.git, so my hope is unfair.

@vigneshshanmugam
Copy link
Member

vigneshshanmugam commented Nov 24, 2021

RUM agent benchmarks are also on the same cluster, you can check the RUM dashboard - https://observability-benchmarks.elastic.dev/goto/27dac144459a24fc7e49a461cd81fca9

We have both Micro and Macro benchmarks for the hot paths of the code. However, the macro benchmark does not cover a general application, instead simulates a blank and heavy page. You can read more details about the RUM benchmarking in this document

@lizozom
Copy link
Contributor

lizozom commented Nov 30, 2021

@trentm @vigneshshanmugam @mshustov
So do we think that the benchmarking APM has internally + a fresh one-of benchmark from Kibana are enough?

@mshustov
Copy link
Contributor Author

mshustov commented Nov 30, 2021

I was somewhat hoping that Kibana usage of the APM agents and https://github.com/elastic/kibana-load-testing might provide a path to getting a feel for APM impact on a large real-world app

I can see the benefits of using Kibana as a real-world scenario for performance testing. But I can see a few problems:

  1. https://github.com/elastic/kibana-load-testing doesn't run the whole APM pipeline. So the test will be limited to the APM nodejs agent in contextPropagationOnly mode.
  2. https://github.com/elastic/kibana-load-testing cannot be used to test the APM RUM agent performance since it doesn't execute any client-side code.
  3. in all the load tests conducted for DemoJourney scenario of https://github.com/elastic/kibana-load-testing, I saw fluctuation of results. Maybe we should introduce a small scenario dedicated to performance testing? It will reduce the impact of changes to the Kibana codebase.

Maybe APM performance testing should belong to https://github.com/elastic/apm-integration-testing?

  • a fresh one-of benchmark from Kibana are enough

IMO it's the quickest solution for now.

@dmlemeshko
Copy link
Member

Hi everyone. Since we started to use bare metal machine for scalability testing, I decided to double check the impact of APM on Kibana server. Results are available in elastic/kibana-load-testing/issues/221
Happy to do some tweaks and do more test runs in order to improve latency.

@lizozom
Copy link
Contributor

lizozom commented Jul 20, 2022

Addressed with #129585
Closing as duplicate

@lizozom lizozom closed this as completed Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests