Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does APM contextPropagationOnly limit server capacity? #129585

Closed
lizozom opened this issue Apr 6, 2022 · 21 comments
Closed

Does APM contextPropagationOnly limit server capacity? #129585

lizozom opened this issue Apr 6, 2022 · 21 comments
Labels
bug Fixes for quality problems that affect the customer experience performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc wg:performance Work tracked by the performance workgroup

Comments

@lizozom
Copy link
Contributor

lizozom commented Apr 6, 2022

While benchmarking our server capacity and running a node profiler, we saw that our server side apm node integration (defined in src/cli/apm.js, and toggled on by src/cli/dist.js) is active, even if no APM configs are provided (It is enabled with contextPropagationOnly).

This enabled integration reduces server capacity (how many concurrent requests can we handle) by about 300-400% for static routes and by 25-50% for other dynamic routes.

According to the @elastic/apm-agent-node-js team, this is understandable behavior, especially for more simple routes, where the impact of APM is relatively more significant compared to the work the route does (static routes).

image

image

Attached is a FlameGraph from a session where we serve static files from a kibana server and you can search and APM execution (APM wasn't explicitly enabled on that server). To search in the file, you need to download and open it in a new tab and then hit Ctrl+F.

test1

@lizozom lizozom added bug Fixes for quality problems that affect the customer experience performance labels Apr 6, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 6, 2022
@lizozom lizozom changed the title Related to https://github.com/elastic/kibana-team/issues/492 While benchmarking our server capacity and running a node profiler, we saw that our server side apm node integration (defined in src/cli/apm.js, and toggled on by src/cli/dist.js) is active, even if no APM configs are provided. This enabled integration reduces server capacity (how many concurrent requests can we handle) by about x3-4 times. ![image](https://user-images.githubusercontent.com/3016806/161717152-fa0045d5-a45b-42f0-a820-76ebb9f9cb93.png) Attached is a [FlameGraph](https://www.brendangregg.com/flamegraphs.html) from a session where we serve static files from a kibana server and you can search and APM execution (APM wasn't explicitly enabled on that server). To search in the file, you need to download and open it in a new tab and then hit Ctrl+F. ![test1](https://user-images.githubusercontent.com/3016806/161026411-3ee711d3-1d7e-4539-a35b-d3fb38bf0a62.svg) [Bug] APM contextPropagationOnly limits server capacity Apr 6, 2022
@mshustov
Copy link
Contributor

mshustov commented Apr 6, 2022

there was a suggestion to disable APM instrumentation for the static asset routes. see the old issue

If tracing, log correlation, and context propagation isn't relevant for static files, I would suggest using the transactionIgnoreUrls config option to skip certain routes, such as /translations/*.

This suggestion was revoked on today's call with the APM client team. @elastic/kibana-core, there are no other actionable items until the APM team finishes their investigation.

@lizozom
Copy link
Contributor Author

lizozom commented Apr 6, 2022

I think that there's still one actionable issue on the @elastic/kibana-core side: make sure that we can completely turn off APM with configs if we need to (i.e. not call init at all).

On the @elastic/apm-agent-node-js team's side, they mentioned they will look at the performance of the simple-search route with and without APM enabled.

@trentm
Copy link
Member

trentm commented Apr 6, 2022

This suggestion was revoked on today's call with the APM client team.

I think it depends. If it turns out the Node.js APM agent overhead is mostly due to its async_hooks usage, then there isn't a way to turn that off for particular routes. However, if there is some significant APM agent overhead due to capturing transactions/spans/errors and sending those for static routes then there might be something to gain from using transactionIgnoreUrls.

@lizozom
Copy link
Contributor Author

lizozom commented Apr 6, 2022

Lets see what comes up and then we consider our alternatives?

@pgayvallet
Copy link
Contributor

FWIW, In the scope of #123737, we also have a performance check with noAPM vs contextPropagationOnly planned: elastic/kibana-load-testing#245

@pgayvallet
Copy link
Contributor

I performed some load testing for this issue:

Test env

kibana instances: GCP n2d-standard-8
ES cluster: Cloud's default configuration, same version, same region
Load tester: GCP n2d-standard-8, same network, using internal IP/connection
Testing scenario: DemoJourney, 300 users, ran 3 times on reference/test version

reference instance

vanilla main (8.3.0 - c2a13af54beb4b9f0b2ce1530f9e329863f92e5a) branch with default configuration

test

vanilla main branch (8.3.0 - c2a13af54beb4b9f0b2ce1530f9e329863f92e5a) with the following configuration:

elastic.apm.contextPropagationOnly: false
elastic.apm.active: false

Summary

My observations confirm @lizozom initial benchmarking, we have a significant difference in performance under heavy load when APM is enabled in contextPropagationOnly mode compared to when it's fully disabled.

  • Min is not affected
  • Mean is 50% higher (1400 vs 2200)
  • 50th percentile is 70% higher (1000 vs 1700)
  • all other percentiles (75th, 95th and 99th) are also quite similar, from 40% to 70% higher

Raw results

reference branch (apm in contextPropagationOnly)

Screenshot 2022-05-03 at 08 35 10

test branch (apm fully disabled)

Screenshot 2022-05-03 at 08 35 18

@felixbarny
Copy link
Member

@pgayvallet is there a way for you to capture a CPU profile with and without the agent for a scenario that sees a particularly high impact, such as Discover...by id) 2?

we have a significant difference in performance under heavy load

Have you conducted testing under lighter load, too? If so, is the impact of the agent also as significant?
I'm wondering if the load test has tested Kibana close to the point at which the response times significantly degrade and the agent is the straw that breaks the camels back.

@felixbarny
Copy link
Member

Btw, which is the version of the Node.js agent used in the tests? Version 3.31.0 comes with performance improvements.

@felixbarny
Copy link
Member

How is the run-to-run variance on these tests? Or in other words how reproducible are the results? I noticed that the standard deviation is about as high as the mean.

@lizozom
Copy link
Contributor Author

lizozom commented May 4, 2022

Great questions, thank you!

@pgayvallet is out for a few days, so we'll follow up next week.
cc the results themselves, they correspond with what I saw in my own benchmarks - it seems like as load gets higher, the impact of the agent becomes higher in % from the execution time.

Anyway, lets wait for Pierre to get back for the rest of the questions.

@pgayvallet
Copy link
Contributor

which is the version of the Node.js agent used in the tests? Version 3.31.0 comes with performance improvements.

Sorry, should definitely have mentioned the version used.

Tests were run with the version of the agent currently used by Kibana's main branch: 3.32.0

Have you conducted testing under lighter load, too? If so, is the impact of the agent also as significant? I'm wondering if the load test has tested Kibana close to the point at which the response times significantly degrade and the agent is the straw that breaks the camels back

I did not, but I sure can.

I agree with your assumption btw, and my guess is that the impact should be way less significant under light load, given the Min metric of the previous results is overall way less affected than the high percentiles.

How is the run-to-run variance on these tests? I noticed that the standard deviation is about as high as the mean.

Pretty low. I ran the suites 3 times for each scenarios with variance inferior to 10% for both.

FWIW, the standard deviation being incredibly high seems 'normal' to me given the high difference between the low and high percentiles metrics.

is there a way for you to capture a CPU profile with and without the agent for a scenario that sees a particularly high impact, such as Discover...by id) 2?

I actually never performed CPU profiling on a Kibana instance, but I see @lizozom generated a flamegraph in the issue's initial description, so she can probably provide me with some insight here.

FWIW, we can't use kibana-load-testing to target a specific request. We could either run the whole discover suite, or use kibana-capacity-test instead.

@lizozom
Copy link
Contributor Author

lizozom commented May 9, 2022

@pgayvallet and I synced up today.

Wanted to add a few more highlights:

  • We need to define what's the current capacity on the default server configuration. This would consist of:
    • Concurrent API requests - we cab capture this value per API, by initially gradually reducing the load in the tests @pgayvallet ran until we reach stable performance.
    • Static assets requests - @dmlemeshko mentioned he had a capacity test that used to concurrently load all static bundles, but that test was turned off. @danielmitterdorfer and I talked and agreed that it would be useful to re-introduce that test.
    • Tasks - TBD.
  • Once we have the current capacity, we need to share it and discuss what is acceptable capacity - how many concurrent users we need and want to support.
  • Once we have those, we should be able to focus our efforts and attention to understand whether the most important bottlenecks are: be it our architecture (e.g. running tasks, static routes and api on the same machine, node limitations, etc.), implementation (e.g. registering too many hooks, loading too much code by default, etc.) and\or APM.

@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@botelastic botelastic bot removed the needs-team Issues missing a team label label May 23, 2022
@dgieselaar
Copy link
Member

@stratoula this is not an issue for team:apm I think. The implementation of contextPropagationOnly is handled by Core. I'll update the labels (feel free to change them again if my assumptions are wrong).

@dgieselaar dgieselaar added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc and removed Team:APM All issues that need APM UI Team support labels May 23, 2022
@lizozom
Copy link
Contributor Author

lizozom commented Jun 2, 2022

Chatted with @pgayvallet
Issue moved to backlog until further updates.

@pgayvallet pgayvallet removed their assignment Jun 9, 2022
@pgayvallet
Copy link
Contributor

Removing self-assignment as I'm not actively working on it atm

@lizozom lizozom changed the title [Bug] APM contextPropagationOnly limits server capacity Does APM contextPropagationOnly limit server capacity? Jun 14, 2022
@afharo afharo added the wg:performance Work tracked by the performance workgroup label Jul 18, 2022
@trentm
Copy link
Member

trentm commented Jul 19, 2022

There is now an elastic-apm-node@3.37.0 that includes elastic/apm-agent-nodejs#2786 which should significantly reduce overhead from the APM agent for Kibana.

@lizozom or others: What do you think are best next steps for this issue? How about:

  1. Open a PR to update the APM agent in Kibana.
  2. (Optional) Someone (on the Kibana performance team) does some sanity DemoJourney test runs with the new elastic-apm-node@3.37.0 for comparison. Is that easy to do with some perf CI setup?
  3. Resolve this issue as "yes, there is overhead from the APM agent, which will limit server capacity to a degree. We understand the overhead to be NN% for this scenario" (where I can describe that scenario).

@TinaHeiligers
Copy link
Contributor

@trentm your proposed next steps seem like the logical way to progress here.

@pgayvallet
Copy link
Contributor

Open a PR to update the APM agent in Kibana.

Renovate already took care of that: #136657

@trentm
Copy link
Member

trentm commented Jul 19, 2022

Renovate already took care of that

Ha thanks. I hadn't yet managed to bootstrap a kibana clone. :)

@lizozom
Copy link
Contributor Author

lizozom commented Jul 20, 2022

@trentm @danielmitterdorfer

I'm so glad to see that this issue ended up yielding some performance improvements. Thank you so much for investing the time into this. I'm OOO in the next couple of weeks, but maybe @pgayvallet could run the benchmarks to verify?

More broadly speaking, we want to start running some capacity benchmarks in the foreseeable future, so we will obviously track this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc wg:performance Work tracked by the performance workgroup
Projects
None yet
Development

No branches or pull requests

10 participants