Does APM contextPropagationOnly limit server capacity? #129585

lizozom · 2022-04-06T12:00:46Z

While benchmarking our server capacity and running a node profiler, we saw that our server side apm node integration (defined in src/cli/apm.js, and toggled on by src/cli/dist.js) is active, even if no APM configs are provided (It is enabled with contextPropagationOnly).

This enabled integration reduces server capacity (how many concurrent requests can we handle) by about 300-400% for static routes and by 25-50% for other dynamic routes.

According to the @elastic/apm-agent-node-js team, this is understandable behavior, especially for more simple routes, where the impact of APM is relatively more significant compared to the work the route does (static routes).

Attached is a FlameGraph from a session where we serve static files from a kibana server and you can search and APM execution (APM wasn't explicitly enabled on that server). To search in the file, you need to download and open it in a new tab and then hit Ctrl+F.

The text was updated successfully, but these errors were encountered:

mshustov · 2022-04-06T16:07:43Z

there was a suggestion to disable APM instrumentation for the static asset routes. see the old issue

If tracing, log correlation, and context propagation isn't relevant for static files, I would suggest using the transactionIgnoreUrls config option to skip certain routes, such as /translations/*.

This suggestion was revoked on today's call with the APM client team. @elastic/kibana-core, there are no other actionable items until the APM team finishes their investigation.

lizozom · 2022-04-06T17:00:11Z

I think that there's still one actionable issue on the @elastic/kibana-core side: make sure that we can completely turn off APM with configs if we need to (i.e. not call init at all).

On the @elastic/apm-agent-node-js team's side, they mentioned they will look at the performance of the simple-search route with and without APM enabled.

trentm · 2022-04-06T17:32:00Z

This suggestion was revoked on today's call with the APM client team.

I think it depends. If it turns out the Node.js APM agent overhead is mostly due to its async_hooks usage, then there isn't a way to turn that off for particular routes. However, if there is some significant APM agent overhead due to capturing transactions/spans/errors and sending those for static routes then there might be something to gain from using transactionIgnoreUrls.

lizozom · 2022-04-06T18:24:14Z

Lets see what comes up and then we consider our alternatives?

pgayvallet · 2022-04-21T09:16:31Z

FWIW, In the scope of #123737, we also have a performance check with noAPM vs contextPropagationOnly planned: elastic/kibana-load-testing#245

pgayvallet · 2022-05-03T06:40:03Z

I performed some load testing for this issue:

Test env

kibana instances: GCP n2d-standard-8
ES cluster: Cloud's default configuration, same version, same region
Load tester: GCP n2d-standard-8, same network, using internal IP/connection
Testing scenario: DemoJourney, 300 users, ran 3 times on reference/test version

reference instance

vanilla main (8.3.0 - c2a13af54beb4b9f0b2ce1530f9e329863f92e5a) branch with default configuration

test

vanilla main branch (8.3.0 - c2a13af54beb4b9f0b2ce1530f9e329863f92e5a) with the following configuration:

elastic.apm.contextPropagationOnly: false
elastic.apm.active: false

Summary

My observations confirm @lizozom initial benchmarking, we have a significant difference in performance under heavy load when APM is enabled in contextPropagationOnly mode compared to when it's fully disabled.

Min is not affected
Mean is 50% higher (1400 vs 2200)
50th percentile is 70% higher (1000 vs 1700)
all other percentiles (75th, 95th and 99th) are also quite similar, from 40% to 70% higher

Raw results

reference branch (apm in `contextPropagationOnly`)

test branch (apm fully disabled)

felixbarny · 2022-05-03T15:11:11Z

@pgayvallet is there a way for you to capture a CPU profile with and without the agent for a scenario that sees a particularly high impact, such as Discover...by id) 2?

we have a significant difference in performance under heavy load

Have you conducted testing under lighter load, too? If so, is the impact of the agent also as significant?
I'm wondering if the load test has tested Kibana close to the point at which the response times significantly degrade and the agent is the straw that breaks the camels back.

felixbarny · 2022-05-03T15:18:28Z

Btw, which is the version of the Node.js agent used in the tests? Version 3.31.0 comes with performance improvements.

felixbarny · 2022-05-04T09:00:09Z

How is the run-to-run variance on these tests? Or in other words how reproducible are the results? I noticed that the standard deviation is about as high as the mean.

lizozom · 2022-05-04T09:49:32Z

Great questions, thank you!

@pgayvallet is out for a few days, so we'll follow up next week.
cc the results themselves, they correspond with what I saw in my own benchmarks - it seems like as load gets higher, the impact of the agent becomes higher in % from the execution time.

Anyway, lets wait for Pierre to get back for the rest of the questions.

pgayvallet · 2022-05-09T07:43:45Z

which is the version of the Node.js agent used in the tests? Version 3.31.0 comes with performance improvements.

Sorry, should definitely have mentioned the version used.

Tests were run with the version of the agent currently used by Kibana's main branch: 3.32.0

Have you conducted testing under lighter load, too? If so, is the impact of the agent also as significant? I'm wondering if the load test has tested Kibana close to the point at which the response times significantly degrade and the agent is the straw that breaks the camels back

I did not, but I sure can.

I agree with your assumption btw, and my guess is that the impact should be way less significant under light load, given the Min metric of the previous results is overall way less affected than the high percentiles.

How is the run-to-run variance on these tests? I noticed that the standard deviation is about as high as the mean.

Pretty low. I ran the suites 3 times for each scenarios with variance inferior to 10% for both.

FWIW, the standard deviation being incredibly high seems 'normal' to me given the high difference between the low and high percentiles metrics.

is there a way for you to capture a CPU profile with and without the agent for a scenario that sees a particularly high impact, such as Discover...by id) 2?

I actually never performed CPU profiling on a Kibana instance, but I see @lizozom generated a flamegraph in the issue's initial description, so she can probably provide me with some insight here.

FWIW, we can't use kibana-load-testing to target a specific request. We could either run the whole discover suite, or use kibana-capacity-test instead.

lizozom · 2022-05-09T13:57:07Z

@pgayvallet and I synced up today.

Wanted to add a few more highlights:

We need to define what's the current capacity on the default server configuration. This would consist of:
- Concurrent API requests - we cab capture this value per API, by initially gradually reducing the load in the tests @pgayvallet ran until we reach stable performance.
- Static assets requests - @dmlemeshko mentioned he had a capacity test that used to concurrently load all static bundles, but that test was turned off. @danielmitterdorfer and I talked and agreed that it would be useful to re-introduce that test.
- Tasks - TBD.
Once we have the current capacity, we need to share it and discuss what is acceptable capacity - how many concurrent users we need and want to support.
Once we have those, we should be able to focus our efforts and attention to understand whether the most important bottlenecks are: be it our architecture (e.g. running tasks, static routes and api on the same machine, node limitations, etc.), implementation (e.g. registering too many hooks, loading too much code by default, etc.) and\or APM.

elasticmachine · 2022-05-23T08:15:54Z

Pinging @elastic/apm-ui (Team:apm)

dgieselaar · 2022-05-23T08:28:26Z

@stratoula this is not an issue for team:apm I think. The implementation of contextPropagationOnly is handled by Core. I'll update the labels (feel free to change them again if my assumptions are wrong).

lizozom · 2022-06-02T14:27:21Z

Chatted with @pgayvallet
Issue moved to backlog until further updates.

pgayvallet · 2022-06-09T08:24:34Z

Removing self-assignment as I'm not actively working on it atm

trentm · 2022-07-19T15:40:09Z

There is now an elastic-apm-node@3.37.0 that includes elastic/apm-agent-nodejs#2786 which should significantly reduce overhead from the APM agent for Kibana.

@lizozom or others: What do you think are best next steps for this issue? How about:

Open a PR to update the APM agent in Kibana.
(Optional) Someone (on the Kibana performance team) does some sanity DemoJourney test runs with the new elastic-apm-node@3.37.0 for comparison. Is that easy to do with some perf CI setup?
Resolve this issue as "yes, there is overhead from the APM agent, which will limit server capacity to a degree. We understand the overhead to be NN% for this scenario" (where I can describe that scenario).

TinaHeiligers · 2022-07-19T15:43:57Z

@trentm your proposed next steps seem like the logical way to progress here.

pgayvallet · 2022-07-19T15:54:49Z

Open a PR to update the APM agent in Kibana.

Renovate already took care of that: #136657

trentm · 2022-07-19T16:16:06Z

Renovate already took care of that

Ha thanks. I hadn't yet managed to bootstrap a kibana clone. :)

lizozom · 2022-07-20T14:00:21Z

@trentm @danielmitterdorfer

I'm so glad to see that this issue ended up yielding some performance improvements. Thank you so much for investing the time into this. I'm OOO in the next couple of weeks, but maybe @pgayvallet could run the benchmarks to verify?

More broadly speaking, we want to start running some capacity benchmarks in the foreseeable future, so we will obviously track this as well.

lizozom added bug Fixes for quality problems that affect the customer experience performance labels Apr 6, 2022

botelastic bot added the needs-team Issues missing a team label label Apr 6, 2022

pgayvallet mentioned this issue May 3, 2022

Stop Kibana from generating unique X-Opaque-ID values when not received from the client. #123737

Open

lizozom assigned pgayvallet May 3, 2022

dmlemeshko mentioned this issue May 10, 2022

[simulation] Add simulation that includes loading static bundles elastic/kibana-load-testing#262

Closed

stratoula added the Team:APM All issues that need APM UI Team support label May 23, 2022

botelastic bot removed the needs-team Issues missing a team label label May 23, 2022

dgieselaar added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc and removed Team:APM All issues that need APM UI Team support labels May 23, 2022

pgayvallet removed their assignment Jun 9, 2022

lizozom changed the title ~~[Bug] APM contextPropagationOnly limits server capacity~~ Does APM contextPropagationOnly limit server capacity? Jun 14, 2022

afharo added the wg:performance Work tracked by the performance workgroup label Jul 18, 2022

lizozom mentioned this issue Jul 20, 2022

Measure APM agent impact on the platform performance #78792

Closed

lizozom closed this as completed Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does APM contextPropagationOnly limit server capacity? #129585

Does APM contextPropagationOnly limit server capacity? #129585

lizozom commented Apr 6, 2022 •

edited

Loading

mshustov commented Apr 6, 2022

lizozom commented Apr 6, 2022

trentm commented Apr 6, 2022

lizozom commented Apr 6, 2022

pgayvallet commented Apr 21, 2022

pgayvallet commented May 3, 2022

felixbarny commented May 3, 2022

felixbarny commented May 3, 2022

felixbarny commented May 4, 2022

lizozom commented May 4, 2022

pgayvallet commented May 9, 2022

lizozom commented May 9, 2022

elasticmachine commented May 23, 2022

dgieselaar commented May 23, 2022

lizozom commented Jun 2, 2022

pgayvallet commented Jun 9, 2022

trentm commented Jul 19, 2022

TinaHeiligers commented Jul 19, 2022

pgayvallet commented Jul 19, 2022

trentm commented Jul 19, 2022

lizozom commented Jul 20, 2022

Does APM contextPropagationOnly limit server capacity? #129585

Does APM contextPropagationOnly limit server capacity? #129585

Comments

lizozom commented Apr 6, 2022 • edited Loading

mshustov commented Apr 6, 2022

lizozom commented Apr 6, 2022

trentm commented Apr 6, 2022

lizozom commented Apr 6, 2022

pgayvallet commented Apr 21, 2022

pgayvallet commented May 3, 2022

Test env

reference instance

test

Summary

Raw results

reference branch (apm in contextPropagationOnly)

test branch (apm fully disabled)

felixbarny commented May 3, 2022

felixbarny commented May 3, 2022

felixbarny commented May 4, 2022

lizozom commented May 4, 2022

pgayvallet commented May 9, 2022

lizozom commented May 9, 2022

elasticmachine commented May 23, 2022

dgieselaar commented May 23, 2022

lizozom commented Jun 2, 2022

pgayvallet commented Jun 9, 2022

trentm commented Jul 19, 2022

TinaHeiligers commented Jul 19, 2022

pgayvallet commented Jul 19, 2022

trentm commented Jul 19, 2022

lizozom commented Jul 20, 2022

lizozom commented Apr 6, 2022 •

edited

Loading

reference branch (apm in `contextPropagationOnly`)