Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up consumer action in Jaeger ingester #2927

Closed
judyQD opened this issue Apr 7, 2021 · 21 comments
Closed

How to speed up consumer action in Jaeger ingester #2927

judyQD opened this issue Apr 7, 2021 · 21 comments
Labels

Comments

@judyQD
Copy link

judyQD commented Apr 7, 2021

Describe the bug
I am using tracegen --> collector -->> kafka -->ingester --> ES.
And I found that it's kind of slow for ingester to pull spans in Kafka.
image

Ingester configuration is as below:

      es.bulk.actions: 1000,
      es.bulk.flush-interval: 200ms
      es.bulk.size: 5e+06
      es.bulk.workers: 50
      es.create-index-templates: true
      es.index-prefix: "zts_dnt_jaeger"
      es.max-doc-count: 10000
      es.max-span-age: 72h0m0s
      es.num-replicas: 1
      es.num-shards: 5
      es.password: "qE3E&FE#2QslS9sdf&q"
      es.server-urls: "http://127.0.0.1:9200"
      es.sniffer: false
      es.sniffer-tls-enabled: false
      es.tags-as-fields.all: false
      es.tags-as-fields.config-file: ""
      es.tags-as-fields.dot-replacement: "@"
      es.tags-as-fields.include: ""
      es.timeout: 0s
      es.tls.ca: ""
      es.tls.cert: ""
      es.tls.enabled: false
      es.tls.key: ""
      es.tls.server-name: ""
      es.tls.skip-host-verify: false
      es.token-file:
      es.use-aliases: false
      es.username: "admin"
      es.version:
      log-level: info
      metrics-backend: "prometheus"
      metrics-http-route: "/metrics"
      ingester.deadlockInterval: 0s
      ingester.parallelism: 1000
      kafka.consumer.authentication: none
      kafka.consumer.brokers: "127.0.0.1:9092"
      kafka.consumer.client-id: "jaeger-ingester"
      kafka.consumer.encoding: "protobuf"
      kafka.consumer.group-id: "jaeger-ingester"
      kafka.consumer.kerberos.config-file: "/etc/krb5.conf"
      kafka.consumer.kerberos.keytab-file: "/etc/security/kafka.keytab"
      kafka.consumer.kerberos.password:
      kafka.consumer.kerberos.realm:
      kafka.consumer.kerberos.service-name: "kafka"
      kafka.consumer.kerberos.use-keytab: false
      kafka.consumer.kerberos.username:
      kafka.consumer.plaintext.password:
      kafka.consumer.plaintext.username:
      kafka.consumer.protocol-version:
      kafka.consumer.tls.ca: ""
      kafka.consumer.tls.cert: ""
      kafka.consumer.tls.enabled: false
      kafka.consumer.tls.key: ""
      kafka.consumer.tls.server-name: ""
      kafka.consumer.tls.skip-host-verify: ""

Screenshots
image

@judyQD judyQD added the bug label Apr 7, 2021
@judyQD
Copy link
Author

judyQD commented Apr 7, 2021

help wanted

@jpkrohling
Copy link
Contributor

Are you able to provide the metrics from both the ingester and from the collector components?

@jpkrohling jpkrohling added needs-info The ticket does not contain sufficient information and removed bug labels Apr 7, 2021
@judyQD
Copy link
Author

judyQD commented Apr 8, 2021

metricsdata.zip

@judyQD
Copy link
Author

judyQD commented Apr 8, 2021

Are you able to provide the metrics from both the ingester and from the collector components?

not sure which metrics data you would like to check. Share you all the data in zip file. Thanks

@judyQD
Copy link
Author

judyQD commented Apr 13, 2021

@jpkrohling Any finding and suggestion? Thx

@jpkrohling
Copy link
Contributor

No, sorry, not yet. @albertteoh, you also have some experience with the ingester, right? Are you able to take a look?

@albertteoh
Copy link
Contributor

albertteoh commented Apr 15, 2021

Unfortunately, I have very little practical prod experience with the ingester so I can't promise quality help here.

Having said that, taking a cursory look at the metrics provided shows some fairly large offset lags:

$ grep jaeger_ingester_sarama_consumer_offset_lag i*metrics.log
i1metrics.log:# HELP jaeger_ingester_sarama_consumer_offset_lag offset-lag
i1metrics.log:# TYPE jaeger_ingester_sarama_consumer_offset_lag gauge
i1metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="0"} 0
i1metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="1"} 1.187711e+06
i1metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="2"} 0
i2metrics.log:# HELP jaeger_ingester_sarama_consumer_offset_lag offset-lag
i2metrics.log:# TYPE jaeger_ingester_sarama_consumer_offset_lag gauge
i2metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="0"} 0
i2metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="1"} 1.222594e+06
i2metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="2"} 1.161799e+06
i3metrics.log:# HELP jaeger_ingester_sarama_consumer_offset_lag offset-lag
i3metrics.log:# TYPE jaeger_ingester_sarama_consumer_offset_lag gauge
i3metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="0"} 1.066824e+06
i3metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="1"} 0
i3metrics.log:jaeger_ingester_sarama_consumer_offset_lag{partition="2"} 1.206965e+06

I'm not entirely sure why some lags are 0 and some are so large, maybe because the ingesters are only consuming from one partition? If that's the case then perhaps these metrics look okay since each partition has at least one 0 lag. But if each ingester is trying to consume from each partition, then this could be a problem.

In my past experience, one possible cause of ingesters struggling to keep up with consuming from kafka, is that the backing store (Elasticsearch in your case) is struggling to keep up with the write load.

Do you have any numbers on your Elasticsearch cluster load? Maybe this might help? https://www.elastic.co/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries.

@judyQD
Copy link
Author

judyQD commented Apr 26, 2021

after our tuning, we found currently, the bottleneck is Ingester, not elasticsearch. Even we add more partion and more ingester, still cannot improve the capacity of ingester to fetch data from kafka partition.
When we use pstack to check ingester, find ingester is waiting all the time. Could u help check why?

pstack 25895
Thread 1 (process 25895):
#0 runtime.futex () at runtime/sys_linux_amd64.s:588
#1 0x00000000004338c6 in runtime.futexsleep (addr=0x1a5c6a8 <runtime.m0+328>, val=0, ns=-1) at runtime/os_linux.go:45
#2 0x000000000040c15f in runtime.notesleep (n=0x1a5c6a8 <runtime.m0+328>) at runtime/lock_futex.go:159
#3 0x000000000043ebcd in runtime.stoplockedm () at runtime/proc.go:2079
#4 0x0000000000440a2c in runtime.schedule () at runtime/proc.go:2611
#5 0x0000000000440e1d in runtime.park_m (gp=0xc000102a80) at runtime/proc.go:2845
#6 0x000000000046e41b in runtime.mcall () at runtime/asm_amd64.s:318
#7 0x000000000046e314 in runtime.rt0_go () at runtime/asm_amd64.s:220
#8 0x0000000000000000 in ?? ()

@albertteoh
Copy link
Contributor

we found currently, the bottleneck is Ingester, not elasticsearch.

How did you come to this conclusion?

When we use pstack to check ingester, find ingester is waiting all the time.

Can we answer the question of what the ingester is waiting on?

Ingester pulls span messages from kafka and synchronously writes this span to the Elasticsearch before pulling the next span message from its partition.
The only other thing I see it doing is report metrics which should be quite cheap.

Could the ingester be waiting on the "OK, the span write was successful" response from ES?

Even we add more partion and more ingester, still cannot improve the capacity of ingester to fetch data from kafka partition.

I'm assuming you mean the offset lag is still present; is it still increasing at the same rate (or getting worse)?

If so, this seems to more strongly suggest a bottleneck in Elasticsearch since there are more ingesters to pull messages from kafka yet the lag is not improving, indicating the ingesters are blocked on writing out spans (before they can consume more kafka messages).

Do you have numbers around ES performance (i.e. CPU, memory pressure)? This seems to be a useful resource.

@judyQD
Copy link
Author

judyQD commented Apr 30, 2021

we found currently, the bottleneck is Ingester, not elasticsearch.

How did you come to this conclusion?
Because we also use fluentd send traffic to ES and write index of ES can be at rate of 16k logs/s. When we use ingester to send spans to ES, the write index of ES is just 2000 span/s. ES is same configuration and same env. So we think the bottle neck is ingester.

When we use pstack to check ingester, find ingester is waiting all the time.

Can we answer the question of what the ingester is waiting on?
Ingester pulls span messages from kafka and synchronously writes this span to the Elasticsearch before pulling the next span message from its partition.
The only other thing I see it doing is report metrics which should be quite cheap.

Could the ingester be waiting on the "OK, the span write was successful" response from ES?

Even we add more partion and more ingester, still cannot improve the capacity of ingester to fetch data from kafka partition.

I'm assuming you mean the offset lag is still present; is it still increasing at the same rate (or getting worse)?

If so, this seems to more strongly suggest a bottleneck in Elasticsearch since there are more ingesters to pull messages from kafka yet the lag is not improving, indicating the ingesters are blocked on writing out spans (before they can consume more kafka messages).

Do you have numbers around ES performance (i.e. CPU, memory pressure)? This seems to be a useful resource.

@judyQD
Copy link
Author

judyQD commented Apr 30, 2021

@albertteoh I am also thinking about your comments:
Ingester pull span messages from kafka and synchronously writes this span to the ES before pulling the next span message from its partition
I think ingester use ES BULK API to write span, which means every bulk request contains more than one span. If your comments is right, then every bulk request will only contains single span. Pl correct me if something wrong.

@albertteoh
Copy link
Contributor

@judyQD thanks for correcting me :) my assumption was incorrect and indeed jaeger does use the ES bulk API.

What I suggest to do as next steps is to:

  1. Measure - Jaeger emits some useful metrics on bulk write:
    1. Write counters: jaeger_bulk_index_inserts_total, jaeger_bulk_index_errors_total and jaeger_bulk_index_attempts_total.
    2. Latency histograms: jaeger_bulk_index_latency_ok_bucket and jaeger_bulk_index_latency_err_bucket.

I threw together a quick dashboard of a locally running ingester with write success rates and latency histograms, happy to share the dashboard if you like (the +Inf error write was probably because I killed elasticsearch):
Screen Shot 2021-05-01 at 7 44 38 am

  1. Based on the above measurements, there are a few knobs you can adjust that may help with optimizing the ingester (or their env var equivalents like ES_BULK_ACTIONS, etc.):
$ SPAN_STORAGE_TYPE=elasticsearch ./cmd/ingester/ingester-darwin-amd64 --help | grep bulk
...
      --es.bulk.actions int                                The number of requests that can be enqueued before the bulk processor decides to commit (default 1000)
      --es.bulk.flush-interval duration                    A time.Duration after which bulk requests are committed, regardless of other thresholds. Set to zero to disable. By default, this is disabled. (default 200ms)
      --es.bulk.size int                                   The number of bytes that the bulk requests can take up before the bulk processor decides to commit (default 5000000)
      --es.bulk.workers int                                The number of workers that are able to receive bulk requests and eventually commit them to Elasticsearch (default 1)

@albertteoh
Copy link
Contributor

@judyQD did you manage to improve the ingester's consumption rate?

@judyQD
Copy link
Author

judyQD commented Jun 7, 2021

@albertteoh Thanks for your nice help. Actually, no. I found Jaeger ingester use this package "github.com/Shopify/sarama" to be the consumer of kafka. I changed the max fetch size of this package, it improved the consumer performance. But the consumption rate is not fixed, and I have no idea how to increase the consumption rate.

@albertteoh
Copy link
Contributor

@judyQD Did you manage to get some measurements from the ingester and try modifying some of its parameters #2927 (comment)?

For example, perhaps adding more workers or influencing the flush rate by count, duration or size, may help increase the consumption rate?

@judyQD
Copy link
Author

judyQD commented Jun 7, 2021

@albertteoh Thanks for your comments. The paremeters you list are focus on the function to store date into ES.
But based on my observation, the bottleneck is fetching data from kafka. So I want to improve the performance of this part.
Do u have any comments about improving this part's performance: fetching data from kafka. Thx.

@albertteoh
Copy link
Contributor

albertteoh commented Jun 7, 2021

@judyQD If I'm not mistaken, the "Kafka consumer goroutine" (after consuming the span from the topic) eventually attempts to write/add the span to a buffer that is eventually bulk written to ES.

I suspect the bottleneck is at the point where the Kafka consumer goroutine offloads the msg from Kafka via a write to an unbuffered channel.

The reason:

  • Another goroutine (we'll call it the "bulk writer goroutine") is continuously consuming from this channel and writing to the buffer and is also responsible for flushing/committing the buffer of spans to ES.
  • What happens if the bulk writer goroutine can't keep up with the messages being written to the channel by the Kafka consumer goroutine?
    • It will cause the Kafka consumer goroutine to block until the bulk writer goroutine is able to read from the channel again.

To illustrate the problem of writing to an unbuffered channel while there is no "reading" goroutine, I've put together this short go program: https://play.golang.org/p/K88GTBJJwHL.

If you comment out the sleep, you'll see the consumer goroutine is no longer blocked writing to the channel.

This is why I think we could explore the parameters for tweaking the ES writes as this may be causing back-pressure on the Kafka consumer. So, perhaps increasing the number of workers or other flush-related parameters to hopefully increase write throughput to ES.

I think it would also be good to measure the bulk index insert counts, etc. above, to provide empirical evidence that an increase in workers, etc. correlates to an increase in inserts and in turn, correlates to an improvement (or not) in the offset lag. I believe ingester also exposes these kafka consumer metrics.

@mehta-ankit
Copy link
Member

I know this is an old issue but recently i tried to improve ingester performance and landed on this issue so wanted to share what i did and how it helped us.

Thanks to @albertteoh 's suggestions these are the steps i took.

  1. Doubled my kafka partitions (keeping the number of brokers the same, but increased broker cpu)
  2. This allowed me to run more ingesters (up to the max of total number of partitions) to parallelize the consumption of spans.

These2 steps did not increase performance and then taking suggestions/ideas from above
3) I added more CPU/heap to my Elasticsearch nodes because i saw elasticsearch_thread_pool_queue_count for type write jumped up drastically after making (1) and (2) changes. Giving more cpu allowed the threadpool queue count to come down.
4) Increased bulk action workers on Ingester es.bulk.workers from 1 which is the default. I played with 2 and 3 I think the right number would depend on how your storage backend behaves.

Using these I was able to increase document index rate of my Elasticsearch nodes by 50% and consumer lag on Ingesters went down immensely. Looking at jaeger_bulk_index_inserts_total it doubled in value.

@zdyj3170101136
Copy link

may is the reason that jsonpb package so slow.

@sbulic
Copy link

sbulic commented Nov 21, 2022

@albertteoh would you mind sharing that Grafana dashboard for monitoring Jaeger Ingester? Is it only for Jeager Ingester of it's maybe also covering Collector / Query services soo?

@yurishkuro
Copy link
Member

jaeger-v2 will have different implementation, closing this

@yurishkuro yurishkuro closed this as not planned Won't fix, can't repro, duplicate, stale Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants