Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tail-based sampling from OTEL Collector #5867

Closed
4 tasks done
yurishkuro opened this issue Aug 20, 2024 · 34 comments · Fixed by #5878
Closed
4 tasks done

Support tail-based sampling from OTEL Collector #5867

yurishkuro opened this issue Aug 20, 2024 · 34 comments · Fixed by #5878
Labels
area/sampling changelog:new-feature Change that should be called out as new feature in CHANGELOG good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement v2

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Aug 20, 2024

Jaeger v2 can now support tail based sampling that exists in OTEL Collector as an extension.

  • include loadbalancing exporter and tail_sampling processor in components.go
  • create sample configuration and docker-compose file utilizing it
  • create e2e integration test (script and GH actions workflow)
  • create README documenting the setup

As a single task it's too large for good-first-issue, but can be done incrementally

@dosubot dosubot bot added area/sampling changelog:new-feature Change that should be called out as new feature in CHANGELOG v2 labels Aug 20, 2024
@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners labels Aug 20, 2024
@mahadzaryab1
Copy link
Contributor

@yurishkuro I'm interested in working on this but have never contributed to this repository before. Would you be able to guide me on how to tackle this issue?

@yurishkuro
Copy link
Member Author

@mahadzaryab1 I would start with

  • reading the blog post I linked
  • reproducing it in a local setup via docker compose
  • then swapping the final stage collectors with jaeger-v2 collector (it will require importing the tail sampling processor into components.go)

@mahadzaryab1
Copy link
Contributor

sounds good! i'll give this a shot

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 22, 2024

@yurishkuro I got the first item done in #5878 and read the blog post. Are there any instructions/examples on how I can can create a sample configuration and docker-compose file. Thank you so much for your time and help!

@yurishkuro
Copy link
Member Author

Sample configurations are provided in the blog post. Example of docker compose using Jaeger is in docker-compose/monitor/docker-compose-v2.yml (but you will need to build the Docker image locally so that the code recognizes tail sampling processor, because officially published image will not)

@mahadzaryab1
Copy link
Contributor

@yurishkuro thank you! as a follow-up, i'm trying to build the image locally by running make build from jaeger/docker-compose/monitor but am running into the following error. am i missing a step in between?

+ cd packages/jaeger-ui
./scripts/rebuild-ui.sh: line 35: cd: packages/jaeger-ui: No such file or directory
make[2]: *** [rebuild-ui] Error 1
make[1]: *** [jaeger-ui/packages/jaeger-ui/build/index.html] Error 2
make: *** [build] Error 2

@mahadzaryab1
Copy link
Contributor

@yurishkuro this is how i've modified the docker setup to reproduce tail based sampling - let me know if you have any feedback

@yurishkuro
Copy link
Member Author

you need to do initialize & update git submodules in order to access UI dir

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 23, 2024

@yurishkuro awesome! thank you so much. I was able to get the setup going based on the configuration I linked above. Here's some sample output I'm seeing from the different policies that are being evaluated:

jaeger-1          | 2024-08-23T05:14:01.932Z	debug	sampling/string_tag_filter.go:95	Evaluting spans in string-tag filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "string_attribute"}
jaeger-1          | 2024-08-23T05:14:01.932Z	debug	sampling/latency.go:34	Evaluating spans in latency filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "latency"}
jaeger-1          | 2024-08-23T05:14:01.932Z	debug	sampling/probabilistic.go:46	Evaluating spans in probabilistic filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "probabilistic"}
jaeger-1          | 2024-08-23T05:14:01.932Z	debug	sampling/status_code.go:54	Evaluating spans in status code filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "status_code"}

@yurishkuro
Copy link
Member Author

I suggest now to think about what e2e integration test could look like for this

@mahadzaryab1
Copy link
Contributor

@yurishkuro will do - thanks for all your help and guidance so far! should the sample configuration/docker compose/readme go in https://github.com/jaegertracing/jaeger/tree/main/examples?

@yurishkuro
Copy link
Member Author

docker-compose/tail-sampling

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 24, 2024

@yurishkuro I completed the second task of creating a sample configuration. I'm looking for a bit of guidance on the e2e integration tests. I see that we have some integration tests but those mostly seem to be for the storage collectors. In this case, do we want to maybe test that the traces are being sampled according to the policies in our config? For example, we could test the 'filter-by-attribute' or the 'all-errors' policies by ensuring that those traces are getting captured and the rest are getting filtered out.

@yurishkuro
Copy link
Member Author

You can use the tests we have as inspiration, but don't try to fit your test to them. How would you e2e test tail sampling if starting from scratch? Try explaining the whole setup and test process.

@mahadzaryab1
Copy link
Contributor

@yurishkuro I am envisioning something like this:

  1. Set up the jaeger-v2 collector with a tail-sampling processor with a string_attribute policy that matches on a particular tag
  2. Start the load balancing collector and jaeger-v2 collector from the docker-compose setup
  3. Send spans with various attributes to the otel-collector with load balancing
  4. Test that only the tags that match the ones listed in the policy are sampled/stored by the jaeger-v2 collector

Let me know what you think and if you have any feedback!

@yurishkuro
Copy link
Member Author

SGTM. A couple thoughts:

  • Such test would depend on your ability to generate very specific traces, how are you planning to achieve that? E.g. both microsim and tracegen utils generate nearly identical traces, although you have more control with tracegen where, for instance, you could control how many service names it generates
  • How will you verify that the right traces are sampled?
  • Most importantly, you need to make sure that the test is robust, e.g. that the reason some traces are not sampled is due to tail sampler, not due to some other condition. One way to ensure this is to run the A/B test where the only change between A and B is the configuration of the tail sampler. E.g. in A you only configure it to sample service-a, but not service-b, and you verify that you can observe that. Then in B you flip that condition and again verify that you get expected results.

BTW, in order to perform this test you do not need load generator running continuously, it's better if you just generate a fixed number of traces. You also need to make sure the storage is purged between A and B.

@mahadzaryab1
Copy link
Contributor

@yurishkuro Thanks for the follow-up. I'm going to try playing around with tracegen to begin with to see what kind of traces I can generate. I'm currently trying to replace mirosim with tracegen to the docker-compose.yml with the following configuration.

  tracegen:
    image: jaegertracing/jaeger-tracegen:latest
    environment:
      - OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
    command: ["-duration", "10s", "-workers", "3", "-pause", "250ms"]
    depends_on:
      - jaeger

This doesn't seem to work and gives me the following error:

tracegen-1        | 2024/08/24 20:42:32 traces export: Post "http://jaeger:4318/v1/traces": dial tcp: lookup jaeger on 127.0.0.11:53: no such host
tracegen-1 exited with code 0

Do you know what I'm missing? Here is the full docker-compose set up if that helps.

@yurishkuro
Copy link
Member Author

You're missing network setting on tracegen compose config

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 24, 2024

@yurishkuro thank you so much! As a follow up, I was wondering if you had any guidance (or an example if we're already doing this somewhere) on how I can query the spans that are stored? If we set up a policy on the service name, then we can do an A/B test with and without sampling as you suggested by querying the spans from store.

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 24, 2024

@yurishkuro I've been reading through the storage integration tests. If we add a filter on the service name in our tail sampling processor config, then can we do something similar to https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/integration/integration.go#L136 to get all the services from the traces we have collected (in memory) and ensure that we only have the ones we listed in our policy?

@mahadzaryab1
Copy link
Contributor

As well, do you have any thoughts on where the integration tests should live? It seems like the storage ones are in cmd/jaeger/internal/integration.

@yurishkuro
Copy link
Member Author

to get all the services from the traces we have collected (in memory) and ensure that we only have the ones we listed in our policy?

Conceptually yes, but sampling works at the trace level (all spans or nothing), so still need a bit more thought how you'd test them

@yurishkuro
Copy link
Member Author

The test needs a new workflow yaml file and a new shell script to orchestrate. Then depending on how you want to do the analysis you might need more - at this point I don't know what you have in mind. If you need go code then cmd/jaeger/internal/integration would be a good place.

@mahadzaryab1
Copy link
Contributor

@yurishkuro Would it not be enough to have a policy in the tail sampling processor that filters out a single service using a string attribute policy, say tracegen-02. We can then use tracegen to generate traces for some number of services, say 5. Then, if our tail sampling processor is working as expected, we would only store tracegen-02 and discard the rest.

@yurishkuro
Copy link
Member Author

That's fine as a first version, but it's not quite a robust test because it doesn't prove that service02 was actually generated in the first place. An A/B test wouldn't have that problem.

@mahadzaryab1
Copy link
Contributor

@yurishkuro I see. How about something like?

  1. Start jaeger backend with batch processor
  2. Use tracegen to generate traces for 5 services
  3. Query the data store to see that traces are present for all 5 services
  4. Flush the data store
  5. Start jaeger backend with tail sampling processor that has a policy to only sample service.name=tracegen-02
  6. Query the data store to see that traces are present only for tracegen-02

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 25, 2024

@yurishkuro would you be able to help me with the setup of the test? i'm stuck on the following:

  • how do I query the datastore?
  • can all of this be done through the shell script or should I be writing go code?
  • If go code needs to be writte, does it make sense for it to cmd/jaeger/internal/integration? it seems like this directory is being used to run all the storage related integration tests from the Makefile (https://github.com/jaegertracing/jaeger/blob/main/Makefile#L7)
  • Is the load balancer required for the integration tests? I'm not sure if there's something wrong with my setup but if I comment out the load balancing otel collector from the docker-compose file, the behaviour looks to be the same. Is this expected? (https://gist.github.com/mahadzaryab1/0b8ccc194421e00cd2e6dce6f450c424)

@yurishkuro
Copy link
Member Author

It is easiest to query data store if you write Go code in cmd/jaeger/internal/integration, because that's exactly what the tests located there are doing - they are using an RPC implementation of SpanReader which proxies the requests to the query service running in the jaeger-v2 collector.

Load balancer is not necessary since you will only be running a single instance of the collector. The objective of the load balancer is to ensure that all spans for the same trace ID end up in the same instance of the collector that runs tail sampling logic.

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 26, 2024

@yurishkuro Got it! Thank you so much. A couple of follow-ups I had:

@yurishkuro
Copy link
Member Author

Yes if you have to write code to read you might as well save whichever traces you need via Span Writer.

The test won't automatically run because those integration tests all require an environment variable to activate, otherwise they will all run at once and most of them fail because storage backend won't be available.

@mahadzaryab1
Copy link
Contributor

@yurishkuro I've completed the integration test and pushed it to #5878. Working on the final README task - should this go in jaeger/docker-compose/tail-sampling?

@yurishkuro
Copy link
Member Author

yes please

@mahadzaryab1
Copy link
Contributor

mahadzaryab1 commented Aug 31, 2024

@yurishkuro Thanks for all your help and guidance in helping me complete this issue. I learnt a lot about OpenTelemtry, Jaeger, and Distributed Tracing. I'm very excited about this project and would like to keep contributing to it. Do you have any recommendations for what I can pick up next?

@yurishkuro
Copy link
Member Author

@mahadzaryab1 appreciate the help. The top priority is completing the work on Jaeger-v2, which you can see in the project board https://github.com/orgs/jaegertracing/projects/3/views/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sampling changelog:new-feature Change that should be called out as new feature in CHANGELOG good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement v2
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants