Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

lucasoares · 2023-10-06T21:16:27Z

Component(s)

connector/spanmetrics, exporter/prometheusremotewrite

What happened?

Subject: Issue Report: Incorrect Behavior in OpenTelemetry Collector Spanmetrics

Issue Description:

We're facing a peculiar issue with the OpenTelemetry Collector's Spanmetrics connector and could use some help sorting it out.

Here's a quick rundown:

Problem:

We've set up an architecture using Grafana Stack LGTM, with Grafana Loki, Tempo, and Mimir for logs, tracing, and metrics, respectively.
The goal is to sample traces efficiently but capture 100% of spanmetrics for a comprehensive APM dashboard.
Our setup involves the otel/opentelemetry-collector-contrib as a load balancer, handling trace metrics with the 'spanmetrics' connector and routing traces/metrics based on an attribute_source to apply our internal's tenant distribuition inside Grafana's services.
Traces are correctly routed and stored in Grafana Tempo, but the spanmetrics exhibit strange behavior on Grafana Mimir.

Spanmetrics Configuration:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [1ms, 2ms, ... , 10000s]
    namespace: traces.spanmetrics
    dimensions:
      - name: http.status_code
      - name: http.method
      - name: rpc.grpc.status_code
      - name: db.system
      - name: external.service
      - name: k8s.cluster.name

Issue Details:

Executing code that generates a specific span 10 times accumulates the counter timeseries correctly.
However, querying the metric using PromQL functions like increase or rate yields inaccurate results.
For example, increase(traces_spanmetrics_calls_total{service_name="my-service"}[5m]) shows a continuously increasing line, reaching 600 executions, and never returning to 0, even after a trace-free period.

Observations:

The discrepancy is causing inflated values in application metrics, with rate showing over 100,000,000 spans/minute for an app generating 40,000 spans/minute.
We sought help on the Grafana Mimir Slack channel (link) without success, but since we haven't found issues with metrics generated by our own applications, it suggests the problem lies within the OpenTelemetry Collector.

Screenshots:

In this last example, the metric only stopped because we restarted the opentelemetry-collector that was serving these spanmetrics

Another example of the metric being incorrect after the application no longer generates new spans:

If you need more details or logs, just let us know!

Collector version

0.83.0

Environment information

Environment

Kubernetes using official helm-chart:

image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
  repository: otel/opentelemetry-collector-contrib
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "0.83.0"
  # When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
  digest: ""

OpenTelemetry Collector configuration

There are 2 yaml helm configurations in this section.

The loadbalancer:

# Default values for opentelemetry-collector.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

nameOverride: ""
fullnameOverride: ""

# Valid values are "daemonset", "deployment", and "statefulset".
mode: "deployment"

configMap:
  # Specifies whether a configMap should be created (true by default)
  create: true

# Base collector configuration.
# Supports templating. To escape existing instances of {{ }}, use {{` <original content> `}}.
# For example, {{ REDACTED_EMAIL }} becomes {{` {{ REDACTED_EMAIL }} `}}.
config:
  receivers:
    jaeger: null
    zipkin: null
    prometheus: null
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
          max_recv_msg_size_mib: 500
        http:
          endpoint: ${env:MY_POD_IP}:4318
  processors:
    batch:
      send_batch_max_size: 8192
    routing:
      from_attribute: k8s.cluster.name
      attribute_source: resource
      table:
      - value: a
        exporters:
          - prometheusremotewrite/mimir-a
      - value: b
        exporters:
          - prometheusremotewrite/mimir-b
      - value: c
        exporters:
          - prometheusremotewrite/mimir-c
      - value: d
        exporters:
          - prometheusremotewrite/mimir-d
      - value: e
        exporters:
          - prometheusremotewrite/mimir-e
      - value: e
        exporters:
          - prometheusremotewrite/mimir-f
      - value: f
        exporters:
          - prometheusremotewrite/mimir-g
      - value: g
        exporters:
          - prometheusremotewrite/mimir-h
      - value: h
        exporters:
          - prometheusremotewrite/mimir-i
      - value: J
        exporters:
          - prometheusremotewrite/mimir-j
    # If set to null, will be overridden with values based on k8s resource limits
    memory_limiter: null
  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [1ms, 2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s, 20s, 40s, 100s, 500s, 1000s, 10000s]
      namespace: traces.spanmetrics
      dimensions:
        - name: http.status_code
        - name: http.method
        - name: rpc.grpc.status_code
        - name: db.system
        - name: external.service
        - name: k8s.cluster.name
  exporters:
    logging: null
    prometheusremotewrite/mimir-a:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanaaMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-b:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanabMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-c:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanacMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-d:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanadMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-e:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanaFirehoseMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-f:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanaeMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-g:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanafMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-h:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanagMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-i:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanahMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-j:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanajMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    loadbalancing:
      protocol:
        otlp:
          tls:
            insecure: true
      resolver:
        dns:
          hostname: opentelemetry-collector-tail.tempo-system.svc.cluster.local
          port: 4317

  extensions:
    # The health_check extension is mandatory for this chart.
    # Without the health_check extension the collector will fail the readiness and liveliness probes.
    # The health_check extension can be modified, but should never be removed.
    health_check: {}
    memory_ballast:
      size_in_percentage: 33
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
      logs:
        encoding: json
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      logs: null
      metrics:
        receivers:
          - spanmetrics
        processors:
          - memory_limiter
          - batch
          - routing
        exporters:
          - prometheusremotewrite/mimir-a
          - prometheusremotewrite/mimir-b
          - prometheusremotewrite/mimir-c
          - prometheusremotewrite/mimir-d
          - prometheusremotewrite/mimir-e
          - prometheusremotewrite/mimir-f
          - prometheusremotewrite/mimir-g
          - prometheusremotewrite/mimir-h
          - prometheusremotewrite/mimir-i
          - prometheusremotewrite/mimir-j
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
        exporters:
          - loadbalancing
          - spanmetrics

image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
  repository: otel/opentelemetry-collector-contrib
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "0.83.0"
  # When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
  digest: ""
imagePullSecrets: []

# OpenTelemetry Collector executable
command:
  name: otelcol-contrib
  extraArgs:
    - --feature-gates=pkg.translator.prometheus.NormalizeName

nodeSelector:
  role: lgtm
tolerations:
- effect: NoSchedule
  key: grafana-stack
  operator: Exists

# Configuration for ports
# nodePort is also allowed
ports:
  otlp:
    enabled: true
    containerPort: 4317
    servicePort: 4317
    hostPort: 4317
    protocol: TCP
    # nodePort: 30317
    appProtocol: grpc
  otlp-http:
    enabled: true
    containerPort: 4318
    servicePort: 4318
    hostPort: 4318
    protocol: TCP
  jaeger-compact:
    enabled: false
    containerPort: 6831
    servicePort: 6831
    hostPort: 6831
    protocol: UDP
  jaeger-thrift:
    enabled: false
    containerPort: 14268
    servicePort: 14268
    hostPort: 14268
    protocol: TCP
  jaeger-grpc:
    enabled: false
    containerPort: 14250
    servicePort: 14250
    hostPort: 14250
    protocol: TCP
  zipkin:
    enabled: false
    containerPort: 9411
    servicePort: 9411
    hostPort: 9411
    protocol: TCP
  metrics:
    # The metrics port is disabled by default. However you need to enable the port
    # in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
    enabled: true
    containerPort: 8888
    servicePort: 8888
    protocol: TCP

# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
  limits:
    cpu: 1
    memory: 1Gi
  requests:
    cpu: 100m
    memory: 100Mi

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8888"

# only used with deployment mode
replicaCount: 4

# only used with deployment mode
revisionHistoryLimit: 10

service:
  type: ClusterIP
  # type: LoadBalancer
  # loadBalancerIP: 1.2.3.4
  # loadBalancerSourceRanges: []
  annotations: {}

# PodDisruptionBudget is used only if deployment enabled
podDisruptionBudget:
  enabled: true
#   minAvailable: 2
  maxUnavailable: 1

rollout:
  rollingUpdate: {}
  # When 'mode: daemonset', maxSurge cannot be used when hostPort is set for any of the ports
  # maxSurge: 25%
  # maxUnavailable: 0
  strategy: RollingUpdate

clusterRole:
  # Specifies whether a clusterRole should be created
  # Some presets also trigger the creation of a cluster role and cluster role binding.
  # If using one of those presets, this field is no-op.
  create: false
  # Annotations to add to the clusterRole
  # Can be used in combination with presets that create a cluster role.
  annotations: {}
  # The name of the clusterRole to use.
  # If not set a name is generated using the fullname template
  # Can be used in combination with presets that create a cluster role.
  name: ""
  # A set of rules as documented here : https://kubernetes.io/docs/reference/access-authn-authz/rbac/
  # Can be used in combination with presets that create a cluster role to add additional rules.
  rules:
  - apiGroups:
    - ''
    resources:
    - 'endpoints'
    verbs:
    - 'get'
    - 'list'
    - 'watch'

  clusterRoleBinding:
    # Annotations to add to the clusterRoleBinding
    # Can be used in combination with presets that create a cluster role binding.
    annotations: {}
    # The name of the clusterRoleBinding to use.
    # If not set a name is generated using the fullname template
    # Can be used in combination with presets that create a cluster role binding.
    name: ""

The tail sampler:

# Default values for opentelemetry-collector.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

nameOverride: ""
fullnameOverride: ""

# Valid values are "daemonset", "deployment", and "statefulset".
mode: "deployment"

configMap:
  # Specifies whether a configMap should be created (true by default)
  create: true

# Base collector configuration.
# Supports templating. To escape existing instances of {{ }}, use {{` <original content> `}}.
# For example, {{ REDACTED_EMAIL }} becomes {{` {{ REDACTED_EMAIL }} `}}.
config:
  receivers:
    jaeger: null
    zipkin: null
    prometheus: null
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
          max_recv_msg_size_mib: 500
        http: null
  processors:
    batch:
      send_batch_max_size: 8192
    # If set to null, will be overridden with values based on k8s resource limits
    memory_limiter: null
    tail_sampling:
      decision_wait: 60s
      policies:
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
    routing:
      from_attribute: k8s.cluster.name
      attribute_source: resource
      # default_exporters:
      # - otlp/default
      table:
      - value: a
        exporters:
          - otlp/tempo-a
      - value: b
        exporters:
          - otlp/tempo-b
      - value: c
        exporters:
          - otlp/tempo-c
      - value: d
        exporters:
          - otlp/tempo-d
      - value: e
        exporters:
          - otlp/tempo-e
      - value: f
        exporters:
          - otlp/tempo-f
      - value: g
        exporters:
          - otlp/tempo-g
      - value: h
        exporters:
          - otlp/tempo-h
      - value: i
        exporters:
          - otlp/tempo-i
      - value: j
        exporters:
          - otlp/tempo-j
  exporters:
    logging: null
    # otlp/default:
    #   endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
    #   tls:
    #     insecure: true
    #   headers:
    #     x-scope-orgid: aMimir
    otlp/tempo-a:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanaaTempo
    otlp/tempo-b:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanabTempo
    otlp/tempo-c:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanacTempo
    otlp/tempo-d:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanadTempo
    otlp/tempo-e:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanaeTempo
    otlp/tempo-f:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanafTempo
    otlp/tempo-g:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanagTempo
    otlp/tempo-h:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanahTempo
    otlp/tempo-i:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanaiTempo
    otlp/tempo-j:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanajTempo
  extensions:
    # The health_check extension is mandatory for this chart.
    # Without the health_check extension the collector will fail the readiness and liveliness probes.
    # The health_check extension can be modified, but should never be removed.
    health_check: {}
    memory_ballast:
      size_in_percentage: 33
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
      logs:
        encoding: json
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      logs: null
      metrics: null
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - tail_sampling
          - batch
          - routing
        exporters:
          - otlp/tempo-a
          - otlp/tempo-b
          - otlp/tempo-c
          - otlp/tempo-d
          - otlp/tempo-e
          - otlp/tempo-f
          - otlp/tempo-g
          - otlp/tempo-h
          - otlp/tempo-i
          - otlp/tempo-j

image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
  repository: otel/opentelemetry-collector-contrib
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "0.83.0"
  # When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
  digest: ""
imagePullSecrets: []

# OpenTelemetry Collector executable
command:
  name: otelcol-contrib
  extraArgs: []

nodeSelector:
  role: lgtm
tolerations:
- effect: NoSchedule
  key: grafana-stack
  operator: Exists

# Configuration for ports
# nodePort is also allowed
ports:
  otlp:
    enabled: true
    containerPort: 4317
    servicePort: 4317
    hostPort: 4317
    protocol: TCP
    # nodePort: 30317
    appProtocol: grpc
  otlp-http:
    enabled: false
    containerPort: 4318
    servicePort: 4318
    hostPort: 4318
    protocol: TCP
  jaeger-compact:
    enabled: false
    containerPort: 6831
    servicePort: 6831
    hostPort: 6831
    protocol: UDP
  jaeger-thrift:
    enabled: false
    containerPort: 14268
    servicePort: 14268
    hostPort: 14268
    protocol: TCP
  jaeger-grpc:
    enabled: false
    containerPort: 14250
    servicePort: 14250
    hostPort: 14250
    protocol: TCP
  zipkin:
    enabled: false
    containerPort: 9411
    servicePort: 9411
    hostPort: 9411
    protocol: TCP
  metrics:
    # The metrics port is disabled by default. However you need to enable the port
    # in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
    enabled: true
    containerPort: 8888
    servicePort: 8888
    protocol: TCP

# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
  limits:
    cpu: 1
    memory: 2Gi
  requests:
    cpu: 100m
    memory: 500Mi

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8888"

# only used with deployment mode
replicaCount: 4

# only used with deployment mode
revisionHistoryLimit: 10

service:
  type: ClusterIP
  # type: LoadBalancer
  # loadBalancerIP: 1.2.3.4
  # loadBalancerSourceRanges: []
  clusterIP: None
  annotations: {}

# PodDisruptionBudget is used only if deployment enabled
podDisruptionBudget:
  enabled: true
#   minAvailable: 2
  maxUnavailable: 1

rollout:
  rollingUpdate: {}
  # When 'mode: daemonset', maxSurge cannot be used when hostPort is set for any of the ports
  # maxSurge: 25%
  # maxUnavailable: 0
  strategy: RollingUpdate

ignore the exporters' names, I replaced them off



### Log output

_No response_

### Additional context

_No response_

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-06T21:16:52Z

Pinging code owners:

connector/spanmetrics: @albertteoh
exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

albertteoh · 2023-10-07T11:05:58Z

Thanks for those details @lucasoares. I agree, that increase of 600 doesn't seem right to me.

It seems like something that's relatively straightforward for us to reproduce locally just with spanmetrics connector + prometheus server with the objective of eliminating mimir from the equation to confirm (or deny) if the problem somehow relates to the spanmetrics connector?

You could use this working docker-compose setup with the spanmetrics connector + prometheus (+ jaeger) as a template: https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor.

diogenesblip · 2023-10-19T17:20:47Z

Thank you for the suggestion to set up a local test environment with the spanmetrics connector and Prometheus. We followed your instructions and the configuration worked perfectly in our local test environment.

This helped us confirm that the spanmetrics connector and Prometheus configuration appear to be working as expected, and the initial issue we were facing may not be directly related to these components.

However, we have a Homologation (HMG) environment that is identical to the production environment, but we have not been able to observe the same erroneous behavior in it.

Below are the configuration files for the HMG environment:

The loadbalancer:

nameOverride: ""
fullnameOverride: ""

mode: "deployment"

configMap:
create: true

config:
receivers:
jaeger: null
zipkin: null
prometheus: null
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 500
http:
endpoint: ${env:MY_POD_IP}:4318
processors:
batch:
send_batch_max_size: 8192
routing:
from_attribute: k8s.cluster.name
attribute_source: resource
table:
- value: a
exporters:
- prometheusremotewrite/mimir-a
- value: b
exporters:
- prometheusremotewrite/mimir-b
- value: c
exporters:
- prometheusremotewrite/mimir-c
- value: d
exporters:
- prometheusremotewrite/mimir-d

memory_limiter: null

connectors:
spanmetrics:
histogram:
explicit:
buckets: [1ms, 2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s, 20s, 40s, 100s, 500s, 1000s, 10000s]
namespace: traces.spanmetrics
dimensions:
- name: http.status_code
- name: http.method
- name: rpc.grpc.status_code
- name: db.system
- name: external.service
- name: k8s.cluster.name
exporters:
logging: null
prometheusremotewrite/mimir-a:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanaaMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-b:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanabMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-c:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanacMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-d:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanadMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: opentelemetry-collector-tail.tempo.svc.cluster.local
port: 4317

extensions:
health_check: {}
memory_ballast:
size_in_percentage: 33
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
encoding: json
extensions:
- health_check
- memory_ballast
pipelines:
logs: null
metrics:
receivers:
- spanmetrics
processors:
- memory_limiter
- batch
- routing
exporters:
- prometheusremotewrite/mimir-a
- prometheusremotewrite/mimir-b
- prometheusremotewrite/mimir-c
- prometheusremotewrite/mimir-d
traces:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- loadbalancing
- spanmetrics

image:
otelcol.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
tag: "0.83.0"
digest: ""
imagePullSecrets: []

command:
name: otelcol-contrib
extraArgs:
- --feature-gates=pkg.translator.prometheus.NormalizeName

nodeSelector:
component: prometheus
tolerations:

effect: NoSchedule
key: kind
operator: Equal
value: prometheus
effect: NoSchedule
key: "kubernetes.azure.com/scalesetpriority"
operator: Equal
value: spot

ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
appProtocol: grpc
otlp-http:
enabled: true
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: false
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: false
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: false
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: false
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP

deployment.
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi

podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"

replicaCount: 2

revisionHistoryLimit: 10

service:
type: ClusterIP
annotations: {}

podDisruptionBudget:
enabled: true
maxUnavailable: 1

rollout:
rollingUpdate: {}
strategy: RollingUpdate

clusterRole:
create: false
annotations: {}
name: ""
rules:

apiGroups:
- ''
  resources:
- 'endpoints'
  verbs:
- 'get'
- 'list'
- 'watch'

clusterRoleBinding:
annotations: {}
name: ""

The tail sampler:

nameOverride: ""
fullnameOverride: ""

mode: "deployment"

configMap:
create: true

config:
receivers:
jaeger: null
zipkin: null
prometheus: null
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 500
http: null
processors:
batch:
send_batch_max_size: 8192
memory_limiter: null
tail_sampling:
decision_wait: 60s
policies:
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
routing:
from_attribute: k8s.cluster.name
attribute_source: resource
table:
- value: a
exporters:
- otlp/tempo-a
- value: b
exporters:
- otlp/tempo-b
- value: c
exporters:
- otlp/tempo-c
- value: d
exporters:
- otlp/tempo-d
exporters:
logging: null
otlp/tempo-a:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanaaTempo
otlp/tempo-b:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanabTempo
otlp/tempo-c:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanacTempo
otlp/tempo-d:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanadTempo
extensions:
health_check: {}
memory_ballast:
size_in_percentage: 33
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
encoding: json
extensions:
- health_check
- memory_ballast
pipelines:
logs: null
metrics: null
traces:
receivers:
- otlp
processors:
- memory_limiter
- tail_sampling
- batch
- routing
exporters:
- otlp/tempo-a
- otlp/tempo-b
- otlp/tempo-c
- otlp/tempo-d

image:
otelcol.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
tag: "0.83.0"
digest: ""
imagePullSecrets: []

OpenTelemetry Collector executable

command:
name: otelcol-contrib
extraArgs: []

nodeSelector:
component: prometheus
tolerations:

effect: NoSchedule
key: kind
operator: Equal
value: prometheus
effect: NoSchedule
key: "kubernetes.azure.com/scalesetpriority"
operator: Equal
value: spot

ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
appProtocol: grpc
otlp-http:
enabled: false
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: false
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: false
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: false
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: false
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP

deployment.
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi

podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"

replicaCount: 2

revisionHistoryLimit: 10

service:
type: ClusterIP
clusterIP: None
annotations: {}

podDisruptionBudget:
enabled: true
maxUnavailable: 1

rollout:
rollingUpdate: {}
strategy: RollingUpdate

luistilingue · 2023-10-27T18:05:32Z

Could that issue be related with this problem?

#27080

I'm using 0.83.0 version and I've upgraded to 0.88.0 to check it was fixed. I'll go back here to tell the results :D

crobert-1 · 2023-11-01T17:22:08Z

@luistilingue Have you been able to test this yet? (Or @lucasoares)

luistilingue · 2023-12-14T01:46:23Z

@crobert-1 The issue still persists, event updating to 0.91.0.

Could be related to prune caches like stated here grafana/agent#5271 and here #17306 ?

That behavior is impacting our usage of otel-collectors :(

luistilingue · 2023-12-14T10:20:03Z

@nijave Could you tell us if you could fix that behavior?

luistilingue · 2023-12-28T17:15:14Z

I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in prometheusremotewriter exporter, metrics values returns to the normal. I think we can close this issuse.

crobert-1 · 2024-01-02T16:42:28Z

@lucasoares Can you confirm what @luistilingue has suggested resolves your issue?

lucasoares · 2024-01-04T16:12:31Z

@lucasoares Can you confirm what @luistilingue has suggested resolves your issue?

Yes

crobert-1 · 2024-01-04T16:33:53Z

I'm going to close the issue for now as it appears to be resolved, but let me know if there's anything else required here.

chewrocca · 2024-01-11T21:13:26Z

I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in prometheusremotewriter exporter, metrics values returns to the normal. I think we can close this issuse.

Can you elaborate a bit more? We experience similar issues.

nijave · 2024-01-11T21:22:56Z

I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in prometheusremotewriter exporter, metrics values returns to the normal. I think we can close this issuse.

Can you elaborate a bit more? We experience similar issues.

https://grafana.com/docs/mimir/latest/configure/configure-high-availability-deduplication/

lucasoares added bug Something isn't working needs triage New item requiring triage labels Oct 6, 2023

github-actions bot added connector/spanmetrics exporter/prometheusremotewrite labels Oct 6, 2023

github-actions bot mentioned this issue Oct 10, 2023

Weekly Report: 2023-10-03 - 2023-10-10 #27574

Closed

github-actions bot mentioned this issue Oct 17, 2023

Weekly Report: 2023-10-10 - 2023-10-17 #27791

Closed

github-actions bot mentioned this issue Oct 24, 2023

Weekly Report: 2023-10-17 - 2023-10-24 #28557

Closed

github-actions bot mentioned this issue Oct 31, 2023

Weekly Report: 2023-10-24 - 2023-10-31 #28813

Closed

This was referenced Nov 7, 2023

Weekly Report: 2023-10-31 - 2023-11-07 #29000

Closed

Weekly Report: 2023-11-07 - 2023-11-14 #29245

Closed

This was referenced Nov 21, 2023

Weekly Report: 2023-11-14 - 2023-11-21 #29422

Closed

Weekly Report: 2023-11-21 - 2023-11-28 #29517

Closed

github-actions bot mentioned this issue Dec 5, 2023

Weekly Report: 2023-11-28 - 2023-12-05 #29650

Closed

crobert-1 added the waiting for author label Dec 7, 2023

github-actions bot mentioned this issue Dec 12, 2023

Weekly Report: 2023-12-05 - 2023-12-12 #29753

Closed

95 tasks

crobert-1 removed the waiting for author label Dec 14, 2023

This was referenced Dec 19, 2023

Weekly Report: 2023-12-12 - 2023-12-19 #30067

Closed

Weekly Report: 2023-12-19 - 2023-12-26 #30206

Closed

github-actions bot mentioned this issue Jan 2, 2024

Weekly Report: 2023-12-26 - 2024-01-02 #30242

Closed

88 tasks

crobert-1 closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

lucasoares commented Oct 6, 2023 •

edited

Loading

github-actions bot commented Oct 6, 2023

albertteoh commented Oct 7, 2023

diogenesblip commented Oct 19, 2023 •

edited

Loading

luistilingue commented Oct 27, 2023

crobert-1 commented Nov 1, 2023

luistilingue commented Dec 14, 2023

luistilingue commented Dec 14, 2023

luistilingue commented Dec 28, 2023

crobert-1 commented Jan 2, 2024

lucasoares commented Jan 4, 2024

crobert-1 commented Jan 4, 2024

chewrocca commented Jan 11, 2024

nijave commented Jan 11, 2024

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

Comments

lucasoares commented Oct 6, 2023 • edited Loading

Component(s)

What happened?

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

github-actions bot commented Oct 6, 2023

albertteoh commented Oct 7, 2023

diogenesblip commented Oct 19, 2023 • edited Loading

OpenTelemetry Collector executable

luistilingue commented Oct 27, 2023

crobert-1 commented Nov 1, 2023

luistilingue commented Dec 14, 2023

luistilingue commented Dec 14, 2023

luistilingue commented Dec 28, 2023

crobert-1 commented Jan 2, 2024

lucasoares commented Jan 4, 2024

crobert-1 commented Jan 4, 2024

chewrocca commented Jan 11, 2024

nijave commented Jan 11, 2024

lucasoares commented Oct 6, 2023 •

edited

Loading

diogenesblip commented Oct 19, 2023 •

edited

Loading