spanmetrics grow indefinitely #5271

nijave · 2023-09-21T18:20:04Z

What's wrong?

spanmetrics series appear to continue to grow indefinitely

I think this might be related to #4614--maybe flushing is only setup if you use flow mode?

Steps to reproduce

Leave the Grafana Agent with spanmetrics enabled running for a few days

System information

No response

Software version

0.36.1

Configuration

traces:
  configs:
  - name: default
    attributes:
      actions:
        - key: traces
          action: upsert
          value: root
    remote_write:
      - endpoint: tempo-prod-09-us-central2.grafana.net:443
        basic_auth:
          username: 123
          password_file: /etc/tempo/tempo-api-token
        sending_queue:
          queue_size: 50000
    receivers:
      otlp:
        protocols:
          grpc:
            keepalive:
              server_parameters:
                max_connection_idle: 2m
                max_connection_age: 10m
          http: null
    service_graphs:
      enabled: true
    spanmetrics:
      handler_endpoint: "0.0.0.0:8889"
      # https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/2fc0da01638047b471765ba7b13910e32d7abdf0/processor/servicegraphprocessor/processor.go#L47
      # default 2, 4, 6, 8, 10, 50, 100, 200, 400, 800, 1000, 1400, 2000, 5000, 10_000, 15_000
      latency_histogram_buckets: [5ms, 15ms, 35ms, 150ms, 250ms, 500ms, 1s, 5s, 30s]
      dimensions_cache_size: 1000
      dimensions:
      - name: http.status_code
      - name: net.peer.name

    scrape_configs:
      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - action: replace
            source_labels:
              - __meta_kubernetes_namespace
            target_label: namespace
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_name
            target_label: pod
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_container_name
            target_label: container
        tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
    load_balancing:
      receiver_port: 8080
      exporter:
        insecure: true
      resolver:
        dns:
          hostname: grafana-agent-traces-headless
          port: 8080
    # see https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
    tail_sampling:
      policies:
        - type: probabilistic
          probabilistic:
            sampling_percentage: 10

logs:
  configs:
  - name: default
    positions:
      filename: /tmp/positions.yaml
    clients:
      - url: https://logs-prod-us-central2.grafana.net/loki/api/v1/push
        basic_auth:
          username: 123
          password_file: /etc/loki/loki-api-token
        external_labels:
          cluster: dev
server:
  log_level: info

Logs

No response

The text was updated successfully, but these errors were encountered:

ptodev · 2023-09-22T09:20:15Z

Hello, thank you for reporting this. I will take a look next week.

nijave · 2023-09-22T15:32:55Z

It seems like metrics_flush_interval default value of 15s is already being applied so it's possible something isn't working in upstream otel or the functionality doesn't work like it suggests

https://github.com/grafana/agent/compare/main...nijave:grafana-agent:d59717e8-metrics-flush?expand=1 (the HEAD~1 commit adds the fields)

nijave · 2023-09-22T16:30:14Z

I see what the issue is now. spanmetricsprocessor caches and clears dimensions in a LRU map but it never clears histograms which get converted into the metric series

dimensions cache gets reset here:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L260

For metric series to go away, ideally spanmetricsprocessor.histogram would store a lastUpdatedTime with a configurable prune setting or spanmetricsprocess.processorImp.histograms would get pruned at the same time as metricKeyToDimensionsCache

nijave · 2023-10-06T12:59:07Z

Here's some data from our dev environment with agent v0.36.2 built with open-telemetry/opentelemetry-collector-contrib#27083 cherry-picked.

Left most grouping is "long running" agent (a few days). Middle grouping is v0.36.1 after a restart. Right grouping is agents with patched v0.36.2

Metric series

CPU

Memory

Metric series over a lot longer window (showing indefinite growth). Each drop off is an agent restart. Not sure what the giant spike is--I assume related to some testing we were doing in dev.

rfratto · 2023-10-11T17:03:27Z

@ptodev Is this still on your radar? Should we clear the assignment?

ptodev · 2023-10-12T23:37:25Z

Yes, I unassigned myself. @nijave recently fixed this issue in the Collector, but unfortunately upgrading OpenTelemetry is such a major effort that I am not able to do it right away. I'd rather unassign myself so that we can prioritise this separately later. Maybe someone else can pick it up as well.

I raised #5467 just now, for updating the Collector dependencies to a version high enough that we will pick up this bugfix. When #5467 is done, we can close this issue.

ptodev · 2023-10-13T08:34:49Z

It is worth mentioning that this should not be an issue for Flow's otelcol.connector.spanmetrics.

As per a comment on the Otel issue, the spanmetrics connector doesn't have this issue. The Agent Flow component is based on the connector, whereas the Agent Static mode component is based on the processor.

ptodev · 2023-10-19T08:09:45Z

Hi, @nijave! We have to upgrade the Agent to version 0.86 of the OTel Collector in order to pick up a few security patches. We decided to upgrade to the latest version, 0.87, and pick up your patch as well. It will go in via #5529, which will be released in the next Agent release on 2023-11-21. I hope you don't mind that I listed you in our changelog in #5529. Thank you for making the Agent and the Collector better!

nijave added the bug Something isn't working label Sep 21, 2023

ptodev self-assigned this Sep 22, 2023

ptodev mentioned this issue Sep 22, 2023

spanmetrics: wire in new metrics_flush_interval field #4614

Closed

ptodev added this to the v0.37.0 milestone Sep 22, 2023

This was referenced Sep 22, 2023

[exporter/prometheus] Expired metrics were not be deleted open-telemetry/opentelemetry-collector-contrib#17306

Closed

spanmetricsprocessor doesn't prune histograms when metric cache is pruned open-telemetry/opentelemetry-collector-contrib#27080

Closed

nijave mentioned this issue Oct 3, 2023

[processor/spanmetrics] Prune histograms open-telemetry/opentelemetry-collector-contrib#27083

Merged

tpaschalis added the type/signals label Oct 6, 2023

tpaschalis modified the milestones: v0.37.0, v0.38.0 Oct 6, 2023

ptodev removed their assignment Oct 12, 2023

ptodev mentioned this issue Oct 12, 2023

Upgrade OpenTelemetry Collector to at least v0.87.0 #5467

Closed

ptodev mentioned this issue Oct 18, 2023

Upgrade Agent to Collector 0.87 #5529

Merged

4 tasks

ptodev self-assigned this Oct 19, 2023

ptodev closed this as completed in #5529 Oct 24, 2023

luistilingue mentioned this issue Dec 14, 2023

Incorrect Behavior in OpenTelemetry Collector Spanmetrics open-telemetry/opentelemetry-collector-contrib#27472

Closed

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanmetrics grow indefinitely #5271

spanmetrics grow indefinitely #5271

nijave commented Sep 21, 2023 •

edited

Loading

ptodev commented Sep 22, 2023

nijave commented Sep 22, 2023

nijave commented Sep 22, 2023

nijave commented Oct 6, 2023 •

edited

Loading

rfratto commented Oct 11, 2023

ptodev commented Oct 12, 2023

ptodev commented Oct 13, 2023

ptodev commented Oct 19, 2023

spanmetrics grow indefinitely #5271

spanmetrics grow indefinitely #5271

Comments

nijave commented Sep 21, 2023 • edited Loading

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

ptodev commented Sep 22, 2023

nijave commented Sep 22, 2023

nijave commented Sep 22, 2023

nijave commented Oct 6, 2023 • edited Loading

rfratto commented Oct 11, 2023

ptodev commented Oct 12, 2023

ptodev commented Oct 13, 2023

ptodev commented Oct 19, 2023

nijave commented Sep 21, 2023 •

edited

Loading

nijave commented Oct 6, 2023 •

edited

Loading