Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanmetrics grow indefinitely #5271

Closed
nijave opened this issue Sep 21, 2023 · 8 comments · Fixed by #5529
Closed

spanmetrics grow indefinitely #5271

nijave opened this issue Sep 21, 2023 · 8 comments · Fixed by #5529
Assignees
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@nijave
Copy link

nijave commented Sep 21, 2023

What's wrong?

spanmetrics series appear to continue to grow indefinitely
image
image

I think this might be related to #4614--maybe flushing is only setup if you use flow mode?

Steps to reproduce

Leave the Grafana Agent with spanmetrics enabled running for a few days

System information

No response

Software version

0.36.1

Configuration

traces:
  configs:
  - name: default
    attributes:
      actions:
        - key: traces
          action: upsert
          value: root
    remote_write:
      - endpoint: tempo-prod-09-us-central2.grafana.net:443
        basic_auth:
          username: 123
          password_file: /etc/tempo/tempo-api-token
        sending_queue:
          queue_size: 50000
    receivers:
      otlp:
        protocols:
          grpc:
            keepalive:
              server_parameters:
                max_connection_idle: 2m
                max_connection_age: 10m
          http: null
    service_graphs:
      enabled: true
    spanmetrics:
      handler_endpoint: "0.0.0.0:8889"
      # https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/2fc0da01638047b471765ba7b13910e32d7abdf0/processor/servicegraphprocessor/processor.go#L47
      # default 2, 4, 6, 8, 10, 50, 100, 200, 400, 800, 1000, 1400, 2000, 5000, 10_000, 15_000
      latency_histogram_buckets: [5ms, 15ms, 35ms, 150ms, 250ms, 500ms, 1s, 5s, 30s]
      dimensions_cache_size: 1000
      dimensions:
      - name: http.status_code
      - name: net.peer.name

    scrape_configs:
      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - action: replace
            source_labels:
              - __meta_kubernetes_namespace
            target_label: namespace
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_name
            target_label: pod
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_container_name
            target_label: container
        tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
    load_balancing:
      receiver_port: 8080
      exporter:
        insecure: true
      resolver:
        dns:
          hostname: grafana-agent-traces-headless
          port: 8080
    # see https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
    tail_sampling:
      policies:
        - type: probabilistic
          probabilistic:
            sampling_percentage: 10

logs:
  configs:
  - name: default
    positions:
      filename: /tmp/positions.yaml
    clients:
      - url: https://logs-prod-us-central2.grafana.net/loki/api/v1/push
        basic_auth:
          username: 123
          password_file: /etc/loki/loki-api-token
        external_labels:
          cluster: dev
server:
  log_level: info

Logs

No response

@nijave nijave added the bug Something isn't working label Sep 21, 2023
@ptodev ptodev self-assigned this Sep 22, 2023
@ptodev
Copy link
Contributor

ptodev commented Sep 22, 2023

Hello, thank you for reporting this. I will take a look next week.

@nijave
Copy link
Author

nijave commented Sep 22, 2023

It seems like metrics_flush_interval default value of 15s is already being applied so it's possible something isn't working in upstream otel or the functionality doesn't work like it suggests

https://github.com/grafana/agent/compare/main...nijave:grafana-agent:d59717e8-metrics-flush?expand=1 (the HEAD~1 commit adds the fields)

@nijave
Copy link
Author

nijave commented Sep 22, 2023

I see what the issue is now. spanmetricsprocessor caches and clears dimensions in a LRU map but it never clears histograms which get converted into the metric series

dimensions cache gets reset here:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L260

For metric series to go away, ideally spanmetricsprocessor.histogram would store a lastUpdatedTime with a configurable prune setting or spanmetricsprocess.processorImp.histograms would get pruned at the same time as metricKeyToDimensionsCache

@nijave
Copy link
Author

nijave commented Oct 6, 2023

Here's some data from our dev environment with agent v0.36.2 built with open-telemetry/opentelemetry-collector-contrib#27083 cherry-picked.

Left most grouping is "long running" agent (a few days). Middle grouping is v0.36.1 after a restart. Right grouping is agents with patched v0.36.2

Metric series
image

CPU
image

Memory
image

Metric series over a lot longer window (showing indefinite growth). Each drop off is an agent restart. Not sure what the giant spike is--I assume related to some testing we were doing in dev.
image

@tpaschalis tpaschalis modified the milestones: v0.37.0, v0.38.0 Oct 6, 2023
@rfratto
Copy link
Member

rfratto commented Oct 11, 2023

@ptodev Is this still on your radar? Should we clear the assignment?

@ptodev
Copy link
Contributor

ptodev commented Oct 12, 2023

Yes, I unassigned myself. @nijave recently fixed this issue in the Collector, but unfortunately upgrading OpenTelemetry is such a major effort that I am not able to do it right away. I'd rather unassign myself so that we can prioritise this separately later. Maybe someone else can pick it up as well.

I raised #5467 just now, for updating the Collector dependencies to a version high enough that we will pick up this bugfix. When #5467 is done, we can close this issue.

@ptodev
Copy link
Contributor

ptodev commented Oct 13, 2023

It is worth mentioning that this should not be an issue for Flow's otelcol.connector.spanmetrics.

As per a comment on the Otel issue, the spanmetrics connector doesn't have this issue. The Agent Flow component is based on the connector, whereas the Agent Static mode component is based on the processor.

@ptodev ptodev self-assigned this Oct 19, 2023
@ptodev
Copy link
Contributor

ptodev commented Oct 19, 2023

Hi, @nijave! We have to upgrade the Agent to version 0.86 of the OTel Collector in order to pick up a few security patches. We decided to upgrade to the latest version, 0.87, and pick up your patch as well. It will go in via #5529, which will be released in the next Agent release on 2023-11-21. I hope you don't mind that I listed you in our changelog in #5529. Thank you for making the Agent and the Collector better!

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants