Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue piles up and metrics are delayed if targets are unreachable #34229

Open
gbxavier opened this issue Jul 23, 2024 · 8 comments
Open

Queue piles up and metrics are delayed if targets are unreachable #34229

gbxavier opened this issue Jul 23, 2024 · 8 comments
Labels

Comments

@gbxavier
Copy link

Component(s)

exporter/prometheusremotewrite, receiver/prometheus

What happened?

Description

I have a dedicated instance to scrape metrics from pods exposing Prometheus Endpoints. The instance has more than enough resources to process a lot of metrics (resource utilization never reaches 60%). In this setup, most namespaces are under Network Policies preventing this scraper from reaching the pods found in the Kubernetes autodiscovery.

At first, the metrics reach the target (Grafana Cloud) as expected; but it's possible to immediately notice that the memory consumption keeps on growing and the queue size starts growing slowly until it reaches the capacity and enqueuing starts failing.

The amount of metrics received and sent remains constant, but over time, the delay between the metric being "seen" by the collector and sent to the backend slowly grows to the point that the last observed data point is hours late (but still being received by the backend). This behavior is observed by all receivers configured in the instance, including the prometheus/self instance that doesn't face any problem scraping the metrics.

This behavior only happens when the workload_prometheus is enabled, and no other instance suffers from this problem or any performance/limits issues.

Steps to Reproduce

  • Create AKS cluster with several namespaces protected by network policies;
  • Install Otel Collector (Prometheus Endpoints scraper);

Expected Result

The receiver scrapes the metrics from the endpoints it can reach and those metrics are correctly sent through the Prometheus Remote Write Exporter reasonably fast.

Actual Result

Memory consumption increases over time; the delay between the metric being "seen" by the collector and sent to the backend slowly grows to the point that the last observed data point is hours late;

Collector version

0.105.0

Environment information

Environment

OS: AKSUbuntu-2204gen2containerd-202407.03.0
Kubernetes: Azure AKS v1.29.2

OpenTelemetry Collector configuration

apiVersion: v1
data:
  relay: |
    exporters:
      debug:
        sampling_initial: 100
        sampling_thereafter: 10
        verbosity: detailed
      prometheusremotewrite/grafana_cloud:
        auth:
          authenticator: basicauth/grafana_cloud_prometheus
        endpoint: ${env:GRAFANA_CLOUD_PROMETHEUS_URL}
        external_labels:
          cluster: ${env:CLUSTER_NAME}
          deployment.environment: ${env:DEPLOYMENT_ENVIRONMENT}
          k8s.cluster.name: ${env:CLUSTER_NAME}
        max_batch_size_bytes: 3000000
        remote_write_queue:
          num_consumers: 10
          queue_size: 20000
    extensions:
      basicauth/grafana_cloud_prometheus:
        client_auth:
          password: ${env:GRAFANA_COM_API_TOKEN}
          username: "1277373"
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      batch: {}
      k8sattributes/extract_app_kubernetes_io:
        extract:
          labels:
          - from: pod
            key: app.kubernetes.io/name
            tag_name: service.name
          - from: pod
            key: app.kubernetes.io/version
            tag_name: service.version
          metadata:
          - k8s.pod.uid
        filter:
          node_from_env_var: KUBE_NODE_NAME
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: connection
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
      resourcedetection/aks:
        detectors:
        - env
        - azure
        - aks
        override: false
        timeout: 2s
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            endpoint: ${env:MY_POD_IP}:4318
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888
      prometheus/self:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - labels:
                instance: ${env:MY_POD_NAME}
              targets:
              - ${env:MY_POD_IP}:8888
      prometheus/workload_prometheus:
        config:
          global:
            scrape_protocols:
            - PrometheusProto
            - OpenMetricsText1.0.0
            - OpenMetricsText0.0.1
            - PrometheusText0.0.4
            scrape_timeout: 200ms
          scrape_configs:
          - job_name: integrations/kubernetes/workload-prometheus
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - action: keep
              regex: "true"
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $${1}:$${2}
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: k8s_namespace_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: k8s_pod_name
    service:
      extensions:
      - health_check
      - basicauth/grafana_cloud_prometheus
      pipelines:
        metrics:
          exporters:
          - prometheusremotewrite/grafana_cloud
          processors:
          - memory_limiter
          - batch
          - k8sattributes/extract_app_kubernetes_io
          - resourcedetection/aks
          receivers:
          - prometheus/self
          - prometheus/workload_prometheus
      telemetry:
        logs:
          level: debug
        metrics:
          address: ${env:MY_POD_IP}:8888
kind: ConfigMap
metadata:
  creationTimestamp: "2024-07-22T13:19:32Z"
  labels:
    app.kubernetes.io/instance: otel-scraper-otel-collectors
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-collector
    app.kubernetes.io/version: 0.104.0
    argocd.argoproj.io/instance: otel-scraper-otel-collectors
    helm.sh/chart: opentelemetry-collector-0.97.2
  name: otel-scraper-otel-collectors-opentelemetry-collector
  namespace: otel-collectors
  resourceVersion: "1333804215"
  uid: dfb34eec-b6da-4d8d-acf1-f08ee066a6b6

Log output

2024-07-23T16:48:25.892Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.51.54:15020/stats/prometheus", "error": "Get \"http://172.24.51.54:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:25.951Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.32.221:15020/stats/prometheus", "error": "Get \"http://172.24.32.221:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:26.462Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.23.186:15020/stats/prometheus", "error": "Get \"http://172.24.23.186:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:26.490Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.25.53:15020/stats/prometheus", "error": "Get \"http://172.24.25.53:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:26.504Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.50.246:15020/stats/prometheus", "error": "Get \"http://172.24.50.246:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:26.654Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.30.59:15020/stats/prometheus", "error": "Get \"http://172.24.30.59:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:26.701Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.50.164:15020/stats/prometheus", "error": "Get \"http://172.24.50.164:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:30.049Z    debug    memorylimiter/memorylimiter.go:200    Currently used memory.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 419}
2024-07-23T16:48:34.147Z    debug    k8sattributesprocessor@v0.105.0/processor.go:122    evaluating pod identifier    {"kind": "processor", "name": "k8sattributes/extract_app_kubernetes_io", "pipeline": "metrics", "value": [{"Source":{"From":"resource_attribute","Name":"k8s.pod.uid"},"Value":"c3192582-53a8-46e0-bc2b-caaf6e394ab8"},{"Source":{"From":"","Name":""},"Value":""},{"Source":{"From":"","Name":""},"Value":""},{"Source":{"From":"","Name":""},"Value":""}]}
2024-07-23T16:48:34.147Z    debug    k8sattributesprocessor@v0.105.0/processor.go:140    getting the pod    {"kind": "processor", "name": "k8sattributes/extract_app_kubernetes_io", "pipeline": "metrics", "pod": {"Name":"namespace-dotnet-test-6-76965ffc96-2wmt2","Address":"172.24.51.3","PodUID":"c3192582-53a8-46e0-bc2b-caaf6e394ab8","Attributes":{"k8s.pod.uid":"c3192582-53a8-46e0-bc2b-caaf6e394ab8"},"StartTime":"2024-07-17T12:31:13Z","Ignore":false,"Namespace":"namespace-dotnet-test-6","NodeName":"","HostNetwork":false,"Containers":{"ByID":null,"ByName":null},"DeletedAt":"0001-01-01T00:00:00Z"}}
2024-07-23T16:48:34.147Z    debug    k8sattributesprocessor@v0.105.0/processor.go:122    evaluating pod identifier    {"kind": "processor", "name": "k8sattributes/extract_app_kubernetes_io", "pipeline": "metrics", "value": [{"Source":{"From":"resource_attribute","Name":"k8s.pod.uid"},"Value":"1b444434-4035-4102-8835-79ff20c2c9bc"},{"Source":{"From":"","Name":""},"Value":""},{"Source":{"From":"","Name":""},"Value":""},{"Source":{"From":"","Name":""},"Value":""}]}
2024-07-23T16:48:34.147Z    debug    k8sattributesprocessor@v0.105.0/processor.go:140    getting the pod    {"kind": "processor", "name": "k8sattributes/extract_app_kubernetes_io", "pipeline": "metrics", "pod": {"Name":"namespace-2-684876b69c-6ksq5","Address":"172.24.52.18","PodUID":"1b444434-4035-4102-8835-79ff20c2c9bc","Attributes":{"k8s.pod.uid":"1b444434-4035-4102-8835-79ff20c2c9bc"},"StartTime":"2024-07-19T11:03:36Z","Ignore":false,"Namespace":"namespace-test-2","NodeName":"","HostNetwork":false,"Containers":{"ByID":null,"ByName":null},"DeletedAt":"0001-01-01T00:00:00Z"}}
2024-07-23T16:48:34.603Z    debug    scrape/scrape.go:1352    Append failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.52.99:15020/stats/prometheus", "error": "expected a valid start token, got \"\\xb0\" (\"INVALID\") while parsing: \"\\xb0\""}
2024-07-23T16:48:34.604Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314573, "target_labels": "{__name__=\"up\", instance=\"172.24.52.99:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-1\", k8s_pod_name=\"pod-1\"}"}
2024-07-23T16:48:34.666Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314465, "target_labels": "{__name__=\"up\", instance=\"172.24.51.51:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-12\", k8s_pod_name=\"pod-12\"}"}
2024-07-23T16:48:34.673Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314473, "target_labels": "{__name__=\"up\", instance=\"172.24.52.188:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-13\", k8s_pod_name=\"pod-13\"}"}
2024-07-23T16:48:34.689Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314484, "target_labels": "{__name__=\"up\", instance=\"172.24.52.193:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-4\", k8s_pod_name=\"pod-4\"}"}
2024-07-23T16:48:34.858Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314656, "target_labels": "{__name__=\"up\", instance=\"172.24.15.122:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-14\", k8s_pod_name=\"pod-14\"}"}
2024-07-23T16:48:34.874Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314673, "target_labels": "{__name__=\"up\", instance=\"172.24.27.189:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-15\", k8s_pod_name=\"pod-15\"}"}
2024-07-23T16:48:34.893Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314688, "target_labels": "{__name__=\"up\", instance=\"172.24.50.81:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-17\", k8s_pod_name=\"pod-17\"}"}
2024-07-23T16:48:35.048Z    debug    memorylimiter/memorylimiter.go:200    Currently used memory.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 465}
2024-07-23T16:48:35.049Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314847, "target_labels": "{__name__=\"up\", instance=\"172.24.30.98:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-20\", k8s_pod_name=\"pod-20\"}"}
2024-07-23T16:48:35.109Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753314908, "target_labels": "{__name__=\"up\", instance=\"172.24.62.137:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-16\", k8s_pod_name=\"pod-16\"}"}
2024-07-23T16:48:35.221Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.30.87:15020/stats/prometheus", "error": "Get \"http://172.24.30.87:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:35.221Z    warn    internal/transaction.go:125    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_timestamp": 1721753315019, "target_labels": "{__name__=\"up\", instance=\"172.24.30.87:15020\", job=\"integrations/kubernetes/workload-prometheus\", k8s_namespace_name=\"namespace-14\", k8s_pod_name=\"pod-14\"}"}
2024-07-23T16:48:35.229Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.10.220:15020/stats/prometheus", "error": "Get \"http://172.24.10.220:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:35.322Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.62.204:15020/stats/prometheus", "error": "Get \"http://172.24.62.204:15020/stats/prometheus\": context deadline exceeded"}
2024-07-23T16:48:35.354Z    debug    scrape/scrape.go:1337    Scrape failed    {"kind": "receiver", "name": "prometheus/workload_prometheus", "data_type": "metrics", "scrape_pool": "integrations/kubernetes/workload-prometheus", "target": "http://172.24.52.80:15020/stats/prometheus", "error": "Get \"http://172.24.52.80:15020/stats/prometheus\": context deadline exceeded"}

Additional context

The resource utilization is low, but memory grows over time up to 50% of the configured limit, specified below.

resources:
  limits:
    cpu: 2
    memory: 8Gi
  requests:
    cpu: 500m
    memory: 1024Mi

Screenshot with metrics from this scraper instance.
Screenshot 2024-07-23 at 19 07 00

@gbxavier gbxavier added bug Something isn't working needs triage New item requiring triage labels Jul 23, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole dashpole removed the receiver/prometheus Prometheus receiver label Jul 29, 2024
@dashpole
Copy link
Contributor

This doesn't look like an issue with the receiver. Does increasing the number of consumers help?

@gbxavier
Copy link
Author

gbxavier commented Aug 2, 2024

No. It does not help. I think it's a performance issue with the exporter. After further experimentation, increasing the batch size to 50000 stopped me from having problems with the queue.

@dashpole
Copy link
Contributor

Looks like the number of consumers is hard-coded to 1, which would explain why bumping that up doesn't help...

// Don't allow users to configure the queue.
// See https://github.com/open-telemetry/opentelemetry-collector/issues/2949.
// Prometheus remote write samples needs to be in chronological
// order for each timeseries. If we shard the incoming metrics
// without considering this limitation, we experience
// "out of order samples" errors.
exporter, err := exporterhelper.NewMetricsExporter(
ctx,
set,
cfg,
prwe.PushMetrics,
exporterhelper.WithTimeout(prwCfg.TimeoutSettings),
exporterhelper.WithQueue(exporterhelper.QueueSettings{
Enabled: prwCfg.RemoteWriteQueue.Enabled,
NumConsumers: 1,
QueueSize: prwCfg.RemoteWriteQueue.QueueSize,
}),
exporterhelper.WithStart(prwe.Start),
exporterhelper.WithShutdown(prwe.Shutdown),
)

We should probably emit a warning for people who try to configure this.

@dashpole
Copy link
Contributor

So increasing the batch size is probably the only resolution we can offer, which you found.

Action items:

  • Comment in the docs
  • Warning in the logs when people try to set num_consumers

@dashpole
Copy link
Contributor

@ArthurSens

@ArthurSens
Copy link
Contributor

I'm struggling a bit to understand how to log a warning. Which object has a logger object during the exporter creation? I can't find any 🤔

@dashpole
Copy link
Contributor

dashpole commented Sep 3, 2024

You can use exporter.Settings:

func createMetricsExporter(ctx context.Context, set exporter.Settings,

That is https://pkg.go.dev/go.opentelemetry.io/collector/exporter@v0.108.1/internal#Settings

It includes a Logger: https://pkg.go.dev/go.opentelemetry.io/collector/component#TelemetrySettings

So you can do set.Logger.Info(...

codeboten pushed a commit that referenced this issue Sep 5, 2024
…te_write_queue.num_consumers being a no-op (#34993)

**Description:** This PR documents and adds a warning log if
remote_write_queue.num_consumers is set in the
prometheusremotewriteexporter's config.

Current behavior already doesn't use the configuration for anything,
more information can be found in
open-telemetry/opentelemetry-collector#2949

**Link to tracking Issue:** Related to #34229 (not a fix)

Should we skip changelog here?

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
f7o pushed a commit to f7o/opentelemetry-collector-contrib that referenced this issue Sep 12, 2024
…te_write_queue.num_consumers being a no-op (open-telemetry#34993)

**Description:** This PR documents and adds a warning log if
remote_write_queue.num_consumers is set in the
prometheusremotewriteexporter's config.

Current behavior already doesn't use the configuration for anything,
more information can be found in
open-telemetry/opentelemetry-collector#2949

**Link to tracking Issue:** Related to open-telemetry#34229 (not a fix)

Should we skip changelog here?

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants