Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prometheus Receiver] Prometheus Receiver configuration for etcd, kube-scheduler and kube-controller in standard K8s cluster not working #34211

Closed
developer1622 opened this issue Jul 23, 2024 · 8 comments
Labels
bug Something isn't working question Further information is requested receiver/prometheus Prometheus receiver

Comments

@developer1622
Copy link

Component(s)

receiver/prometheus

What happened?

Description

Please bear with me for descriptive error message, however actually it is short.

I am trying to scrape prometheus metrics for the etcd, kube-scheduler and kube-controller. However, it resulting in the error , i have tried multiple relabel configurations to get the end URL address coreect, however it is still not working

I execed into pod and used curl to scrape respective pod ip targets, all worked but with scraping config, it is not working.

Steps to Reproduce

Keep the following scrape config in under receiver section of receiver


    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: etcd
              scheme: https
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: keep
                  source_labels:
                    - __meta_kubernetes_namespace
                    - __meta_kubernetes_pod_name
                  separator: "/"
                  regex: "kube-system/etcd.+"

               #  **This did not work**
                # - source_labels:
                #     - __address__
                #   action: replace
                #   target_label: __address__
                #   regex: (.+?)(\\:\\d)?
                #   replacement: $1:2379

                # Specify the port
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  replacement: ${__meta_kubernetes_pod_ip}:2379
                  # **here below for the replacement I have tried multiple options, none worked**
                  # replacement: $1:2379
                 # replacement: ${1}:2379

              tls_config:
                insecure_skip_verify: true
                ca_file: /etc/etcd/ca.crt
                cert_file: /etc/etcd/server.crt
                key_file: /etc/etcd/server.key

            - job_name: kube-controller-manager
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-controller-manager;control-plane

                # Replace the address to use the pod IP with port 10257
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  # replacement: $1:10259
                  replacement: ${__meta_kubernetes_pod_ip}:10259
                  # **here below for the replacement I have tried multiple options, none worked**
                 # replacement: ${1}:10259

                  

            - job_name: kube-scheduler
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-scheduler;control-plane

                # Replace the address to use the pod IP with port 10250
                # - source_labels: [__meta_kubernetes_pod_ip]
                #   action: replace
                #   target_label: __address__
                #   # regex: (.*)
                #   regex: ^(.*)$
                  # replacement: ${__meta_kubernetes_pod_ip}:10257

                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: ^(.*)$  # Captures the entire IP address
                  replacement: ${1}:10257
                  replacement: ${__meta_kubernetes_pod_ip}:10257
                  # **here below for the replacement I have tried multiple options, none worked**
                 # replacement: ${1}:10257


Expected Result

We usually see the metrics wherver being exported

Actual Result

I have tried multiple ,so I have got multiple errors, I will post all of them here



2024-07-23T07:19:14.872Z        warn    expandconverter@v0.102.1/expand.go:107  Configuration references unset environment variable     {"name": "__meta_kubernetes_pod_ip"}
2024-07-23T07:19:14.872Z        warn    expandconverter@v0.102.1/expand.go:107  Configuration references unset environment variable     {"name": "__meta_kubernetes_pod_ip"}
Error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/07/23 07:19:14 collector server run finished with error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

Second

Error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/07/23 07:21:46 collector server run finished with error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

and third

Seems like it is not building the complete URL , which we can see below error , instance for all 3 components

2024-07-23T07:24:25.931Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719465929, "target_labels": "{__name__=\"up\", instance=\":2379\", job=\"etcd\"}"}


2024-07-23T07:24:33.006Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719473005, "target_labels": "{__name__=\"up\", instance=\":10257\", job=\"kube-controller-manager\"}"}


2024-07-23T07:25:17.586Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719517584, "target_labels": "{__name__=\"up\", instance=\":10259\", job=\"kube-scheduler\"}"}

Collector version

latest image of contrib. here: otel/opentelemetry-collector-contrib:latest

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 22")
it is mult-control plane K8s cluster
I have 3 control plane nodes

so, I have 3 etcd services, 3 kube-control-mangers and 3 kube-schedulers

OpenTelemetry Collector configuration

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: otelcontribcol
  name: otelcontribcol
  namespace: default
data:
  config.yaml: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: etcd
              scheme: https
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: keep
                  source_labels:
                    - __meta_kubernetes_namespace
                    - __meta_kubernetes_pod_name
                  separator: "/"
                  regex: "kube-system/etcd.+"

                # - source_labels:
                #     - __address__
                #   action: replace
                #   target_label: __address__
                #   regex: (.+?)(\\:\\d)?
                #   replacement: $1:2379

                # Specify the port
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  # replacement: $1:2379
                  replacement: ${__meta_kubernetes_pod_ip}:2379

              tls_config:
                insecure_skip_verify: true
                ca_file: /etc/etcd/ca.crt
                cert_file: /etc/etcd/server.crt
                key_file: /etc/etcd/server.key

            - job_name: kube-controller-manager
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-controller-manager;control-plane

                # Replace the address to use the pod IP with port 10257
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  # replacement: $1:10257
                  replacement: ${__meta_kubernetes_pod_ip}:10257
                  

            - job_name: kube-scheduler
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-scheduler;control-plane

                # Replace the address to use the pod IP with port 10250
                # - source_labels: [__meta_kubernetes_pod_ip]
                #   action: replace
                #   target_label: __address__
                #   # regex: (.*)
                #   regex: ^(.*)$
                # replacement: ${__meta_kubernetes_pod_ip}:10257

                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: ^(.*)$  # Captures the entire IP address
                  # replacement: ${1}:10257
                  replacement: ${__meta_kubernetes_pod_ip}:10259



    processors:
      batch:
        timeout: 1s
        send_batch_size: 1000
        send_batch_max_size: 2000

    exporters:
      debug:
        verbosity: detailed

    service:
      telemetry:
        metrics:
          address: 0.0.0.0:8881
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [batch]
          exporters: [debug]

Log output

In top I have attached logs more.


2024-07-23T07:24:25.931Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719465929, "target_labels": "{__name__=\"up\", instance=\":2379\", job=\"etcd\"}"}


2024-07-23T07:24:33.006Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719473005, "target_labels": "{__name__=\"up\", instance=\":10257\", job=\"kube-controller-manager\"}"}


2024-07-23T07:25:17.586Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719517584, "target_labels": "{__name__=\"up\", instance=\":10259\", job=\"kube-scheduler\"}"}

Additional context

it is mult-control plane K8s cluster

Thank you, I have tried to build target, seems like it is not if my scraping is not correct, please give correct scraping config for 3 K8s components

Here are my pods of all 3 comoponents ,

kube-scheduler-master01                    1/1     Running       8 (18d ago)   237d   component=kube-scheduler,tier=control-plane
kube-scheduler-master02                    1/1     Running       4 (43d ago)   237d   component=kube-scheduler,tier=control-plane
kube-scheduler-master03                    1/1     Running       6 (18d ago)   237d   component=kube-scheduler,tier=control-plane

kube-controller-manager-master01           1/1     Running       8 (18d ago)   237d   component=kube-controller-manager,tier=control-plane
kube-controller-manager-master02           1/1     Running       4 (43d ago)   237d   component=kube-controller-manager,tier=control-plane
kube-controller-manager-master03           1/1     Running       6 (18d ago)   237d   component=kube-controller-manager,tier=control-plane

etcd-master01                              1/1     Running       4 (27d ago)   237d   component=etcd,tier=control-plane
etcd-master02                              1/1     Running       2 (43d ago)   237d   component=etcd,tier=control-plane
etcd-master03                              1/1     Running       1 (27d ago)   237d   component=etcd,tier=control-plane

Thank you

@developer1622 developer1622 added bug Something isn't working needs triage New item requiring triage labels Jul 23, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label Jul 23, 2024
@developer1622
Copy link
Author

In case my standard Prometheus deployment (attached YAML below)
I can see below target to build, but in OTel Prometheus receiver throwing errors.

Screenshot 2024-07-23 at 3 59 44 PM

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  labels:
    app: prometheus
data:
  prometheus.yml: |
    global:
      scrape_interval: 2m
      evaluation_interval: 2m
    scrape_configs:
      - job_name: etcd
        scheme: https
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          # Keep only etcd pods in the kube-system namespace
          - action: keep
            source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
            separator: /
            regex: "kube-system/etcd.+"

          # Replace the address to use the pod IP with port 2379
          - source_labels: [__meta_kubernetes_pod_ip]
            action: replace
            target_label: __address__
            regex: (.*)
            replacement: $1:2379

        tls_config:
          insecure_skip_verify: true
          ca_file: /etc/etcd/ca.crt
          cert_file: /etc/etcd/server.crt
          key_file: /etc/etcd/server.key

      - job_name: kube-controller-manager
        honor_labels: true
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - kube-system
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        scheme: https
        tls_config:
          insecure_skip_verify: true
        relabel_configs:
          # Keep pods with the specified labels
          - source_labels: [__meta_kubernetes_pod_label_component, __meta_kubernetes_pod_label_tier]
            action: keep
            regex: kube-controller-manager;control-plane

          # Replace the address to use the pod IP with port 10257
          - source_labels: [__meta_kubernetes_pod_ip]
            action: replace
            target_label: __address__
            regex: (.*)
            replacement: $1:10257

      - job_name: kube-scheduler
        honor_labels: true
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - kube-system
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        scheme: https
        tls_config:
          insecure_skip_verify: true
        relabel_configs:
          # Keep pods with the specified labels
          - source_labels: [__meta_kubernetes_pod_label_component, __meta_kubernetes_pod_label_tier]
            action: keep
            regex: kube-scheduler;control-plane

          # Replace the address to use the pod IP with port 10250
          - source_labels: [__meta_kubernetes_pod_ip]
            action: replace
            target_label: __address__
            regex: (.*)
            replacement: $1:10259

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus-cont
          image: prom/prometheus
          volumeMounts:
            - name: config-volume
              mountPath: /etc/prometheus/prometheus.yml
              subPath: prometheus.yml
            - mountPath: /etc/etcd
              name: etcd-certs
          ports:
            - containerPort: 9090
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - configMap:
            name: etcd-certs
          name: etcd-certs
      hostNetwork: true
      serviceAccount: otelcontribcol
      serviceAccountName: otelcontribcol
---
kind: Service
apiVersion: v1
metadata:
  name: prometheus-service
spec:
  selector:
    app: prometheus
  ports:
    - name: promui
      nodePort: 30900
      protocol: TCP
      port: 9090
      targetPort: 9090
  type: NodePort


Thank you.

@dashpole
Copy link
Contributor

I skimmed the issue, so apologize if I missed this. The otel collector interprets $1 as the environment variable. You need to escape it with $$1

@dashpole
Copy link
Contributor

LMK if that was your issue, or if I misread

@developer1622
Copy link
Author

Hi @dashpole.

Thank you very much for the response; it worked after using 2 dollars($). You saved actually.

So, whatever works in standard Prometheus needs tweaking for running in OTel Prometheus receiver?

Is the deviation in OTel from the standard Prometheus scrape config something architecturally specific that end users need to know?

Thank you.

@dashpole
Copy link
Contributor

It exists because the promethues server config doesn't support environment variables, but the otel collector does.

@developer1622
Copy link
Author

developer1622 commented Aug 1, 2024

Hi @dashpole , I have forgot to ask one query, thank you

I have the below kube-scheduler scrape config which is working fine

            - job_name: kube-scheduler
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-scheduler;control-plane

                # Replace the address to use the pod IP with port 10250
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  replacement: $$1:10259

So, with this configuration, I am able to see only one instance scheduler metrics ( I have 3 control plane nodes, so that means I have 3 schedulers)

Is this expected behaviour in multi control plane (multi-master) K8s clusters? other scrape configs(other 2 control plane schedulers) are failing, but one of them is successful

Here is the sample log of 2 instances failing

Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1722538620170, "target_labels": "{name="up", instance=".11:10259", job="kube-scheduler"}"}

Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1722538620179, "target_labels": "{name="up", instance=".10:10259", job="kube-scheduler"}"}

Thank you.

@developer1622 developer1622 reopened this Aug 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Aug 1, 2024

I would expect 3 metrics. Try raising the logging verbosity to DEBUG to see the detailed scrape failure reason

@dashpole dashpole added question Further information is requested and removed needs triage New item requiring triage labels Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested receiver/prometheus Prometheus receiver
Projects
None yet
Development

No branches or pull requests

2 participants