Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The units of kubernetes.cpu.usage.total has changed from agent5 #1552

Open
nimeshksingh opened this issue Apr 3, 2018 · 13 comments
Open

The units of kubernetes.cpu.usage.total has changed from agent5 #1552

nimeshksingh opened this issue Apr 3, 2018 · 13 comments

Comments

@nimeshksingh
Copy link

Output of the info page (if this is a bug)

root@datadog-uds-datadog-6l9p2:/# agent status
Getting the status from the agent.

==============
Agent (v6.1.0)
==============

  Status date: 2018-04-03 22:04:44.453018 UTC
  Pid: 356
  Python Version: 2.7.14
  Logs: 
  Check Runners: 2
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 0.000941067 s
    System UTC time: 2018-04-03 22:04:44.453018 UTC

  Host Info
  =========
    bootTime: 2018-03-21 17:33:22.000000 UTC
    kernelVersion: 4.4.111+
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.4
    procs: 68
    uptime: 1.138638e+06

  Hostnames
  =========
    host_aliases: [gke-lightstep-staging-pool-3-e7fc6500-wrtr.helpful-cat-109717]
    hostname: gke-lightstep-staging-pool-3-e7fc6500-wrtr.c.helpful-cat-109717.internal
    socket-fqdn: datadog-uds-datadog-6l9p2
    socket-hostname: datadog-uds-datadog-6l9p2

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 56
      Metrics: 6, Total Metrics: 330
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    disk
    ----
      Total Runs: 56
      Metrics: 158, Total Metrics: 8848
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    docker
    ------
      Total Runs: 56
      Metrics: 378, Total Metrics: 21379
      Events: 0, Total Events: 13
      Service Checks: 1, Total Service Checks: 56
  
    file_handle
    -----------
      Total Runs: 56
      Metrics: 1, Total Metrics: 56
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    io
    --
      Total Runs: 56
      Metrics: 156, Total Metrics: 8628
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    kube_dns
    --------
      Total Runs: 55
      Metrics: 80, Total Metrics: 4400
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    kubelet
    -------
      Total Runs: 55
      Metrics: 704, Total Metrics: 38925
      Events: 0, Total Events: 0
      Service Checks: 3, Total Service Checks: 165
  
    kubernetes_apiserver
    --------------------
      Total Runs: 55
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    load
    ----
      Total Runs: 55
      Metrics: 6, Total Metrics: 330
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    memory
    ------
      Total Runs: 55
      Metrics: 14, Total Metrics: 770
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    network
    -------
      Total Runs: 55
      Metrics: 146, Total Metrics: 8030
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    ntp
    ---
      Total Runs: 55
      Metrics: 1, Total Metrics: 55
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 55
  
    uptime
    ------
      Total Runs: 55
      Metrics: 1, Total Metrics: 55
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    zk
    --
      Total Runs: 55
      Metrics: 30, Total Metrics: 1650
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 55
  
    zk
    --
      Total Runs: 55
      Metrics: 30, Total Metrics: 1650
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 55
  
========
JMXFetch
========

  Initialized checks
  ==================
    jmx
      instance_name : jmx_instance
      message : 
      metric_count : 13
      service_check_count : 0
      status : OK
  Failed checks
  =============
    no checks
    
=========
Forwarder
=========

  CheckRunsV1: 56
  IntakeV1: 16
  RetryQueueSize: 0
  Success: 128
  TimeseriesV1: 56

  API Keys status
  ===============
    https://6-1-0-app.agent.datadoghq.com,*************************a39e7: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 97414
  Event: 14
  Events Flushed: 14
  Number Of Flushes: 56
  Series Flushed: 94103
  Service Check: 1273
  Service Checks Flushed: 1319
  Dogstatsd Metric Sample: 236300

Describe what happened:
I have a daemonset running datadog agents in a kubernetes 1.8 cluster. After adding a new daemonset with a v6 agent and deleting the v5 agent daemonset, the kubernetes.cpu metrics have changed units. kubernetes.cpu.limits and kubernetes.cpu.requests is now 1/1000 of what it was before. kubernetes.cpu.usage.total is now a much smaller number. It usually seems approximately 1/1000 of what it was before, but not quite right.

Describe what you expected:
The cpu metrics should be unchanged.

Steps to reproduce the issue:
Upgrade from dd-agent 5 to datadog-agent 6.

Additional environment details (Operating System, Cloud provider, etc):
Kubernetes 1.8, datadog-agent jmx docker image, GKE

@JulienBalestra
Copy link
Contributor

@nimeshksingh thank you for submitting this issue.

You're right, some metrics collected from the kubelet endpoint /metrics/cadvisor changed.

We are aware of that and we're working on this kubernetes.cpu.usage.total metric to make it scale by cores.
It should be coherent with other cpu metrics (load average, docker cpu, ...).

@nimeshksingh
Copy link
Author

Cool, good to hear. If it's a known issue it may be worth adding to this page: https://github.com/DataDog/datadog-agent/blob/master/docs/agent/changes.md#kubernetes-support

@epinzur
Copy link

epinzur commented Apr 5, 2018

was going to report this too... is there any combination of metrics I can use until a fix is available?

@JulienBalestra
Copy link
Contributor

This is the on going fix to keep the metric kubernetes.cpu.usage.total coherent between both agents:
DataDog/integrations-core#1361

@kiyose
Copy link

kiyose commented Apr 9, 2018

@JulienBalestra, it would be very good if we had a option to keep the new behavior. Agent 5 scaled metrics require nearly every graph to have a / 1000000000 added to them.

@nimeshksingh
Copy link
Author

Just to confirm - the difference in kubernetes.cpu.usage.total between agent5 and agent6 is not just a scaling factor, right? With agent5, it's absolute, and with agent6, it seems to be relative to cpu limits (not sure if it's kubernetes.cpu.limits or kubernetes.cpu.requests). It doesn't seem like DataDog/integrations-core#1361 will address that change.

@nimeshksingh
Copy link
Author

Actually, the more I look at this, the more it seems that the kubernetes.cpu and kubernetes.mem metrics are just wrong a lot of the time with agent6. I'm seeing things like two pods that have the same cpu usage according the kuberenetes dashboard with wildly different cpu usages in datadog. Also, for processes with very steady memory usage, memory in datadog just jumps around in ways that don't match the kubernetes dashboard's view.

Should I open a bug in integrations-core?

@JulienBalestra
Copy link
Contributor

@nimeshksingh the scaling factor is a first step to allow the metric kubernetes.cpu.usage.total to scale by cores on both agents.

We can keep this issue open to continue to track this problem.

We need to continue to investigate on this (/metrics/cadvisor and the original cAdvisor endpoint).

@nimeshksingh
Copy link
Author

Okay, cool. In the meantime is there a good way to turn off just the kubelet check in agent6, so I can run it alongside an agent5 that just gets these metrics? I don't want to just set KUBERNETES=false for agent6 because I still want the kubelet to be used for autodiscovery and labels for other checks.

@JulienBalestra
Copy link
Contributor

It's possible by deleting any associated file: /etc/datadog-agent/conf.d/kubelet.d/conf.yaml*

Currently it's setup by the entrypoint: https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-kubernetes.sh#L15

@xvello
Copy link
Contributor

xvello commented Apr 11, 2018

Hello,

Running two agents on the same host is not supported, as host metadata will get mangled, and this is a not a use case we are planning to support.

I am currently working on a port of the cadvisor collection logic into the new kubelet check to support legacy versions of kubernetes: DataDog/integrations-core#1339
This is still WIP and requires extensive testing before release, but I'll be posting a test docker image to the PR once it is ready for testing.

@nimeshksingh
Copy link
Author

For reference, I am actually solving the problem temporarily by running both agents in parallel and using DD_AC_INCLUDE and DD_AC_EXCLUDE to get the new agent to ignore most containers. We're not running a 'legacy' version of k8s as far as what that PR is talking about - it's 1.8. I'm not really concerned about what particular units are being used, more that the data is actually correct. As I said above, the new prometheus-based check just seems to be wrong a lot of the time. I'm nervous about having to use a 'legacy' check as the solution to get correct data. Are there any tests that compare the data produced by the cadvisor check and the data produced by the prometheus check?

@nimeshksingh
Copy link
Author

As an example of the strange numbers with the new agents - here is a screenshot of a graph where cpu usage suddenly jumped to some crazy seemingly arbitrarily scaled number for different pods that actually have very similar cpu usage. This happened when I upgraded our datadog daemonset from 6.1.0-jmx to 6.2.0-rc.1-jmx.
screen shot 2018-05-01 at 3 14 03 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants