The units of kubernetes.cpu.usage.total has changed from agent5 #1552

nimeshksingh · 2018-04-03T22:10:30Z

Output of the info page (if this is a bug)

root@datadog-uds-datadog-6l9p2:/# agent status
Getting the status from the agent.

==============
Agent (v6.1.0)
==============

  Status date: 2018-04-03 22:04:44.453018 UTC
  Pid: 356
  Python Version: 2.7.14
  Logs: 
  Check Runners: 2
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 0.000941067 s
    System UTC time: 2018-04-03 22:04:44.453018 UTC

  Host Info
  =========
    bootTime: 2018-03-21 17:33:22.000000 UTC
    kernelVersion: 4.4.111&#43;
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.4
    procs: 68
    uptime: 1.138638e&#43;06

  Hostnames
  =========
    host_aliases: [gke-lightstep-staging-pool-3-e7fc6500-wrtr.helpful-cat-109717]
    hostname: gke-lightstep-staging-pool-3-e7fc6500-wrtr.c.helpful-cat-109717.internal
    socket-fqdn: datadog-uds-datadog-6l9p2
    socket-hostname: datadog-uds-datadog-6l9p2

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 56
      Metrics: 6, Total Metrics: 330
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    disk
    ----
      Total Runs: 56
      Metrics: 158, Total Metrics: 8848
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    docker
    ------
      Total Runs: 56
      Metrics: 378, Total Metrics: 21379
      Events: 0, Total Events: 13
      Service Checks: 1, Total Service Checks: 56
  
    file_handle
    -----------
      Total Runs: 56
      Metrics: 1, Total Metrics: 56
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    io
    --
      Total Runs: 56
      Metrics: 156, Total Metrics: 8628
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    kube_dns
    --------
      Total Runs: 55
      Metrics: 80, Total Metrics: 4400
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    kubelet
    -------
      Total Runs: 55
      Metrics: 704, Total Metrics: 38925
      Events: 0, Total Events: 0
      Service Checks: 3, Total Service Checks: 165
  
    kubernetes_apiserver
    --------------------
      Total Runs: 55
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    load
    ----
      Total Runs: 55
      Metrics: 6, Total Metrics: 330
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    memory
    ------
      Total Runs: 55
      Metrics: 14, Total Metrics: 770
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    network
    -------
      Total Runs: 55
      Metrics: 146, Total Metrics: 8030
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    ntp
    ---
      Total Runs: 55
      Metrics: 1, Total Metrics: 55
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 55
  
    uptime
    ------
      Total Runs: 55
      Metrics: 1, Total Metrics: 55
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
  
    zk
    --
      Total Runs: 55
      Metrics: 30, Total Metrics: 1650
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 55
  
    zk
    --
      Total Runs: 55
      Metrics: 30, Total Metrics: 1650
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 55
  
========
JMXFetch
========

  Initialized checks
  ==================
    jmx
      instance_name : jmx_instance
      message : 
      metric_count : 13
      service_check_count : 0
      status : OK
  Failed checks
  =============
    no checks
    
=========
Forwarder
=========

  CheckRunsV1: 56
  IntakeV1: 16
  RetryQueueSize: 0
  Success: 128
  TimeseriesV1: 56

  API Keys status
  ===============
    https://6-1-0-app.agent.datadoghq.com,*************************a39e7: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 97414
  Event: 14
  Events Flushed: 14
  Number Of Flushes: 56
  Series Flushed: 94103
  Service Check: 1273
  Service Checks Flushed: 1319
  Dogstatsd Metric Sample: 236300

Describe what happened:
I have a daemonset running datadog agents in a kubernetes 1.8 cluster. After adding a new daemonset with a v6 agent and deleting the v5 agent daemonset, the kubernetes.cpu metrics have changed units. kubernetes.cpu.limits and kubernetes.cpu.requests is now 1/1000 of what it was before. kubernetes.cpu.usage.total is now a much smaller number. It usually seems approximately 1/1000 of what it was before, but not quite right.

Describe what you expected:
The cpu metrics should be unchanged.

Steps to reproduce the issue:
Upgrade from dd-agent 5 to datadog-agent 6.

Additional environment details (Operating System, Cloud provider, etc):
Kubernetes 1.8, datadog-agent jmx docker image, GKE

The text was updated successfully, but these errors were encountered:

JulienBalestra · 2018-04-04T09:03:40Z

@nimeshksingh thank you for submitting this issue.

You're right, some metrics collected from the kubelet endpoint /metrics/cadvisor changed.

We are aware of that and we're working on this kubernetes.cpu.usage.total metric to make it scale by cores.
It should be coherent with other cpu metrics (load average, docker cpu, ...).

nimeshksingh · 2018-04-04T16:52:44Z

Cool, good to hear. If it's a known issue it may be worth adding to this page: https://github.com/DataDog/datadog-agent/blob/master/docs/agent/changes.md#kubernetes-support

epinzur · 2018-04-05T23:01:16Z

was going to report this too... is there any combination of metrics I can use until a fix is available?

JulienBalestra · 2018-04-09T16:42:34Z

This is the on going fix to keep the metric kubernetes.cpu.usage.total coherent between both agents:
DataDog/integrations-core#1361

kiyose · 2018-04-09T17:50:06Z

@JulienBalestra, it would be very good if we had a option to keep the new behavior. Agent 5 scaled metrics require nearly every graph to have a / 1000000000 added to them.

nimeshksingh · 2018-04-10T01:34:19Z

Just to confirm - the difference in kubernetes.cpu.usage.total between agent5 and agent6 is not just a scaling factor, right? With agent5, it's absolute, and with agent6, it seems to be relative to cpu limits (not sure if it's kubernetes.cpu.limits or kubernetes.cpu.requests). It doesn't seem like DataDog/integrations-core#1361 will address that change.

nimeshksingh · 2018-04-10T02:02:13Z

Actually, the more I look at this, the more it seems that the kubernetes.cpu and kubernetes.mem metrics are just wrong a lot of the time with agent6. I'm seeing things like two pods that have the same cpu usage according the kuberenetes dashboard with wildly different cpu usages in datadog. Also, for processes with very steady memory usage, memory in datadog just jumps around in ways that don't match the kubernetes dashboard's view.

Should I open a bug in integrations-core?

JulienBalestra · 2018-04-10T12:15:31Z

@nimeshksingh the scaling factor is a first step to allow the metric kubernetes.cpu.usage.total to scale by cores on both agents.

We can keep this issue open to continue to track this problem.

We need to continue to investigate on this (/metrics/cadvisor and the original cAdvisor endpoint).

nimeshksingh · 2018-04-11T00:50:25Z

Okay, cool. In the meantime is there a good way to turn off just the kubelet check in agent6, so I can run it alongside an agent5 that just gets these metrics? I don't want to just set KUBERNETES=false for agent6 because I still want the kubelet to be used for autodiscovery and labels for other checks.

JulienBalestra · 2018-04-11T08:06:30Z

It's possible by deleting any associated file: /etc/datadog-agent/conf.d/kubelet.d/conf.yaml*

Currently it's setup by the entrypoint: https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-kubernetes.sh#L15

xvello · 2018-04-11T08:51:13Z

Hello,

Running two agents on the same host is not supported, as host metadata will get mangled, and this is a not a use case we are planning to support.

I am currently working on a port of the cadvisor collection logic into the new kubelet check to support legacy versions of kubernetes: DataDog/integrations-core#1339
This is still WIP and requires extensive testing before release, but I'll be posting a test docker image to the PR once it is ready for testing.

nimeshksingh · 2018-05-01T21:29:33Z

For reference, I am actually solving the problem temporarily by running both agents in parallel and using DD_AC_INCLUDE and DD_AC_EXCLUDE to get the new agent to ignore most containers. We're not running a 'legacy' version of k8s as far as what that PR is talking about - it's 1.8. I'm not really concerned about what particular units are being used, more that the data is actually correct. As I said above, the new prometheus-based check just seems to be wrong a lot of the time. I'm nervous about having to use a 'legacy' check as the solution to get correct data. Are there any tests that compare the data produced by the cadvisor check and the data produced by the prometheus check?

nimeshksingh · 2018-05-01T22:18:46Z

As an example of the strange numbers with the new agents - here is a screenshot of a graph where cpu usage suddenly jumped to some crazy seemingly arbitrarily scaled number for different pods that actually have very similar cpu usage. This happened when I upgraded our datadog daemonset from 6.1.0-jmx to 6.2.0-rc.1-jmx.

mfpierre added the team/containers label Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The units of kubernetes.cpu.usage.total has changed from agent5 #1552

The units of kubernetes.cpu.usage.total has changed from agent5 #1552

nimeshksingh commented Apr 3, 2018

JulienBalestra commented Apr 4, 2018

nimeshksingh commented Apr 4, 2018

epinzur commented Apr 5, 2018

JulienBalestra commented Apr 9, 2018

kiyose commented Apr 9, 2018

nimeshksingh commented Apr 10, 2018

nimeshksingh commented Apr 10, 2018

JulienBalestra commented Apr 10, 2018

nimeshksingh commented Apr 11, 2018

JulienBalestra commented Apr 11, 2018

xvello commented Apr 11, 2018

nimeshksingh commented May 1, 2018

nimeshksingh commented May 1, 2018

The units of kubernetes.cpu.usage.total has changed from agent5 #1552

The units of kubernetes.cpu.usage.total has changed from agent5 #1552

Comments

nimeshksingh commented Apr 3, 2018

JulienBalestra commented Apr 4, 2018

nimeshksingh commented Apr 4, 2018

epinzur commented Apr 5, 2018

JulienBalestra commented Apr 9, 2018

kiyose commented Apr 9, 2018

nimeshksingh commented Apr 10, 2018

nimeshksingh commented Apr 10, 2018

JulienBalestra commented Apr 10, 2018

nimeshksingh commented Apr 11, 2018

JulienBalestra commented Apr 11, 2018

xvello commented Apr 11, 2018

nimeshksingh commented May 1, 2018

nimeshksingh commented May 1, 2018