Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubelet] port cadvisor metric collection from legacy kubernetes check #1339

Merged
merged 4 commits into from
Apr 17, 2018

Conversation

xvello
Copy link
Contributor

@xvello xvello commented Apr 4, 2018

What does this PR do?

Port the cadvisor collection logic from agent5's kubernetes check to agent6. This PR:

  • adds the ContainerFilter helper class to consume the new agent interface. Only used for cadvisor mode for now, prometheus mode will be patched in another PR
  • factors-out common parts to a common.py file
  • copy agent5 cadvisor logic to a CadvisorScraper helper class for clearer separation
  • update the cadvisor logic to support agent6 facilities (tagger, filter)
  • update the cadvisor logic to report network metrics at the pod cardinality for consistency with prometheus mode (change from agent5): see comment lower
  • add the missing disk metric in the kubernetes.csv metadata file

Motivation

Support legacy k8s clusters

Testing Guidelines

Pushed the datadog/agent-dev:xvello-test1-3 and datadog/agent-dev:xvello-test1-3-jmx images for testing

Cadvisor mode can be triggered with the following confd configmap:

  confd:
    kubelet.yaml: |-
      init_config:
      instances:
        - cadvisor_port: 4194

Versioning

  • Bumped the check version in manifest.json
  • Bumped the check version in datadog_checks/{integration}/__init__.py
  • Updated CHANGELOG.md. Please use Unreleased as the date in the title
    for the new section.
  • If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Additional Notes

Anything else we should know when reviewing?

@xvello
Copy link
Contributor Author

xvello commented Apr 12, 2018

As autodiscovering prometeus/cadvisor might be pretty tricky, we aggreed on going with manual configuration. This can be done by overriding the kubelet.d/conf.yaml.default by a custom kubelet.yaml. For example, with the helm chart:

  confd:
    kubelet.yaml: |-
      init_config:
      instances:
        - cadvisor_port: 4194

README.md Outdated
@@ -119,6 +119,8 @@ the new testing approach:

For checks that are not listed here, please refer to [Legacy development Setup](docs/dev/legacy.md).

If you updated the test requirements for a check, you will need to run `tox --recreate` for changes to be effective.
Copy link
Contributor

@l0k0ms l0k0ms Apr 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r/you will need to run `tox --recreate` for changes/run `tox --recreate` for changes

###
### Metric collection for legacy (< 1.7.6) clusters via the kubelet's
### cadvisor port.
### This port is closed by default on k8s 1.7 and OpenShift, make sure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r/make sure you enable it/enable it

@l0k0ms
Copy link
Contributor

l0k0ms commented Apr 12, 2018

I left some minor nit phrasing comment to follow the contributing guidelines: https://github.com/DataDog/documentation/blob/master/CONTRIBUTING.md

It seems that we are collecting a new metric:

  • kubernetes.diskio.io_service_bytes.stats.total

should we update https://github.com/DataDog/integrations-core/blob/master/kubelet/metadata.csv or https://github.com/DataDog/integrations-core/blob/master/kubernetes/metadata.csv accordingly?

@xvello
Copy link
Contributor Author

xvello commented Apr 12, 2018

Cadvisor exposes network metrics per container, but as there's one network namespace per pod, the new endpoint exposes them per pod. The agent6 cadvisor mode will align with prometheus mode (that will have correct host sums), and break compat with agent5.

Agent5, sending one gauge per container in the pod, only pause container has a non-zero value

 (u'kubernetes.network.tx_bytes',
  1523537701,
  0.0,
  {'hostname': u'gke-xvello-19-default-pool-be004370-pjvz.c.datadog-sandbox.internal',
   'tags': [u'image_tag:1.14.8',
            u'image_name:asia.gcr.io/google_containers/k8s-dns-sidecar-amd64',
            'agent:5',
            u'kube_k8s-app:kube-dns',
            u'kube_namespace:kube-system',
            u'kube_container_name:sidecar',
            u'kube_replica_set:kube-dns-5c5884448b',
            u'pod_name:kube-dns-5c5884448b-z4d79',
            u'kube_replication_controller:kube-dns-5c5884448b',
            u'container_name:k8s_sidecar_kube-dns-5c5884448b-z4d79_kube-system_4874125c-3363-11e8-a97b-42010a840074_1',
            u'kube_pod-template-hash:1714400046',
            u'kube_deployment:kube-dns',

Agent6 + prometeus: correctly tagging by pod tags only

      "metric": "kubernetes.network.tx_bytes",
      "points": [
        [
          1523537810,
          0
        ]
      ],
      "tags": [
        "kube_deployment:kube-dns",
        "kube_namespace:kube-system",
        "kube_replica_set:kube-dns-5c5884448b",
        "pod_name:kube-dns-5c5884448b-z4d79"
      ],
      "host": "gke-xvello-19-default-pool-be004370-pjvz.c.datadog-sandbox.internal",
      "type": "gauge",
      "interval": 0,

Agent6 + cadvisor : move the network metric to pod level, mirroring prometeus mode

      "metric": "kubernetes.network.tx_bytes",
      "points": [
        [
          1523546563,
          0
        ]
      ],
      "tags": [
        "agent:6",
        "kube_daemon_set:fluentd-gcp-v2.0.10",
        "kube_namespace:kube-system",
        "pod_name:fluentd-gcp-v2.0.10-csz7w"
      ],
      "host": "gke-xvello-19-default-pool-be004370-pjvz.c.datadog-sandbox.internal",
      "type": "gauge",
      "interval": 0,
      "source_type_name": "System"
    },

@xvello xvello force-pushed the xvello/kubelet-cadvisor branch 4 times, most recently from 51ab121 to e2c9487 Compare April 13, 2018 12:05
@xvello xvello requested a review from a team April 13, 2018 12:43
@xvello xvello changed the title [WIP] port cadvisor metric collection from legacy kubernetes check [kubelet] port cadvisor metric collection from legacy kubernetes check Apr 13, 2018
self._update_metrics(instance)

def _update_metrics(self, instance):
def parse_quantity(s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you define this function in this method ?
This function isn't used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead-code copy-pasted from agent5, removing

except Exception as e:
self.log.error("Unable to collect metrics for container: {0} ({1})".format(c_id, e))

def _publish_raw_metrics(self, metric, dat, tags, is_pod, depth=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a docstrings for this method ?
Like:

def _publish_raw_metrics(self, metric, dat, tags, is_pod, depth=0):
"""
Blahblah
metric: type
dat: type
...
"""

LEGACY_CADVISOR_METRICS_PATH = '/api/v1.3/subcontainers/'


class CadvisorScraper():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the new style Python 2.7 class:

class CadvisorScraper(object):

WDYT ?

return url

def retrieve_cadvisor_metrics(self, timeout=10):
return requests.get(self.cadvisor_legacy_url, timeout=timeout).json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.cadvisor_legacy_url attribute defined in this class ?

metrics = self.retrieve_cadvisor_metrics()

if not metrics:
raise Exception('No metrics retrieved cmd=%s' % self.metrics_cmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.metrics_cmd attribute defined in this class ?

try:
self._update_container_metrics(instance, subcontainer)
except Exception as e:
self.log.error("Unable to collect metrics for container: {0} ({1})".format(c_id, e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.log isn't defined, this could be a global from (IIRC):

logger = logging.getLogger(__name__)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is a mixin intended to be used inside an AgentCheck class, so it'll use the agentcheck's self.log as it uses its self.gauge. Adding a docstring

self._publish_raw_metrics(metric, dat[-1], tags, is_pod, depth + 1)

def _update_container_metrics(self, instance, subcontainer):
tags = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags definition isn't needed as you are redefining it in each condition below.

# Let's see who we have here
if is_pod:
tags = tags_for_pod(pod_uid, True)
elif (in_static_pod and k_container_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the parenthesis are redundant

tags += tags_for_pod(pod_uid, True)
tags.append("kube_container_name:%s" % k_container_name)
else: # Standard container
if self.container_filter.is_excluded(cid):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for self.container_filter, what's this scope ?

return False


class ContainerFilter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old style class, see the other mention.


self._update_metrics(instance, cadvisor_url, pod_list, container_filter)

def _retrieve_cadvisor_metrics(self, cadvisor_url, timeout=10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't use the instance, can it be static ?

"""
Recusively parses and submit metrics for a given entity, until
reaching self.max_depth.
Nested metric names are flattened: memory/usage -> memory.usage
Copy link
Contributor

@JulienBalestra JulienBalestra Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I really enjoy docstrings with the type of each parameter

self._publish_raw_metrics(metric + '.%s' % k.lower(), v, tags, is_pod, depth + 1)

elif isinstance(dat, list):
self._publish_raw_metrics(metric, dat[-1], tags, is_pod, depth + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to catch a potential else here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else would only be a pass. I'm not sure we should log it

is_pod = False
in_static_pod = False
cid = subcontainer.get('id')
pod_uid = subcontainer.get('labels', []).get('io.kubernetes.pod.uid')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the default be a dict instead of a list ?
Especially with .get('io.kubernetes.pod.uid') over it (list doesn't support get on it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, ouch

in_static_pod = False
cid = subcontainer.get('id')
pod_uid = subcontainer.get('labels', []).get('io.kubernetes.pod.uid')
k_container_name = subcontainer.get('labels', []).get('io.kubernetes.container.name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return
tags = list(set(tags + instance.get('tags', [])))

stats = subcontainer['stats'][-1] # take the latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure of the existence of stats and a len > 0 ?
Does it make sense to add a try except here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception will be caught in the parent _update_metrics

stats = subcontainer['stats'][-1] # take the latest
self._publish_raw_metrics(NAMESPACE, stats, tags, is_pod)

if is_pod is False and subcontainer.get("spec", {}).get("has_filesystem") and stats.get('filesystem', []) != []:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about doing this instead:

if is_pod is False and subcontainer.get("spec", {}).get("has_filesystem") and stats.get('filesystem'):

It doesn't create two empty lists to just compare if it's not None.

return get_tags('docker://%s' % cid, cardinality)


def get_pod_by_uid(uid, podlist):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like that kind of docstrings 👍

def __init__(self, podlist):
self.containers = {}

for pod in podlist.get('items') or []:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this by:

for pod in podlist.get('items', []):

Copy link
Contributor Author

@xvello xvello Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have

            if self.pod_list.get("items") is None:
                # Sanitize input: if no pod are running, 'items' is a NoneObject
                self.pod_list['items'] = []

in check() we can. Before, we had a items key with a None value, that made the iteration fail when no pod was running

- adds the ContainerFilter helper class to consume the new agent
interface. Only used for cadvisor mode for now
- factors-out common parts to a common.py file
- copy agent5 cadvisor logic to a CadvisorScraper helper class for separation
- update the cadvisor logic to support agent6 facilities (tagger, filter)
- update the cadvisor logic to report network metrics at the pod cardinality
for consistency with prometheus mode (change from agent5): see comment lower
- add the missing disk metric in the kubernetes.csv metadata file
@xvello
Copy link
Contributor Author

xvello commented Apr 16, 2018

Final testing with datadog/agent-dev:xvello-test1-3 image show metrics are equivalent to agent5, and CPU usage is lower (~5% average) thant agent5 (~9% average)

JulienBalestra
JulienBalestra previously approved these changes Apr 17, 2018
@xvello
Copy link
Contributor Author

xvello commented Apr 17, 2018

@l0k0ms can we sync about this PR and DataDog/datadog-agent#1550 ? Do you see other docs to update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants