-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kubelet] port cadvisor metric collection from legacy kubernetes check #1339
Conversation
0243101
to
b908076
Compare
b951a17
to
69871ee
Compare
As autodiscovering prometeus/cadvisor might be pretty tricky, we aggreed on going with manual configuration. This can be done by overriding the
|
README.md
Outdated
@@ -119,6 +119,8 @@ the new testing approach: | |||
|
|||
For checks that are not listed here, please refer to [Legacy development Setup](docs/dev/legacy.md). | |||
|
|||
If you updated the test requirements for a check, you will need to run `tox --recreate` for changes to be effective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r/you will need to run `tox --recreate` for changes
/run `tox --recreate` for changes
kubelet/conf.yaml.example
Outdated
### | ||
### Metric collection for legacy (< 1.7.6) clusters via the kubelet's | ||
### cadvisor port. | ||
### This port is closed by default on k8s 1.7 and OpenShift, make sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r/make sure you enable it
/enable it
b7c6587
to
a5ca319
Compare
I left some minor nit phrasing comment to follow the contributing guidelines: https://github.com/DataDog/documentation/blob/master/CONTRIBUTING.md It seems that we are collecting a new metric:
should we update https://github.com/DataDog/integrations-core/blob/master/kubelet/metadata.csv or https://github.com/DataDog/integrations-core/blob/master/kubernetes/metadata.csv accordingly? |
Cadvisor exposes network metrics per container, but as there's one network namespace per pod, the new endpoint exposes them per pod. The agent6 cadvisor mode will align with prometheus mode (that will have correct host sums), and break compat with agent5. Agent5, sending one gauge per container in the pod, only pause container has a non-zero value
Agent6 + prometeus: correctly tagging by pod tags only
Agent6 + cadvisor : move the network metric to pod level, mirroring prometeus mode
|
51ab121
to
e2c9487
Compare
self._update_metrics(instance) | ||
|
||
def _update_metrics(self, instance): | ||
def parse_quantity(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you define this function in this method ?
This function isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead-code copy-pasted from agent5, removing
except Exception as e: | ||
self.log.error("Unable to collect metrics for container: {0} ({1})".format(c_id, e)) | ||
|
||
def _publish_raw_metrics(self, metric, dat, tags, is_pod, depth=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have a docstrings for this method ?
Like:
def _publish_raw_metrics(self, metric, dat, tags, is_pod, depth=0):
"""
Blahblah
metric: type
dat: type
...
"""
LEGACY_CADVISOR_METRICS_PATH = '/api/v1.3/subcontainers/' | ||
|
||
|
||
class CadvisorScraper(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return url | ||
|
||
def retrieve_cadvisor_metrics(self, timeout=10): | ||
return requests.get(self.cadvisor_legacy_url, timeout=timeout).json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is self.cadvisor_legacy_url
attribute defined in this class ?
metrics = self.retrieve_cadvisor_metrics() | ||
|
||
if not metrics: | ||
raise Exception('No metrics retrieved cmd=%s' % self.metrics_cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is self.metrics_cmd
attribute defined in this class ?
try: | ||
self._update_container_metrics(instance, subcontainer) | ||
except Exception as e: | ||
self.log.error("Unable to collect metrics for container: {0} ({1})".format(c_id, e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log
isn't defined, this could be a global from (IIRC):
logger = logging.getLogger(__name__)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class is a mixin intended to be used inside an AgentCheck
class, so it'll use the agentcheck's self.log
as it uses its self.gauge
. Adding a docstring
self._publish_raw_metrics(metric, dat[-1], tags, is_pod, depth + 1) | ||
|
||
def _update_container_metrics(self, instance, subcontainer): | ||
tags = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tags
definition isn't needed as you are redefining it in each condition below.
# Let's see who we have here | ||
if is_pod: | ||
tags = tags_for_pod(pod_uid, True) | ||
elif (in_static_pod and k_container_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the parenthesis are redundant
tags += tags_for_pod(pod_uid, True) | ||
tags.append("kube_container_name:%s" % k_container_name) | ||
else: # Standard container | ||
if self.container_filter.is_excluded(cid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for self.container_filter
, what's this scope ?
return False | ||
|
||
|
||
class ContainerFilter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old style class, see the other mention.
|
||
self._update_metrics(instance, cadvisor_url, pod_list, container_filter) | ||
|
||
def _retrieve_cadvisor_metrics(self, cadvisor_url, timeout=10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method doesn't use the instance, can it be static ?
""" | ||
Recusively parses and submit metrics for a given entity, until | ||
reaching self.max_depth. | ||
Nested metric names are flattened: memory/usage -> memory.usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I really enjoy docstrings with the type of each parameter
self._publish_raw_metrics(metric + '.%s' % k.lower(), v, tags, is_pod, depth + 1) | ||
|
||
elif isinstance(dat, list): | ||
self._publish_raw_metrics(metric, dat[-1], tags, is_pod, depth + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be useful to catch a potential else
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else
would only be a pass
. I'm not sure we should log it
is_pod = False | ||
in_static_pod = False | ||
cid = subcontainer.get('id') | ||
pod_uid = subcontainer.get('labels', []).get('io.kubernetes.pod.uid') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the default be a dict instead of a list ?
Especially with .get('io.kubernetes.pod.uid')
over it (list doesn't support get
on it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, ouch
in_static_pod = False | ||
cid = subcontainer.get('id') | ||
pod_uid = subcontainer.get('labels', []).get('io.kubernetes.pod.uid') | ||
k_container_name = subcontainer.get('labels', []).get('io.kubernetes.container.name') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
return | ||
tags = list(set(tags + instance.get('tags', []))) | ||
|
||
stats = subcontainer['stats'][-1] # take the latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure of the existence of stats
and a len > 0 ?
Does it make sense to add a try except
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception will be caught in the parent _update_metrics
stats = subcontainer['stats'][-1] # take the latest | ||
self._publish_raw_metrics(NAMESPACE, stats, tags, is_pod) | ||
|
||
if is_pod is False and subcontainer.get("spec", {}).get("has_filesystem") and stats.get('filesystem', []) != []: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what about doing this instead:
if is_pod is False and subcontainer.get("spec", {}).get("has_filesystem") and stats.get('filesystem'):
It doesn't create two empty lists to just compare if it's not None
.
return get_tags('docker://%s' % cid, cardinality) | ||
|
||
|
||
def get_pod_by_uid(uid, podlist): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like that kind of docstrings 👍
def __init__(self, podlist): | ||
self.containers = {} | ||
|
||
for pod in podlist.get('items') or []: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace this by:
for pod in podlist.get('items', []):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have
if self.pod_list.get("items") is None:
# Sanitize input: if no pod are running, 'items' is a NoneObject
self.pod_list['items'] = []
in check()
we can. Before, we had a items
key with a None value, that made the iteration fail when no pod was running
f7b29c2
to
95a3ebc
Compare
- adds the ContainerFilter helper class to consume the new agent interface. Only used for cadvisor mode for now - factors-out common parts to a common.py file - copy agent5 cadvisor logic to a CadvisorScraper helper class for separation - update the cadvisor logic to support agent6 facilities (tagger, filter) - update the cadvisor logic to report network metrics at the pod cardinality for consistency with prometheus mode (change from agent5): see comment lower - add the missing disk metric in the kubernetes.csv metadata file
95a3ebc
to
d1a7593
Compare
23acc34
to
463a8f9
Compare
Final testing with |
@l0k0ms can we sync about this PR and DataDog/datadog-agent#1550 ? Do you see other docs to update? |
What does this PR do?
Port the cadvisor collection logic from agent5's kubernetes check to agent6. This PR:
ContainerFilter
helper class to consume the new agent interface. Only used for cadvisor mode for now, prometheus mode will be patched in another PRcommon.py
fileCadvisorScraper
helper class for clearer separationkubernetes.csv
metadata fileMotivation
Support legacy k8s clusters
Testing Guidelines
Pushed the
datadog/agent-dev:xvello-test1-3
anddatadog/agent-dev:xvello-test1-3-jmx
images for testingCadvisor mode can be triggered with the following confd configmap:
Versioning
manifest.json
datadog_checks/{integration}/__init__.py
CHANGELOG.md
. Please useUnreleased
as the date in the titlefor the new section.
Additional Notes
Anything else we should know when reviewing?