Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581

devonboyer · 2018-05-18T18:09:14Z

What does this PR do?

This PR adds support for scraping metrics from prometheus endpoint for kubelet metrics.

This involved refactoring the code for scraping the /metrics/cadvisor prometheus endpoint into a separate class. Much of the code is the same except that the kubelet check now instantiates two different prometheus scrapers instead of inheriting from PrometheusCheck.

Motivation

This endpoint also has some metrics that may be of interest to customers.

Review checklist

PR has a meaningful title or PR has the no-changelog label attached
Feature or bugfix has tests
Git history is clean
If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Additional Notes

I would be happy for any feedback on which metrics to gather/ignore for the KubeletPrometheusScraper. Right now, it only gathers a few metrics that looked interesting for the purpose of testing.

devonboyer · 2018-05-18T18:10:53Z

datadog_checks_base/datadog_checks/checks/prometheus/base_check.py

@@ -7,7 +7,13 @@
 from .. import AgentCheck
 from ...errors import CheckException

+
 class Scraper(PrometheusScraper):


The name of this class is admittedly bad and I'd take any suggestions for renaming. Maybe rename the mixin to PrometheusScraperMixin and also have PrometheusScraper as the standalone version?

Your suggestion sounds good to me, you'll need to update a few things (check base,exports, Istio check) for renaming.

I'd say it'd be nice to have but as it's in the prometheus module it's not shocking to be named just Scraper.

… metrics.

devonboyer · 2018-05-18T19:46:26Z

kubelet/conf.yaml.example

    # Pass an empty string, or set the cadvisor_port option to disable
-    # prometheus metrics collection
+    # prometheus cadvisor metrics collection
    # metrics_endpoint: http://10.8.0.1:10255/metrics/cadvisor
    #


Does it make more sense to have a single config option for kubelet_api_endpoint:http://10.8.0.1:10255 and a collect_cadvisor_metrics flag or have separate options for kubelet_metrics_endpoint:http://10.8.0.1:10255/metrics/cadvisor and cadvisor_metrics_endpoint:http://10.8.0.1:10255/metrics?

i like the first option better, but it won't be easy to keep retro-compatibility. The second option sounds simpler. We can update the yaml with metrics_endpoint --> cadvisor_endpoint and add kubelet_endpoint, and fallback to metrics_endpoint in the check if no cadvisor_endpoint is found in the config object.

IMHO the best woud be to have only kubelet_api_endpoint and then we can guess /metrics and /metrics/cadvisor because there's no way to change the path in the kubelet.
So I guess we can keep the current metrics_endpoint parse it and add paths if needed

devonboyer · 2018-05-18T20:34:53Z

kubelet/datadog_checks/kubelet/kubelet.py

@@ -70,39 +70,9 @@ def __init__(self, name, init_config, agentConfig, instances=None):
        self.cadvisor_legacy_port = inst.get('cadvisor_port', CADVISOR_DEFAULT_PORT)


It seems like self.kube_node_labels is not used anywhere in this check 🤔

looks like it's a left over of https://github.com/DataDog/integrations-core/blob/master/kubernetes/conf.yaml.example#L140-L146
@hkaj should it be implemented or dropped?

it can be dropped, this has been moved to main agent settings: https://github.com/DataDog/datadog-agent/blob/0aa3114a92a568b8b84a02919789fbc2d9358477/pkg/config/config.go#L176

mfpierre · 2018-05-22T12:34:54Z

kubelet/datadog_checks/kubelet/prometheus.py

+            'etcd_helper_cache_entry_count': 'kubelet.etcd.cache.entry.count',
+            'etcd_helper_cache_hit_count': 'kubelet.etcd.cache.hit.count',
+            'etcd_helper_cache_miss_count': 'kubelet.etcd.cache.miss.count',
+            'go_goroutines': 'kubelet.goroutines',


cc @JulienBalestra for additional useful metrics

mfpierre

@devonboyer the switch to a regular check with two scrapers LGTM 👍

don't forget to update the metadata.csv for the metrics you'll be adding, bump the check version and update the changelog 📝

JulienBalestra · 2018-05-22T13:48:58Z

kubelet/tests/test_kubelet.py

+    'kubernetes.kubelet.etcd.cache.entry.count',
+    'kubernetes.kubelet.etcd.cache.hit.count',
+    'kubernetes.kubelet.etcd.cache.miss.count',
+    'kubernetes.kubelet.goroutines',


We should add some metrics like:

apiserver_client_certificate_expiration_seconds

And what do you think of:

rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.001"} 2 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.002"} 13 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.004"} 18 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.008"} 32 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.016"} 34 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.032"} 34 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.064"} 34 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.128"} 34 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.256"} 34 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.512"} 34 rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="+Inf"} 34 rest_client_request_latency_seconds_sum{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST"} 0.13816789299999999 rest_client_request_latency_seconds_count{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST"} 34

This metric could be useful in term of security.
A spike could indicate some unexpected requests on kubelet API (someone tempting to access to the logs or doing an exec).
But it's hard to parse because the requests is in the label itself (http://127.0.0.1:8080 is the endpoint of the kube-apiserver)

The REST counter could be also valuable, to identify any overwhelming kubelet:

# TYPE rest_client_requests_total counter rest_client_requests_total{code="200",host="127.0.0.1:8080",method="DELETE"} 2 rest_client_requests_total{code="200",host="127.0.0.1:8080",method="GET"} 506 rest_client_requests_total{code="200",host="127.0.0.1:8080",method="PATCH"} 173 rest_client_requests_total{code="200",host="127.0.0.1:8080",method="PUT"} 80 rest_client_requests_total{code="201",host="127.0.0.1:8080",method="POST"} 309 rest_client_requests_total{code="404",host="127.0.0.1:8080",method="GET"} 1 rest_client_requests_total{code="409",host="127.0.0.1:8080",method="POST"} 1

The following metrics looks like being interesting:

# HELP kubelet_runtime_operations Cumulative number of runtime operations by operation type. # TYPE kubelet_runtime_operations counter kubelet_runtime_operations{operation_type="container_status"} 156 kubelet_runtime_operations{operation_type="create_container"} 33 kubelet_runtime_operations{operation_type="exec"} 16 kubelet_runtime_operations{operation_type="exec_sync"} 233 kubelet_runtime_operations{operation_type="image_status"} 73 kubelet_runtime_operations{operation_type="list_containers"} 3800 kubelet_runtime_operations{operation_type="list_images"} 278 kubelet_runtime_operations{operation_type="list_podsandbox"} 3766 kubelet_runtime_operations{operation_type="podsandbox_status"} 146 kubelet_runtime_operations{operation_type="pull_image"} 5 kubelet_runtime_operations{operation_type="remove_container"} 9 kubelet_runtime_operations{operation_type="run_podsandbox"} 28 kubelet_runtime_operations{operation_type="start_container"} 33 kubelet_runtime_operations{operation_type="status"} 398 kubelet_runtime_operations{operation_type="stop_container"} 5 kubelet_runtime_operations{operation_type="stop_podsandbox"} 42 kubelet_runtime_operations{operation_type="version"} 194

# TYPE kubelet_runtime_operations_errors counter kubelet_runtime_operations_errors{operation_type="container_status"} 1 kubelet_runtime_operations_errors{operation_type="exec_sync"} 6 kubelet_runtime_operations_errors{operation_type="run_podsandbox"} 4

devonboyer · 2018-05-22T20:13:16Z

kubelet/datadog_checks/kubelet/prometheus.py

-        self.metrics_mapper = {
-            'kubelet_runtime_operations_errors': 'kubelet.runtime.errors',
-        }
-


This is actually collected from /metrics and not /metrics/cadvisor so its been moved to the metrics_mapper for kubelet metrics.

devonboyer · 2018-05-22T20:17:11Z

kubelet/conf.yaml.example

+    # url of the kubelet metrics prometheus endpoint
+    # Pass an empty string to disable kubelet metrics collection
+    # kubelet_metrics_endpoint: http://10.8.0.1:10255/metrics
+    #


This ended up being much more straightforward to be able to turn off kublet metrics and/or cadvisor metrics separately by setting the option to "" than trying to share kubelet_api_endpoint between them.

I added a deprecated note to metrics_endpoint in favour of the more clearly named cadvisor_metrics_endpoint. Does it make sense to log a warning/info if someone is using it?

yes, a deprecation warning would be nice.

devonboyer · 2018-05-22T20:22:10Z

datadog_checks_base/datadog_checks/checks/prometheus/base_check.py


 from .. import AgentCheck
 from ...errors import CheckException

-class Scraper(PrometheusScraper):
+
+class PrometheusScraper(PrometheusScraperMixin):


Anyone object to these renames?

love it 👍

devonboyer · 2018-05-22T21:13:47Z

@mfpierre I think the changelog updates are generated automatically now at release time using the PR labels and title. Does check version need to be bumped manually still @masci?

masci · 2018-05-23T06:12:33Z

@devonboyer version will be automatically bumped by the same tool generating the changelog, no need to do that manually

JulienBalestra · 2018-05-23T07:53:35Z

kubelet/metadata.csv

+kubernetes.apiserver.certificate.expiration.sum,gauge,,second,,The sum of remaining lifetime on the certificate used to authenticate a request,,1,kubelet,
+kubernetes.etcd.cache.entry.count,gauge,,operation,,The number of etcd helper cache entries,,0,kubelet,
+kubernetes.etcd.cache.hit.count,gauge,,operation,,The number of etcd helper cache hits,,0,kubelet,
+kubernetes.etcd.cache.miss.count,gauge,,operation,,The number of etcd helper cache misses,,0,kubelet,


Did you see this metric change ? (kubernetes.etcd.cache*) I wasn't able to see it. What's for ?

I just thought it might be something interesting to have. If you don't think its a relevant metric I can remove it, I don't really know a good use-case for having it.

@devonboyer I think we're better of removing them if we're not sure on how they're useful

devonboyer · 2018-05-24T14:24:14Z

@DataDog/kenafeh Anyone able to give me a 👍on this PR?

mfpierre

LGTM, nice rework

devonboyer added the integration/kubelet label May 18, 2018

devonboyer requested a review from a team May 18, 2018 18:09

devonboyer commented May 18, 2018

View reviewed changes

Add support for scraping metrics from prometheus endpoint for kubelet…

cac5d98

… metrics.

devonboyer force-pushed the devon/prometheus-kubelet-check branch from b2dc90e to cac5d98 Compare May 18, 2018 18:12

devonboyer changed the title ~~Add support for scraping metrics from prometheus endpoint for kubelet metrics.~~ [Added] Support for gathering metrics from prometheus endpoint for the kubelet itself. May 18, 2018

rishabh added this to the 6.3 milestone May 18, 2018

devonboyer commented May 18, 2018

View reviewed changes

Test cleanup.

a3c1443

masci added the changelog/Added label May 19, 2018

masci changed the title ~~[Added] Support for gathering metrics from prometheus endpoint for the kubelet itself.~~ Support for gathering metrics from prometheus endpoint for the kubelet itself. May 19, 2018

mfpierre reviewed May 22, 2018

View reviewed changes

mfpierre requested changes May 22, 2018

View reviewed changes

JulienBalestra reviewed May 22, 2018

View reviewed changes

Code review changes.

3dd315e

devonboyer commented May 22, 2018

View reviewed changes

Fix failing test.

ee90155

Log warning for deprecated config option.

3a55e38

JulienBalestra reviewed May 23, 2018

View reviewed changes

Remove etcd metrics from kubelet prometheus endpoint.

6a20339

mfpierre approved these changes May 24, 2018

View reviewed changes

devonboyer merged commit e3985fe into master May 25, 2018

devonboyer deleted the devon/prometheus-kubelet-check branch May 25, 2018 17:10

masci mentioned this pull request Sep 6, 2018

php_fpm: Sporadic error "Failed to get metrics" #1843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581

Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581

devonboyer commented May 18, 2018 •

edited

Loading

devonboyer May 18, 2018

mfpierre May 22, 2018 •

edited

Loading

devonboyer May 18, 2018 •

edited

Loading

hkaj May 22, 2018

mfpierre May 22, 2018

devonboyer May 18, 2018

mfpierre May 22, 2018 •

edited

Loading

hkaj May 22, 2018

mfpierre May 22, 2018

mfpierre left a comment •

edited

Loading

JulienBalestra May 22, 2018 •

edited

Loading

JulienBalestra May 22, 2018

JulienBalestra May 22, 2018 •

edited

Loading

devonboyer May 22, 2018

devonboyer May 22, 2018 •

edited

Loading

hkaj May 23, 2018

devonboyer May 22, 2018

masci May 23, 2018

devonboyer commented May 22, 2018 •

edited

Loading

masci commented May 23, 2018

JulienBalestra May 23, 2018

devonboyer May 23, 2018

mfpierre May 24, 2018

devonboyer commented May 24, 2018

mfpierre left a comment

		@@ -70,39 +70,9 @@ def __init__(self, name, init_config, agentConfig, instances=None):
		self.cadvisor_legacy_port = inst.get('cadvisor_port', CADVISOR_DEFAULT_PORT)

Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581

Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581

Conversation

devonboyer commented May 18, 2018 • edited Loading

What does this PR do?

Motivation

Review checklist

Additional Notes

Choose a reason for hiding this comment

mfpierre May 22, 2018 • edited Loading

Choose a reason for hiding this comment

devonboyer May 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfpierre May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfpierre left a comment • edited Loading

Choose a reason for hiding this comment

JulienBalestra May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JulienBalestra May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devonboyer May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devonboyer commented May 22, 2018 • edited Loading

masci commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devonboyer commented May 24, 2018

mfpierre left a comment

Choose a reason for hiding this comment

devonboyer commented May 18, 2018 •

edited

Loading

mfpierre May 22, 2018 •

edited

Loading

devonboyer May 18, 2018 •

edited

Loading

mfpierre May 22, 2018 •

edited

Loading

mfpierre left a comment •

edited

Loading

JulienBalestra May 22, 2018 •

edited

Loading

JulienBalestra May 22, 2018 •

edited

Loading

devonboyer May 22, 2018 •

edited

Loading

devonboyer commented May 22, 2018 •

edited

Loading