-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581
Conversation
@@ -7,7 +7,13 @@ | |||
from .. import AgentCheck | |||
from ...errors import CheckException | |||
|
|||
|
|||
class Scraper(PrometheusScraper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this class is admittedly bad and I'd take any suggestions for renaming. Maybe rename the mixin to PrometheusScraperMixin
and also have PrometheusScraper
as the standalone version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion sounds good to me, you'll need to update a few things (check base,exports, Istio check) for renaming.
I'd say it'd be nice to have but as it's in the prometheus
module it's not shocking to be named just Scraper
.
b2dc90e
to
cac5d98
Compare
kubelet/conf.yaml.example
Outdated
# Pass an empty string, or set the cadvisor_port option to disable | ||
# prometheus metrics collection | ||
# prometheus cadvisor metrics collection | ||
# metrics_endpoint: http://10.8.0.1:10255/metrics/cadvisor | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense to have a single config option for kubelet_api_endpoint:http://10.8.0.1:10255
and a collect_cadvisor_metrics
flag or have separate options for kubelet_metrics_endpoint:http://10.8.0.1:10255/metrics/cadvisor
and cadvisor_metrics_endpoint:http://10.8.0.1:10255/metrics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the first option better, but it won't be easy to keep retro-compatibility. The second option sounds simpler. We can update the yaml with metrics_endpoint
--> cadvisor_endpoint
and add kubelet_endpoint
, and fallback to metrics_endpoint
in the check if no cadvisor_endpoint
is found in the config object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO the best woud be to have only kubelet_api_endpoint
and then we can guess /metrics
and /metrics/cadvisor
because there's no way to change the path in the kubelet.
So I guess we can keep the current metrics_endpoint
parse it and add paths if needed
@@ -70,39 +70,9 @@ def __init__(self, name, init_config, agentConfig, instances=None): | |||
self.cadvisor_legacy_port = inst.get('cadvisor_port', CADVISOR_DEFAULT_PORT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like self.kube_node_labels
is not used anywhere in this check 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it's a left over of https://github.com/DataDog/integrations-core/blob/master/kubernetes/conf.yaml.example#L140-L146
@hkaj should it be implemented or dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be dropped, this has been moved to main agent settings: https://github.com/DataDog/datadog-agent/blob/0aa3114a92a568b8b84a02919789fbc2d9358477/pkg/config/config.go#L176
'etcd_helper_cache_entry_count': 'kubelet.etcd.cache.entry.count', | ||
'etcd_helper_cache_hit_count': 'kubelet.etcd.cache.hit.count', | ||
'etcd_helper_cache_miss_count': 'kubelet.etcd.cache.miss.count', | ||
'go_goroutines': 'kubelet.goroutines', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @JulienBalestra for additional useful metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devonboyer the switch to a regular check with two scrapers LGTM 👍
don't forget to update the metadata.csv
for the metrics you'll be adding, bump the check version and update the changelog 📝
kubelet/tests/test_kubelet.py
Outdated
'kubernetes.kubelet.etcd.cache.entry.count', | ||
'kubernetes.kubelet.etcd.cache.hit.count', | ||
'kubernetes.kubelet.etcd.cache.miss.count', | ||
'kubernetes.kubelet.goroutines', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some metrics like:
- apiserver_client_certificate_expiration_seconds
And what do you think of:
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.001"} 2
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.002"} 13
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.004"} 18
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.008"} 32
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.016"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.032"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.064"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.128"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.256"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.512"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="+Inf"} 34
rest_client_request_latency_seconds_sum{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST"} 0.13816789299999999
rest_client_request_latency_seconds_count{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST"} 34
This metric could be useful in term of security.
A spike could indicate some unexpected requests on kubelet API (someone tempting to access to the logs or doing an exec).
But it's hard to parse because the requests is in the label itself (http://127.0.0.1:8080 is the endpoint of the kube-apiserver)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The REST counter could be also valuable, to identify any overwhelming kubelet:
# TYPE rest_client_requests_total counter
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="DELETE"} 2
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="GET"} 506
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="PATCH"} 173
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="PUT"} 80
rest_client_requests_total{code="201",host="127.0.0.1:8080",method="POST"} 309
rest_client_requests_total{code="404",host="127.0.0.1:8080",method="GET"} 1
rest_client_requests_total{code="409",host="127.0.0.1:8080",method="POST"} 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following metrics looks like being interesting:
# HELP kubelet_runtime_operations Cumulative number of runtime operations by operation type.
# TYPE kubelet_runtime_operations counter
kubelet_runtime_operations{operation_type="container_status"} 156
kubelet_runtime_operations{operation_type="create_container"} 33
kubelet_runtime_operations{operation_type="exec"} 16
kubelet_runtime_operations{operation_type="exec_sync"} 233
kubelet_runtime_operations{operation_type="image_status"} 73
kubelet_runtime_operations{operation_type="list_containers"} 3800
kubelet_runtime_operations{operation_type="list_images"} 278
kubelet_runtime_operations{operation_type="list_podsandbox"} 3766
kubelet_runtime_operations{operation_type="podsandbox_status"} 146
kubelet_runtime_operations{operation_type="pull_image"} 5
kubelet_runtime_operations{operation_type="remove_container"} 9
kubelet_runtime_operations{operation_type="run_podsandbox"} 28
kubelet_runtime_operations{operation_type="start_container"} 33
kubelet_runtime_operations{operation_type="status"} 398
kubelet_runtime_operations{operation_type="stop_container"} 5
kubelet_runtime_operations{operation_type="stop_podsandbox"} 42
kubelet_runtime_operations{operation_type="version"} 194
# TYPE kubelet_runtime_operations_errors counter
kubelet_runtime_operations_errors{operation_type="container_status"} 1
kubelet_runtime_operations_errors{operation_type="exec_sync"} 6
kubelet_runtime_operations_errors{operation_type="run_podsandbox"} 4
self.metrics_mapper = { | ||
'kubelet_runtime_operations_errors': 'kubelet.runtime.errors', | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually collected from /metrics
and not /metrics/cadvisor
so its been moved to the metrics_mapper
for kubelet metrics.
# url of the kubelet metrics prometheus endpoint | ||
# Pass an empty string to disable kubelet metrics collection | ||
# kubelet_metrics_endpoint: http://10.8.0.1:10255/metrics | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ended up being much more straightforward to be able to turn off kublet metrics and/or cadvisor metrics separately by setting the option to ""
than trying to share kubelet_api_endpoint
between them.
I added a deprecated note to metrics_endpoint
in favour of the more clearly named cadvisor_metrics_endpoint
. Does it make sense to log a warning/info if someone is using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, a deprecation warning would be nice.
|
||
from .. import AgentCheck | ||
from ...errors import CheckException | ||
|
||
class Scraper(PrometheusScraper): | ||
|
||
class PrometheusScraper(PrometheusScraperMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyone object to these renames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it 👍
@devonboyer version will be automatically bumped by the same tool generating the changelog, no need to do that manually |
kubelet/metadata.csv
Outdated
kubernetes.apiserver.certificate.expiration.sum,gauge,,second,,The sum of remaining lifetime on the certificate used to authenticate a request,,1,kubelet, | ||
kubernetes.etcd.cache.entry.count,gauge,,operation,,The number of etcd helper cache entries,,0,kubelet, | ||
kubernetes.etcd.cache.hit.count,gauge,,operation,,The number of etcd helper cache hits,,0,kubelet, | ||
kubernetes.etcd.cache.miss.count,gauge,,operation,,The number of etcd helper cache misses,,0,kubelet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this metric change ? (kubernetes.etcd.cache*
) I wasn't able to see it. What's for ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought it might be something interesting to have. If you don't think its a relevant metric I can remove it, I don't really know a good use-case for having it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devonboyer I think we're better of removing them if we're not sure on how they're useful
@DataDog/kenafeh Anyone able to give me a 👍on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice rework
What does this PR do?
This PR adds support for scraping metrics from prometheus endpoint for kubelet metrics.
This involved refactoring the code for scraping the
/metrics/cadvisor
prometheus endpoint into a separate class. Much of the code is the same except that the kubelet check now instantiates two different prometheus scrapers instead of inheriting fromPrometheusCheck
.Motivation
This endpoint also has some metrics that may be of interest to customers.
Review checklist
no-changelog
label attachedAdditional Notes
I would be happy for any feedback on which metrics to gather/ignore for the
KubeletPrometheusScraper
. Right now, it only gathers a few metrics that looked interesting for the purpose of testing.