Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gathering metrics from prometheus endpoint for the kubelet itself. #1581

Merged
merged 6 commits into from
May 25, 2018

Conversation

devonboyer
Copy link
Contributor

@devonboyer devonboyer commented May 18, 2018

What does this PR do?

This PR adds support for scraping metrics from prometheus endpoint for kubelet metrics.

This involved refactoring the code for scraping the /metrics/cadvisor prometheus endpoint into a separate class. Much of the code is the same except that the kubelet check now instantiates two different prometheus scrapers instead of inheriting from PrometheusCheck.

screen shot 2018-05-18 at 5 56 48 pm

Motivation

This endpoint also has some metrics that may be of interest to customers.

Review checklist

  • PR has a meaningful title or PR has the no-changelog label attached
  • Feature or bugfix has tests
  • Git history is clean
  • If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Additional Notes

I would be happy for any feedback on which metrics to gather/ignore for the KubeletPrometheusScraper. Right now, it only gathers a few metrics that looked interesting for the purpose of testing.

@devonboyer devonboyer requested a review from a team May 18, 2018 18:09
@@ -7,7 +7,13 @@
from .. import AgentCheck
from ...errors import CheckException


class Scraper(PrometheusScraper):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this class is admittedly bad and I'd take any suggestions for renaming. Maybe rename the mixin to PrometheusScraperMixin and also have PrometheusScraper as the standalone version?

Copy link
Contributor

@mfpierre mfpierre May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion sounds good to me, you'll need to update a few things (check base,exports, Istio check) for renaming.

I'd say it'd be nice to have but as it's in the prometheus module it's not shocking to be named just Scraper.

@devonboyer devonboyer force-pushed the devon/prometheus-kubelet-check branch from b2dc90e to cac5d98 Compare May 18, 2018 18:12
@devonboyer devonboyer changed the title Add support for scraping metrics from prometheus endpoint for kubelet metrics. [Added] Support for gathering metrics from prometheus endpoint for the kubelet itself. May 18, 2018
@rishabh rishabh added this to the 6.3 milestone May 18, 2018
# Pass an empty string, or set the cadvisor_port option to disable
# prometheus metrics collection
# prometheus cadvisor metrics collection
# metrics_endpoint: http://10.8.0.1:10255/metrics/cadvisor
#
Copy link
Contributor Author

@devonboyer devonboyer May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense to have a single config option for kubelet_api_endpoint:http://10.8.0.1:10255 and a collect_cadvisor_metrics flag or have separate options for kubelet_metrics_endpoint:http://10.8.0.1:10255/metrics/cadvisor and cadvisor_metrics_endpoint:http://10.8.0.1:10255/metrics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the first option better, but it won't be easy to keep retro-compatibility. The second option sounds simpler. We can update the yaml with metrics_endpoint --> cadvisor_endpoint and add kubelet_endpoint, and fallback to metrics_endpoint in the check if no cadvisor_endpoint is found in the config object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the best woud be to have only kubelet_api_endpoint and then we can guess /metrics and /metrics/cadvisor because there's no way to change the path in the kubelet.
So I guess we can keep the current metrics_endpoint parse it and add paths if needed

@@ -70,39 +70,9 @@ def __init__(self, name, init_config, agentConfig, instances=None):
self.cadvisor_legacy_port = inst.get('cadvisor_port', CADVISOR_DEFAULT_PORT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like self.kube_node_labels is not used anywhere in this check 🤔

Copy link
Contributor

@mfpierre mfpierre May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's a left over of https://github.com/DataDog/integrations-core/blob/master/kubernetes/conf.yaml.example#L140-L146
@hkaj should it be implemented or dropped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masci masci changed the title [Added] Support for gathering metrics from prometheus endpoint for the kubelet itself. Support for gathering metrics from prometheus endpoint for the kubelet itself. May 19, 2018
'etcd_helper_cache_entry_count': 'kubelet.etcd.cache.entry.count',
'etcd_helper_cache_hit_count': 'kubelet.etcd.cache.hit.count',
'etcd_helper_cache_miss_count': 'kubelet.etcd.cache.miss.count',
'go_goroutines': 'kubelet.goroutines',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @JulienBalestra for additional useful metrics

Copy link
Contributor

@mfpierre mfpierre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devonboyer the switch to a regular check with two scrapers LGTM 👍

don't forget to update the metadata.csv for the metrics you'll be adding, bump the check version and update the changelog 📝

'kubernetes.kubelet.etcd.cache.entry.count',
'kubernetes.kubelet.etcd.cache.hit.count',
'kubernetes.kubelet.etcd.cache.miss.count',
'kubernetes.kubelet.goroutines',
Copy link
Contributor

@JulienBalestra JulienBalestra May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some metrics like:

  • apiserver_client_certificate_expiration_seconds

And what do you think of:

rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.001"} 2
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.002"} 13
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.004"} 18
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.008"} 32
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.016"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.032"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.064"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.128"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.256"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="0.512"} 34
rest_client_request_latency_seconds_bucket{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST",le="+Inf"} 34
rest_client_request_latency_seconds_sum{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST"} 0.13816789299999999
rest_client_request_latency_seconds_count{url="http://127.0.0.1:8080/apis/authentication.k8s.io/v1beta1/tokenreviews",verb="POST"} 34

This metric could be useful in term of security.
A spike could indicate some unexpected requests on kubelet API (someone tempting to access to the logs or doing an exec).
But it's hard to parse because the requests is in the label itself (http://127.0.0.1:8080 is the endpoint of the kube-apiserver)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The REST counter could be also valuable, to identify any overwhelming kubelet:

# TYPE rest_client_requests_total counter
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="DELETE"} 2
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="GET"} 506
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="PATCH"} 173
rest_client_requests_total{code="200",host="127.0.0.1:8080",method="PUT"} 80
rest_client_requests_total{code="201",host="127.0.0.1:8080",method="POST"} 309
rest_client_requests_total{code="404",host="127.0.0.1:8080",method="GET"} 1
rest_client_requests_total{code="409",host="127.0.0.1:8080",method="POST"} 1

Copy link
Contributor

@JulienBalestra JulienBalestra May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following metrics looks like being interesting:

# HELP kubelet_runtime_operations Cumulative number of runtime operations by operation type.
# TYPE kubelet_runtime_operations counter
kubelet_runtime_operations{operation_type="container_status"} 156
kubelet_runtime_operations{operation_type="create_container"} 33
kubelet_runtime_operations{operation_type="exec"} 16
kubelet_runtime_operations{operation_type="exec_sync"} 233
kubelet_runtime_operations{operation_type="image_status"} 73
kubelet_runtime_operations{operation_type="list_containers"} 3800
kubelet_runtime_operations{operation_type="list_images"} 278
kubelet_runtime_operations{operation_type="list_podsandbox"} 3766
kubelet_runtime_operations{operation_type="podsandbox_status"} 146
kubelet_runtime_operations{operation_type="pull_image"} 5
kubelet_runtime_operations{operation_type="remove_container"} 9
kubelet_runtime_operations{operation_type="run_podsandbox"} 28
kubelet_runtime_operations{operation_type="start_container"} 33
kubelet_runtime_operations{operation_type="status"} 398
kubelet_runtime_operations{operation_type="stop_container"} 5
kubelet_runtime_operations{operation_type="stop_podsandbox"} 42
kubelet_runtime_operations{operation_type="version"} 194
# TYPE kubelet_runtime_operations_errors counter
kubelet_runtime_operations_errors{operation_type="container_status"} 1
kubelet_runtime_operations_errors{operation_type="exec_sync"} 6
kubelet_runtime_operations_errors{operation_type="run_podsandbox"} 4

self.metrics_mapper = {
'kubelet_runtime_operations_errors': 'kubelet.runtime.errors',
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually collected from /metrics and not /metrics/cadvisor so its been moved to the metrics_mapper for kubelet metrics.

# url of the kubelet metrics prometheus endpoint
# Pass an empty string to disable kubelet metrics collection
# kubelet_metrics_endpoint: http://10.8.0.1:10255/metrics
#
Copy link
Contributor Author

@devonboyer devonboyer May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being much more straightforward to be able to turn off kublet metrics and/or cadvisor metrics separately by setting the option to "" than trying to share kubelet_api_endpoint between them.

I added a deprecated note to metrics_endpoint in favour of the more clearly named cadvisor_metrics_endpoint. Does it make sense to log a warning/info if someone is using it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, a deprecation warning would be nice.


from .. import AgentCheck
from ...errors import CheckException

class Scraper(PrometheusScraper):

class PrometheusScraper(PrometheusScraperMixin):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone object to these renames?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it 👍

@devonboyer
Copy link
Contributor Author

devonboyer commented May 22, 2018

@mfpierre I think the changelog updates are generated automatically now at release time using the PR labels and title. Does check version need to be bumped manually still @masci?

@masci
Copy link
Contributor

masci commented May 23, 2018

@devonboyer version will be automatically bumped by the same tool generating the changelog, no need to do that manually

kubernetes.apiserver.certificate.expiration.sum,gauge,,second,,The sum of remaining lifetime on the certificate used to authenticate a request,,1,kubelet,
kubernetes.etcd.cache.entry.count,gauge,,operation,,The number of etcd helper cache entries,,0,kubelet,
kubernetes.etcd.cache.hit.count,gauge,,operation,,The number of etcd helper cache hits,,0,kubelet,
kubernetes.etcd.cache.miss.count,gauge,,operation,,The number of etcd helper cache misses,,0,kubelet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this metric change ? (kubernetes.etcd.cache*) I wasn't able to see it. What's for ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought it might be something interesting to have. If you don't think its a relevant metric I can remove it, I don't really know a good use-case for having it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devonboyer I think we're better of removing them if we're not sure on how they're useful

@devonboyer
Copy link
Contributor Author

@DataDog/kenafeh Anyone able to give me a 👍on this PR?

Copy link
Contributor

@mfpierre mfpierre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice rework :shipit:

@devonboyer devonboyer merged commit e3985fe into master May 25, 2018
@devonboyer devonboyer deleted the devon/prometheus-kubelet-check branch May 25, 2018 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants