Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics to CSI external-provisioner #386

Closed
wants to merge 2 commits into from

Conversation

saad-ali
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds 3 metrics to the external-provisioner for the CreateVolume and DeleteVolume CSI calls:

  • csi_external_provisioner_operations_count counter
  • csi_external_provisioner_operations_seconds histogram
  • csi_external_provisioner_errors_count counter

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add prometheus metrics to CSI external-provisioner 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 16, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 16, 2019
@saad-ali
Copy link
Member Author

/assign @msau42
CC @verult

pmm.csiErrorMetric.WithLabelValues(
driverName, operationName).Inc()
} else {
// Observe duration for successful operations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should record duration for failed operations too. Would it be strange to combine error code and latency into one metric?

Copy link
Member Author

@saad-ali saad-ali Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will keep latency separate from count, but will also add error codes as a label.

driverName, operationName).Inc()

if operationFailed {
// Observe error count metric in case of error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log error code too so that we can distinguish between configuration errors vs system errors. Check out kubernetes/kubernetes#75750

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made grpc error code a dimension. And dropped separate error counter

CSIDeleteVolumeOperationName = "DeleteVolume"

// Common metric strings
subsystem = "csi_external_provisioner"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about some potential state in the future where we combine all sidecars into one process, should we have a more generic name like "csi_sidecar" as the component? We can figure out which sidecar it is based on the operation name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -71,6 +74,8 @@ var (
featureGates map[string]bool
provisionController *controller.ProvisionController
version = "unknown"
metricsEndpoint = "/metrics"
metricsPort = "8080"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the port configurable, considering that we will eventually want all sidecars to export metrics?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Can't this clash with existing, used ports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made endpoint (including port) and path both configurable parameters.

@msau42
Copy link
Collaborator

msau42 commented Dec 16, 2019

cc @logicalhan

Copy link

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does versioning work for this component? Will it follow along the kubernetes versions?

@@ -71,6 +74,8 @@ var (
featureGates map[string]bool
provisionController *controller.ProvisionController
version = "unknown"
metricsEndpoint = "/metrics"
metricsPort = "8080"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Can't this clash with existing, used ports?

labelCSIOperationName = "operation_name"

// CSI Operation Total - Count Metric
operationsMetricName = "operations_count"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan wdyt of combining both latency and error metrics into one histogram? The dimensions would be:

  • plugin name
  • operation
  • status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds okay to me. What is the cardinality of plugin name and operation look like, respectively?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a single cluster, plugin_name would be 1 for the most part. Across an entire fleet, there could be 5-10.

Operation name would probably be around 10-15.

Status would probably be around 5.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed _count to _total. Reduced from 3 metrics to 2: one counter and one histogram. Both have status as a dimension.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan do you see benefit to having both a counter and a histogram? The histogram dimensions contains everything the counter has

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histograms actually have a counter metric embedded into them, so having another counter is probably unnecessary.

rep, err = p.csiClient.CreateVolume(ctx, &req)

p.metricsManager.RecordMetrics(
Copy link
Collaborator

@msau42 msau42 Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of adding the metrics in the common grpc handler, so that we automatically record metrics for every single csi call?

https://github.com/kubernetes-csi/external-provisioner/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L179

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on another PR to do this. If I can finish that in the next couple of days, we can abandon this PR. Otherwise, we can merge this to unblock the larger effort.

@saad-ali
Copy link
Member Author

How does versioning work for this component? Will it follow along the kubernetes versions?

Versioning for side-car containers is independent of Kubernetes versions. See https://kubernetes-csi.github.io/docs/external-provisioner.html for example.

@saad-ali
Copy link
Member Author

Comments addressed. PTAL. I'll work on moving this to kubernetes-csi/csi-lib-utils/connection/connection.go so we can reuse it from all sidecars with minimal work.

@@ -68,6 +71,9 @@ var (
leaderElectionNamespace = flag.String("leader-election-namespace", "", "Namespace where the leader election resource lives. Defaults to the pod namespace if not set.")
strictTopology = flag.Bool("strict-topology", false, "Passes only selected node topology to CreateVolume Request, unlike default behavior of passing aggregated cluster topologies that match with topology keys of the selected node.")

metricsAddress = flag.String("metrics-address", ":8080", "The TCP network address address where the prometheus metrics endpoint will listen. Default is ':8080'.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make this opt-in instead of default, in case there are other containers in the pod listening on the same address?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to make it opt-in in general. Changing.

// Common metric strings
subsystem = "csi_sidecar"
labelCSIDriverName = "driver_name"
labelCSIOperationName = "operation_name"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation name or csi method? (ie ControllerPublish, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer operation over method. But let me know if you feel strongly can change.

labelCSIOperationName = "operation_name"

// CSI Operation Total - Count Metric
operationsMetricName = "operations_count"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan do you see benefit to having both a counter and a histogram? The histogram dimensions contains everything the counter has

@saad-ali
Copy link
Member Author

Feedback addressed PTAL

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 18, 2019

@saad-ali: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-csi-external-provisioner-unit 7266237 link /test pull-kubernetes-csi-external-provisioner-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@saad-ali
Copy link
Member Author

This PR is superseded by #388 which uses the new CSI metrics library kubernetes-csi/csi-lib-utils#35

@saad-ali saad-ali closed this Dec 24, 2019
kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this pull request Dec 29, 2023
add dependabot github action for auto dependency update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants