Add unremovable_nodes_count metric #3690

evgenii-petrov-arrival · 2020-11-13T15:44:34Z

I want to set up alerting for cases where nodes are unremovable due to user errors, like lack of "safe to deschedule" annotation, and so this Pull Requests adds a Gauge sliced by fmt.Sprintf("%s", simulator.UnremovableReason).

I wasn't sure if there is any concurrent access involved, so added a mutex for the counter map, please let me know if it is unnecessary.

I also wasn't sure how to test this, any suggestions are welcome.

k8s-ci-robot · 2020-11-13T15:44:41Z

Welcome @evgenii-petrov-arrival!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

mwielgus

A related test is failing:

--- FAIL: TestFindUnneededNodes (0.02s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3b7791f]

Please fix your code.

evgenii-petrov-arrival · 2020-11-16T14:09:05Z

Please fix your code.

Done.

Sorry for not running the gofmt/golint/tests before create a pull request, I haven't figured out how vendoring is structured in the repo and was depending on CI to catch any errors while I'm figuring that out.

evgenii-petrov-arrival · 2020-11-19T12:40:49Z

@mwielgus , as far as I can tell, there is no way for me to mark your request for changes as addressed, so just in case I'll duplicate a request for re-review as a comment.

I think I've addressed your request for changes and the tests are passing now.

cluster-autoscaler/core/scale_down.go

MaciekPytel · 2020-11-26T13:31:43Z

It seems to me that this is duplicating the logic for gathering unready reasons that we already have for ScaleDownStatusProcessor. I think implementing this as a processor should be preferred to avoid duplication, keep the aggregation logic encapsulated and avoid bloating scale-down logic even more.

cluster-autoscaler/core/scale_down.go

evgenii-petrov-arrival · 2020-11-26T16:50:25Z

It seems to me that this is duplicating the logic for gathering unready reasons that we already have for ScaleDownStatusProcessor.

I've looked at ScaleDownStatusProcessor after your comment (wasn't aware of it before), and I don't think that adding a processor here would reduce complexity. Currently NoOpScaleDownStatusProcessor is the default ScaleDownStatusProcessor. And changing from NoOp to SomeOp seems to be a much larger change than adding a metric next to another similar metric.

That said, moving both metrics.UpdateUnneededNodesCount and metrics.UpdateUnremovableNodesCount to ScaleDownStatusProcessor seems like a good idea, but that refactoring is a much larger change than this one, and I think shouldn't block addition of this feature.

evgenii-petrov-arrival · 2020-12-10T13:57:25Z

@mwielgus , is there something I can do to advance this pull request through the review?

towca · 2021-01-15T17:39:00Z

Adding a processor only for updating one metric does seem like a bit of an overkill to me. This PR isn't actually duplicating any existing logic, since we don't pivot the unremovableNodeReasons map into per-reason counters anywhere. That said, I see two major problems with the proposed approach.

First of all, unremovable reasons are also set in TryToScaleDown(), which is called after UpdateUnneededNodes(). If you set the metric at the end of UpdateUnneededNodes(), you'll miss all of the reasons found in TryToScaleDown(). I think a good place to update this metric is just after TryToScaleDown() is called in RunOnce().

Secondly, I think you're missing the fact that the unremovableNodeReasons map is cleared every loop, at the beginning of the scale-down logic in RunOnce() - in scaleDown.CleanUp(). So you don't need to keep track of how many reasons were added in the current loop in another structure. I'd recommend adding a getUnremovableNodeReasonCounters() map[simulator.UnremovableReason]int method to ScaleDown, which would go through all nodes in unremovableNodeReasons and count how many times each reason occurs. Then you can just pass its result to the function which updates the metric.

evgenii-petrov-arrival · 2021-01-22T16:17:30Z

Thanks @towca , both very good points. I think I've addressed them know, please take a look.

towca · 2021-02-12T14:57:16Z

Sorry for missing your reply, thanks for addressing my comments!

/lgtm

mwielgus · 2021-02-12T14:58:20Z

Please squash commits to just 1-2 and we are good to merge.

evgenii-petrov-arrival · 2021-02-12T15:48:57Z

Squashed to one commit, please take another look.

towca · 2021-02-15T11:42:27Z

/lgtm

mwielgus

/lgtm
/approve

k8s-ci-robot · 2021-02-17T12:12:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evgenii-petrov-arrival, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 13, 2020

k8s-ci-robot requested review from aleksandra-malinowska and Jeffwan November 13, 2020 15:44

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 13, 2020

mwielgus suggested changes Nov 16, 2020

View reviewed changes

evgenii-petrov-arrival requested a review from mwielgus November 16, 2020 16:09

mwielgus suggested changes Nov 24, 2020

View reviewed changes

cluster-autoscaler/core/scale_down.go Outdated Show resolved Hide resolved

evgenii-petrov-arrival requested a review from mwielgus November 26, 2020 11:54

mwielgus reviewed Nov 26, 2020

View reviewed changes

cluster-autoscaler/core/scale_down.go Outdated Show resolved Hide resolved

evgenii-petrov-arrival requested a review from mwielgus November 27, 2020 12:18

k8s-ci-robot assigned towca Feb 12, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2021

Add unremovable_nodes_count metric

b6f5d55

evgenii-petrov-arrival force-pushed the master branch from 2e0c653 to b6f5d55 Compare February 12, 2021 15:48

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 15, 2021

mwielgus approved these changes Feb 17, 2021

View reviewed changes

k8s-ci-robot assigned mwielgus Feb 17, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2021

k8s-ci-robot merged commit 1fc6705 into kubernetes:master Feb 17, 2021

towca mentioned this pull request Apr 27, 2021

Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS #4040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unremovable_nodes_count metric #3690

Add unremovable_nodes_count metric #3690

evgenii-petrov-arrival commented Nov 13, 2020

k8s-ci-robot commented Nov 13, 2020

mwielgus left a comment

evgenii-petrov-arrival commented Nov 16, 2020

evgenii-petrov-arrival commented Nov 19, 2020

MaciekPytel commented Nov 26, 2020

evgenii-petrov-arrival commented Nov 26, 2020

evgenii-petrov-arrival commented Dec 10, 2020

towca commented Jan 15, 2021

evgenii-petrov-arrival commented Jan 22, 2021

towca commented Feb 12, 2021

mwielgus commented Feb 12, 2021

evgenii-petrov-arrival commented Feb 12, 2021

towca commented Feb 15, 2021

mwielgus left a comment

k8s-ci-robot commented Feb 17, 2021

Add unremovable_nodes_count metric #3690

Add unremovable_nodes_count metric #3690

Conversation

evgenii-petrov-arrival commented Nov 13, 2020

k8s-ci-robot commented Nov 13, 2020

mwielgus left a comment

Choose a reason for hiding this comment

evgenii-petrov-arrival commented Nov 16, 2020

evgenii-petrov-arrival commented Nov 19, 2020

MaciekPytel commented Nov 26, 2020

evgenii-petrov-arrival commented Nov 26, 2020

evgenii-petrov-arrival commented Dec 10, 2020

towca commented Jan 15, 2021

evgenii-petrov-arrival commented Jan 22, 2021

towca commented Feb 12, 2021

mwielgus commented Feb 12, 2021

evgenii-petrov-arrival commented Feb 12, 2021

towca commented Feb 15, 2021

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 17, 2021