Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unremovable_nodes_count metric #3690

Merged
merged 1 commit into from
Feb 17, 2021

Conversation

evgenii-petrov-arrival
Copy link
Contributor

I want to set up alerting for cases where nodes are unremovable due to user errors, like lack of "safe to deschedule" annotation, and so this Pull Requests adds a Gauge sliced by fmt.Sprintf("%s", simulator.UnremovableReason).

I wasn't sure if there is any concurrent access involved, so added a mutex for the counter map, please let me know if it is unnecessary.

I also wasn't sure how to test this, any suggestions are welcome.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 13, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @evgenii-petrov-arrival!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 13, 2020
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related test is failing:

--- FAIL: TestFindUnneededNodes (0.02s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3b7791f]

Please fix your code.

@evgenii-petrov-arrival
Copy link
Contributor Author

Please fix your code.

Done.

Sorry for not running the gofmt/golint/tests before create a pull request, I haven't figured out how vendoring is structured in the repo and was depending on CI to catch any errors while I'm figuring that out.

@evgenii-petrov-arrival
Copy link
Contributor Author

@mwielgus , as far as I can tell, there is no way for me to mark your request for changes as addressed, so just in case I'll duplicate a request for re-review as a comment.

I think I've addressed your request for changes and the tests are passing now.

cluster-autoscaler/core/scale_down.go Outdated Show resolved Hide resolved
@MaciekPytel
Copy link
Contributor

It seems to me that this is duplicating the logic for gathering unready reasons that we already have for ScaleDownStatusProcessor. I think implementing this as a processor should be preferred to avoid duplication, keep the aggregation logic encapsulated and avoid bloating scale-down logic even more.

@evgenii-petrov-arrival
Copy link
Contributor Author

It seems to me that this is duplicating the logic for gathering unready reasons that we already have for ScaleDownStatusProcessor.

I've looked at ScaleDownStatusProcessor after your comment (wasn't aware of it before), and I don't think that adding a processor here would reduce complexity. Currently NoOpScaleDownStatusProcessor is the default ScaleDownStatusProcessor. And changing from NoOp to SomeOp seems to be a much larger change than adding a metric next to another similar metric.

That said, moving both metrics.UpdateUnneededNodesCount and metrics.UpdateUnremovableNodesCount to ScaleDownStatusProcessor seems like a good idea, but that refactoring is a much larger change than this one, and I think shouldn't block addition of this feature.

@evgenii-petrov-arrival
Copy link
Contributor Author

@mwielgus , is there something I can do to advance this pull request through the review?

@towca
Copy link
Collaborator

towca commented Jan 15, 2021

Adding a processor only for updating one metric does seem like a bit of an overkill to me. This PR isn't actually duplicating any existing logic, since we don't pivot the unremovableNodeReasons map into per-reason counters anywhere. That said, I see two major problems with the proposed approach.

First of all, unremovable reasons are also set in TryToScaleDown(), which is called after UpdateUnneededNodes(). If you set the metric at the end of UpdateUnneededNodes(), you'll miss all of the reasons found in TryToScaleDown(). I think a good place to update this metric is just after TryToScaleDown() is called in RunOnce().

Secondly, I think you're missing the fact that the unremovableNodeReasons map is cleared every loop, at the beginning of the scale-down logic in RunOnce() - in scaleDown.CleanUp(). So you don't need to keep track of how many reasons were added in the current loop in another structure. I'd recommend adding a getUnremovableNodeReasonCounters() map[simulator.UnremovableReason]int method to ScaleDown, which would go through all nodes in unremovableNodeReasons and count how many times each reason occurs. Then you can just pass its result to the function which updates the metric.

@evgenii-petrov-arrival
Copy link
Contributor Author

Thanks @towca , both very good points. I think I've addressed them know, please take a look.

@towca
Copy link
Collaborator

towca commented Feb 12, 2021

Sorry for missing your reply, thanks for addressing my comments!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2021
@mwielgus
Copy link
Contributor

Please squash commits to just 1-2 and we are good to merge.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2021
@evgenii-petrov-arrival
Copy link
Contributor Author

Squashed to one commit, please take another look.

@towca
Copy link
Collaborator

towca commented Feb 15, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 15, 2021
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evgenii-petrov-arrival, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2021
@k8s-ci-robot k8s-ci-robot merged commit 1fc6705 into kubernetes:master Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants