Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport HPA replica count calculation fixes #18216

Conversation

RobertKrawitz
Copy link
Contributor

There were multiple fixes upstream to the HPA upstream logic regarding interaction of max replica count and the scaleup limit that could result in the replica count temporarily higher than the max replica count.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 22, 2018
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Jan 22, 2018
Fix #53670

Fix a bug where `desiredReplicas` could be greater than `maxReplicas`
if the original value for `desiredReplicas > scaleUpLimit` and
`scaleUpLimit > maxReplicas`. Previously, when that happened, we would
scale up to `scaleUpLimit`, and then in the next auto-scaling run, scale
down to `maxReplicas`. Address this issue and introduce a regression
test.
@RobertKrawitz RobertKrawitz force-pushed the bugs/fix_hpa_replica_count branch 2 times, most recently from 2e90d35 to f68229b Compare January 23, 2018 20:32
@RobertKrawitz
Copy link
Contributor Author

Resubmitted, with testing using resource-consumer ensuring that the replica count did not exceed the limit.

There have been a couple of recent bugs in the "normalizing" part of the
`reconcileAutoscaler` method. This part of the code base is responsible
for, among other things, taking the suggested desired replicas based on
the metrics, ensuring it conforms to certain conditions, and updating it
if it does not. Isolate the part that converts the desired replicas
based on a given set of rules into its own function.

We are refactoring this part of the code base to make the logic simpler
and to make it easier to write unit tests.
@DirectXMan12
Copy link
Contributor

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DirectXMan12, RobertKrawitz

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2018
@RobertKrawitz
Copy link
Contributor Author

/retest

@RobertKrawitz
Copy link
Contributor Author

From test log of ci/openshift-jenkins/gcp (https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18216/test_pull_request_origin_extended_conformance_gce/15016/):

• [SLOW TEST:28.129 seconds]
[k8s.io] SchedulerPredicates [Serial]
/tmp/openshift/build-rpm-release/tito/rpmbuild-originM4UNWL/BUILD/origin-3.7.1/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:620
validates that NodeSelector is respected if matching [Conformance] [Suite:openshift/conformance/serial] [Suite:k8s]
/tmp/openshift/build-rpm-release/tito/rpmbuild-originM4UNWL/BUILD/origin-3.7.1/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:296

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSJan 25 17:40:52.381: INFO: Running AfterSuite actions on all node
Jan 25 17:40:52.381: INFO: Running AfterSuite actions on node 1

Ran 5 of 805 Specs in 170.185 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 800 Skipped Jan 25 17:40:52.387: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec /data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cluster/log-dump/log-dump.sh: no such file or directory
PASS

Ginkgo ran 1 suite in 2m50.839534058s
Test Suite Passed
[INFO] [CLEANUP] Beginning cleanup routines...
[INFO] [CLEANUP] Dumping cluster events to _output/scripts/conformance/artifacts/events.txt
Logged into "https://internal-api.prtest-5a37c28-15016.origin-ci-int-gce.dev.rhcloud.com:8443" as "system:admin" using existing credentials.

You have access to the following projects and can switch between them with 'oc project ':

  • default
    kube-public
    kube-system
    logging
    management-infra
    openshift
    openshift-infra
    openshift-node

Using project "default".
[INFO] [CLEANUP] Dumping container logs to _output/scripts/conformance/logs/containers
[INFO] [CLEANUP] Truncating log files over 200M
[INFO] [CLEANUP] Stopping docker containers
[INFO] [CLEANUP] Removing docker containers
[INFO] [CLEANUP] Killing child processes
[INFO] [CLEANUP] Pruning etcd data directory
rm: cannot remove ‘/tmp/etcd’: Operation not permitted
[ERROR] test/extended/conformance.sh exited with code 1 after 00h 14m 12s
make: *** [test-extended] Error 1

This appears to match issue #16917

@RobertKrawitz
Copy link
Contributor Author

@RobertKrawitz
Copy link
Contributor Author

Case /origin-ci-test/pr-logs/directory/test_pull_request_origin_extended_conformance_crio has been failing consistently for others for at least a day, and build 3177 (following my build 3175) failed in the same way mine did: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18286/test_pull_request_origin_extended_conformance_crio/3177/

@RobertKrawitz
Copy link
Contributor Author

Case /origin-ci-test/pr-logs/directory/test_pull_request_origin_extended_conformance_gce has been failing sporadically; cases 15012 and 15019 failed in the same way mine (15016): https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18224/test_pull_request_origin_extended_conformance_gce/15012/ and https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18273/test_pull_request_origin_extended_conformance_gce/15019/

@RobertKrawitz
Copy link
Contributor Author

Case /origin-ci-test/pr-logs/directory/test_pull_request_origin_extended_conformance_install has been failing reliably since 10:30 ET 2017-01-25 with the same error I got, e. g. 6332 vs. my 6333: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18279/test_pull_request_origin_extended_conformance_install/6332/

@RobertKrawitz
Copy link
Contributor Author

Case /origin-ci-test/pr-logs/pull/18216/test_pull_request_origin_extended_conformance_install_update has been failing consistently; build 10375 got the same error as my 10376: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18199/test_pull_request_origin_extended_conformance_install_update/10375/

@RobertKrawitz
Copy link
Contributor Author

This appears to account for all failures reported on my job.

@sjenning
Copy link
Contributor

Blocking test_pull_request_origin_extended_conformance_install #18294

@RobertKrawitz
Copy link
Contributor Author

/retest

@sjenning
Copy link
Contributor

flake #17901 and fedora 25 (likely transient) mirror failure
/retest

@sjenning
Copy link
Contributor

/retest

@sjenning
Copy link
Contributor

opened new flake #18306 for GCP auth issue

@RobertKrawitz
Copy link
Contributor Author

/retest

1 similar comment
@RobertKrawitz
Copy link
Contributor Author

/retest

@RobertKrawitz
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit b59a547 into openshift:release-3.7 Jan 28, 2018
@RobertKrawitz RobertKrawitz deleted the bugs/fix_hpa_replica_count branch January 29, 2018 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants