Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump(*): update etcd to 3.2.16 and grpc to 1.7.5 #18731

Merged
merged 4 commits into from
Feb 27, 2018

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Feb 23, 2018

Replaces #18660 by fixing an upstream bug with the testing etcd.

Copied from #18660:

Update of etcd level from 3.2.8 to 3.2.16 and gRPC to 1.7.5 (matching etcd version).

Fixes: #18496

etcd:

grpc:

List of interesting changes or changes related to gRPC:

3.2.10: etcd-io/etcd@6d40628: update grpc, grpc-gateway (1.4.2 -> 1.7.3)
3.2.10: etcd-io/etcd@a8c84ff: clientv3: fix client balancer with gRPC v1.7
3.2.10: etcd-io/etcd@939337f: add max requests bytes, keepalive to server, blackhole methods to integration
3.2.10: etcd-io/etcd@8de0c04: Switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.3 (<- concerning?)
3.2.11: etcd-io/etcd@5921b2c: log grpc stream send/recv errors in server-side
3.2.11: etcd-io/etcd@ff1f08c: upgrade grpc/grpc-go to v1.7.4
3.2.12: etcd-io/etcd@e82f055: clientv3: configure gRPC message limits in Config
3.2.12: etcd-io/etcd@c67e6d5: clientv3: call KV/Txn APIs with default gRPC call options
3.2.12: etcd-io/etcd@348b25f: clientv3: call other APIs with default gRPC call options
3.2.13: etcd-io/etcd@288ef7d: embed: fix gRPC server panic on GracefulStop
3.2.16: etcd-io/etcd@e08abbe: mvcc: restore unsynced watchers

@smarterclayton @deads2k @liggitt maybe too late in 3.9 cycle, but I don't see any huge risk change and this is a minor version bump that contain plenty of bug fixes.

This was clean bump, no build errors or panics during server start. There were 0 picks/carries on grpc or etcd.

@deads2k I wonder if we need to add grpc into our glide.yaml... If I don't and just bump the etcd, there are no changes in grpc, just etcd. I was worried that whatever higher level client we have that use grpc will use different version than etcd?

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 23, 2018
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Feb 23, 2018
@sttts sttts added dependency/etcd priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed vendor-update Touching vendor dir or related files labels Feb 23, 2018
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Feb 23, 2018
@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

Upstream kubernetes/kubernetes#60299

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

@deads2k
Copy link
Contributor

deads2k commented Feb 23, 2018

@juanvallejo this is highlighting shortcomings in the dep set detection. bolt really shouldn't be considered "ours". We'll never pin it. We have to think of a different rule.

@sttts thanks for the list
@soltysh @mfojtik we need to get that list from people so we can be sure nothing is getting accidentally reverted.

@deads2k
Copy link
Contributor

deads2k commented Feb 23, 2018

@soltysh @mfojtik we need to get that list from people so we can be sure nothing is getting accidentally reverted.

And once we get the publisher bot going and the the dep set detection stuff completed, I think we'll be in good shape.

@liggitt
Copy link
Contributor

liggitt commented Feb 23, 2018

still timing out:

FAIL	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	122.046s

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

still timing out:

Which go version do we use? 1.9? I saw a panic in 1.10 deep inside etcd with the consequence of the test to time out.

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

still timing out:

I also dropped @mfojtik's wip timeout commit. So this might have an influence here, independently from my commit.

@liggitt
Copy link
Contributor

liggitt commented Feb 23, 2018

yes, go 1.9

@liggitt
Copy link
Contributor

liggitt commented Feb 23, 2018

I also dropped @mfojtik's wip timeout commit.

sure, though before this PR, that package took ~9 seconds, and with this PR, it takes ~450 (timing out at 120s by default)

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

and with this PR, it takes ~450 (timing out at 120s by default)

Ouch, has anybody looked into the reason? Now I understand @mfojtik's 600 in that commit.

@liggitt
Copy link
Contributor

liggitt commented Feb 23, 2018

Ouch, has anybody looked into the reason? Now I understand @mfojtik's 600 in that commit.

got as far as isolating which tests jumped in time, hadn't dug beyond that - #18660 (comment)

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

I cannot reproduce the long runtimes locally.

What I found in TestLegacyRestStorageStrategies is that the integration etcd server termianted early:

=== RUN   TestLegacyRestStorageStrategies
2018-02-23 13:41:15.895839 I | integration: launching 7927324665964986513 (unix://localhost:79273246659649865130)
2018-02-23 13:41:15.902644 I | etcdserver: name = 7927324665964986513
2018-02-23 13:41:15.902712 I | etcdserver: data dir = /tmp/etcd595091931
2018-02-23 13:41:15.902764 I | etcdserver: member dir = /tmp/etcd595091931/member
2018-02-23 13:41:15.902812 I | etcdserver: heartbeat = 10ms
2018-02-23 13:41:15.902860 I | etcdserver: election = 100ms
2018-02-23 13:41:15.902901 I | etcdserver: snapshot count = 0
2018-02-23 13:41:15.902970 I | etcdserver: advertise client URLs = unix://127.0.0.1:2100215573
2018-02-23 13:41:15.903028 I | etcdserver: initial advertise peer URLs = unix://127.0.0.1:2100115573
2018-02-23 13:41:15.903103 I | etcdserver: initial cluster = 7927324665964986513=unix://127.0.0.1:2100115573
2018-02-23 13:41:15.910571 I | etcdserver: starting member 6acce32c272f622e in cluster fdce2b555e23edf3
2018-02-23 13:41:15.910694 I | raft: 6acce32c272f622e became follower at term 0
2018-02-23 13:41:15.910790 I | raft: newRaft 6acce32c272f622e [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-02-23 13:41:15.910853 I | raft: 6acce32c272f622e became follower at term 1
2018-02-23 13:41:15.921585 W | auth: simple token is not cryptographically signed
2018-02-23 13:41:15.927332 I | etcdserver: set snapshot count to default 100000
2018-02-23 13:41:15.927404 I | etcdserver: starting server... [version: 3.2.16, cluster version: to_be_decided]
2018-02-23 13:41:15.932085 I | integration: launched 7927324665964986513 (unix://localhost:79273246659649865130)
2018-02-23 13:41:15.937529 I | etcdserver/membership: added member 6acce32c272f622e [unix://127.0.0.1:2100115573] to cluster fdce2b555e23edf3
2018-02-23 13:41:16.001594 I | raft: 6acce32c272f622e is starting a new election at term 1
2018-02-23 13:41:16.001718 I | raft: 6acce32c272f622e became candidate at term 2
2018-02-23 13:41:16.001796 I | raft: 6acce32c272f622e received MsgVoteResp from 6acce32c272f622e at term 2
2018-02-23 13:41:16.001878 I | raft: 6acce32c272f622e became leader at term 2
2018-02-23 13:41:16.001945 I | raft: raft.node: 6acce32c272f622e elected leader 6acce32c272f622e at term 2
2018-02-23 13:41:16.014811 I | etcdserver: setting up the initial cluster version to 3.2
2018-02-23 13:41:16.015438 N | etcdserver/membership: set the initial cluster version to 3.2
2018-02-23 13:41:16.018706 I | etcdserver: published {Name:7927324665964986513 ClientURLs:[unix://127.0.0.1:2100215573]} to cluster fdce2b555e23edf3
2018-02-23 13:41:16.022316 I | etcdserver/api: enabled capabilities for version 3.2
2018-02-23 13:41:17.873216 I | integration: terminating 7927324665964986513 (unix://localhost:79273246659649865130)
WARNING: 2018/02/23 13:41:17 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:79273246659649865130: connect: no such file or directory"; Reconnecting to {localhost:79273246659649865130 0  <nil>}
...
panic: test timed out after 2m0s

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

/retest

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

Running a test locally in an infinite loop. Maybe this early termination is just a flake.

@sttts
Copy link
Contributor Author

sttts commented Feb 23, 2018

Running a test locally in an infinite loop. Maybe this early termination is just a flake.

does not help.

@mfojtik
Copy link
Contributor

mfojtik commented Feb 23, 2018

@sttts I swear I saw a comment in kubernetes claiming that all these etcd messages are "log spam".

@sttts
Copy link
Contributor Author

sttts commented Feb 24, 2018

Latest finding: the termination of the cluster is the normal defer Terminate call of the test. I.e. it does not return in time.

Continueing digging.

@sttts
Copy link
Contributor Author

sttts commented Feb 24, 2018

This blocks during termination:

@mfojtik
Copy link
Contributor

mfojtik commented Feb 24, 2018

/retest

@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

@sttts the grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:27365759778032236110: connect: no such file or directory"; Reconnecting to {localhost:27365759778032236110 0 <nil>} is weird. If I understand the code under it, we should not see these message if the grpc connections are marked as "shutdown". It might be that we forgetting to close something when the server is terminated?

@sttts
Copy link
Contributor Author

sttts commented Feb 26, 2018

@mfojtik something is blocked on shutdown definitely, leading to those messages. But the messages themselves are not the issue.

@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

update: I added defer func() { go s.Terminate() } to rule out the termination as a cause for the delay, but it barely cut ~10s from the runtime:

# on master branch:
~/.../openshift/origin → go test ./vendor/k8s.io/kubernetes/pkg/master/
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	87.359s
# on this branch:
~/.../openshift/origin → go test ./vendor/k8s.io/kubernetes/pkg/master/
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	167.152s
# on this branch with async terminate
~/.../openshift/origin → go test ./vendor/k8s.io/kubernetes/pkg/master/
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	156.632s

That means the cause is not Terminate and this need more debugging :-(

@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

@mfojtik something is blocked on shutdown definitely, leading to those messages. But the messages themselves are not the issue.

@sttts +1 the messages seems to be just log spam, but what is interesting is that I would expect these messages to show the status of the connection as "shutdown" and not "connecting"... IOW. when we terminate the etcd server, we also go terminate all connections and etcd server should terminate all connections to grpc... In code that means the connections status is updated to "shutdown" and we should not see that log messages... The fact we see them might indicate that something is not right in the termination code. However, even if we make that termination non-blocking in defer, the time improves barely ~10s which also indicates that the problem might be somewhere else.

@deads2k
Copy link
Contributor

deads2k commented Feb 26, 2018

It appears to be the StorageWithCacher. If you running with that in your unit tests, the destroy isn't being called and its not failing nicely.

@deads2k
Copy link
Contributor

deads2k commented Feb 26, 2018

Try pulling in kubernetes/kubernetes#60430

@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

@deads2k tested, verified this make this test to run in 87s vs. 167s before.

@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

/approve no-issue

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2018
@mfojtik mfojtik added this to the 3.9.0 milestone Feb 26, 2018
@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

all tests are green

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2018
@mfojtik
Copy link
Contributor

mfojtik commented Feb 26, 2018

@smarterclayton FYI

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@liggitt
Copy link
Contributor

liggitt commented Feb 27, 2018

/retest

1 similar comment
@smarterclayton
Copy link
Contributor

/retest

@smarterclayton
Copy link
Contributor

that which is not closed can eternal flake

@mfojtik
Copy link
Contributor

mfojtik commented Feb 27, 2018

/retest

@mfojtik
Copy link
Contributor

mfojtik commented Feb 27, 2018

The e2e test failure seems to be permanent break, I asked @miminar and @legionus to investigate. We can't hold this until it is resolved, so I'm green button merging this.

Issue: #18762

@mfojtik mfojtik closed this Feb 27, 2018
@mfojtik mfojtik reopened this Feb 27, 2018
@mfojtik mfojtik merged commit 9e7dc61 into openshift:master Feb 27, 2018
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 27, 2018

@sttts: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_networking_minimal 373bdee link /test extended_networking_minimal
ci/openshift-jenkins/end_to_end 373bdee link /test end_to_end

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dependency/etcd lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in grpc lib used by etcd client
7 participants