Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(error "EOF", ServerName "") error on etcd servers #10587

Closed
jcrowthe opened this issue Mar 26, 2019 · 14 comments
Closed

(error "EOF", ServerName "") error on etcd servers #10587

jcrowthe opened this issue Mar 26, 2019 · 14 comments
Labels

Comments

@jcrowthe
Copy link

Environment

Server version: 3.3.10
Client version: 3.3.10

Issue

Running

etcdctl --endpoints "https://10.0.0.13:2379,https://10.0.0.11:2379,https://10.0.0.12:2379" member list

will cause etcd servers to log errors. However, when adding spaces after the commas, such as

etcdctl --endpoints "https://10.0.0.13:2379, https://10.0.0.11:2379, https://10.0.0.12:2379" member list

no errors will be logged on etcd servers. This is predictable in both specifying the --endpoints flag in-line as well as via ETCDCTL_ENDPOINTS environment variable.

This issue was discovered while upgrading from 3.2.24 to 3.3.10. We did not see this same issue with 3.2.24. Note also that these etcdctl member list command return successfully without issue and with proper data. This is only an issue of logs.

Example error logs:

Mar 26 20:53:34 vm-etcd-1 etcd[27805]: rejected connection from "10.0.0.11:57684" (error "EOF", ServerName "")
Mar 26 20:53:38 vm-etcd-1 etcd[27805]: rejected connection from "10.0.0.11:57708" (error "EOF", ServerName "")

Full example command for testing:

ETCDCTL_API=3 \
ETCDCTL_ENDPOINTS="https://10.0.0.13:2379,https://10.0.0.11:2379,https://10.0.0.12:2379" \
ETCDCTL_CACERT="/etc/kubernetes/certs/kubernetes_ca.pem" \
ETCDCTL_KEY="/etc/kubernetes/certs/etcd_server.key" \
ETCDCTL_CERT="/etc/kubernetes/certs/etcd_server.pem" \
etcdctl member list

Relevant other issues: #10040 and #10391 were both closed as duplicates of #9949, however #9949 does not appear to be related to this particular issue.

@johncming
Copy link
Contributor

This is the problem about the go tls standard library13523.
It happened in master version too.

The associated issuses like
#8534
#8798
#8803

@gyuho

@TheKangaroo
Copy link
Contributor

I can confirm that the rejected connection log entries do not appear when I use the --endpoints flag with spaces in between the endpoints. But for ETCDCTL_ENDPOINTS env var, I still get the rejected connection log entries with spaces in between the endpoints. My env Var is set like this:

ETCDCTL_ENDPOINTS="https://k8s-etcd-00.example.com:2379, https://k8s-etcd-01.example.com:2379, https://k8s-etcd-02.example.com:2379"

@jejer
Copy link

jejer commented Oct 23, 2019

I can confirm same issue.

[root@justin-cwes-03 ~]# etcdctl version
etcdctl version: 3.3.15
API version: 3.3

@jejer
Copy link

jejer commented Oct 23, 2019

seems etcdctl did not use correct server IP to validate server cert, it always use 1st endpoint ip to validate all servers.

[root@justin-cwes-03 ~]# etcdctl --endpoints="https://192.16.1.18:2379,https://192.16.1.25:2379,https://192.16.1.17:2379" get /xxx --debug
ETCDCTL_CACERT=/etc/etcd/ssl/ca.pem
ETCDCTL_CERT=/etc/etcd/ssl/etcd-client.pem
ETCDCTL_COMMAND_TIMEOUT=5s
ETCDCTL_DEBUG=true
ETCDCTL_DIAL_TIMEOUT=2s
ETCDCTL_DISCOVERY_SRV=
ETCDCTL_ENDPOINTS=[https://192.16.1.18:2379,https://192.16.1.25:2379,https://192.16.1.17:2379]
ETCDCTL_HEX=false
ETCDCTL_INSECURE_DISCOVERY=true
ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false
ETCDCTL_INSECURE_TRANSPORT=true
ETCDCTL_KEEPALIVE_TIME=2s
ETCDCTL_KEEPALIVE_TIMEOUT=6s
ETCDCTL_KEY=/etc/etcd/ssl/etcd-client-key.pem
ETCDCTL_USER=
ETCDCTL_WRITE_OUT=simple
WARNING: 2019/10/23 06:15:03 Adjusting keepalive ping interval to minimum period of 10s
WARNING: 2019/10/23 06:15:03 Adjusting keepalive ping interval to minimum period of 10s
INFO: 2019/10/23 06:15:03 parsed scheme: "endpoint"
INFO: 2019/10/23 06:15:03 ccResolverWrapper: sending new addresses to cc: [{https://192.16.1.18:2379 0  <nil>} {https://192.16.1.25:2379 0  <nil>} {https://192.16.1.17:2379 0  <nil>}]
WARNING: 2019/10/23 06:15:03 grpc: addrConn.createTransport failed to connect to {https://192.16.1.25:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.16.1.25, 192.16.1.25, ::1, not 192.16.1.18". Reconnecting...
WARNING: 2019/10/23 06:15:03 grpc: addrConn.createTransport failed to connect to {https://192.16.1.25:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.16.1.25, 192.16.1.25, ::1, not 192.16.1.18". Reconnecting...
WARNING: 2019/10/23 06:15:03 grpc: addrConn.createTransport failed to connect to {https://192.16.1.17:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.16.1.17, 192.16.1.17, ::1, not 192.16.1.18". Reconnecting...
WARNING: 2019/10/23 06:15:03 grpc: addrConn.createTransport failed to connect to {https://192.16.1.17:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.16.1.17, 192.16.1.17, ::1, not 192.16.1.18". Reconnecting...
/xxx
xxx

@jejer
Copy link

jejer commented Oct 23, 2019

okay, already fixed by #11184

@spzala
Copy link
Member

spzala commented Oct 23, 2019

@jcrowthe as @jejer has verified the issue seems fixed now. I am closing it, but please feel free to reopen if needed. Thanks!

@spzala spzala closed this as completed Oct 23, 2019
@cwdsuzhou
Copy link

I meet the same issue when using etcd 3.3.18 when I upgrade from 3.2.24. Seems a similar issue still exists.

etcd Version: 3.3.18
Git SHA: 3c8740a79
Go Version: go1.12.9
Go OS/Arch: linux/amd64

@davissp14
Copy link
Contributor

We are seeing this issue on 3.3.18 as well.

@aeoluswing
Copy link

@spzala Sorry for recomment in this closed issue.
Unfortunately, I met this at version 3.4.9 after upgrading from version 3.3.10

Jun 06 11:16:08 kubernetes-master-03 etcd[9262]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Jun 06 11:16:08 kubernetes-master-03 etcd[9262]: serving client requests on 10.3.145.10:2379
Jun 06 11:16:08 kubernetes-master-03 etcd[9262]: established a TCP streaming connection with peer ba80ef9cc6549fd9 (stream Message reader)
Jun 06 11:16:08 kubernetes-master-03 etcd[9262]: established a TCP streaming connection with peer e48bf6e4b3b874d (stream Message reader)
Jun 06 11:16:08 kubernetes-master-03 etcd[9262]: da1ce4c886f80f9c initialized peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
Jun 06 11:16:12 kubernetes-master-03 etcd[9262]: updated the cluster version from 3.0 to 3.4
Jun 06 11:16:12 kubernetes-master-03 etcd[9262]: enabled capabilities for version 3.4
Jun 06 11:19:58 kubernetes-master-03 etcd[9262]: rejected connection from "10.3.145.8:54824" (error "EOF", ServerName "")
Jun 06 11:22:03 kubernetes-master-03 etcd[9262]: rejected connection from "10.3.145.8:55326" (error "EOF", ServerName "")
Jun 06 11:22:24 kubernetes-master-03 etcd[9262]: rejected connection from "10.3.145.8:55432" (error "EOF", ServerName "")
Jun 06 11:22:29 kubernetes-master-03 etcd[9262]: rejected connection from "10.3.145.8:55458" (error "EOF", ServerName "")

The endpoints status below:

+------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|     ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|  10.3.145.8:2379 | ba80ef9cc6549fd9 |   3.4.9 |   23 MB |      true |      false |         2 |          9 |                  9 |        |
|  10.3.145.9:2379 |  e48bf6e4b3b874d |   3.4.9 |   23 MB |     false |      false |         2 |          9 |                  9 |        |
| 10.3.145.10:2379 | da1ce4c886f80f9c |   3.4.9 |   23 MB |     false |      false |         2 |          9 |                  9 |        |
+------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

I have followed issue #11184 #10391 #10634 and the cluster was in health status by stable operation of kubernetes production cluster,so did i miss anything momentous about this?

@kennyqn
Copy link

kennyqn commented Nov 11, 2020

Still hitting this on v3.4.13 as well

@kerberoszhan
Copy link

I meet the same issue at 3.4.14 when I start Flanneld

etcdctl version

etcdctl version: 3.4.14
API version: 3.4

flanneld -version

v0.13.1-rc1

the log messages as below
Apr 9 16:08:56 cscnode01 etcd: rejected connection from "192.168.1.73:43180" (error "EOF", ServerName "")
Apr 9 16:13:42 cscnode01 systemd: Starting Flanneld overlay address etcd agent...
Apr 9 16:13:42 cscnode01 flanneld: ERROR: logging before flag.Parse: I0409 16:13:42.907419 37969 main.go:519] Determining IP address of default interface
Apr 9 16:13:42 cscnode01 flanneld: ERROR: logging before flag.Parse: I0409 16:13:42.907761 37969 main.go:532] Using interface with name ens33 and address 192.168.1.75
Apr 9 16:13:42 cscnode01 flanneld: ERROR: logging before flag.Parse: I0409 16:13:42.907790 37969 main.go:549] Defaulting external address to interface address (192.168.1.75)
Apr 9 16:13:42 cscnode01 flanneld: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Apr 9 16:13:42 cscnode01 flanneld: ERROR: logging before flag.Parse: I0409 16:13:42.908166 37969 main.go:253] Created subnet manager: Etcd Local Manager with Previous Subnet: None
Apr 9 16:13:42 cscnode01 flanneld: ERROR: logging before flag.Parse: I0409 16:13:42.908172 37969 main.go:256] Installing signal handlers
Apr 9 16:13:42 cscnode01 flanneld: ERROR: logging before flag.Parse: E0409 16:13:42.913778 37969 main.go:387] Couldn't fetch network config: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.

who could give some suggestions where is the issues ? why the flanneld start fault ?

@Joseph94m
Copy link

@jejer @spzala Sorry for recommenting on this closed issue but I thought this could be useful.

I tried applying this idea to our own clusters deployed with kubeadm and etcd managed locally with a kubelet doc.

Version

k8s/kubeadm 1.17.4
3.4.3-0 (I don't specify it the kubeadmcfg.yaml)

In the kubeadmcfg.yaml config file as per the documentation the endpoints are not separated by spaces:
image

So the manifest generated (/etc/kubernetes/manifests/etcd.yaml) also doesnt have spaced between endpoints:
image

This results in the famous (error "EOF", ServerName "") messages on the ETCDs.

If i were to edit the config file under (/etc/kubernetes/manifests/etcd.yaml) on each node, the kubelet restarts each ETCD pod and the error messages disappear and everything is is fine.

Issue

The main issue arose from when i tried initing a new cluster with the spaces in the kubeadmcfg.yaml file
image

The manifest generated (/etc/kubernetes/manifests/etcd.yaml) also has these spaces as expected
image

This results in the kubelet complaining that the ETCDs are in CrashLoopBackoff and as you can expect. Nothing works 👯 . Adding quotes didn't seem to help the issue in this case either.

image

Is that a bug or have I applied it incorrectly ?

Does that mean I should be concerned for the clusters where i manually edited the /etc/kubernetes/manifests/etcd.yaml to not have the error messages anymore. 🤔

Cheers

@stale
Copy link

stale bot commented Jan 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 20, 2022
@stale stale bot closed this as completed Feb 10, 2022
gengwg added a commit to gengwg/k8s that referenced this issue Feb 18, 2023
@WilliamDEdwards
Copy link
Contributor

WilliamDEdwards commented Feb 23, 2024

For anyone else coming across this issue:

After rebooting an etcd node, my clients couldn't reconnect due to timeouts. The following messages were being logged multiple times per second:

rejected connection from "[2a0c:eb00:0:f9:3a1:63f:798:8ef]:41400" (error "EOF", ServerName "etcd.cyberfusion.cloud")

The culprit turned out to be a lack of CPU cores. https://etcd.io/docs/v3.4/op-guide/hardware/ says:

Few etcd deployments require a lot of CPU capacity. Typical clusters need two to four cores to run smoothly. Heavily loaded etcd deployments, serving thousands of clients or tens of thousands of requests per second, tend to be CPU bound since etcd can serve requests from memory. Such heavy deployments usually need eight to sixteen dedicated cores.

My etcd nodes had 2 CPU cores each. Contrary to the documentation, that was not nearly enough. With 'only' a few hundred clients, etcd now easily hogs 8 CPU cores when starting. I still see EOF messages, but only for a few seconds while all CPU cores are busy. There are two possible reasons, and the log does not reveal which is the case:

  • Because all CPU cores are in use, etcd has not started fully, and therefore rejects connections.
  • Because all CPU cores are in use, etcd cannot handle incoming connections, and therefore rejects them. I think this manifests itself by network buffers being full, causing the EOF (which would be more of an internal issue).

Right now, the tail of my startup log looks like below, as all CPU cores are hogged on startup still, but only for a few seconds. Note that before, it would not get further than rejected connection from ....

root@etcd0:~# journalctl -fu etcd
Feb 23 13:16:32 etcd0.cyberfusion.cloud etcd[603]: waiting for ReadIndex response took too long, retrying (sent-request-id: 13010492323331209987, retry-timeout: 500ms)
Feb 23 13:16:32 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f7:185:233:175:155]:39668" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:32 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f6:185:233:175:12]:42620" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:33 INFO: 62c9400d8248b48e is starting a new election at term 3920
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:33 INFO: 62c9400d8248b48e became candidate at term 3921
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:33 INFO: 62c9400d8248b48e received MsgVoteResp from 62c9400d8248b48e at term 3921
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:33 INFO: 62c9400d8248b48e [logterm: 3918, index: 29535846] sent MsgVote request to 2c110e48d3d01f4d at term 3921
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:33 INFO: 62c9400d8248b48e [logterm: 3918, index: 29535846] sent MsgVote request to 89419b39d533c2fd at term 3921
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: waiting for ReadIndex response took too long, retrying (sent-request-id: 13010492323331209987, retry-timeout: 500ms)
Feb 23 13:16:33 etcd0.cyberfusion.cloud etcd[603]: waiting for ReadIndex response took too long, retrying (sent-request-id: 13010492323331209987, retry-timeout: 500ms)
Feb 23 13:16:34 etcd0.cyberfusion.cloud etcd[603]: waiting for ReadIndex response took too long, retrying (sent-request-id: 13010492323331209987, retry-timeout: 500ms)
Feb 23 13:16:34 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f7:185:233:175:155]:39674" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:34 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f6:185:233:175:12]:42612" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:34 etcd0.cyberfusion.cloud etcd[603]: waiting for ReadIndex response took too long, retrying (sent-request-id: 13010492323331209987, retry-timeout: 500ms)
Feb 23 13:16:34 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f6:185:233:175:12]:42618" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f7:185:233:175:155]:53198" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:35 INFO: 62c9400d8248b48e received MsgVoteResp from 89419b39d533c2fd at term 3921
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:35 INFO: 62c9400d8248b48e has received 2 MsgVoteResp votes and 0 vote rejections
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:35 INFO: 62c9400d8248b48e became leader at term 3921
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:35 INFO: raft.node: 62c9400d8248b48e elected leader 62c9400d8248b48e at term 3921
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: first commit in current term: resending ReadIndex request
Feb 23 13:16:35 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f7:185:233:175:203]:52574" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f9:3a1:63f:798:8ef]:41338" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f9:9eb:b09:3de:905]:45938" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f9:3a1:63f:798:8ef]:41400" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: rejected connection from "[2a0c:eb00:0:f9:9eb:b09:3de:905]:45748" (error "EOF", ServerName "etcd.cyberfusion.cloud")
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: waiting for ReadIndex response took too long, retrying (sent-request-id: 13010492323331210063, retry-timeout: 500ms)
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: failed to send out heartbeat on time (exceeded the 100ms timeout for 372.003162ms, to 89419b39d533c2fd)
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: server is likely overloaded
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: failed to send out heartbeat on time (exceeded the 100ms timeout for 372.435176ms, to 2c110e48d3d01f4d)
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: server is likely overloaded
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: failed to send out heartbeat on time (exceeded the 100ms timeout for 15.768388ms, to 89419b39d533c2fd)
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: server is likely overloaded
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: failed to send out heartbeat on time (exceeded the 100ms timeout for 15.909769ms, to 2c110e48d3d01f4d)
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: server is likely overloaded
Feb 23 13:16:36 etcd0.cyberfusion.cloud etcd[603]: request "header:<ID:13010492323331210099 > authenticate:<name:dproxy.cyberfusion.email simple_token:<stripped>>" with result "size:57" took too long (111.436785ms) to execute
Feb 23 13:16:39 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:39 INFO: 62c9400d8248b48e [logterm: 3921, index: 29535958, vote: 62c9400d8248b48e] ignored MsgVote from 2c110e48d3d01f4d [logterm: 3921, index: 29535950] at term 3921: lease is not expired (remaining ticks: 9)
Feb 23 13:16:39 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:39 INFO: 62c9400d8248b48e [term: 3921] received a MsgHeartbeat message with higher term from 2c110e48d3d01f4d [term: 3922]
Feb 23 13:16:39 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:39 INFO: 62c9400d8248b48e became follower at term 3922
Feb 23 13:16:39 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:39 INFO: raft.node: 62c9400d8248b48e changed leader from 62c9400d8248b48e to 2c110e48d3d01f4d at term 3922
Feb 23 13:16:41 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:41 INFO: found conflict at index 29535951 [existing term: 3921, conflicting term: 3922]
Feb 23 13:16:41 etcd0.cyberfusion.cloud etcd[603]: raft2024/02/23 13:16:41 INFO: replace the unstable entries from index 29535951

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests