Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientconn: set dial target "Authority" with target address #2650

Closed
wants to merge 1 commit into from
Closed

clientconn: set dial target "Authority" with target address #2650

wants to merge 1 commit into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Feb 21, 2019

I am creating this PR, in order to start a discussion :)

When user dials with "grpc.WithDialer", "grpc.DialContext" "cc.parsedTarget"
update only happpens once. This is problematic, because when TLS is enabled,
retries happen through "grpc.WithDialer" with static "cc.parsedTarget" from
the initial dial call.

If the server authenticates by IP addresses, we want to set a new endpoint as
a new authority. Otherwise

transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.168.154.254, not 192.168.208.149

when the new dial target is "192.168.154.254" whose certificate host name is also "192.168.154.254"
but client tries to authenticate with previously set "cc.parsedTarget" field "192.168.208.149"

ref.

/cc @xiang90 @jpbetz

To add more details, this is the etcd use case.

Server 1 certificate:

X509v3 extensions:
    X509v3 Key Usage: critical
        Digital Signature, Key Encipherment
    X509v3 Extended Key Usage:
        TLS Web Server Authentication, TLS Web Client Authentication
    X509v3 Basic Constraints: critical
        CA:FALSE
    X509v3 Subject Alternative Name:
        DNS:localhost, IP Address:127.0.0.1, IP Address:192.168.208.149

Server 2 certificate:

X509v3 extensions:
    X509v3 Key Usage: critical
        Digital Signature, Key Encipherment
    X509v3 Extended Key Usage:
        TLS Web Server Authentication, TLS Web Client Authentication
    X509v3 Basic Constraints: critical
        CA:FALSE
    X509v3 Subject Alternative Name:
        DNS:localhost, IP Address:127.0.0.1, IP Address:192.168.154.254

Server 3 certificate:

X509v3 extensions:
    X509v3 Key Usage: critical
        Digital Signature, Key Encipherment
    X509v3 Extended Key Usage:
        TLS Web Server Authentication, TLS Web Client Authentication
    X509v3 Basic Constraints: critical
        CA:FALSE
    X509v3 Subject Alternative Name:
        DNS:localhost, IP Address:127.0.0.1, IP Address:192.168.152.228
  1. Set up the 3-node cluster with those IPs and SAN fields.
  2. Now shut down the Server 1.
  3. Send etcd client requests (that uses grpc.WithDialer while passing the first endpoint in the slice to the balancer, later to be used for parsed target)
# ssh into remote machine (public IP "54.190.195.210", private IP "192.168.84.19")
ETCD_CLIENT_DEBUG=1 ETCDCTL_API=3 /usr/local/bin/etcdctl \
  --endpoints 192.168.208.149:2379,192.168.154.254:2379,192.168.152.228:2379 \
  --cacert ${HOME}/certs/etcd-root-ca.pem \
  --cert ${HOME}/certs/s1.pem \
  --key ${HOME}/certs/s1-key.pem \
  put foo bar

192.168.84.19 is the IP requesting the client calls.

connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.168.152.228, not 192.168.208.149"

When user dials with "grpc.WithDialer", "grpc.DialContext" "cc.parsedTarget"
update only happpens once. This is problematic, because when TLS is enabled,
retries happen through "grpc.WithDialer" with static "cc.parsedTarget" from
the initial dial call.
If the server authenticates by IP addresses, we want to set a new endpoint as
a new authority. Otherwise
"transport: authentication handshake failed: x509: certificate is valid for 127.0.0.1, 192.168.154.254, not 192.168.208.149"
when the new dial target is "192.168.154.254" whose certificate host name is also "192.168.154.254"
but client tries to authenticate with previously set "cc.parsedTarget" field "192.168.208.149"

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@gyuho
Copy link
Contributor Author

gyuho commented Feb 21, 2019

I am not sure if this is the right fix, but at least this was the workaround that resolved the issue for us.

@xiang90
Copy link
Contributor

xiang90 commented Feb 28, 2019

/cc @menghanl can you take a quick look at this PR? it affects many projects right now.

@menghanl
Copy link
Contributor

Sorry for the late reply.

What's the dial target you passed to grpc.Dial()? The target is used as the authority for TLS verification.
If it worked before, I'm assuming you were dialing to the IP directly, is that right?
In that case, how did the ClientConn get a different IP later?

@gyuho
Copy link
Contributor Author

gyuho commented Mar 1, 2019

In that case, how did the ClientConn get a different IP later?

We have a base balancer and picker to implement balancer.Balancer interface. And picker gets multiple endpoints, but all dial logic happens in gRPC? All we do is pass the dialer with grpc.WithBalancerName

@menghanl
Copy link
Contributor

menghanl commented Mar 1, 2019

It's unexpected to create a ClientConn to an IP, and then connect to a different IP using the same ClientConn.
A normal use case would be, Dial to a server-name, then resolve to IPs. The IPs can be more than one, and also can change at any time. So the SubConns are just connections to different backends of the same server (thus they share the server name).

Also I think certificates should normally be signed for an abstract name, like "example.com", not IP.


Some ideas:

There's an option for balancer to control what credentials to use for each SubConn.
It's supposed to be used to set a different type of credentials (but still verifying the same authority).
But if you set it to a TLS creds with a hardcoded authority and ignore the one passed in authority, it should work. You may need to copy the base balancer to do this.

You can configure the client to skip the verification (InsecureSkipVerify). This won't protect you from man-in-the-middle attacks, but if you trust your connections, it may be fine.

You can also configure the server so it uses a different certificate based on the server name clients sends (SNI).
If some of the nodes are normal nodes, and others are fallback nodes, you only need to configure the fallback nodes to support all the normal nodes. Also it still doesn't sound right if the certificates are signed for IPs (you will have a server using certificates signed for other node's IP). If the normal nodes have server-names each, the fallback node will just support all the names.

@menghanl
Copy link
Contributor

menghanl commented Mar 7, 2019

Closing for now. Please reply if you have more questions. Thanks!

@menghanl menghanl closed this Mar 7, 2019
@xiang90
Copy link
Contributor

xiang90 commented Mar 8, 2019

@menghanl

It's unexpected to create a ClientConn to an IP, and then connect to a different IP using the same ClientConn.

Why cannot we dial to an IP, and then supply the balancer with alternative IP addresses as different backends? I think this is also a valid use case.

@menghanl
Copy link
Contributor

@xiang90
A ClientConn abstractly represents a client for a service (example.com). It may contain multiple connections, but all of them are connected to the backends serving the same service, and all the IPs are resolved from example.com.
Based on this assumption, the authority is set to example.com.
To use TLS, the user would sign the certificate for example.com,and deploy it on all the backends. It also makes adding backends easier because they can deploy the same certificate instead of signing a new one.

With that being said, dialing to an IP but connect to a different IP works with dialer or balancer, but it breaks the address hierarchy that was assumed, so extra work need to be done to fix the gap, like the balancer option to configure credentials I mentioned above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants