Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New balancer wrapper tearDown does not call down function from Up #1649

Closed
gyuho opened this issue Nov 3, 2017 · 9 comments
Closed

New balancer wrapper tearDown does not call down function from Up #1649

gyuho opened this issue Nov 3, 2017 · 9 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Nov 3, 2017

What version of gRPC are you using?

v1.7.2

What version of Go are you using (go version)?

Go 1.9.2

What operating system (Linux, Windows, …) and version?

Linux 4.10.0-38-generic 16.04.1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux

What did you do?

etcd with grpc v1.6.0 implements old(v1) balancer interface with Up, Get, Notify, etc..
Now we are upgrading to v1.7.2, while keeping current etcd implementation--temporarily, we will rewrite balancer to use new interface.

What did you expect to see?

Same behavior in etcd client balancer with gRPC's v1.7.2 balancer v1 wrapper.

What did you see instead?

A lot of etcd test failures.




Our balancer depends on v1.6.0 lbWatcher where it creates/drains connections from Notify channel. And it calls down function to propagate grpc: the connection is drained error to the balancer. We use this error to unpin/switch endpoints.

v1.7.2 lbWatcher creates/drains connections the same way, except tearDown. It calls RemoveSubConn and tearDown, but not down function on grpc: the connection is drained.



Relevant changes:

v1.6.0

grpc-go/clientconn.go

Lines 1212 to 1221 in 84671c5

func (ac *addrConn) tearDown(err error) {
ac.cancel()
ac.mu.Lock()
ac.curAddr = Address{}
defer ac.mu.Unlock()
if ac.down != nil {
ac.down(downErrorf(false, false, "%v", err))
ac.down = nil
}

v1.7.2

grpc-go/clientconn.go

Lines 644 to 655 in 5ffe308

// removeAddrConn removes the addrConn in the subConn from clientConn.
// It also tears down the ac with the given error.
func (cc *ClientConn) removeAddrConn(ac *addrConn, err error) {
cc.mu.Lock()
if cc.conns == nil {
cc.mu.Unlock()
return
}
delete(cc.conns, ac)
cc.mu.Unlock()
ac.tearDown(err)
}

grpc-go/clientconn.go

Lines 1109 to 1130 in 5ffe308

func (ac *addrConn) tearDown(err error) {
ac.cancel()
ac.mu.Lock()
ac.curAddr = resolver.Address{}
defer ac.mu.Unlock()
if err == errConnDrain && ac.transport != nil {
// GracefulClose(...) may be executed multiple times when
// i) receiving multiple GoAway frames from the server; or
// ii) there are concurrent name resolver/Balancer triggered
// address removal and GoAway.
ac.transport.GracefulClose()
}
if ac.state == connectivity.Shutdown {
return
}
ac.state = connectivity.Shutdown
ac.tearDownErr = err
if ac.cc.balancerWrapper != nil {
ac.cc.balancerWrapper.handleSubConnStateChange(ac.acbw, ac.state)
} else {
ac.cc.csMgr.updateState(ac.state)
}



Isn't v1 balancer wrapper supposed to keep the old behavior?
Could anybody help understand why down(errConnDrain) was removed?

Thanks in advance!
/cc @xiang90

@menghanl
Copy link
Contributor

menghanl commented Nov 3, 2017

down is called by v1 balancer wrapper when the SubConn state changes from Ready to non-Ready:

if oldS != connectivity.Ready && s == connectivity.Ready {
scSt.down = bw.balancer.Up(scSt.addr)
} else if oldS == connectivity.Ready && s != connectivity.Ready {
if scSt.down != nil {
scSt.down(errConnClosing)
}
}

It's called with "grpc: the connection is closing" though.

Did you see that your down function is never called?
Or you expect the drain error message?

@gyuho
Copy link
Contributor Author

gyuho commented Nov 3, 2017

It's called with "grpc: the connection is closing" though.

Right. We are also seeing that.

But our current balancer logic was built, based on v1.6.0 logic.
So, we were expecting to see errConnDrain message first.
And then, grpc: the connection is closing.

Was there any reason why errConnDrain was removed and not trigger down any more in v1.7.x?

@menghanl
Copy link
Contributor

menghanl commented Nov 3, 2017

down is called by notify balancer that the connection is not working.
It should be only called once. What do you mean see errConnDrain and then the connection is closing?

down is not removed. It's moved to the v1 balancer wrapper. When tearDown() is called, the down function should eventually be called v1 balancer wrapper. If your down is not called, that will be a bug.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 3, 2017

What do you mean see errConnDrain and then the connection is closing?

v1.6.0 has this in lbWatcher:

grpc-go/clientconn.go

Lines 651 to 653 in 84671c5

for _, c := range del {
c.tearDown(errConnDrain)
}

So when we delete connections, v1.6.0 tearDown calls down function, so custom balancer knows that this connection has been drained.

v1.7.2 has this in lbWatcher:

for _, c := range del {
bw.cc.RemoveSubConn(c)
}

And RemoveSubConn calls removeAddrConn and tearDown

func (ccb *ccBalancerWrapper) RemoveSubConn(sc balancer.SubConn) {
grpclog.Infof("ccBalancerWrapper: removing subconn")
acbw, ok := sc.(*acBalancerWrapper)
if !ok {
return
}
ccb.cc.removeAddrConn(acbw.getAddrConn(), errConnDrain)
}

grpc-go/clientconn.go

Lines 646 to 655 in 57ebb0f

func (cc *ClientConn) removeAddrConn(ac *addrConn, err error) {
cc.mu.Lock()
if cc.conns == nil {
cc.mu.Unlock()
return
}
delete(cc.conns, ac)
cc.mu.Unlock()
ac.tearDown(err)
}

But v1.7.2 tearDown does not call down as I linked above.

So, I meant to say tearDown(errConnDrain) is called in both versions, but not the down function due to change in tearDown method.

@menghanl
Copy link
Contributor

menghanl commented Nov 4, 2017

tearDown calls handleSubConnStateChange, which will eventually trigger v1 balancer wrapper's HandleSubConnStateChange being called with Shutdown state.

The HandleSubConnStateChange function in v1 balancer wrapper calls the down function whenever the SubConn's state changes from Ready to non-Ready.

So, when a SubConn's tearDown gets called, the corresponding down function should eventually be called by the v1 balancer wrapper (because of the state transition that happens after tearDown).

@gyuho
Copy link
Contributor Author

gyuho commented Nov 4, 2017

The HandleSubConnStateChange function in v1 balancer wrapper calls the down function whenever the SubConn's state changes from Ready to non-Ready.

Then down(errConnClosing) is only called in such case, as

if oldS != connectivity.Ready && s == connectivity.Ready {
scSt.down = bw.balancer.Up(scSt.addr)
} else if oldS == connectivity.Ready && s != connectivity.Ready {
if scSt.down != nil {
scSt.down(errConnClosing)
}
}

This is different than how v1.6.0 handles drained connections, right?
v1.6.0 calls directly down(errConnDrain) when removing a connection in lbWatcher.

Please correct me if I am wrong.

Thanks!

@menghanl
Copy link
Contributor

menghanl commented Nov 6, 2017

v1.6.0 calls down(errConnDrain) if down is not nil, which means Up was called.
If the ac was never Up, down will be nil, and tearDown won't call it.

In balancer wrapper, Up is called when state goes from non-Ready to Ready, and down is called when it goes from Ready to non-Ready. tearDown for a Ready SubConn will call its down function.
The behavior should be the same.

@menghanl
Copy link
Contributor

menghanl commented Nov 6, 2017

If you mean the timing is different, that might be right.
In v1.6.0, down is called immediately when tearDown is called, while in balancer wrapper, down will be called when the state change is propagated to the balancer.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 6, 2017

@menghanl I will see if our client can internally handle this.
Thanks!

@gyuho gyuho closed this as completed Nov 6, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants