Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcproxy: fix memberlist results not update when proxy node down #15835

Conversation

yellowzf
Copy link
Contributor

@yellowzf yellowzf commented May 6, 2023

If start grpc proxy with --resolver-prefix, memberlist will return all alive proxy nodes, when one grpc proxy node is down, it is expected to not return the down node, but it is still return

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

case endpoints.Delete:
delete(cp.umap, up.Endpoint.Addr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when op delete, up.Endpoint.Addr is not specified, can not delete the endpoint from map, use up.Key instead

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I think using Key is ok because it's already used in resolver.watch https://github.com/etcd-io/etcd/blob/249c0d71d4e106e0d2f5576b9638f2e8bd75ca47/client/v3/naming/resolver/resolver.go#LL97C6-L97C12
Not sure why Endpoint.Addr was used initially

Copy link

@lavacat lavacat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but needs a test.
case endpoints.Add: is covered in TestClusterProxyMemberList. But case endpoints.Delete isn't covered at all.

case endpoints.Delete:
delete(cp.umap, up.Endpoint.Addr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I think using Key is ok because it's already used in resolver.watch https://github.com/etcd-io/etcd/blob/249c0d71d4e106e0d2f5576b9638f2e8bd75ca47/client/v3/naming/resolver/resolver.go#LL97C6-L97C12
Not sure why Endpoint.Addr was used initially

@yellowzf yellowzf force-pushed the grpcproxy_fix_memberlist_results_not_update_when_proxy_node_down branch from 87355e9 to eb54d18 Compare May 8, 2023 12:09
@yellowzf yellowzf closed this May 8, 2023
@yellowzf yellowzf reopened this May 8, 2023
@yellowzf
Copy link
Contributor Author

yellowzf commented May 8, 2023

LGTM, but needs a test. case endpoints.Add: is covered in TestClusterProxyMemberList. But case endpoints.Delete isn't covered at all.

tests added

@yellowzf
Copy link
Contributor Author

yellowzf commented May 9, 2023

PTAL, it's ready for review @lavacat

Copy link

@lavacat lavacat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cleanup the logic and add comments to any non-obvious test setup and checks.

There is also TestRegister test that uses NewWatchChannel to listen to updates, it might be easier to update. Up to you.

@@ -90,6 +129,27 @@ func (cts *clusterproxyTestServer) close(t *testing.T) {
}
}

func registDelayDeleteMember(lg *zap.Logger, endpoints []string, addr string, delay time.Duration, ttl int, t *testing.T) chan struct{} {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: registerMemberWithTTL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this func in new code, use DeleteEndpoint is better

time.Sleep(time.Second)
donec := make(chan struct{})
go func() {
time.Sleep(delay)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point this goroutine if it's just waiting? I think this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this goroutine used to delay close the etcd client, if client is closed, lease will expire soon, then the key which lease attached will delete, it is complex, use DeleteEndpoint is more convenient

}

//check del member succ
time.Sleep(wait)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait > delay. What is the point if delay?

Copy link
Contributor Author

@yellowzf yellowzf May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use delay then test case can check proxy member has successfully updated

Endpoints: endpoints,
DialTimeout: 5 * time.Second,
}
client, err := integration2.NewClient(t, cfg)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to pass client from the test without creating a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is

@yellowzf yellowzf force-pushed the grpcproxy_fix_memberlist_results_not_update_when_proxy_node_down branch from eb54d18 to 2fd92e4 Compare May 10, 2023 03:48
@yellowzf
Copy link
Contributor Author

Please cleanup the logic and add comments to any non-obvious test setup and checks.

There is also TestRegister test that uses NewWatchChannel to listen to updates, it might be easier to update. Up to you.

TestRegister only cover watch channel logic, TestClusterProxyMemberList cover member list api and watch channel.
in new code, use DeleteEndpoint to delete key of member, seems more clear.
it's ready for review, @lavacat , PTAL

if len(mresp.Members) != 2 {
t.Fatalf("len(mresp.Members) expected 2, got %d (%+v)", len(mresp.Members), mresp.Members)
}
if mresp.Members[1].ClientURLs[0] != newMemberAddr {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think Members is populated from a map, so order isn't guarantied. Might cause test flakiness.

Can we add a check for Members[0] and add a comment about order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map can not guarantee order, we should check if the new member is in mresp.Member, codes have updated, PTAL

@lavacat
Copy link

lavacat commented May 11, 2023

I wonder if anything can be done with those time.Sleep(time.Second) in the test? I understand that you are reusing prior code, but if you see some other solution would be nice to update.

@yellowzf yellowzf force-pushed the grpcproxy_fix_memberlist_results_not_update_when_proxy_node_down branch from 2fd92e4 to c50e7bd Compare May 12, 2023 02:31
@yellowzf
Copy link
Contributor Author

I wonder if anything can be done with those time.Sleep(time.Second) in the test? I understand that you are reusing prior code, but if you see some other solution would be nice to update.

proxy watch specific prefix for members update, when register or delete members by put/delete key, it updates members by process watch resp, it is asynchronous, so use time.Sleep(time.Second) to wait for update, i think prior code is the same meaning

@@ -67,6 +71,47 @@ func TestClusterProxyMemberList(t *testing.T) {
if mresp.Members[0].ClientURLs[0] != cts.caddr {
t.Fatalf("mresp.Members[0].ClientURLs[0] expected %q, got %q", cts.caddr, mresp.Members[0].ClientURLs[0])
}

//test proxy member up/down
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test proxy member add

}

succ := false
for _, member := range mresp.Members {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace with

	hostname, _ := os.Hostname()
	...
	assert.Contains(t, mresp.Members, &pb.Member{Name: hostname, ClientURLs: []string{newMemberAddr}})

@@ -67,6 +71,47 @@ func TestClusterProxyMemberList(t *testing.T) {
if mresp.Members[0].ClientURLs[0] != cts.caddr {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert.Contains(t, mresp.Members, &pb.Member{Name: hostname, ClientURLs: []string{cts.caddr}})

if len(mresp.Members) != 1 {
t.Fatalf("len(mresp.Members) expected 1, got %d (%+v)", len(mresp.Members), mresp.Members)
}
if mresp.Members[0].ClientURLs[0] != cts.caddr {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert.Contains(t, mresp.Members, &pb.Member{Name: hostname, ClientURLs: []string{cts.caddr}})

@lavacat
Copy link

lavacat commented May 12, 2023

LGTM, please use assert.Contains as per comments. cc @ahrtr for final review

newMemberAddr := "127.0.0.2:6789"
grpcproxy.Register(lg, cts.c, prefix, newMemberAddr, 7)
// wait some time for proxy update members
time.Sleep(time.Second)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can switch to time.Sleep(200 * time.Millisecond) for all waits in this test. 1s seams too long.

@ahrtr
Copy link
Member

ahrtr commented May 13, 2023

If start grpc proxy with --resolver-prefix, memberlist will return all alive proxy nodes, when one grpc proxy node is down, it is expected to not return the down node, but it is still return

I do not get the real issue. Please provide the detailed steps to reproduce the issue.

If start grpc proxy with --resolver-prefix, memberlist will return all alive proxy nodes, when one grpc proxy node is down, it is expected to not return the down node, but it is still return

Signed-off-by: yellowzf <zzhf3311@163.com>
@yellowzf yellowzf force-pushed the grpcproxy_fix_memberlist_results_not_update_when_proxy_node_down branch from c50e7bd to ca22120 Compare May 15, 2023 02:59
@yellowzf
Copy link
Contributor Author

If start grpc proxy with --resolver-prefix, memberlist will return all alive proxy nodes, when one grpc proxy node is down, it is expected to not return the down node, but it is still return

I do not get the real issue. Please provide the detailed steps to reproduce the issue.

step1: start a grpc-proxy cluster which contains 3 member, member1/member2/member3, use --resolver-prefix flag when member start.
step2: use etcd client connect to member1, call MemberList, get a response which contains member1/member2/member3.
step3: stop member3, then lease of member3 in etcd server will expire and key will be deleted.
step4: use etcd client connect to member1, call MemberList, still get a response which contains member1/member2/member3, it is expect to only contains member1/member2 because member3 is stoped. @ahrtr

@yellowzf
Copy link
Contributor Author

LGTM, please use assert.Contains as per comments. cc @ahrtr for final review

all codes memtioned on comments has updated, @lavacat PTAL

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @yellowzf for the fix.

Could you backport this PR to 3.4 and 3.5?

@yellowzf
Copy link
Contributor Author

yellowzf commented May 16, 2023

LGTM

Thanks @yellowzf for the fix.

Could you backport this PR to 3.4 and 3.5?

3.4 use PreKv to assign addr, no need to backport, https://github.com/etcd-io/etcd/blob/release-3.4/clientv3/naming/grpc.go#L91-L100

3.5 backport at #15907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants