Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: controlling range leadership #6929

Closed
andreimatei opened this issue May 26, 2016 · 16 comments
Closed

kv: controlling range leadership #6929

andreimatei opened this issue May 26, 2016 · 16 comments
Assignees

Comments

@andreimatei
Copy link
Contributor

I started working on a TestCluster class that's supposed to wrap a configurable number of TestServers and let tests control how ranges are replicated and who the leader of each one is. A higher-level MultiTestContext. It's gonna be used for testing DistSQL in general and in particular this wrapper I'm building over over the LeaderCache and the RangeDescriptorCache to be used for enquiring about the leadership of key spans. And David has tried to use something like this in the past for running more realistic ("multi-node") benchmarks. Seems like a good thing, right? It's gonna be great.

One area when I quickly ran into problems is how controlling the leadership of ranges is gonna work (technically leadership in the sense of holder of the LeaderLease, not Raft leadership). I would like the TestCluster to let the client say that it wants node n1 to become the leader of range r1. And 1ms later, it wants n2 to become the leader of that range.
Seems like we don't currently have a way for a node to release its LeaderLease, and then we also don't have a clear way for force a node to become a leader when there's no leader.
Besides being necessary for testing, we can imagine that this capability would be useful in the real world when we start talking about collocating ranges (and their leadership).
I've talked to Tobias a bit about the topic, and here's a plan I want to drop in the pool and see if it makes a splash.

  1. We introduce a ShortenLease(t) raft command. This command can only be proposed by the current lease holder, and has the effect that it moves up the expiration of the current lease to time t. t is chosen as the current clock of the leader (which is higher than the timestamps from its read cache - i.e. the highest timestamp it already served a read at).
    In between when the command is proposed and when it's applied, the leader replica is gonna remember this t and refuse to serve reads above it.
    The refusal is done by returning a NotLeaderError, redirecting to the nextLeader. nextLeader is a hint given to the leader replica when ShortenLease is proposed, indicating if there's another replica that we intend to make the leader after the current lease expires. This is so we clients following the redirect don't proceed to random node and force it to inadvertently become the leader.
    This ShortenLease sounds like it will be also useful for draining a node, particularly since Tobi says our LeaderLeases are gonna become longer than the current 1s when they get... tied... to... raft ticks(?).

We now have a way to expire leases. We still need a way to (probabilistically) elect a leader of the puppeteer's choosing.

  1. We introduce a GetOrElectLeader(replicaID) RPC method, which returns the current leader if there's a valid leader lease, or tries to acquire the lease otherwise - through the usual LeaderLease raft command. This is to be used for asking a node to become leader. It's also to be used for LeaderLease cache population purposes, when you don't know who the leader is, and would also like to favor one particular node to become leader if there is none (this will be useful for SQL planning, when you need to figure out who a leader is but you also have a preference on the subject).
    An alternative here is to somehow use the DistSender directly to send a read command for the range to the node that you want to elect leader. The downside is that some DistSender methods for talking directly to a particular replica would have to be exposed for testing, and that we'd be doing read without caring about the actual values, just for its side effect of electing a leader.

So with these two together, a test has the means to try to move leadership at will. It's all probabilistic - there's no guarantee that a rando doesn't grab the lease in between shortening the old lease and acquiring the new one on the desired target, but the test can keep trying until it gets what it wants.

Does this sound sane?

@tschottdorf @bdarnell @cuongdo @RaduBerinde

@andreimatei andreimatei self-assigned this May 26, 2016
@tbg
Copy link
Member

tbg commented May 26, 2016

  1. We introduce a ShortenLease(t) raft command.

This would just be the existing LeaderLease command but it would allow for an Expiration which is smaller than that of the existing lease. t can be chosen arbitrarily. I don't want to get into the details because what's above is worthy of some discussion to get right, but that's for when that's actually implemented. The major aspect to get correct is that no reads have been served with a timestamp that's higher than what the new expiration of the lease is at apply time.

I think this sounds like the sanest, most non-invasive way to get what you want. Lease shortening also has some general use cases, though it could be a little tricky to get right.

Rather than diving into the deep end of this, I think it's worth speccing out what you actually want from this magical test cluster before this turns into a moloch.

@andreimatei
Copy link
Contributor Author

What I want out of this TestCluster immediately is for testing this wrapper I have on top of the LeaderCache. I want to setup a bunch of ranges and check that I'm able to figure out who the leaders of the corresponding key spans is. Then I want to move leadership around, and see that my cache gets updated properly. And when there's no leader for a range, I want to see that the "cache" actually forces leadership on the node that it considers most desirable - generally collocated with other ranges that are part of the same request.
Then, what we generally need for testing DistSQL planning is to be able to construct arbitrary topologies and see that the plans we produce make sense - data that's colocated gets to be part of the same flow, processors are scheduled close to their data, etc. And also we'll probably want to change leadership in the middle of operations and see that queries still work.

@vivekmenezes
Copy link
Contributor

I think it is valuable and gonna take enough time to develop that it is
worth writing up an RFC before you move forward on it. This is going to be
useful in testing not just distributed sql functionality.

On Thu, May 26, 2016 at 3:22 PM Andrei Matei notifications@github.com
wrote:

What I want out of this TestCluster immediately is for testing this
wrapper I have on top of the LeaderCache. I want to setup a bunch of
ranges and check that I'm able to figure out who the leaders of the
corresponding key spans is. Then I want to move leadership around, and see
that my cache gets updated properly. And when there's no leader for a
range, I want to see that the "cache" actually forces leadership on the
node that it considers most desirable - generally collocated with other
ranges that are part of the same request.
Then, what we generally need for testing DistSQL planning is to be able to
construct arbitrary topologies and see that the plans we produce make sense

  • data that's colocated gets to be part of the same flow, processors are
    scheduled close to their data, etc. And also we'll probably want to change
    leadership in the middle of operations and see that queries still work.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#6929 (comment)

@andreimatei
Copy link
Contributor Author

Upon further consultations, seems like we can drop the ShortenLease(t) command and instead expand the existing LeaderLease to support lease transfers. So, I guess the old leader would propose this command, with a lease naming another node and with timestamp=clock() + max clock offset. After proposing, the old leader starts a "stasis" period, and refuses any reads (to avoid the "single-register linearizability failure" described here).

GetOrElectLeader stays, to be used independently of TransferLease, for cache population purposes. Except instead of being an RPC as I said before, it can just be a (non-raft) admin command.

@vivekmenezes I might put this in an RFC if more discussion follows or if the scope grows somehow. Otherwise I don't know what more to say than these paragraphs. I'm sure the devil is in details to be seen at implementation time; I wish I understood the code and indeed these issues better beforehand but alas.

@tamird
Copy link
Contributor

tamird commented May 27, 2016

The point of an RFC is to document the devilish details before making code
changes which are implicitly assumed to eventually be merged.
On May 26, 2016 18:25, "Andrei Matei" notifications@github.com wrote:

Upon further consultations, seems like we can drop the ShortenLease(t)
command and instead expand the existing LeaderLease to support lease
transfers. So, I guess the old leader would propose this command, with a
lease naming another node and with timestamp=clock() + max clock offset.
After proposing, the old leader starts a "stasis" period, and refuses any
reads (to avoid the "single-register linearizability failure" described
here
https://github.com/cockroachdb/cockroach/blob/ce67233/roachpb/data.proto#L329
).

GetOrElectLeader stays, to be used independently of TransferLease, for
cache population purposes. Except instead of being an RPC as I said before,
it can just be a (non-raft) admin command.

@vivekmenezes https://github.com/vivekmenezes I might put this in an
RFC if more discussion follows or if the scope grows somehow. Otherwise I
don't know what more to say than these paragraphs. I'm sure the devil is in
details to be seen at implementation time; I wish I understood the code and
indeed these issues better beforehand but alas.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#6929 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABdsPPBFBcjAp5NJLtnrcW1hMNqU7osAks5qFh3pgaJpZM4In2PJ
.

@vivekmenezes
Copy link
Contributor

Changing leadership seems a lot simpler than picking a leader. Is there a strong need to pick a leader?

@andreimatei
Copy link
Contributor Author

Well, what does it mean to change leadership if you can't assign a new one? If the same node gets the lease again, have you changed anything?
Why do you think that simply "changing" is simpler? It seems to me that a transfer is just as difficult.

@vivekmenezes
Copy link
Contributor

I think its about giving up leadership and not participating in the
election for a bit. It appears to be straight forward to do that. I suppose
picking a specific leader is also not complex, but im curious if you need
that functionality too.

On Fri, May 27, 2016, 2:03 PM Andrei Matei notifications@github.com wrote:

Well, what does it mean to change leadership if you can't assign a new
one? If the same node gets the lease again, have you changed anything?
Why do you think that simply "changing" is simpler? It seems to me that a
transfer is just as difficult.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6929 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ALOpBPeRZCbxLV3r4mhCU6n2KfK5nWJiks5qFzH3gaJpZM4In2PJ
.

@tbg
Copy link
Member

tbg commented May 27, 2016

Leadership is opportunistic, so if a node gives up leadership it'll regain
it with the next incoming read/write (or even just queue activity). That's
why transfer is important.

On Fri, May 27, 2016 at 3:25 PM vivekmenezes notifications@github.com
wrote:

I think its about giving up leadership and not participating in the
election for a bit. It appears to be straight forward to do that. I suppose
picking a specific leader is also not complex, but im curious if you need
that functionality too.

On Fri, May 27, 2016, 2:03 PM Andrei Matei notifications@github.com
wrote:

Well, what does it mean to change leadership if you can't assign a new
one? If the same node gets the lease again, have you changed anything?
Why do you think that simply "changing" is simpler? It seems to me that a
transfer is just as difficult.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#6929 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe/ALOpBPeRZCbxLV3r4mhCU6n2KfK5nWJiks5qFzH3gaJpZM4In2PJ

.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#6929 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135Ja5VZnPtouyPLsDilBQEsxUCjn9ks5qF0UjgaJpZM4In2PJ
.

-- Tobias

@bdarnell
Copy link
Contributor

GetOrElectLeader stays, to be used independently of TransferLease

What is TransferLease? (This is the only time that name is used)

What exactly is the flow you're now proposing? Does the client send LeaderLease requests in directly or does it use a new TransferLease RPC to ask the old lease holder to do it?

GetLeader() (with the implication that it elects whichever node it lands on if there is no leader) is fine; GetOrElectLeader(replicaID) would need to be fleshed out in more detail (if we need it at all - can we just use GetLeader and transfer leadership separately if it's not where we want it?). In particular I'd prefer to hide all knowledge of replicas at the DistSender level and below; I'd rather not have higher-level code manipulating range leadership (except in tests, which is why I'd prefer to use a blind GetLeader from the sql code and make lease transfers test-only).

@andreimatei
Copy link
Contributor Author

andreimatei commented May 31, 2016

What is TransferLease? (This is the only time that name is used)

Sorry, there's no TransferLease. What I'm really doing is using LeaderLease for transfers, setting the ReplicaID in the request to the next leader replica. So, the old leader has to be the one that proposes this command, and it will do some things before proposing to make sure it doesn't serve more reads after proposing.

And again you're right - there's no GetOrElectLeader, just GetLeader. But GetLeader would be an RPC, not a Raft command, right?

@bdarnell
Copy link
Contributor

bdarnell commented Jun 3, 2016

So if the old leader must propose the new LeaderLease, what causes that to happen? I'm still not clear on the overall flow.

But GetLeader would be an RPC, not a Raft command, right?

Right, GetLeader would be a read RPC, not a raft command.

@tbg
Copy link
Member

tbg commented Jun 7, 2016

ping @andreimatei re: @bdarnell's qs above.

@andreimatei
Copy link
Contributor Author

The old leader would propose the new LeaderLease when a test tells it to do some, through some sort of test interface that accesses the old leader replica directly (TBD exactly how that's going to look, maybe just a method on the replica). It would also happen on a node shutdown - the node would iterate over its replicas, see which ones are leaders and scramble to transfer leadership away to another random replica.
I think it's likely other uses in production will come up where we'll want to control leadership - once the sql planner / sql statistics become more sophisticated and we want to collocate leadership, analogous to how we'll want to collocate replicas for related ranges. At that point, I guess we'd introduce an RPC.
Makes sense?

@bdarnell
Copy link
Contributor

SGTM for tests and orderly shutdown; I think we'll probably want a full RFC once we're thinking about introducing an RPC for use in live clusters.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 22, 2016
Allow the leader of a range to propose a LeaderLease request naming
another replica as the leader.

towards cockroachdb#6929
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 22, 2016
Allow the leader of a range to propose a LeaderLease request naming
another replica as the leader.

towards cockroachdb#6929
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 22, 2016
Allow the leader of a range to propose a LeaderLease request naming
another replica as the leader.

referencing cockroachdb#6929
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 30, 2016
Allow the leader of a range to propose a LeaderLease request naming
another replica as the leader.

referencing cockroachdb#6929
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jul 7, 2016
Allow the leader of a range to propose a LeaderLease request naming
another replica as the leader.

referencing cockroachdb#6929
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jul 11, 2016
Allow the leader of a range to propose a LeaderLease request naming
another replica as the leader.

referencing cockroachdb#6929
@andreimatei
Copy link
Contributor Author

A TransferLeaseRequest has been introduced in the commit above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants