add new 'query' cluster mode and better name for modes #1243

Dieterbe · 2019-03-19T06:01:35Z

In my testing, it works fine
you can test yourself via the included docker-cluster-query stack, run fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 1000 and issue some fakedata queries against the metrictank data source

codelingo

0 issues found before hitting an error.
Error: Sorry, an error occurred while processing your request. Please try again.

cluster/manager.go

docker/docker-chaos/metrictank.ini

woodsaj · 2019-03-20T18:08:23Z

cmd/metrictank/metrictank.go

@@ -62,7 +62,7 @@ var (
 chunkMaxStaleStr = flag.String("chunk-max-stale", "1h", "max age for a chunk before to be considered stale and to be persisted to Cassandra.")
 metricMaxStaleStr = flag.String("metric-max-stale", "3h", "max age for a metric before to be considered stale and to be purged from memory.")
 gcIntervalStr = flag.String("gc-interval", "1h", "Interval to run garbage collection job.")
- warmUpPeriodStr = flag.String("warm-up-period", "1h", "duration before secondary nodes start serving requests")
+ warmUpPeriodStr = flag.String("warm-up-period", "1h", "duration before non-primary (secondary or query ) nodes start serving requests")


why do query nodes need a warm-up period?

woodsaj · 2019-03-20T18:19:59Z

metrictank-sample.ini

-# duration before secondary nodes start serving requests
-# shorter warmup means metrictank will need to query cassandra more if it doesn't have requested data yet.
-# in clusters, best to assure the primary has saved all the data that a newly warmup instance will need to query, to prevent gaps in charts
+# duration before non-primary (secondary or query) nodes start serving requests


It doesnt make sense to me to have warm-up period applied to query nodes. If waiting for gossip to settle is needed for query nodes, then it would also be needed for primary nodes, in which case warm-up-period should always apply.

The intent of the warm-up-period is to set the amount of time a node should spend rebuilding in-memory chunks. If we need to set a time to wait for gossip to settle, we should add a new config setting.

Let me explain where I'm coming from.

First of all with "gossip settle" I mean "become aware of all other cluster peers" or at the very least "become aware of at least one ready replica of each shard" (assuming a healthy cluster wherein all shards are ready)

Here's what happens during startup:

gossip start and http server coming online happens basically at the same time.

Then we load index, start indexpersist notifiers, start data consumption

now we set ready state (currently: immediately for primaries, after warmup otherwise)

My position is that we should not speculate that step 2 will always take more time than what's needed for gossip settle.
In most cases, it'll be a cluster with already some data, and it'll take some time until we reach step 3, enough time for gossip to settle.
But, the cluster may only contain very little data and step 2 may go very quickly, or maybe the network is slow and gossip is taking some time. or it's an overprovisioned cluster. or peers are slow to gossip because they're overloaded, or whatever.
The point is, we cannot guarantee that by the time we reach step 3, gossip has settled.
Thus, if the node receives render queries it may not be able to respond them properly (as completely as other nodes that have been up longer, would have. or it may error depending on min-available-shards)

So generally speaking (for sharded instances that receive queries), I think we need to make sure to leave some time for gossip settle.
Your argument that we shouldn't piggyback this on the warm-up-period setting, but instead introduce a 2nd variable for it, is fine with me. I have no strong preference either way.

As far as why do primaries become ready immediately at step 3, IIRC I think this is based on some old logic that if you're querying a primary that just started, you probably have no other instances online that serve the same data as this primary is holding. (otherwise why would you query the primary?) so may as well try our best and serve what we can. This works well in our deployments where we have primaries that we don't query. But for clusters wherein you query your secondaries and primaries, I think it would make sense to subject the primary to the same warm-up-period.

after your explanation it is even more clear that time for gossip to settle needs to be a separate config setting.

in a cluster, nodes have 3 functions. A single MT node can provide one, two or all three.
a) ingesting and storing data in memory
b) writing data to backend store
c) executing queries.

for "a" and "b", gossip has nothing to do with whether the node is "ready". Readiness is determined solely by the current state of the node. A node should not be marked as ready unless the union of data in-memory and data in the backend store is the complete set of data. This can be ensured by

replaying from kafka

waiting a fixed "warm-up-period" to ensure that for any data older then what a node has in-memory, the "primary" node has written it to the backend-store. The "warm-up-period" is skipped If a node is a "primary" because the data it has in-memory + what is in the store is already the full set of all data.

"c" (executing queries) is the only function that relies on gossip as it needs to know which other shards exist, and which nodes are "ready" for each shard.

The "warm-up-period" is skipped If a node is a "primary" because the data it has in-memory + what is in the store is already the full set of all data.

It depends: when not backfilling from kafka, i.e. with offset=newest , if a primary just restarted and we skip warm-up, recent data won't have been written yet to the store and won't be consumed from kafka either.
Hence my earlier point that we may want to subject primaries to warmup period as well. Though this is out of scope for this PR.

"c" (executing queries) is the only function that relies on gossip as it needs to know which other shards exist, and which nodes are "ready" for each shard.

yes

after your explanation it is even more clear that time for gossip to settle needs to be a separate config setting.

OK. I will add a new setting cluster.gossip-settle-period or something like that (note: should probably not go into the swim section because those all directly correspond to memberlist config settings) instead of abusing warm-up-period

woodsaj · 2019-03-20T18:27:01Z

metrictank-sample.ini

-# in clusters, best to assure the primary has saved all the data that a newly warmup instance will need to query, to prevent gaps in charts
+# duration before non-primary (secondary or query) nodes start serving requests
+# shorter warmup on secondary nodes means metrictank will need to query cassandra more if it doesn't have requested data yet.
+# in clusters, best to assure the primary has saved all the data that a newly warmed up instance will need to query,


this sentence doesnt make sense. Perhaps

To prevent gaps in charts when running a cluster of nodes you need to either have the nodes backfill data from Kafka (set via "offset" in the "kafka-mdm-in" config) or set the warm-up-period to a value long enough to ensure all data received before the node started has been persisted to the backend store.

It makes sense to me, but yours is much more specific, and thus better.

woodsaj · 2019-03-20T18:31:22Z

cluster/node.go

+
+const (
+ ModeShard NodeMode = iota
+ ModeFull


This seems like the wrong name for a single node. "ModeAll" seems like a better fit as the node is acting as both a shard (it ingested data) and a querier.

"all" may imply it does "all the things". But in this mode there is no clustering (gossip) participation.
Can you articulate the problem with 'full' ? I suspect you mean because a cluster can have multiple ModeFull nodes, each of which is not fully responsible for the operation of the cluster (though it does get a full view of all the data)

No name is perfect I guess

to me, modeFull implies a "full" production deployment. How it is used here is really a dev single instance.

Given that you shouldnt run with "modeFull" for production, perhaps modeDev would be more appropriate.

It's a good suggestion, but I think running multiple instances that each consume all data (like we used to run worldping way back) - so without sharding and gossip - is a legitimate way to set up a production cluster.

Note that "mode" has changed meaning from mode of the cluster to the mode of the individual instance, so 'full' means 'the node is full' not 'the cluster is full'

If we can't find a better name by end of today let's keep it.

but I think running multiple instances that each consume all data (like we used to run worldping way back) - so without sharding and gossip - is a legitimate way to set up a production cluster.

I completely disagree with this. We only ran this topology because MT didnt have gossip back then. If you have multiple MT instances (which you need for production workloads to allow upgrades and protect against node failures) you definitely should be using gossip.

If users want to do this they still can, but by calling it modeDev we make it clear that a production deployment should be using gossip.

multiple MT instances (which you need for production workloads to allow upgrades and protect against node failures) you definitely should be using gossip.

redundancy for upgrades and protecting against node failures only requires running multiple replicas.
gossip is for sharding which is for load distribution, which is an orthogonal concern.

production deployments should use gossip.

just because you can deploy in a different way doesnt mean you should.

With the speculativeQuery feature, even a single shard deployment will benefit from using gossip.

That makes sense. we have a legitimate use case for gossip beyond load distribution now, speculative queries across replicas.
I think ModeDev is a good name then, and will change.

replay · 2019-03-25T12:52:11Z

docker/docker-cluster-query/metrictank.ini

+# see https://github.com/grafana/metrictank/blob/master/docs/cassandra.md for more details
+
+# enable the cassandra backend store plugin
+enabled = true


shouldn't this be disabled for the query node?

oh, nvm, it's overriden via env vars

fix #989 fix #1013

Note: when doing an in-place upgrade, old nodes won't provide a mode, so it'll be the zero value which is ModeShard, which is accurate. (but it actually doesn't really matter as we currently only check for ModeQuery)

also, use the words 'node' and 'instance' more consistently

rather than abusing warm-up-period.

Dieterbe · 2019-03-27T10:54:54Z

@replay @woodsaj can you have another look. this should be ready to go now

Dieterbe mentioned this pull request Mar 19, 2019

Support external query layer #989

Closed

Dieterbe force-pushed the cluster-modes branch from 3067a21 to 2791bdb Compare March 19, 2019 06:02

codelingo bot reviewed Mar 19, 2019

View reviewed changes

Dieterbe force-pushed the cluster-modes branch 2 times, most recently from dccd7eb to 194f28c Compare March 19, 2019 07:54

Dieterbe requested review from woodsaj, shanson7 and replay March 19, 2019 07:55

replay reviewed Mar 19, 2019

View reviewed changes

cluster/manager.go Outdated Show resolved Hide resolved

replay reviewed Mar 19, 2019

View reviewed changes

docker/docker-chaos/metrictank.ini Outdated Show resolved Hide resolved

Dieterbe force-pushed the cluster-modes branch from 194f28c to 9c5ea75 Compare March 20, 2019 17:24

woodsaj reviewed Mar 20, 2019

View reviewed changes

replay reviewed Mar 25, 2019

View reviewed changes

Dieterbe force-pushed the cluster-modes branch 2 times, most recently from 89e6b81 to c278d4e Compare March 27, 2019 10:38

Dieterbe added 10 commits March 27, 2019 11:48

add new 'query' cluster mode and better name for modes

3038909

fix #989 fix #1013

add docker-cluster-query environment

46ca3c5

query nodes need a more aggressive warmup time

4a0d4bd

add separate metrics for query nodes, don't count them as secondaries

8cec7e5

handle 'mode' in a more standard way

3692624

have peers communicate their mode, for more precise stats

20570a1

Note: when doing an in-place upgrade, old nodes won't provide a mode, so it'll be the zero value which is ModeShard, which is accurate. (but it actually doesn't really matter as we currently only check for ModeQuery)

render Mode in a human readable way in /cluster calls

8c19a88

clustering doc update

ebdbc2d

PR feedback: rename mode Full to Dev

9853557

also, use the words 'node' and 'instance' more consistently

PR feedback: use new cluster.gossip-settle-period setting

06deb80

rather than abusing warm-up-period.

Dieterbe force-pushed the cluster-modes branch from c278d4e to 06deb80 Compare March 27, 2019 10:48

woodsaj approved these changes Mar 27, 2019

View reviewed changes

Dieterbe merged commit 293b55b into master Mar 27, 2019

Dieterbe deleted the cluster-modes branch March 27, 2019 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new 'query' cluster mode and better name for modes #1243

add new 'query' cluster mode and better name for modes #1243

Dieterbe commented Mar 19, 2019 •

edited

codelingo bot left a comment

woodsaj Mar 20, 2019

woodsaj Mar 20, 2019

Dieterbe Mar 22, 2019 •

edited

woodsaj Mar 25, 2019

Dieterbe Mar 25, 2019

woodsaj Mar 20, 2019

Dieterbe Mar 22, 2019 •

edited

woodsaj Mar 20, 2019

Dieterbe Mar 22, 2019

woodsaj Mar 25, 2019

Dieterbe Mar 25, 2019

woodsaj Mar 25, 2019 •

edited

Dieterbe Mar 25, 2019

woodsaj Mar 26, 2019

Dieterbe Mar 27, 2019

replay Mar 25, 2019

replay Mar 25, 2019

Dieterbe commented Mar 27, 2019

add new 'query' cluster mode and better name for modes #1243

add new 'query' cluster mode and better name for modes #1243

Conversation

Dieterbe commented Mar 19, 2019 • edited

codelingo bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Mar 22, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Mar 22, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woodsaj Mar 25, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Mar 27, 2019

Dieterbe commented Mar 19, 2019 •

edited

Dieterbe Mar 22, 2019 •

edited

Dieterbe Mar 22, 2019 •

edited

woodsaj Mar 25, 2019 •

edited