Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

add new 'query' cluster mode and better name for modes #1243

Merged
merged 10 commits into from
Mar 27, 2019
Merged

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Mar 19, 2019

fix #989
fix #1013

In my testing, it works fine
you can test yourself via the included docker-cluster-query stack, run fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 1000 and issue some fakedata queries against the metrictank data source

Copy link

@codelingo codelingo bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found before hitting an error.
Error: Sorry, an error occurred while processing your request. Please try again.

@Dieterbe Dieterbe force-pushed the cluster-modes branch 2 times, most recently from dccd7eb to 194f28c Compare March 19, 2019 07:54
cluster/manager.go Outdated Show resolved Hide resolved
@@ -62,7 +62,7 @@ var (
chunkMaxStaleStr = flag.String("chunk-max-stale", "1h", "max age for a chunk before to be considered stale and to be persisted to Cassandra.")
metricMaxStaleStr = flag.String("metric-max-stale", "3h", "max age for a metric before to be considered stale and to be purged from memory.")
gcIntervalStr = flag.String("gc-interval", "1h", "Interval to run garbage collection job.")
warmUpPeriodStr = flag.String("warm-up-period", "1h", "duration before secondary nodes start serving requests")
warmUpPeriodStr = flag.String("warm-up-period", "1h", "duration before non-primary (secondary or query ) nodes start serving requests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do query nodes need a warm-up period?

# duration before secondary nodes start serving requests
# shorter warmup means metrictank will need to query cassandra more if it doesn't have requested data yet.
# in clusters, best to assure the primary has saved all the data that a newly warmup instance will need to query, to prevent gaps in charts
# duration before non-primary (secondary or query) nodes start serving requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesnt make sense to me to have warm-up period applied to query nodes. If waiting for gossip to settle is needed for query nodes, then it would also be needed for primary nodes, in which case warm-up-period should always apply.

The intent of the warm-up-period is to set the amount of time a node should spend rebuilding in-memory chunks. If we need to set a time to wait for gossip to settle, we should add a new config setting.

Copy link
Contributor Author

@Dieterbe Dieterbe Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain where I'm coming from.

First of all with "gossip settle" I mean "become aware of all other cluster peers" or at the very least "become aware of at least one ready replica of each shard" (assuming a healthy cluster wherein all shards are ready)

Here's what happens during startup:

  1. gossip start and http server coming online happens basically at the same time.
  2. Then we load index, start indexpersist notifiers, start data consumption
  3. now we set ready state (currently: immediately for primaries, after warmup otherwise)

My position is that we should not speculate that step 2 will always take more time than what's needed for gossip settle.
In most cases, it'll be a cluster with already some data, and it'll take some time until we reach step 3, enough time for gossip to settle.
But, the cluster may only contain very little data and step 2 may go very quickly, or maybe the network is slow and gossip is taking some time. or it's an overprovisioned cluster. or peers are slow to gossip because they're overloaded, or whatever.
The point is, we cannot guarantee that by the time we reach step 3, gossip has settled.
Thus, if the node receives render queries it may not be able to respond them properly (as completely as other nodes that have been up longer, would have. or it may error depending on min-available-shards)

So generally speaking (for sharded instances that receive queries), I think we need to make sure to leave some time for gossip settle.
Your argument that we shouldn't piggyback this on the warm-up-period setting, but instead introduce a 2nd variable for it, is fine with me. I have no strong preference either way.

As far as why do primaries become ready immediately at step 3, IIRC I think this is based on some old logic that if you're querying a primary that just started, you probably have no other instances online that serve the same data as this primary is holding. (otherwise why would you query the primary?) so may as well try our best and serve what we can. This works well in our deployments where we have primaries that we don't query. But for clusters wherein you query your secondaries and primaries, I think it would make sense to subject the primary to the same warm-up-period.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after your explanation it is even more clear that time for gossip to settle needs to be a separate config setting.

in a cluster, nodes have 3 functions. A single MT node can provide one, two or all three.
a) ingesting and storing data in memory
b) writing data to backend store
c) executing queries.

for "a" and "b", gossip has nothing to do with whether the node is "ready". Readiness is determined solely by the current state of the node. A node should not be marked as ready unless the union of data in-memory and data in the backend store is the complete set of data. This can be ensured by

  • replaying from kafka
  • waiting a fixed "warm-up-period" to ensure that for any data older then what a node has in-memory, the "primary" node has written it to the backend-store. The "warm-up-period" is skipped If a node is a "primary" because the data it has in-memory + what is in the store is already the full set of all data.

"c" (executing queries) is the only function that relies on gossip as it needs to know which other shards exist, and which nodes are "ready" for each shard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "warm-up-period" is skipped If a node is a "primary" because the data it has in-memory + what is in the store is already the full set of all data.

It depends: when not backfilling from kafka, i.e. with offset=newest , if a primary just restarted and we skip warm-up, recent data won't have been written yet to the store and won't be consumed from kafka either.
Hence my earlier point that we may want to subject primaries to warmup period as well. Though this is out of scope for this PR.

"c" (executing queries) is the only function that relies on gossip as it needs to know which other shards exist, and which nodes are "ready" for each shard.

yes

after your explanation it is even more clear that time for gossip to settle needs to be a separate config setting.

OK. I will add a new setting cluster.gossip-settle-period or something like that (note: should probably not go into the swim section because those all directly correspond to memberlist config settings) instead of abusing warm-up-period

# in clusters, best to assure the primary has saved all the data that a newly warmup instance will need to query, to prevent gaps in charts
# duration before non-primary (secondary or query) nodes start serving requests
# shorter warmup on secondary nodes means metrictank will need to query cassandra more if it doesn't have requested data yet.
# in clusters, best to assure the primary has saved all the data that a newly warmed up instance will need to query,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence doesnt make sense. Perhaps

To prevent gaps in charts when running a cluster of nodes you need to either have the nodes backfill data from Kafka (set via "offset" in the "kafka-mdm-in" config) or set the warm-up-period to a value long enough to ensure all data received before the node started has been persisted to the backend store.

Copy link
Contributor Author

@Dieterbe Dieterbe Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me, but yours is much more specific, and thus better.

cluster/node.go Outdated

const (
ModeShard NodeMode = iota
ModeFull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong name for a single node. "ModeAll" seems like a better fit as the node is acting as both a shard (it ingested data) and a querier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all" may imply it does "all the things". But in this mode there is no clustering (gossip) participation.
Can you articulate the problem with 'full' ? I suspect you mean because a cluster can have multiple ModeFull nodes, each of which is not fully responsible for the operation of the cluster (though it does get a full view of all the data)

No name is perfect I guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, modeFull implies a "full" production deployment. How it is used here is really a dev single instance.

Given that you shouldnt run with "modeFull" for production, perhaps modeDev would be more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good suggestion, but I think running multiple instances that each consume all data (like we used to run worldping way back) - so without sharding and gossip - is a legitimate way to set up a production cluster.

Note that "mode" has changed meaning from mode of the cluster to the mode of the individual instance, so 'full' means 'the node is full' not 'the cluster is full'

If we can't find a better name by end of today let's keep it.

Copy link
Member

@woodsaj woodsaj Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think running multiple instances that each consume all data (like we used to run worldping way back) - so without sharding and gossip - is a legitimate way to set up a production cluster.

I completely disagree with this. We only ran this topology because MT didnt have gossip back then. If you have multiple MT instances (which you need for production workloads to allow upgrades and protect against node failures) you definitely should be using gossip.

If users want to do this they still can, but by calling it modeDev we make it clear that a production deployment should be using gossip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple MT instances (which you need for production workloads to allow upgrades and protect against node failures) you definitely should be using gossip.

redundancy for upgrades and protecting against node failures only requires running multiple replicas.
gossip is for sharding which is for load distribution, which is an orthogonal concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

production deployments should use gossip.

just because you can deploy in a different way doesnt mean you should.

With the speculativeQuery feature, even a single shard deployment will benefit from using gossip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. we have a legitimate use case for gossip beyond load distribution now, speculative queries across replicas.
I think ModeDev is a good name then, and will change.

# see https://github.com/grafana/metrictank/blob/master/docs/cassandra.md for more details

# enable the cassandra backend store plugin
enabled = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be disabled for the query node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nvm, it's overriden via env vars

@Dieterbe Dieterbe force-pushed the cluster-modes branch 2 times, most recently from 89e6b81 to c278d4e Compare March 27, 2019 10:38
@Dieterbe
Copy link
Contributor Author

@replay @woodsaj can you have another look. this should be ready to go now

@Dieterbe Dieterbe merged commit 293b55b into master Mar 27, 2019
@Dieterbe Dieterbe deleted the cluster-modes branch March 27, 2019 11:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster-mode single vs multi is confusing Support external query layer
3 participants