Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

apply max-series-per-req to non-tagged queries #1926

Merged
merged 11 commits into from
Oct 22, 2020

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Oct 16, 2020

max-series-per-req has been implemented for tagged queries for a quite a while, but wasn't implemented yet for non-tagged queries. We have seen customers do excessively large queries leading to lots of allocated memory, we want to reject the queries instead. fix #1916

This PR does the following:

  • implement a fractional limit. e.g. when a query node receives a query with a limit of 100, and the cluster has 10 shards and 5 shardgroups (meaning all read nodes own 2 shards out of the 10), then each read node will see a limit of 100 * 2 / 10 = 20.
  • the read nodes will try to detect limit breach as early as possible without first allocating a bunch of stuff. also, limit breaches are cached in the findCache
  • some minor refactoring to make this possible and some small doc tweaks. In particular, fetchFunc now gets the full peersGroup map, so it has the awareness of the cluster it needs to give each peer its correct fractional limit (though there's a caveat here, is shardgroups are completely down - degraded cluster - then the fractional limit will be higher than it should. suboptimal but should be acceptable)
  • add a "global" limit: on a query node, even if all individual fan-out requests somehow succeeded, but the aggregate still breached the limit (unlikely), return an error

see all individual commits for more details.

fetchFunc can use this to determine the ratio of how much
data the target peer owns compared to the cluster as a whole.
Caveat: this is based on live cluster state. If shardgroups
go completely down it'll look like the target peer owns more of the
cluster than it actually does.
if query limits are set based on this, the limits would loosen up as
shards leave the cluster.
we do this by adjusting queryAllShards such that it doesn't assume
a simple peer.Post but allows passing in the fetchFunc, such that
we can define a custom fetchFunc that can figure out the fraction
of the data that our target peer is responsible for
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Oct 16, 2020

tested in docker-cluster-query with these tweaks:

-max-series-per-req = 250000
+max-series-per-req = 15
 # require x-org-id authentication to auth as a specific org. otherwise orgId 1 is assumed
 multi-tenant = true
 # in case our /render endpoint does not support the requested processing, proxy the request to this graphite
 fallback-graphite-addr = http://graphite
 # proxy to graphite when metrictank considers the request bad
-proxy-bad-requests = true //otherwise the http 400's will cause a proxy to graphite. it would still error the same, but the messages are harder to read
+proxy-bad-requests = false

and

-      MT_LOG_LEVEL: info
+      MT_LOG_LEVEL: debug

on all metrictank processes

mt-fakemetrics feed --kafka-mdm-addr localhost:9092 --period 10s --mpo 100
docker-compose logs -f metrictank0 metrictank1 metrictank2 metrictank3 metrictank-q0 | grep -v 'memberlist|CLU manager|Sarama|AM:|kafkamdm|already in index|stats flushing.*to graphite|kafka-cluster|cassandra-store: (save complete|starting to save)|updating.*in.*index'  # easy logging view

demonstration single target

wget -q 'http://localhost:6061/render?target=some.id.of.a.metric.1*&from=-60s' -O - | jsonpp | r target
        "target": "some.id.of.a.metric.1",
        "target": "some.id.of.a.metric.10",
        "target": "some.id.of.a.metric.100",
        "target": "some.id.of.a.metric.11",
        "target": "some.id.of.a.metric.12",
        "target": "some.id.of.a.metric.13",
        "target": "some.id.of.a.metric.14",
        "target": "some.id.of.a.metric.15",
        "target": "some.id.of.a.metric.16",
        "target": "some.id.of.a.metric.17",
        "target": "some.id.of.a.metric.18",
        "target": "some.id.of.a.metric.19",
wget --server-response 'http://localhost:6061/render?target=some.id.of.a.metric.*&from=-60s' -O -
--2020-10-16 22:43:16--  http://localhost:6061/render?target=some.id.of.a.metric.*&from=-60s
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:6061... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 400 Bad Request
  Content-Length: 15
  Content-Type: text/plain
  Date: Fri, 16 Oct 2020 20:43:16 GMT
  Server: Caddy
  Trace-Id: 28522102b04971f2
  Vary: Origin
2020-10-16 22:43:16 ERROR 400: Bad Request.

demonstration of breaching limit across multiple targets

wget -q 'http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.1&from=-60s' -O - | jsonpp | grep target
        "target": "some.id.of.a.metric.1",
        "target": "some.id.of.a.metric.10",
        "target": "some.id.of.a.metric.100",
        "target": "some.id.of.a.metric.11",
        "target": "some.id.of.a.metric.12",
        "target": "some.id.of.a.metric.13",
        "target": "some.id.of.a.metric.14",
        "target": "some.id.of.a.metric.15",
        "target": "some.id.of.a.metric.16",
        "target": "some.id.of.a.metric.17",
        "target": "some.id.of.a.metric.18",
        "target": "some.id.of.a.metric.19",
        "target": "some.id.of.a.metric.1",
wget --server-response  'http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.{1,2}&from=-60s' -O -                                ⏎
--2020-10-16 22:39:35--  http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.%7B1,2%7D&from=-60s
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:6061... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 400 Bad Request
  Content-Length: 15
  Content-Type: text/plain
  Date: Fri, 16 Oct 2020 20:39:35 GMT
  Server: Caddy
  Trace-Id: 2153aa0824baf179
  Vary: Origin
2020-10-16 22:39:35 ERROR 400: Bad Request.

note: this one works because we dedup identical targets. bit of an (unlikely) loophole. it returns all targets twice.

wget --server-response  'http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.1*&from=-60s'

I also verified that the check for http400 leading to aborting the spec-exec works leading to a cancelation.
(unfortunately we don't actually implement the cancellation on the read nodes yet)

metrictank-q0_1  | ABORT SPECEXEC
metrictank-q0_1  | 2020-10-16 20:46:43.103 [ERROR] Peer metrictank3 responded with error = "400 Bad Request"
metrictank-q0_1  | 2020-10-16 20:46:43.103 [INFO] ts=2020-10-16T20:46:43.103780874Z traceID=79c160bd75d20d0b, sampled=true msg="GET /render?from=-60s&target=some.id.of.a.metric.1%2A&target=some.id.of.a.metric.%2A (400) 1.055897ms" orgID=1 sourceIP="172.25.0.1" error="400%20Bad%20Request"
metrictank-q0_1  | 2020-10-16 20:46:43.103 [INFO] CLU HTTPNode: context canceled on request to peer metrictank0

I also made a couple of prints for the findCache and tested querying 1.* (ok) and * (not ok )
the cache seemed to handle this fine.

metrictank2_1    | DIETER ADDING TO CACHE some.id.of.a.metric.1*
metrictank2_1    | ([]*memory.Node) (len=6 cap=8) {
metrictank2_1    |  (*memory.Node)(0xc001a72500)(leaf - some.id.of.a.metric.16),
metrictank2_1    |  (*memory.Node)(0xc001a72680)(leaf - some.id.of.a.metric.14),
metrictank2_1    |  (*memory.Node)(0xc001a728c0)(leaf - some.id.of.a.metric.15),
metrictank2_1    |  (*memory.Node)(0xc001a72900)(leaf - some.id.of.a.metric.1),
metrictank2_1    |  (*memory.Node)(0xc001a729c0)(leaf - some.id.of.a.metric.13),
metrictank2_1    |  (*memory.Node)(0xc0017bfc80)(leaf - some.id.of.a.metric.100)
metrictank2_1    | }
metrictank2_1    | (interface {}) <nil>
...
metrictank2_1    | DIETER FINDCACHE SAID some.id.of.a.metric.1*
metrictank2_1    | (memory.CacheResult) {
metrictank2_1    |  nodes: ([]*memory.Node) (len=6 cap=8) {
metrictank2_1    |   (*memory.Node)(0xc001a72500)(leaf - some.id.of.a.metric.16),
metrictank2_1    |   (*memory.Node)(0xc001a72680)(leaf - some.id.of.a.metric.14),
metrictank2_1    |   (*memory.Node)(0xc001a728c0)(leaf - some.id.of.a.metric.15),
metrictank2_1    |   (*memory.Node)(0xc001a72900)(leaf - some.id.of.a.metric.1),
metrictank2_1    |   (*memory.Node)(0xc001a729c0)(leaf - some.id.of.a.metric.13),
metrictank2_1    |   (*memory.Node)(0xc0017bfc80)(leaf - some.id.of.a.metric.100)
metrictank2_1    |  },
metrictank2_1    |  err: (error) <nil>
metrictank2_1    | }
...
metrictank2_1    | DIETER ADDING TO CACHE some.id.of.a.metric.*
metrictank2_1    | ([]*memory.Node) <nil>
metrictank2_1    | (errors.BadRequest) (len=15) limit exhausted
metrictank2_1    | 2020-10-16 20:59:49.429 [INFO] ts=2020-10-16T20:59:49.428962139Z traceID=650ec2db82e2365d, sampled=true msg="POST /index/find (400) 413.898µs" orgID=0 sourceIP="172.25.0.14" error="limit%20exhausted"
metrictank2_1    | 2020/10/16 21:00:08 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.0.13:7946
...
metrictank2_1    | DIETER FINDCACHE SAID some.id.of.a.metric.*
metrictank2_1    | (memory.CacheResult) {
metrictank2_1    |  nodes: ([]*memory.Node) <nil>,
metrictank2_1    |  err: (errors.BadRequest) (len=15) limit exhausted
metrictank2_1    | }
metrictank2_1    | 2020-10-16 21:01:38.173 [INFO] ts=2020-10-16T21:01:38.173685973Z traceID=34079b8664f88809, sampled=true msg="POST /index/find (400) 365.66µs" orgID=0 sourceIP="172.25.0.14" error="limit%20exhausted"

@Dieterbe Dieterbe marked this pull request as ready for review October 16, 2020 21:06
note that we also now cache limit breaches in the findcache
if the request is bad, there is no point retrying on a different replica
of the same shard. breaking the limit on one, will also break it on the
2nd.  See also #985
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Oct 16, 2020

caveat:
if find is called with the same patterns but different limits, this may lead to a proliferation of entries in the findCache
this would only happen if the shard topology changes (or shardgroups go down) on a live cluster (rare), and also if the same patterns are queried but as different targets (e.g. target=foo , target=bar&target=foo, target=foo&target=consolidateBy(foo,'sum') this may all lead to different limits being applied (limit takes into account already resolved series) and thus the size of the findcache increasing.

this all sounds reasonable though. in practice this should typically not have an adverse effect.

Copy link
Contributor

@robert-milan robert-milan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed it, tested it locally. Overall I am pretty sure it is fine, but would like a little more time to look into 1 or 2 things.

Copy link
Contributor

@robert-milan robert-milan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work fine.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

max-series-per-req should also apply to non-tagged metrics
2 participants