Range query not looked up from the results cache when start=end and step=86400 #5253

pracucci · 2023-06-14T16:17:12Z

I'm observing some recurring range query requests from a customer which I was expected to be looked up from the results cache, but they're not. The queries look the exact same in terms of query parameters and the end time is older than "10m ago", the customer has no OOO ingestion enabled, so I'm bit puzzled why they're picked up from the cache. I'm opening this issue to track the investigation.

Sample of one of such queries:

Current time: 2023-06-14 16:00 UTC
start=1686088800 (2023-06-06 22:00:00 UTC)
end=1686693600 (2023-06-13 22:00:00 UTC)
step=86400
query=REDACTED (but the exact same across all requests)

Edit: these query results are not cached because not step aligned.

The text was updated successfully, but these errors were encountered:

56quarters · 2023-06-14T17:45:11Z

This seems roughly expected since we don't cache queries with start and end times that aren't step-aligned? logic

pracucci · 2023-06-14T18:41:16Z

This seems roughly expected since we don't cache queries with start and end times that aren't step-aligned? logic

Oh sure, you're right. I totally forgot about the step alignment.

pracucci · 2023-06-15T03:47:04Z

We have a configuration option to force the caching of non-step aligned requests (-query-frontend.cache-unaligned-requests) but it can only be configured globally. Enabling it in a large multi-tenant cluster may not be a good idea, because would pollute the query results cache with a bunch of results not reusable. However, could be beneficial being able to configure it on a per-tenant basis, so that we can selectively enable it for a specific tenant when we see an opportunity (like in this case).

pstibrany · 2023-06-15T06:15:48Z

Our main reason for disabling step-alignment was compatibility with PromQL.

pracucci · 2023-06-15T06:18:27Z

Our main reason for disabling step-alignment was compatibility with PromQL.

My previous comment is about a different config option. It does not change the query, but force the caching of non-step aligned queries too. If a customer sends repetitive non-step aligned queries with the same exact parameters, it would benefit from the cache anyway.

francoposa · 2023-06-21T22:04:14Z

picking this up - is there prior art on duplicating or migrating an existing config option into the limits section of the yaml so that we can set it per-tenant instead of globally?

We have deprecation labels, but it appears that would also mark the CLI option as deprecated, which we don't need because the CLI option -query-frontend.cache-unaligned-requests can remain as-is.

pstibrany · 2023-06-22T07:30:43Z

picking this up - is there prior art on duplicating or migrating an existing config option into the limits section of the yaml so that we can set it per-tenant instead of globally?

You can take a look at #4287

It introduces per-tenant QueryIngestersWithin field, where we only had global value before. CLI flag is still the same (-querier.query-ingesters-within, now setting per-tenant default), but old YAML field is deprecated. However it's not removed, and when it's set (= doesn't have default value), we actually set the per-tenant default value from it. (Old field QueryIngestersWithin is defined in querier.go)

francoposa · 2023-06-22T17:39:26Z

is there any reason we would want to take this opportunity to set a separate TTL for these queries, a la ResultsCacheTTLForOutOfOrderTimeWindow and ResultsCacheTTLForCardinalityQuery?

56quarters · 2023-06-22T20:10:53Z

is there any reason we would want to take this opportunity to set a separate TTL for these queries, a la ResultsCacheTTLForOutOfOrderTimeWindow and ResultsCacheTTLForCardinalityQuery?

It seems like the plan for this feature is not to enable it by default, but only when we notice a repeated query so I don't think a separate TTL is necessary in that case but I don't feel strongly about it

francoposa · 2023-06-27T19:06:48Z

As of now, all tenant-specific request cache options under limits are decided with multiple tenantIDs as the input:

func (s *splitAndCacheMiddleware) getCacheOptions(tenantIDs []string) (ttl, ttlInOOO, oooWindow time.Duration) {
	ttl = validation.SmallestPositiveNonZeroDurationPerTenant(tenantIDs, s.limits.ResultsCacheTTL)
	ttlInOOO = validation.SmallestPositiveNonZeroDurationPerTenant(tenantIDs, s.limits.ResultsCacheTTLForOutOfOrderTimeWindow)
	oooWindow = validation.MaxDurationPerTenant(tenantIDs, s.limits.OutOfOrderTimeWindow)
	return
}

These are pretty straightforward as they are scalar values, we have decided to take either the min or max for all tenants depending on the parameter.

For boolean values such as cache_unaligned_requests, how do we want to handle this?
The cache key is calculated on the concatenation of the tenant IDs.
Imagining that of tenants A, B, and C, only A has cache_unaligned_requests enabled:

Option 1: cache only if all tenants being queried have cache_unaligned_requests is enabled
1. Query A: cache miss, response cached
2. Query A: cache hit
3. Query A & B: cache miss, response not cached
4. Query A & C: cache miss, response not cached

Option 2: cache if any tenant being queried has cache_unaligned_requests enabled
1. Query A: cache miss, response cached
2. Query A: cache hit
3. Query A & B: cache miss , response cached - now two cache entries
4. Query A & C: cache miss, response cached - now three cache entries
6. Query A, or A & B or A & C: cache hit

My initial instinct is that Option 1 makes the most sense from a user perspective.
Option 2 is likely to create a lot of cache entries that never get used, and an operator always has the option to enable it for more tenants if they need multi-tenant non-step-aligned queries cached.

56quarters · 2023-06-27T20:30:15Z

My initial instinct is that Option 1 makes the most sense from a user perspective.
Option 2 is likely to create a lot of cache entries that never get used, and an operator always has the option to enable it for more tenants if they need multi-tenant non-step-aligned queries cached.

Agreed, I think option 1 makes the most sense. FWIW queries that involve multiple tenants are a small fraction of single-tenant queries in Grafana Cloud and multi-tenant queries (tenant federation) are off by default in Mimir.

pracucci added the component/query-frontend label Jun 14, 2023

francoposa mentioned this issue Jun 22, 2023

issues/5253 convert cache-unaligned-requests to per-tenant config #5312

Merged

3 tasks

pstibrany closed this as completed in #5312 Jul 4, 2023

osg-grafana mentioned this issue Nov 28, 2023

Docs: Document how query alignment affects cacheability #6711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range query not looked up from the results cache when start=end and step=86400 #5253

Range query not looked up from the results cache when start=end and step=86400 #5253

pracucci commented Jun 14, 2023 •

edited

Loading

56quarters commented Jun 14, 2023

pracucci commented Jun 14, 2023

pracucci commented Jun 15, 2023

pstibrany commented Jun 15, 2023

pracucci commented Jun 15, 2023

francoposa commented Jun 21, 2023

pstibrany commented Jun 22, 2023

francoposa commented Jun 22, 2023

56quarters commented Jun 22, 2023

francoposa commented Jun 27, 2023

56quarters commented Jun 27, 2023

Range query not looked up from the results cache when start=end and step=86400 #5253

Range query not looked up from the results cache when start=end and step=86400 #5253

Comments

pracucci commented Jun 14, 2023 • edited Loading

56quarters commented Jun 14, 2023

pracucci commented Jun 14, 2023

pracucci commented Jun 15, 2023

pstibrany commented Jun 15, 2023

pracucci commented Jun 15, 2023

francoposa commented Jun 21, 2023

pstibrany commented Jun 22, 2023

francoposa commented Jun 22, 2023

56quarters commented Jun 22, 2023

francoposa commented Jun 27, 2023

56quarters commented Jun 27, 2023

pracucci commented Jun 14, 2023 •

edited

Loading