Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range query not looked up from the results cache when start=end and step=86400 #5253

Closed
pracucci opened this issue Jun 14, 2023 · 11 comments · Fixed by #5312
Closed

Range query not looked up from the results cache when start=end and step=86400 #5253

pracucci opened this issue Jun 14, 2023 · 11 comments · Fixed by #5312

Comments

@pracucci
Copy link
Collaborator

pracucci commented Jun 14, 2023

I'm observing some recurring range query requests from a customer which I was expected to be looked up from the results cache, but they're not. The queries look the exact same in terms of query parameters and the end time is older than "10m ago", the customer has no OOO ingestion enabled, so I'm bit puzzled why they're picked up from the cache. I'm opening this issue to track the investigation.

Sample of one of such queries:

  • Current time: 2023-06-14 16:00 UTC
  • start=1686088800 (2023-06-06 22:00:00 UTC)
  • end=1686693600 (2023-06-13 22:00:00 UTC)
  • step=86400
  • query=REDACTED (but the exact same across all requests)

Edit: these query results are not cached because not step aligned.

@56quarters
Copy link
Contributor

This seems roughly expected since we don't cache queries with start and end times that aren't step-aligned? logic

@pracucci
Copy link
Collaborator Author

This seems roughly expected since we don't cache queries with start and end times that aren't step-aligned? logic

Oh sure, you're right. I totally forgot about the step alignment.

@pracucci
Copy link
Collaborator Author

We have a configuration option to force the caching of non-step aligned requests (-query-frontend.cache-unaligned-requests) but it can only be configured globally. Enabling it in a large multi-tenant cluster may not be a good idea, because would pollute the query results cache with a bunch of results not reusable. However, could be beneficial being able to configure it on a per-tenant basis, so that we can selectively enable it for a specific tenant when we see an opportunity (like in this case).

@pstibrany
Copy link
Member

Our main reason for disabling step-alignment was compatibility with PromQL.

@pracucci
Copy link
Collaborator Author

Our main reason for disabling step-alignment was compatibility with PromQL.

My previous comment is about a different config option. It does not change the query, but force the caching of non-step aligned queries too. If a customer sends repetitive non-step aligned queries with the same exact parameters, it would benefit from the cache anyway.

@francoposa
Copy link
Member

picking this up - is there prior art on duplicating or migrating an existing config option into the limits section of the yaml so that we can set it per-tenant instead of globally?

We have deprecation labels, but it appears that would also mark the CLI option as deprecated, which we don't need because the CLI option -query-frontend.cache-unaligned-requests can remain as-is.

@pstibrany
Copy link
Member

picking this up - is there prior art on duplicating or migrating an existing config option into the limits section of the yaml so that we can set it per-tenant instead of globally?

You can take a look at #4287

It introduces per-tenant QueryIngestersWithin field, where we only had global value before. CLI flag is still the same (-querier.query-ingesters-within, now setting per-tenant default), but old YAML field is deprecated. However it's not removed, and when it's set (= doesn't have default value), we actually set the per-tenant default value from it. (Old field QueryIngestersWithin is defined in querier.go)

@francoposa
Copy link
Member

is there any reason we would want to take this opportunity to set a separate TTL for these queries, a la ResultsCacheTTLForOutOfOrderTimeWindow and ResultsCacheTTLForCardinalityQuery?

@56quarters
Copy link
Contributor

is there any reason we would want to take this opportunity to set a separate TTL for these queries, a la ResultsCacheTTLForOutOfOrderTimeWindow and ResultsCacheTTLForCardinalityQuery?

It seems like the plan for this feature is not to enable it by default, but only when we notice a repeated query so I don't think a separate TTL is necessary in that case but I don't feel strongly about it

@francoposa
Copy link
Member

As of now, all tenant-specific request cache options under limits are decided with multiple tenantIDs as the input:

func (s *splitAndCacheMiddleware) getCacheOptions(tenantIDs []string) (ttl, ttlInOOO, oooWindow time.Duration) {
	ttl = validation.SmallestPositiveNonZeroDurationPerTenant(tenantIDs, s.limits.ResultsCacheTTL)
	ttlInOOO = validation.SmallestPositiveNonZeroDurationPerTenant(tenantIDs, s.limits.ResultsCacheTTLForOutOfOrderTimeWindow)
	oooWindow = validation.MaxDurationPerTenant(tenantIDs, s.limits.OutOfOrderTimeWindow)
	return
}

These are pretty straightforward as they are scalar values, we have decided to take either the min or max for all tenants depending on the parameter.

For boolean values such as cache_unaligned_requests, how do we want to handle this?
The cache key is calculated on the concatenation of the tenant IDs.
Imagining that of tenants A, B, and C, only A has cache_unaligned_requests enabled:

Option 1: cache only if all tenants being queried have cache_unaligned_requests is enabled
1. Query A: cache miss, response cached
2. Query A: cache hit
3. Query A & B: cache miss, response not cached
4. Query A & C: cache miss, response not cached

Option 2: cache if any tenant being queried has cache_unaligned_requests enabled
1. Query A: cache miss, response cached
2. Query A: cache hit
3. Query A & B: cache miss , response cached - now two cache entries
4. Query A & C: cache miss, response cached - now three cache entries
6. Query A, or A & B or A & C: cache hit

My initial instinct is that Option 1 makes the most sense from a user perspective.
Option 2 is likely to create a lot of cache entries that never get used, and an operator always has the option to enable it for more tenants if they need multi-tenant non-step-aligned queries cached.

@56quarters
Copy link
Contributor

My initial instinct is that Option 1 makes the most sense from a user perspective.
Option 2 is likely to create a lot of cache entries that never get used, and an operator always has the option to enable it for more tenants if they need multi-tenant non-step-aligned queries cached.

Agreed, I think option 1 makes the most sense. FWIW queries that involve multiple tenants are a small fraction of single-tenant queries in Grafana Cloud and multi-tenant queries (tenant federation) are off by default in Mimir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants