Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Memory Tracking - track memory used during a shard search #1009

Open
6 tasks
malpani opened this issue Jul 26, 2021 · 4 comments
Open
6 tasks

Search Memory Tracking - track memory used during a shard search #1009

malpani opened this issue Jul 26, 2021 · 4 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance Search Search query, autocomplete ...etc

Comments

@malpani
Copy link
Contributor

malpani commented Jul 26, 2021

Is your feature request related to a problem? Please describe.
There is limited visibility into how much memory is consumed by a query. In an ideal world, resource consumption details should be abstracted out from users and everything should auto-tune/auto-reject. But we are not there (yet!) and with every query treated equally, certain memory heavy queries can end up tipping the memory breakers for all requests. It will be helpful to track and surface memory consumed by a query. This visibility can help users tune their query better

Describe the solution you'd like
The plan is to make this generic and expose these stats via tasks framework. Tasks Framework already tracks latency and has some context about the query/work being done. The idea is to enhance this to start tracking additional stats of memory and CPU consumed per task. As tasks have a nice parent--> child hierarchy, this mechanism will allow tracking the cluster-wide resource consumption from a query. So plan is to update Task to start tracking additional context + stats. When a task completes, this task info will be pushed to a sink. Sink can be logs or a system index to enable additional insights

For search side tracking of stats - The proposed solution is to leverage the single threaded nature of searching within a shard. I plan to use ThreadMxBean.getCurrentThreadAllocatedBytes for tracking the memory consumption and exposing this in 2 forms

Based on some initial rally benchmarks on a POC, the overhead does not look high. Having said that, my plan is to gate this under a cluster setting search.track_resources that defaults to false (disabled)

Describe alternatives you've considered

  • Instead of exposing this via the generic tasks framework, the change could have exposed this information via
    • Slow logs - Search slow logs
    • Node Stats - Adding a new search_stats section into /_node/stats API that returns top N expensive queries
      However this model is restricted to search model and will require additional work to track at a parent level, the cluster-wide impact of a query. Hence, this alternate while lesser work is not as powerful.
  • Performance Analyzer also tracks metrics but I did not go down that route as eventually this could serve as feedback to improve memory estimations prior to executing a query. Further the slowlog plumbing is already well defined in the core

Planning

@malpani malpani added the enhancement Enhancement or improvement to existing feature or request label Jul 26, 2021
@AmiStrn
Copy link
Contributor

AmiStrn commented Jul 27, 2021

How about having a way to stop/deprioritise memory heavy queries kind of like the way timeout for a query works?

This is different than the observability issue. But makes sense to prevent these really intensive queries to begin with.
(In addition, not instead of...)

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jul 27, 2021

Nice proposal, maybe we need an extension for aggregation reduce phases on the coordinator as well(major contributors to memory), also being cautious about deserialisation overhead.

@AmiStrn maybe we need a special handling for query prioritization for instance async searches should have a different priority than usual search #1017. Also we might need to track/estimate memory prior to the memory allocation in order for it to be terminated early. I guess both of the above can be tracked separately. Thoughts?

@malpani
Copy link
Contributor Author

malpani commented Jul 27, 2021

@AmiStrn Today a query execution can be stopped on scenarios like hitting the bucket limit or parent breakers. There is value to adding some notion of memory sandbox and preempt the query on hitting a 'per query memory limit' as the next phase and eventually improve the memory estimation (prior to executing)

@Bukhtawar good point. This approach will not capture the reduce phase overhead and I will explore that as a follow up

@malpani
Copy link
Contributor Author

malpani commented Oct 7, 2021

Finally got some time to explore this more and here are some thoughts

  1. The utility of exposing top N via a new search_stats section into /_node/stats API to return N most expensive queries is limited and may not help answer questions like "What queries between October 4 and 5 were most expensive in terms of their memory footprint?" as N most expensive queries might have run 60 days ago.
  2. Implementing this via tasks framework can provide a hook to track on parent task ids and not just restrict to isolated shard level memory tracking (thanks @sohami for the idea). It also allows for other actions (not just search, if they choose to) to track memory usage. Existing tasks API already tracks latency and adding memory consumption could be useful.
  3. On completion of task - task info which will include memory used(for search tasks) can be dumped into a sink - sink could be configurable a simple log file or a system index for further analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance Search Search query, autocomplete ...etc
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

6 participants