[Doc] Server-side rejection of search requests based on resource #795

hdhalter · 2022-07-11T23:07:04Z

RFC- opensearch-project/OpenSearch#1329
Tracking issue- opensearch-project/OpenSearch#1181
POC's: Prabhu Senthamarai, Suresh N S, Ketan Verma, Pritkumar Ladani

hdhalter · 2022-07-12T21:09:00Z

@JeffH-AWS - Hi Jeff, do you mind taking this one? It involves measuring shard resource consumption when running search related tasks. Thanks.

JeffHuss · 2022-07-12T21:11:57Z

Sure - is there a specific ask/scope defined somewhere? I'm not really sure what is being requested specifically after glancing at the other issues. Seems like it could be super broad.

JeffHuss · 2022-07-13T21:11:15Z

Also relevant: https://opensearch.org/blog/feature/2022/02/shard-indexing-backpressure-in-opensearch/

Naarcha-AWS · 2022-07-14T16:13:47Z

@JeffH-AWS: Scope and ask still TBD. There's active development going on for Task Consumer Integration, however, we'll have to wait till the PR is merged in to start documenting it.

JeffHuss · 2022-07-14T18:50:00Z

I'm going to split this into two issues and work them separately. One item is search back pressure, the other is task consumer integration.

hdhalter · 2022-07-15T19:00:47Z

Sounds good! Feel free to close this as duplicate.

hdhalter · 2022-07-21T15:12:10Z

I'm switching the label from 2.2 to 2.3, as discussed in the project roadmap meeting.

JeffHuss · 2022-08-15T19:22:31Z

Will start looking at this shortly as we're not in the run-up to the 2.3 release.

JeffHuss · 2022-08-16T14:59:34Z

It looks like milestone 2 is supposed to be included in 2.3:

Milestone 2
Goals
Improved fairness in Search request rejections
Reducing chances of node getting overwhelmed due to Search request load
Ability to stabilise overloaded nodes by identifying and cancelling resource guzzling queries.
Achieving resiliency with reduced dependency on Circuit breaker and Threadpool queue configurations as the accuracy of rejections due to these depends on user input.
2.1 Server Side rejection of in-coming search requests
Currently, Search rejections are solely based on the number of tasks in queue for Search ThreadPool. That doesn’t provide fairness in rejections as multiple smaller queries can exhaust this limit but are not resource intensive and the node can take in much more requests and vice versa. Essentially count is not the reflection of actual work.

Hence based on metrics in point 1.1 above, we want to build a frame which can perform more informed rejections based on point in time resource utilisation. The new model will take the admission decision for search requests on the node. These admission decisions or rejection limits can have different levels to it:

Level 1: At this point system has detected overload due to search requests and it’ll prioritise which requests to accept. Example: It’ll accept fetch requests over Query requests as for Fetch phase we have already done some work to reach at this point whereas Query is going to be more resource intensive and have least wastage of work if rejected. Similar logic can be applied for Force search requests as well.
Level 2: At this point we’ll start rejecting all search requests beyond capacity to prevent any impact on the availability of node.
This can be further evolved to support Shard level priority model, where user can set priority on an index or every request, so that framework can consume them for taking admission/rejection decisions.

If user has configured partial results to be true, then upon these rejections and Coordinator’s inability to retry the request on another shard on a different node might result in user’s getting partial response.

The above will provide the required isolation of accounting and fairness in the rejections which is currently not there. This is still a reactive back-pressure mechanism as it only focusses on the current consumption and does not estimate the future work which is to be done for these search requests.

2.2 Server side Cancellation of in-flight search requests based on resource consumption
This is the 3rd level which kicks in after we’re cancelling all search request coming to a node. Here, we take decision to
cancel on-going requests, If the resource limits for that shard/node have started breaching the assigned limits (point 2.1), and there is no recovery seen for a certain time threshold. The BackPressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.

[rramachand21](https://github.com/rramachand21) commented [26 days ago](https://github.com/opensearch-project/OpenSearch/issues/1329#issuecomment-1191618947)
This (milestone 2) will come in 2.3 - we are merging in the changes for resource tracking framework in 2.2 (milestone 1)

JeffHuss · 2022-08-17T20:33:34Z

Still trying to get a hold of @rramachand21 for details about milestone 2 from the meta/epic issue.

JeffHuss · 2022-08-25T14:50:33Z

I still have not received any information from the devs and there hasn't been a response on the feature issue.

hdhalter added enhancement New feature or request untriaged v2.2.0 and removed enhancement New feature or request labels Jul 11, 2022

hdhalter added this to the v2.2 milestone Jul 11, 2022

hdhalter assigned JeffHuss Jul 12, 2022

JeffHuss changed the title ~~[Doc] Improve resliency in memory management~~ [Doc] Improve resiliency in memory management - back pressure in search path Jul 19, 2022

JeffHuss added the 1 - Backlog Issue: The issue is unassigned or assigned but not started label Jul 20, 2022

hdhalter added v2.3.0 and removed v2.2.0 labels Jul 21, 2022

Naarcha-AWS removed this from the v2.2 milestone Jul 21, 2022

Naarcha-AWS removed the untriaged label Jul 21, 2022

hdhalter added this to the v 3.0 milestone Aug 1, 2022

JeffHuss added xx-documentation Improvements or additions to documentation feedback needed Needs SME Waiting on input from subject matter expert labels Aug 18, 2022

Naarcha-AWS modified the milestones: v 3.0, v2.3 Aug 22, 2022

JeffHuss mentioned this issue Aug 25, 2022

[RFC] Backpressure in Search Path opensearch-project/OpenSearch#1329

Open

hdhalter removed the xx-documentation Improvements or additions to documentation label Aug 29, 2022

Naarcha-AWS added v2.4.0 'Issues and PRs related to version v2.4.0' and removed v2.3.0 labels Sep 13, 2022

Naarcha-AWS modified the milestones: v2.3, v2.4 Sep 13, 2022

Naarcha-AWS removed Needs SME Waiting on input from subject matter expert feedback needed labels Sep 27, 2022

Naarcha-AWS assigned kolchfa-aws and unassigned JeffHuss Sep 27, 2022

Naarcha-AWS changed the title ~~[Doc] Improve resiliency in memory management - back pressure in search path~~ [Doc] Server-side cancellation of search requests based on resource Sep 27, 2022

Naarcha-AWS changed the title ~~[Doc] Server-side cancellation of search requests based on resource~~ [Doc] Server-side rejection of search requests based on resource Sep 27, 2022

kolchfa-aws mentioned this issue Nov 2, 2022

Adds search backpressure documentation #1790

Merged

hdhalter added 2 - In progress Issue/PR: The issue or PR is in progress. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Nov 3, 2022

kolchfa-aws closed this as completed in #1790 Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] Server-side rejection of search requests based on resource #795

[Doc] Server-side rejection of search requests based on resource #795

hdhalter commented Jul 11, 2022 •

edited by Naarcha-AWS

Loading

hdhalter commented Jul 12, 2022

JeffHuss commented Jul 12, 2022 •

edited

Loading

JeffHuss commented Jul 13, 2022

Naarcha-AWS commented Jul 14, 2022

JeffHuss commented Jul 14, 2022

hdhalter commented Jul 15, 2022

hdhalter commented Jul 21, 2022

JeffHuss commented Aug 15, 2022

JeffHuss commented Aug 16, 2022 •

edited

Loading

JeffHuss commented Aug 17, 2022

JeffHuss commented Aug 25, 2022

[Doc] Server-side rejection of search requests based on resource #795

[Doc] Server-side rejection of search requests based on resource #795

Comments

hdhalter commented Jul 11, 2022 • edited by Naarcha-AWS Loading

hdhalter commented Jul 12, 2022

JeffHuss commented Jul 12, 2022 • edited Loading

JeffHuss commented Jul 13, 2022

Naarcha-AWS commented Jul 14, 2022

JeffHuss commented Jul 14, 2022

hdhalter commented Jul 15, 2022

hdhalter commented Jul 21, 2022

JeffHuss commented Aug 15, 2022

JeffHuss commented Aug 16, 2022 • edited Loading

JeffHuss commented Aug 17, 2022

JeffHuss commented Aug 25, 2022

hdhalter commented Jul 11, 2022 •

edited by Naarcha-AWS

Loading

JeffHuss commented Jul 12, 2022 •

edited

Loading

JeffHuss commented Aug 16, 2022 •

edited

Loading