-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add profiling and documentation for dfs phase #90536
Conversation
Documentation preview: |
Pinging @elastic/es-search (Team:Search) |
Hi @jdconrad, I've created a changelog YAML for you. |
As an example, let's first setup an index with multiple shards and index | ||
a pair of documents with different values on a keyword field. | ||
|
||
[source,console] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do [source,console,id=profile_dfs]
this'll name the yaml test it makes and make an invisible but linkable anchor tag in the docs. I've been naming everything lately because it makes the error messages when the test fails nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
... | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make this [source,console-result]
with a bit of hacking to replace the ...
. Something like
// TESTRESPONSE[s/: \{\.\.\.\}/: $body.$_path/]
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
might do it, depending on how much you are willing to fight with the test generator. The advantage of doing this is that it'll fail if the names of the results change so it'll force us to keep at least the json keys up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik9000 Thank you for walking me through and creating the appropriate test responses with me!
[[profiling-knn]] | ||
===== Profiling kNN | ||
|
||
A k-nearest neighbor (kNN) search runs as part of the dfs phase. To |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should link to <<approximate-knn>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
@@ -1070,7 +1281,7 @@ have a drastic effect compared to other components in the profiled query. | |||
responses on the coordinating node, or additional work such as building global | |||
ordinals (an internal data structure used to speed up search). | |||
- Profiling statistics are currently not available for suggestions, | |||
highlighting, `dfs_query_then_fetch`. | |||
highlighting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove highlighting
as well? Profiling is available for highlighting as part of the fetch phase work I did a while back. I just never noticed this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/370_profile.yml
Show resolved
Hide resolved
- is_true: profile.shards.0.dfs.statistics.breakdown | ||
- match: { profile.shards.1.dfs.statistics.type: "statistics" } | ||
- match: { profile.shards.1.dfs.statistics.description: "collect term statistics" } | ||
- gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think asserting that both shards have time > 0 relies on the hashing of the _ids landing on different shards, right? Maybe worth a comment, but ok, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this cause test flakes though, in cases where all docs happen to be hashed to the same shard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is which shards the docs are on shouldn't matter as we still need to hit each shard to collect dfs information, so each shard will have a dfs profile even if no info is ultimately aggregated. (Am I missing some caching or shard skipping in this case?)
I have updated the test to change/add the following:
- changed gt to gte for time_in_nanos
- updated the test to add a no documents case where we still expect to get dfs profile info
- updated the test to add a single document case where we still expect to get dfs profile info
- updated the test to add a three document case where two share the same keyword and we still expect to get dfs profile info
- updated the test to ensure we get no profile info if search type is
query_then_fetch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this makes sense. For me maybe we don't need both tests (3) and (4) since having the same keyword shouldn't affect shard routing, or whether we use DFS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed 3 and 4 since they are extraneous as you mentioned.
@@ -356,6 +356,8 @@ | |||
|
|||
opens org.elasticsearch.common.logging to org.apache.logging.log4j.core; | |||
|
|||
exports org.elasticsearch.search.profile.dfs; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this bit for? Should this be in alphabetic order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why spotless didn't pick this up. I moved it to the appropriate location and it mimics the behavior of the rest of the of profile package to allow for other modules to use these classes.
server/src/main/java/org/elasticsearch/search/profile/SearchProfileDfsPhaseResult.java
Show resolved
Hide resolved
QueryProfileShardResult queryProfileShardResult = profiledVectorQuery | ||
? new QueryProfileShardResult(queryProfiler.getTree(), queryProfiler.getRewriteTime(), queryProfiler.getCollector()) | ||
: null; | ||
return new SearchProfileDfsPhaseResult(dfsProfileResult, queryProfileShardResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to use the "emptiness" of some object to infer this? Like, if the breakdown map is empty it's all the statistics? I don't know that that's a perfect thing, but it might make this easier for someone to read in nine months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this being a bit confusing. I'll give it some more thought on a better way to work with this existing. It's a bit tricky because of what things are available at what times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 it'd be nice to avoid these mutable booleans like profiledVectorQuery
if possible (but I haven't looked into it deeply how we could do that!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clean this up a bit. If the dfs phase runs, we always attempt to collect term stats, so there's no reason to check anything. For knn, we have to set the collector on the profiler, so I track whether or not we've set the collector in DfsProfiler
to determine if we should have a knn profile section.
- is_true: profile.shards.0.dfs.statistics.breakdown | ||
- match: { profile.shards.1.dfs.statistics.type: "statistics" } | ||
- match: { profile.shards.1.dfs.statistics.description: "collect term statistics" } | ||
- gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this cause test flakes though, in cases where all docs happen to be hashed to the same shard?
QueryProfileShardResult queryProfileShardResult = profiledVectorQuery | ||
? new QueryProfileShardResult(queryProfiler.getTree(), queryProfiler.getRewriteTime(), queryProfiler.getCollector()) | ||
: null; | ||
return new SearchProfileDfsPhaseResult(dfsProfileResult, queryProfileShardResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 it'd be nice to avoid these mutable booleans like profiledVectorQuery
if possible (but I haven't looked into it deeply how we could do that!)
@nik9000 @jtibshirani Thank you both for the review. I will work on addressing all of the feedback. |
@nik9000 @jtibshirani I think I have addressed all of the given feedback, so this is ready for another round of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me too, I just left some tiny last comments.
the dfs phase. | ||
|
||
The following is an example of setting `profile` to `true` on a search | ||
that has a knn section: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny comment, we usually style the names of API parameters as literals ("knn
section").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
the of timings for <<query-section, query>>, <<rewrite-section, rewrite>>, | ||
and <<collectors-section, collector>>. Unlike many other queries, kNN | ||
search does the bulk of the work during the query rewrite. This means | ||
rewrite_time represents the time spent on kNN search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, we should say rewrite_time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -1069,8 +1286,7 @@ have a drastic effect compared to other components in the profiled query. | |||
- Profiling also does not account for time spent in the queue, merging shard | |||
responses on the coordinating node, or additional work such as building global | |||
ordinals (an internal data structure used to speed up search). | |||
- Profiling statistics are currently not available for suggestions, | |||
highlighting, `dfs_query_then_fetch`. | |||
- Profiling statistics are currently not available for suggestions. | |||
- Profiling of the reduce phase of aggregation is currently not available. | |||
- The Profiler is instrumenting internals that can change from version to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this existing comment and glad we have it -- it will let us tweak the DFS output format if needed!
- is_true: profile.shards.0.dfs.statistics.breakdown | ||
- match: { profile.shards.1.dfs.statistics.type: "statistics" } | ||
- match: { profile.shards.1.dfs.statistics.description: "collect term statistics" } | ||
- gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this makes sense. For me maybe we don't need both tests (3) and (4) since having the same keyword shouldn't affect shard routing, or whether we use DFS?
@elasticmachine retest this please |
* main: (150 commits) Remove ToXContent interface from ChunkedToXContent (elastic#90409) Remove extra SearchService constructor (elastic#90733) Update min version for the diagnosis yaml test (elastic#90731) Use the AggTestConfig object in testCase (elastic#90699) [DOCS] Add links to clear trained model deployment cache API (elastic#90727) Assert wildcards are not expanded as specified by request options (elastic#90641) [TEST] Fix exit snapshot restore exit condition (elastic#90696) [TEST] Change to atomic file contents save (elastic#90695) Update forbiddenapis to 3.4 (elastic#90624) [Tests] Don't use concurrent search in scripted field type tests (elastic#90712) [ML] Move scaling is possible check for starting trained model (elastic#90706) Add new base test case for chunked xcontent types (elastic#90707) Fix testRedNoBlockedIndicesAndRedAllRoleNodes (elastic#90671) Fix nullpointer in docs test setup (elastic#90660) Don't produce build logs artifact when in a composite build Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue (elastic#90688) docs: update fleet/agent pipeline docs (elastic#90659) [HealthAPI] Use plural consistently in resource types (elastic#90682) [Testing] Enable bwc and fix sorting for 500_date_range (elastic#90681) Add profiling and documentation for dfs phase (elastic#90536) ... # Conflicts: # x-pack/plugin/mapper-aggregate-metric/src/test/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapperTests.java
Introduced in: #90536 Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between. The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
Introduced in: elastic#90536 Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between. The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
Introduced in: #90536 Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between. The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
This change adds profiling statistics for the dfs phase that look like the following:
This change also adds documentation for both the above dfs phase profiling and kNN profiling.
Closes #89713