Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add profiling and documentation for dfs phase #90536

Merged
merged 17 commits into from
Oct 5, 2022

Conversation

jdconrad
Copy link
Contributor

This change adds profiling statistics for the dfs phase that look like the following:

{
    ...
    "dfs" : {
        "statistics" : {
            "type" : "statistics",
            "description" : "collect term statistics",
            "time_in_nanos" : 236955,
            "breakdown" : {
                "term_statistics" : 4815,
                "collection_statistics" : 27081,
                "collection_statistics_count" : 1,
                "create_weight" : 153278,
                "term_statistics_count" : 1,
                "rewrite_count" : 0,
                "create_weight_count" : 1,
                "rewrite" : 0
            }
        }
    }
    ...
}

This change also adds documentation for both the above dfs phase profiling and kNN profiling.

Closes #89713

@jdconrad jdconrad added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.6.0 labels Sep 29, 2022
@github-actions
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Sep 29, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @jdconrad, I've created a changelog YAML for you.

As an example, let's first setup an index with multiple shards and index
a pair of documents with different values on a keyword field.

[source,console]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do [source,console,id=profile_dfs] this'll name the yaml test it makes and make an invisible but linkable anchor tag in the docs. I've been naming everything lately because it makes the error messages when the test fails nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

...
}
--------------------------------------------------
// NOTCONSOLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this [source,console-result] with a bit of hacking to replace the .... Something like

// TESTRESPONSE[s/: \{\.\.\.\}/: $body.$_path/]
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]

might do it, depending on how much you are willing to fight with the test generator. The advantage of doing this is that it'll fail if the names of the results change so it'll force us to keep at least the json keys up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 Thank you for walking me through and creating the appropriate test responses with me!

[[profiling-knn]]
===== Profiling kNN

A k-nearest neighbor (kNN) search runs as part of the dfs phase. To
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should link to <<approximate-knn>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -1070,7 +1281,7 @@ have a drastic effect compared to other components in the profiled query.
responses on the coordinating node, or additional work such as building global
ordinals (an internal data structure used to speed up search).
- Profiling statistics are currently not available for suggestions,
highlighting, `dfs_query_then_fetch`.
highlighting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove highlighting as well? Profiling is available for highlighting as part of the fetch phase work I did a while back. I just never noticed this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

- is_true: profile.shards.0.dfs.statistics.breakdown
- match: { profile.shards.1.dfs.statistics.type: "statistics" }
- match: { profile.shards.1.dfs.statistics.description: "collect term statistics" }
- gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think asserting that both shards have time > 0 relies on the hashing of the _ids landing on different shards, right? Maybe worth a comment, but ok, I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause test flakes though, in cases where all docs happen to be hashed to the same shard?

Copy link
Contributor Author

@jdconrad jdconrad Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is which shards the docs are on shouldn't matter as we still need to hit each shard to collect dfs information, so each shard will have a dfs profile even if no info is ultimately aggregated. (Am I missing some caching or shard skipping in this case?)

I have updated the test to change/add the following:

  1. changed gt to gte for time_in_nanos
  2. updated the test to add a no documents case where we still expect to get dfs profile info
  3. updated the test to add a single document case where we still expect to get dfs profile info
  4. updated the test to add a three document case where two share the same keyword and we still expect to get dfs profile info
  5. updated the test to ensure we get no profile info if search type is query_then_fetch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this makes sense. For me maybe we don't need both tests (3) and (4) since having the same keyword shouldn't affect shard routing, or whether we use DFS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed 3 and 4 since they are extraneous as you mentioned.

@@ -356,6 +356,8 @@

opens org.elasticsearch.common.logging to org.apache.logging.log4j.core;

exports org.elasticsearch.search.profile.dfs;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this bit for? Should this be in alphabetic order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why spotless didn't pick this up. I moved it to the appropriate location and it mimics the behavior of the rest of the of profile package to allow for other modules to use these classes.

QueryProfileShardResult queryProfileShardResult = profiledVectorQuery
? new QueryProfileShardResult(queryProfiler.getTree(), queryProfiler.getRewriteTime(), queryProfiler.getCollector())
: null;
return new SearchProfileDfsPhaseResult(dfsProfileResult, queryProfileShardResult);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to use the "emptiness" of some object to infer this? Like, if the breakdown map is empty it's all the statistics? I don't know that that's a perfect thing, but it might make this easier for someone to read in nine months.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this being a bit confusing. I'll give it some more thought on a better way to work with this existing. It's a bit tricky because of what things are available at what times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it'd be nice to avoid these mutable booleans like profiledVectorQuery if possible (but I haven't looked into it deeply how we could do that!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clean this up a bit. If the dfs phase runs, we always attempt to collect term stats, so there's no reason to check anything. For knn, we have to set the collector on the profiler, so I track whether or not we've set the collector in DfsProfiler to determine if we should have a knn profile section.

- is_true: profile.shards.0.dfs.statistics.breakdown
- match: { profile.shards.1.dfs.statistics.type: "statistics" }
- match: { profile.shards.1.dfs.statistics.description: "collect term statistics" }
- gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause test flakes though, in cases where all docs happen to be hashed to the same shard?

docs/reference/search/profile.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/profile.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/profile.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/profile.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/profile.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/profile.asciidoc Outdated Show resolved Hide resolved
QueryProfileShardResult queryProfileShardResult = profiledVectorQuery
? new QueryProfileShardResult(queryProfiler.getTree(), queryProfiler.getRewriteTime(), queryProfiler.getCollector())
: null;
return new SearchProfileDfsPhaseResult(dfsProfileResult, queryProfileShardResult);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it'd be nice to avoid these mutable booleans like profiledVectorQuery if possible (but I haven't looked into it deeply how we could do that!)

@jdconrad
Copy link
Contributor Author

jdconrad commented Oct 3, 2022

@nik9000 @jtibshirani Thank you both for the review. I will work on addressing all of the feedback.

@jdconrad
Copy link
Contributor Author

jdconrad commented Oct 4, 2022

@nik9000 @jtibshirani I think I have addressed all of the given feedback, so this is ready for another round of review.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me too, I just left some tiny last comments.

the dfs phase.

The following is an example of setting `profile` to `true` on a search
that has a knn section:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny comment, we usually style the names of API parameters as literals ("knn section").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

the of timings for <<query-section, query>>, <<rewrite-section, rewrite>>,
and <<collectors-section, collector>>. Unlike many other queries, kNN
search does the bulk of the work during the query rewrite. This means
rewrite_time represents the time spent on kNN search.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, we should say rewrite_time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -1069,8 +1286,7 @@ have a drastic effect compared to other components in the profiled query.
- Profiling also does not account for time spent in the queue, merging shard
responses on the coordinating node, or additional work such as building global
ordinals (an internal data structure used to speed up search).
- Profiling statistics are currently not available for suggestions,
highlighting, `dfs_query_then_fetch`.
- Profiling statistics are currently not available for suggestions.
- Profiling of the reduce phase of aggregation is currently not available.
- The Profiler is instrumenting internals that can change from version to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this existing comment and glad we have it -- it will let us tweak the DFS output format if needed!

- is_true: profile.shards.0.dfs.statistics.breakdown
- match: { profile.shards.1.dfs.statistics.type: "statistics" }
- match: { profile.shards.1.dfs.statistics.description: "collect term statistics" }
- gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this makes sense. For me maybe we don't need both tests (3) and (4) since having the same keyword shouldn't affect shard routing, or whether we use DFS?

@mark-vieira
Copy link
Contributor

@elasticmachine retest this please

@jdconrad jdconrad merged commit 8b0d071 into elastic:main Oct 5, 2022
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 10, 2022
* main: (150 commits)
  Remove ToXContent interface from ChunkedToXContent (elastic#90409)
  Remove extra SearchService constructor (elastic#90733)
  Update min version for the diagnosis yaml test (elastic#90731)
  Use the AggTestConfig object in testCase (elastic#90699)
  [DOCS] Add links to clear trained model deployment cache API (elastic#90727)
  Assert wildcards are not expanded as specified by request options  (elastic#90641)
  [TEST] Fix exit snapshot restore exit condition (elastic#90696)
  [TEST] Change to atomic file contents save (elastic#90695)
  Update forbiddenapis to 3.4 (elastic#90624)
  [Tests] Don't use concurrent search in scripted field type tests (elastic#90712)
  [ML] Move scaling is possible check for starting trained model (elastic#90706)
  Add new base test case for chunked xcontent types  (elastic#90707)
  Fix testRedNoBlockedIndicesAndRedAllRoleNodes (elastic#90671)
  Fix nullpointer in docs test setup (elastic#90660)
  Don't produce build logs artifact when in a composite build
  Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue (elastic#90688)
  docs: update fleet/agent pipeline docs (elastic#90659)
  [HealthAPI] Use plural consistently in resource types (elastic#90682)
  [Testing] Enable bwc and fix sorting for 500_date_range (elastic#90681)
  Add profiling and documentation for dfs phase (elastic#90536)
  ...

# Conflicts:
#	x-pack/plugin/mapper-aggregate-metric/src/test/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapperTests.java
benwtrent added a commit that referenced this pull request Jan 3, 2023
Introduced in: #90536

Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between.

The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Jan 3, 2023
Introduced in: elastic#90536

Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between.

The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
elasticsearchmachine pushed a commit that referenced this pull request Jan 3, 2023
Introduced in: #90536

Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between.

The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include ANN in search profile output
5 participants