Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracing Instrumentation] Add instrumentation at deep indexing level #8555

Open
Tracked by #12082
Gaganjuneja opened this issue Jul 9, 2023 · 16 comments · Fixed by #10273
Open
Tracked by #12082

[Tracing Instrumentation] Add instrumentation at deep indexing level #8555

Gaganjuneja opened this issue Jul 9, 2023 · 16 comments · Fixed by #10273
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Search Search query, autocomplete ...etc v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@Gaganjuneja
Copy link
Contributor

Gaganjuneja commented Jul 9, 2023

Is your feature request related to a problem? Please describe.
Now the distributed tracing framework #7543 is available. We need to add the tracing instrumentation at several places to effectively monitoring the overall OpenSearch. Creating this task to add instrumentation at deep indexing path like coordinator, shard, segments, lucene calls, etc.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Gaganjuneja Gaganjuneja added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 9, 2023
@minalsha minalsha added the Search Search query, autocomplete ...etc label Jul 14, 2023
@macohen macohen added Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Jul 17, 2023
@macohen
Copy link
Contributor

macohen commented Jul 17, 2023

@Gaganjuneja thanks for opening this issue! can you please add more information to your request about what metrics you want to capture? What outcome are you looking for? What analytics do you want? What will you do with this data?

@Gaganjuneja
Copy link
Contributor Author

@Gaganjuneja thanks for opening this issue! can you please add more information to your request about what metrics you want to capture? What outcome are you looking for? What analytics do you want? What will you do with this data?

Thanks @macohen for your reply, I have created a Meta task #8553 for instrumentations in OpenSearch code paths. I have updated the description. Contributor can add further details while picking it up for implementation.

@sarthakaggarwal97
Copy link
Contributor

@Gaganjuneja
I was looking at tracing core indexing flows like document parsing, updates, version / id look ups, shard permits (locks) and other essential parts of Indexing. Most of these operations are at present happening in the IndexShard (by extension InternalEngine class)

If I want to introduce a tracer in the IndexShard class, I could think of one way where I instantiate IndicesServices in Node.java with Tracer, then IndexModule, then IndexService and then IndexShard. The concern with this approach is the breaking of dependency with other plugins or classes and backward compatibility. Also, this might lead to some volumetric changes in terms of tests and refactoring of constructors.
Additionally I checked, we are not passing a listener in core functions like applyIndexOperation where TraceableActionListener might have helped.

I would like to check with you if there is a better or different approach that we can follow to accomplish this?

cc: @mgodwan

@Gaganjuneja
Copy link
Contributor Author

@sarthakaggarwal97 Unfortunately this is the only way for now. Can you please list down the couple of plugins going to break.

@reta , This is going to be a problem everywhere. Should we think about exposing the tracer instance in some static way?

@Gaganjuneja
Copy link
Contributor Author

Other option is that we address these issues case by case, If the implementations are isolated and all will be instrumenting it in different ways then we can overload the constructors with NoopTracer as default value to avoid the volumetric changes but we need to make sure that it's not introducing any uncertainties about instrumentation behaviour.

@reta
Copy link
Collaborator

reta commented Oct 2, 2023

@sarthakaggarwal97 the way we are trying to add instrumentation with @Gaganjuneja is from the top to bottom: basically, we are starting from network layer (rest actions and transport actions) and going deeper with thread pools and tasks, those are very key to get done first before going deep on any other flow.

IndexShard has ~25 dependencies right now, I think it is a bit premature to think about adding yet another one (like tracer) at this moment, we may have clear understanding if that is needed once we get core pieces instrumented.

@mgodwan
Copy link
Member

mgodwan commented Oct 6, 2023

@reta @Gaganjuneja
Instrumenting flows wihtin IndexShard will provide benefits to get more details about a lot of critical indexing/search paths (e.g. document parsing, indexing, gets during updates, refreshes, translog wait etc ) and should prove beneficial to gain more insights about workloads and OpenSearch performance characteristics.

While profiling transport layer, thread pool, etc provide a good starting point, it will help if we can start instrumenting the individual pieces through tracing/metrics and collect data leveraging the current state of tracing framework.

The points of creation of IndexShard seem limited to IndexService and tests, and hence based on the look through the code, it may not create a lot of churn if we decide to change the framework behavior in future as well.

@reta
Copy link
Collaborator

reta commented Oct 6, 2023

Instrumenting flows wihtin IndexShard will provide benefits to get more details about a lot of critical indexing/search paths (e.g. document parsing, indexing, gets during updates, refreshes, translog wait etc ) and should prove beneficial to gain more insights about workloads and OpenSearch performance characteristics.

@mgodwan The indexing is as important as all other flows (search, aggregations, ...). The goal of tracing is to have a complete flow (trace) from the moment the request entered the system till the moment it leaves it. This is why we are instrumenting from the bottom (network) to top. Just creating out of the blue trace spans is not the way to approach the problem - please use logs for that, they fit perfectly the purpose.

@mgodwan
Copy link
Member

mgodwan commented Oct 9, 2023

Just creating out of the blue trace spans is not the way to approach the problem

This is not what I'm suggesting 😄. The general idea with instrumentation, as you suggested, is to get an overview of the journey of a request itself within the system, not just across nodes and threads, but also include spans/metrics which can reveal insights about the general behaviour/performance of a specific code path which may not be possible unless we include custom spans within the parent spans.

From what I understand, given with the network traces already available and thread context propagation of spans in place, we have the ability to use these spans as the parent, and create child spans for specific code paths as needed.

please use logs for that, they fit perfectly the purpose

Traces and metrics provides a mechanism to actually monitor these easily. It will be good to be able to use the framework if you feel it can be used.

@backslasht
Copy link
Contributor

I agree with @mgodwan, it is really hard to debug the problems in specific code paths (indexing, search, aggregations, etc.) with just the logs. While I see the merit with what @reta has proposed on bottom up approach, it does help to start on specific code paths (indexing in this case) in parallel given that we have the base covered now.

Having tracing on specific code paths does help in debugging deep indexing issues which otherwise is very hard to debug especially while doing postmortem analysis.

@reta
Copy link
Collaborator

reta commented Oct 11, 2023

From what I understand, given with the network traces already available and thread context propagation of spans in place, we have the ability to use these spans as the parent, and create child spans for specific code paths as needed.

Thanks @mgodwan , some paths are instrumented (still issues but we are fixing them), some are not, the relevant example is #10291 (we need to have it done for local transport).

While I see the merit with what @reta has proposed on bottom up approach, it does help to start on specific code paths (indexing in this case) in parallel given that we have the base covered now.

@backslasht it does not help, I sadly failed to convey the message why tracing has to be done this way, but I would like to point out to the formal definition here [1]. Without prior instrumentation this implementation is no go:

  • it does not respect the configured sampling
  • it does have no context why it has happened at all
  • it has not correlation with any other related flows that might have happened in parallel

To conclude: the indexing path should and will be instrumented, however it depends other on ongoing instrumentation efforts. You could wait for them to happen or you could help to make them happen faster.

[1] https://opentelemetry.io/docs/specs/otel/overview/#traces

@rayshrey
Copy link
Contributor

Just creating out of the blue trace spans is not the way to approach the problem

@reta We are not trying to create spans without any context.
A reference to the approach we are talking about can be seen here - #10273

In the images attached in the above PR you can see that the spans have proper parents and child and have context with reference to the timelines of the request (when did the primary write start, when did the replica write start etc) along with useful attributes such as the index name, shard id etc.

I hope it helps in clearing what @backslasht and @mgodwan have been trying to convey all along.

@reta
Copy link
Collaborator

reta commented Oct 16, 2023

I hope it helps in clearing what @backslasht and @mgodwan have been trying to convey all along.

@rayshrey for sure we getting there, the instrumentation for transport action was merged a few weeks ago (this is why we needed to have that done), so the context propagation now works. However we still need to cover local transport I believe, I will double check and comment on the pull request.

@reta
Copy link
Collaborator

reta commented Oct 17, 2023

@rayshrey we should be all set now!

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Oct 17, 2023
@sarthakaggarwal97
Copy link
Contributor

Keeping this issue open for now. We can create sub-issues while we instrument different deep indexing paths and tag it here.

@kiranprakash154
Copy link
Contributor

Hi, are we on track for this to be released in 2.12 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Search Search query, autocomplete ...etc v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: 🏗 In progress
Development

Successfully merging a pull request may close this issue.

9 participants