[Internal] Designing and adding capabilities to support Distributed Tracing #3459
Closed
17 of 24 tasks
Labels
Diagnostics
Issues around diagnostics and troubleshooting
documentation
Engineering
engineering improvements (CI, tests, etc.)
Telemetry
Milestone
Purpose of statement
This is not directly related to a problem. The goal is to be able to implement and deliver distributed tracing transactions downstream of the Cosmos DB SDK (.NET/Java) into the compute gateway, routing gateway, and/or the backend services.
Tasks
Engineering Tasks
Stakeholders
Iman MalikScope of work
While conforming to the OpenTelemetry Guidelines, Standards and Best Practices, we will build the following capabilities:
The ability to generate a new TraceId, as well as additional intrinsic (essential) and extrinsic context information, that represents a new trace transaction from the SDK client automatically, when one is not provided by the user.
TraceId generated by the Cosmos DB SDK (.NET/Java) MUST be generated by the OpenTelemetry API.The TraceId will not be generated by the OpenTelemetry APIThe ability to create accept a TraceId, as well as additional intrinsic (essential) and extrinsic contextual information, that represents a new or existing trace transaction from the SDK client manually, when one is provided by the user.
TraceId provided by the user MUST be generated by the OpenTelemetry APIThe TraceId will not be generated by the OpenTelemetry APIThe ability to propagate a generated or manually created TraceId, as well as additional intrinsic (essential) and extrinsic contextual information, from the SDK downstream throughout multi-service architectures.
The ability to store and read TraceId in SDK Diagnostics via Trace.AddDatum for supportability and troubleshooting.
The ability to generate SpanId, as well as additional intrinsic (essential) and extrinsic contextual information, that represents a new unit of work, or operation within a transaction, in processing the request. A Trace can consist of more than one Span
SpanId generated by the Cosmos DB SDK (.NET/Java) MUST be generated by the OpenTelemetry APIThe SpanId will not be generated by the OpenTelemetry APIThe ability to propagate a generated SpanId, as well as additional intrinsic (essential) and extrinsic contextual information, from the SDK downstream throughout multi-service architectures.
Use cases and scenarios
Please use Gherkin syntax (Given, When and Then)
Scenario #1
Scenario #2
Scenario #3
Scenario #4
Scenario #5
Test Scenarios
Control plane:
make a call that would create a new DB(201 response)
make a call that would create a new DB that already exists(409 conflict)
make a call that would create a new DB that already exists(408 request timeout)
make a call that would delete an existing DB(200 response)
make a call that would delete an non-existing DB(404 response)
Data plane:
Direct: TCP request
make a call that would add item to a container(201 response)
make a call that would delete an existing item from container (200 response)
make a call that would get an existing item from container (200 response)
modifying the get request and verify that the request has headers TraceId and SpanId headers
make a call that would delete/get an non-existing record in container(404 response)
Gateway: HTTP request
make a call that would add item to a container(201 response)
make a call that would delete an existing item from container (200 response)
make a call that would get an existing item from container (200 response)
make a call that would delete/get an non-existing record in container(404 response)
Where will the work be done
Azure Cosmos SDK Team
Implementation Details
DistributedTracingOptionsCurrently internal, needs to be publicthat is not the same as the OpenTelemetry API. How should we deal with this if we have to reference and implement TraceId and SpanId? It seems, just by looking at the code, that there is an implementation for things like DistributedTracing already existing on client options by @sourabh1007 so not 100% sure why this was something that needed to be scoped to add OpenTelemetry and DistributedTracing (@kirankumarkolli, @FabianMeiswinkel), when it seems like it already exists. Seems like we just need to include TraceId and SpanId to the existing DistributedTracing type and add functionality to autogenerate them both usingActivitySource.Start
andStop
, orDiagnosticScope.ActivitySourceStartActivity
.azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/OSS/Azure.Core/DiagnosticScope.cs
Line 567 in 69d5ef4
Repositories
Other resources
Architectural assets
C4 model diagramsApproach comparison chartMilestones
TBD
Deliverables
Microsoft.Azure.Cosmos
Microsoft.Azure.Cosmos.Samples
Dependencies
TBD
Schedules
TBD
Standards and Testing
Define Project Success
Project Requirements
TBD
Other
TBD
Closure
TBD
Clarifying Questions
AttributeCountLimit
andAtrributeValueLengthLimit
.What is the impact for implementing OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?How would we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?When should we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?Where should we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?Why should we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?Who should implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?Meeting Notes
[x] Action Item: Iman will add his feedback to the approach comparison matrix[x] Action Item: Iman will start on C4 Level 1 diagram.[ ] Action Item: Iman will start on C4 Level 2 diagram.[ ] Action Item: Iman will start on C4 Level 3 diagram.The text was updated successfully, but these errors were encountered: