Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Designing and adding capabilities to support Distributed Tracing #3459

Closed
17 of 24 tasks
philipthomas-MSFT opened this issue Sep 14, 2022 · 2 comments
Closed
17 of 24 tasks
Assignees
Labels
Diagnostics Issues around diagnostics and troubleshooting documentation Engineering engineering improvements (CI, tests, etc.) Telemetry
Milestone

Comments

@philipthomas-MSFT
Copy link
Contributor

philipthomas-MSFT commented Sep 14, 2022

Purpose of statement

This is not directly related to a problem. The goal is to be able to implement and deliver distributed tracing transactions downstream of the Cosmos DB SDK (.NET/Java) into the compute gateway, routing gateway, and/or the backend services.

Tasks

  • Identify stakeholders
  • Defining scope of work
  • Researching various approaches
  • Exhaustive list of use cases and scenarios
  • Designing and presenting
  • Defining milestones
  • Define MVP
  • Establishing deliverable dates

Engineering Tasks

  • Create design document for Distributed Tracing in .Net SDK
  • POC/Investigations and Implementation for Distributed Tracing in .Net SDK
  • Implementation of Distributed Tracing for Direct Mode and Direct Release
  • Update V3 for enabled/disabled logic for distributed tracing in Direct/Gateway Mode: In Code Review
  • Performance Testing: Done
  • Sampling and DistributedContextPropagator Update: In Progress Assigned To: @sourabh1007 Done

Stakeholders

  • Monitoring and Observability
    • Dipanshu Bhola
    • Jang-uk In
    • Aditya Jain
    • Debarshi Bhowal
  • Microsoft Azure Cosmos DB .NET SDK
    • Kiran Kumar Koli
    • Iman Malik
    • Philip Thomas
    • Sourabh Jain
    • Arooshi Avasthy
  • Microsoft Azure Cosmos DB Java SDK
    • Kushagra Thapar
    • Fabian Meiswinkel
    • Annie Liang
  • Compute
    • Siddhesh Vethe
  • Routing
    • Soham Mehta
  • Backend
    • Arun Khetarpal
  • Azure.Core
    • Liudmila Molkova
    • Josh Love

Scope of work

While conforming to the OpenTelemetry Guidelines, Standards and Best Practices, we will build the following capabilities:

  • The ability to generate a new TraceId, as well as additional intrinsic (essential) and extrinsic context information, that represents a new trace transaction from the SDK client automatically, when one is not provided by the user.

    • TraceId generated by the Cosmos DB SDK (.NET/Java) MUST be generated by the OpenTelemetry API. The TraceId will not be generated by the OpenTelemetry API
  • The ability to create accept a TraceId, as well as additional intrinsic (essential) and extrinsic contextual information, that represents a new or existing trace transaction from the SDK client manually, when one is provided by the user.

    • TraceId provided by the user MUST be generated by the OpenTelemetry API The TraceId will not be generated by the OpenTelemetry API
  • The ability to propagate a generated or manually created TraceId, as well as additional intrinsic (essential) and extrinsic contextual information, from the SDK downstream throughout multi-service architectures.

  • The ability to store and read TraceId in SDK Diagnostics via Trace.AddDatum for supportability and troubleshooting.

  • The ability to generate SpanId, as well as additional intrinsic (essential) and extrinsic contextual information, that represents a new unit of work, or operation within a transaction, in processing the request. A Trace can consist of more than one Span

    • SpanId generated by the Cosmos DB SDK (.NET/Java) MUST be generated by the OpenTelemetry API The SpanId will not be generated by the OpenTelemetry API
    • SpanId is generated for each retry that occurs.
  • The ability to propagate a generated SpanId, as well as additional intrinsic (essential) and extrinsic contextual information, from the SDK downstream throughout multi-service architectures.

Use cases and scenarios

Please use Gherkin syntax (Given, When and Then)

Scenario #1

Given I am a CX that has an application that uses Azure Cosmos SDK
    And the Azure Cosmos SDK client is configured for Direct Mode
    And the CX is not providing a TraceId
When an Azure Cosmos DB SDK operation is performed (Create, Delete, GetChangeFeedIterator, etc.)
Then a TraceId is auto-generated for that application and propagated to the Routing Gateway
    And a SpanId is auto-generated for the Azure Cosmos SDK client
    And a SpanId is auto-generated for the Routing Gateway
    And a SpanId is auto-generated for the Backend

Scenario #2

Given I am a CX that has an application that uses Azure Cosmos SDK
    And the Azure Cosmos SDK client is configured for Gateway Mode
    And the CX is not providing a TraceId
When an Azure Cosmos DB SDK operation is performed (Create, Delete, GetChangeFeedIterator, etc.)
Then a TraceId is auto-generated for that application and propagated to the Routing Gateway
    And a SpanId is auto-generatedfor the Azure Cosmos SDK client
    And a SpanId is auto-generated for the Routing Gateway
    And a SpanId is auto-generated for the Compute Gateway
    And a SpanId is auto-generated for the Backend

Scenario #3

REVISIT: Since TraceId is generated on the application level, do we need to allow CX to provide a TraceId manually?
Given I am a CX that has an application that uses Azure Cosmos SDK
    And the Azure Cosmos SDK client is configured for Direct Mode
    And the CX is not providing a TraceId
When an Azure Cosmos DB SDK operation is performed (Create, Delete, GetChangeFeedIterator, etc.)
Then a TraceId is provided by the CX and passed into that application and propagated to the Routing Gateway
    And a SpanId is auto-generated for the Azure Cosmos SDK client
    And a SpanId is auto-generated for the Routing Gateway
    And a SpanId is auto-generated for the Backend

Scenario #4

REVISIT: Since TraceId is generated on the application level, do we need to allow CX to provide a TraceId manually?
Given I am a CX that wishes would perform any CosmosClient.Container operation (Create, Delete, GetChangeFeedIterator, etc.)
Given I am a CX that has an application that uses Azure Cosmos SDK
    And the Azure Cosmos SDK client is configured for Gateway Mode
    And the CX is not providing a TraceId
When an Azure Cosmos DB SDK operation is performed (Create, Delete, GetChangeFeedIterator, etc.)
Then a TraceId is provided by the CX and passed into that application and propagated to the Routing Gateway
    And a SpanId is auto-generated for the Azure Cosmos SDK client
    And a SpanId is auto-generated for the Routing Gateway
    And a SpanId is auto-generated for the Compute Gateway
    And a SpanId is auto-generated for the Backend

Scenario #5

REVISIT: Since TraceId is generated on the application level, do we need to allow CX to provide a TraceId manually?
Given I am a CX that wishes would perform any CosmosClient.Container operation (Create, Delete, GetChangeFeedIterator, etc.)
Given I am a CX that has an application that uses Azure Cosmos SDK
    And the Azure Cosmos SDK client is configured for Gateway Mode
    And the CX is not providing a TraceId
When an Azure Cosmos DB SDK operation is performed (Create, Delete, GetChangeFeedIterator, etc.)
Then a TraceId is provided by the CX and passed into that application and propagated to the Routing Gateway
    And a SpanId is auto-generated for the Azure Cosmos SDK client
    And a SpanId is auto-generated for the Routing Gateway
    And a SpanId is auto-generated for the Compute Gateway
    And a SpanId is auto-generated for the Backend

Test Scenarios

Control plane:

  • Gateway: HTTP request
    • make a call that would create a new DB(201 response)

      • modifying the db create request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would create a new DB that already exists(409 conflict)

      • modifying the db create request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would create a new DB that already exists(408 request timeout)

      • modifying all retry db create requests and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would delete an existing DB(200 response)

      • modifying the db delete request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would delete an non-existing DB(404 response)

      • modifying the db delete request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?

Data plane:

  • Direct: TCP request

    • make a call that would add item to a container(201 response)

      • modifying the add item request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would delete an existing item from container (200 response)

      • modifying the delete request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would get an existing item from container (200 response)

      • modifying the get request and verify that the request has headers TraceId and SpanId headers

        • how does we test and verify this?
        • do we need to create tests or we already have?
      • make a call that would delete/get an non-existing record in container(404 response)

        • modifying the delete/get request and verify that the request has headers TraceId and SpanId headers
          • how does we test and verify this?
          • do we need to create tests or we already have?
  • Gateway: HTTP request

    • make a call that would add item to a container(201 response)

      • modifying the add item request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would delete an existing item from container (200 response)

      • modifying the delete request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would get an existing item from container (200 response)

      • modifying the get request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?
    • make a call that would delete/get an non-existing record in container(404 response)

      • modifying the delete/get request and verify that the request has headers TraceId and SpanId headers
        • how does we test and verify this?
        • do we need to create tests or we already have?

Where will the work be done

Azure Cosmos SDK Team

Implementation Details

  • SCOPE IS LARGER AND EXTENDS TO ADDING TRACEID, SPANID TO AZURE.CORE
  • DistributedTracingOptions
    • Currently internal, needs to be public
    • Need to add a capability that allows the CX to set the TraceId, as well as any intrinsic/extrinsic contextual information (attributes).
    • Need to add a capability that allows SDK to auto-generate the TraceId for the CX.
    • Need to add a capability that propagates the TraceId to downstream services (Routing, Compute gateways) using a defined request header (support for both Direct and Gateway modes)
    • Need to add a capability that auto-generates the SpanId, as well as any intrinsic/extrinsic contextual information (attributes), for the CX for every downstream service call, including retries.
    • Need to add TraceId to the diagnostic string for support.
  • The Azure Cosmos .NET v3 SDK has a concept of OpenTelemetry that is not the same as the OpenTelemetry API. How should we deal with this if we have to reference and implement TraceId and SpanId? It seems, just by looking at the code, that there is an implementation for things like DistributedTracing already existing on client options by @sourabh1007 so not 100% sure why this was something that needed to be scoped to add OpenTelemetry and DistributedTracing (@kirankumarkolli, @FabianMeiswinkel), when it seems like it already exists. Seems like we just need to include TraceId and SpanId to the existing DistributedTracing type and add functionality to autogenerate them both using ActivitySource.Start and Stop, or DiagnosticScope.ActivitySourceStartActivity.

Repositories

Other resources

Architectural assets

  • Required
    • C4 model diagrams
    • Sequence diagrams
    • Approach comparison chart
  • Optional

Milestones

TBD

Deliverables

  • Microsoft.Azure.Cosmos
  • Microsoft.Azure.Cosmos.Samples

Dependencies

TBD

Schedules

TBD

Standards and Testing

  • OpenTelemetry Client Design Principles
  • Pure or first-class functions (optional due to readability and supportability)
  • Higher-order functions
  • Avoiding primitive obsession
  • Avoiding side-effects
  • Immutability
  • Please take a Test-Driven Development approach
  • Mock and unit testing
  • Integration testing
  • End-to-End testing
  • Performance testing

Define Project Success

  • See Purpose of statement
  • Instrumentation done with OpenTelemetry
  • Exporters using Geneva Exporters
  • Visualization using Jarvis Trace Explorer
  • Impact of change on the SDK at a minimal
  • Reusability across all Cosmos SDKs (optional)
  • Performance impact at a minimal
  • Maintainability
  • Supportability
  • Usability
  • Security
  • Reliability

Project Requirements

TBD

Other

TBD

Closure

TBD

Clarifying Questions

  • What are the restrictions on collecting, storing, and transmitting Trace and Span intrinsic and extrinsic contextual information that can be stored?
    • There are no restrictions enforced by any downstream service teams. However, there are limitation that can be defined by OpenTelemetry. There is more information here on attributes and attribute limits using AttributeCountLimit and AtrributeValueLengthLimit.
  • Other than Jarvis, should this Trace and Span intrinsic and extrinsic contextual information be displayed in other places, like KUSTO?
    • No
  • Can you provide us with most common to least common scenarios?
    • Action Item: Collectively, we need to come up with close to an exhaustive list of scenarios.
  • Does the initial release need to be supported across all SDKs?
    • Cosmos DB .NET SDK for SQL API seems to be the critical SDK at the moment that adds more value and opportunities to learn from to make changes to other SDKs some time after.
  • What is the request header for sending this to downstream services?
  • Should this initially be limited to specific request coming from the SDK? Like Change Feed? Patch Operations?
    • All request should be supported and there is no need to add contextual extrinsic information for type of operation.
  • Does this require public or legacy emulator code changes?
    • Action Item: For testability, there would more than likely need some support at the public emulator level, but I think we need to have more conversations with the right stakeholders before this is determined.
  • Does this require client telemetry changes?
    • Action Item: Need to check with SDK teams to determine if both Trace and Span Ids need to be included in the SDK client telemetry and what value it could potentially bring.
  • Who are the pull request and continuous integration signoffs?
    • Outside of the SDK teams, we will potentially need the following to also be reviewers, but may be optional or willing to have a code review session:
      • Aditya Jain
      • Dipanshu Bhowal
      • Jang-uk In
  • What is the impact for implementing OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?
  • How would we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?
  • When should we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?
  • Where should we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?
  • Why should we implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?
  • Who should implement OpenTelemetry API in the Cosmos DB .NET SQL API SDK? Java SDK?
  • Since TraceId is generated on the application-level, do we need to allow CX to provide a TraceId manually?
  • Can you explain again where the Trace information that is being consumed by Jarvis is stored?

Meeting Notes

  • Thursday, September 22, 2002
    • Scope of work and clarification questions
    • Action Item: Collectively, we need to come up with close to an exhaustive list of scenarios.
    • Action Item: Backend or gateway teams will work of providing the request header http constant, https://msdata.visualstudio.com/DefaultCollection/CosmosDB/_git/CosmosDB?
    • Action Item: For testability, there would more than likely need some support at the public emulator level, but I think we need to have more conversations with the right stakeholders before this is determined.
    • Action Item: Need to check with SDK teams to determine if both Trace and Span Ids need to be included in the SDK client telemetry and what value it could potentially bring.
  • Thursday, September 29, 2022
    • Scope of work, stakeholders and scenario discussions. For supportability, the need to include TraceId into the diagnostic string. The importance of making sure that scope of work includes use cases that require Compute to make SDK calls, not just calls from the "public" SDK. Talk about where we are with tasks and what are the next steps and expectations.
    • Action Item: Dipanshu will work on including scenario that involve expected behavior for exceptions.
    • [x] Action Item: Iman will add his feedback to the approach comparison matrix
    • [x] Action Item: Iman will start on C4 Level 1 diagram.
    • [ ] Action Item: Iman will start on C4 Level 2 diagram.
    • [ ] Action Item: Iman will start on C4 Level 3 diagram.
    • Action Item: Sequence diagram.
    • Action Item: Philip will add to scope of work, adding TraceId to Diagnostic string
    • Action Item: Dipanshu will provide the names of additional stakeholders
    • Action Item: Dipanshu will work on including scenarios that involve calls into the SDK made by Compute, not just "public" SDK calls.
    • Make updates to scenarios to have TraceId and SpanId starting at application level and not cosmos client creation.
@philipthomas-MSFT philipthomas-MSFT added documentation Engineering engineering improvements (CI, tests, etc.) Diagnostics Issues around diagnostics and troubleshooting Telemetry labels Sep 14, 2022
@philipthomas-MSFT philipthomas-MSFT added this to the Azure Cosmos SDKs milestone Sep 14, 2022
@philipthomas-MSFT philipthomas-MSFT self-assigned this Sep 14, 2022
@philipthomas-MSFT philipthomas-MSFT changed the title [Internal] Designing capabilities to support Distributed Tracing [Internal] Designing and adding capabilities to support Distributed Tracing Sep 14, 2022
@philipthomas-MSFT
Copy link
Contributor Author

Issues with Compute and Routing Gateway build failure.
PR is out, need more reviews.

@aavasthy
Copy link
Contributor

Issue closed with this PR #3801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Diagnostics Issues around diagnostics and troubleshooting documentation Engineering engineering improvements (CI, tests, etc.) Telemetry
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants