Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activities for Http Connections, Dns, Sockets and SslStream #103922

Merged
merged 50 commits into from
Jul 13, 2024

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Jun 24, 2024

Edit: for the final design see:

#103922 (comment)
open-telemetry/semantic-conventions#1192


This is a replacement for #101814 following our agreement that Http Connection sub-operations (Dns, Socket, SslStream) should be represented with their sub-activities instead of emitting events on the Http Connection activity.

The PR implements the following activity-tree across multiple libraries:

System.Net.Http.Connections.HttpConnection
  |- System.Net.NameResolution.DsnLookup
  |- System.Net.Sockets.Connect
  |- System.Net.Security.TlsHandshake

Where System.Net.Http.Connections.HttpConnection doesn't have a parent (has it's own Trace ID), and every request is linked to their connection via activity.AddLink() when the connection is being dispatched to a request.

The 4 new activities have their own ActivitySources which have to be enabled separately. By using OpenTelemetry.Exporter.Console this can be done via wildcards:

using var tracerProvider = Sdk.CreateTracerProviderBuilder()
                .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("ConnAct"))
                .AddSource("System.Net.*")
                .AddConsoleExporter()
                .Build();

A sampe program & output can be found here..

Fixes #93832.

/cc @samsp-msft @noahfalk @lmolkova

@davidfowl
Copy link
Member

A sampe program & output can be found here..

Can I ask you to show an example with the aspire dashboard? https://aspiredashboard.com/

It should be a simple docker command to run, pointing the app at it with the otlp exporter. Looking at console output for traces is quite hard.

# Conflicts:
#	src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3Connection.cs
#	src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionBase.cs
@antonfirsov
Copy link
Member Author

@davidfowl here you go:

image

Where expanding the connection will see:

image

A few notes:

  • The connection span covers the entire connection lifetime until closure, while the subspans cover the networking operations (DNS lookup, TCP connection establishment, TLS handshake)
  • There are no attributes since thy are not yet implemented. We are currently discussing what is reasonable to include.
  • Currently, it is not possible to navigate to connections from requests. For that, Aspire dashboard needs to implement Aspire Dashboard: Traces: Span links aren't shown in span details aspire#2577.
    image
  • With the current design, you will not be able to navigate to failing connections from their initiating requests. For that we would need to add another Activity Link and tag those links with names ("initiated connection" vs "serving connection"). This might raise the concern that we bake too many connection pool implementation details into the telemetry we produce.

@stephentoub
Copy link
Member

stephentoub commented Jun 26, 2024

The connection span covers the entire connection lifetime until closure

Is this going to be useful in a trace? I don't know one way or the other, but it seems like it could result in a lot of noise.

Is there an equivalent span that's only about setting up the connection, from the time it was requested until the time it became usable?

@davidfowl
Copy link
Member

Is this going to be useful in a trace? I don't know one way or the other, but it seems like it could result in a lot of noise.

Is there an equivalent span that's only about setting up the connection, from the time it was requested until the time it became usable?

I agree, we care about how long it took to connect, not how long the connection itself lasts. That feels like it could be another span, but not the most useful one. I think about this as zooming into why a request may have taken so long.

@antonfirsov
Copy link
Member Author

antonfirsov commented Jun 26, 2024

That feels like it could be another span, but not the most useful one.

The proposed connection span matches the semantics of the http.client.request.duration metric.

One solution to have more useful traces could be to decompose it as following:

[http] connection {peer}                 // Covers the entire pooled connection lifetime
  |- [http] connection_initiation {peer} // Covers the connection establishment
      |- DNS {host}
      |- [socket] connection_initiation {peer}
      |- TLS {server_name}

Another is to only introduce connection_initiation, however that feels problematic to link from follow-up requests reusing the same connection.

cc @noahfalk @lmolkova

@antonfirsov antonfirsov marked this pull request as draft June 26, 2024 23:29
@lmolkova
Copy link

lmolkova commented Jun 26, 2024

The overall connection span could be useful to understand:

  • which connection my request was done over (if we can link connection span to request span). alternatives could be:
    • some connection id attribute on the request span
    • just logs
  • could help debugging abandoned response streams and connections that are not returned to the pool (some additional stuff is probably needed to make it obvious)
  • not sure if HTTP pipelining is still a thing, but it makes it easy to detect
  • how connection ended (server closed, error happened, etc), when did it end comparing to the request

Many other useful things are already exposed as metrics (number of active connections, connection duration), so I agree span usefulness is limited.

Regarding the noise: connections should be rare comparing to requests (if there is any significant load). The spans are reported by a different ActivitySource, so users decide if they want them.

Anyway, I don't have a strong opinion on how necessary it is - it can always be added later.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

@antonfirsov
Copy link
Member Author

@ManickaP addressed:

#103922 (comment)
#103922 (comment)
class renaming from: #103922 (comment)

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last batch of comments. Thank you for your patience!

}
}

static (string?, string?) GetNameAndVersionString(SslProtocols protocol) => protocol switch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SslProtocols is flags, so is this behaving as expected? E.g. returning nothing in case there are more protocols set, or we're sure that at this point there will always be just one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EventSource telemetry does the same after acquiring the enum with GetSslProtocolInternal() so I assumed it should be a single value after handshake completion normally. If not, our telemetry doesn't emit the protocol info.

switch (protocol)
{
#pragma warning disable SYSLIB0039 // TLS 1.0 and 1.1 are obsolete
case SslProtocols.Tls:
protocolSessionsOpen = ref _sessionsOpenTls10;
handshakeDurationCounter = _handshakeDurationTls10Counter;
break;
case SslProtocols.Tls11:
protocolSessionsOpen = ref _sessionsOpenTls11;
handshakeDurationCounter = _handshakeDurationTls11Counter;
break;
#pragma warning restore SYSLIB0039
case SslProtocols.Tls12:
protocolSessionsOpen = ref _sessionsOpenTls12;
handshakeDurationCounter = _handshakeDurationTls12Counter;
break;
case SslProtocols.Tls13:
protocolSessionsOpen = ref _sessionsOpenTls13;
handshakeDurationCounter = _handshakeDurationTls13Counter;
break;
}

@@ -15,6 +15,11 @@ namespace System.Net.Security
{
public partial class SslStream
{
private const string ActivitySourceName = "Experimental.System.Net.Security";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this part of NetSecurityTelemetry like in Dns and Socket?
So we have:

  • an extra class in Http
  • part of telemetry class in Dns and Socket
  • part of production class in Ssl

Copy link
Member Author

@antonfirsov antonfirsov Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These pieces of code evolved like dust balls in a dirty room, but there is some logic:

  • In HTTP, various telemetry pillars follow complex requirements across multiple protocols and various aspects (request vs connection telemetry, Metrics vs EventSource vs Activities, http1-2 vs http3). When there is cohesion, a class has been defined to put related things into a single place, when there isn't, things were kept separate.
  • In DNS and Sockets, telemetry can be tracked together across the 3 pillars
  • In SSL, the Activity logic is so tiny, that I didn't bother to move it to a separate class.

I will do a second round of thinking about the last point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved around things a bit for SslStream

antonfirsov and others added 8 commits July 12, 2024 16:26
Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
@antonfirsov
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

public const string RequestActivityStartName = RequestActivityName + ".Start";
public const string RequestActivityStopName = RequestActivityName + ".Stop";

public const string ConnectionsNamespace = "Experimental.System.Net.Http.Connections";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimental?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out that our solution results is an unexpected volume of span links for tools like Azure Monitor, blowing up their UI, see open-telemetry/opentelemetry-specification#4130.

We need to work with OTel and Azure Monitor to introduce a standard way for categorizing links so visualization tools can distinguish them. The plan is to drive this work forward in order to stabilize the feature in .NET 10.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding activities for potentially long running network operations
9 participants