Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do Not Merge] Add description doc for DotNet Runtime metrics #404

Closed
wants to merge 15 commits into from
Closed
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 198 additions & 0 deletions src/OpenTelemetry.Instrumentation.Runtime/description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Runtime metrics description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Runtime metrics overview" or "Runtime metric details" here instead of "description"? Also we should link to this from README.


Metrics name are prefixed with the `process.runtime.dotnet.` namespace, following
the general guidance for runtime metrics in the
[specs](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md#runtime-environment-specific-metrics---processruntimeenvironment).
Instrument Units [should follow](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/README.md#instrument-units)
the Unified Code for Units of Measure.
Comment on lines +3 to +7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-worded this a bit:

Metric names are prefixed with the process.runtime.dotnet namespace following
the general guidance in the OpenTelemetry Specification. Instrument units follow the Unified Code for Units of Measure as outlined in the OpenTelemetry specification.


## GC related metrics

The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsGcEnabled` switch.

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-----------------------------------------------|--------------------------|-----------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**gc.totalmemorysize** | Total allocated bytes | `By` | ObservableGauge | `Int64` | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many of the ObservableGauge here should be ObservableUpDownCounter once .NET has the support for up-down counters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably need to to get more precise on this one. Some different possible interpretations:

  • Total amount of memory ever allocated from the GC since process start
  • Total amount of memory commited by the GC currently
  • Total amount of memory for live .NET objects + memory for free objects (objects that existed in the past and were collected but the memory hasn't yet been reused)
  • Total amount of memory for live .NET objects only
  • Total amount of memory for rooted .NET objects (objects that would still be live after a full GC)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some raw notes from our discussion:

process.runtime.dotnet.gc.heapsize: live object (not necessarily hold by the user objects/threads, could be hold by the GC itself) + fragmentation

all the gen adding up should <= committed (since GC commits at the granularity of page, and it normally would ask for multiple pages instead of just one for caching purpose)

neither of the above includes GC's own memory usage

github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md

GC.GetGenerationSize(n) is not providing the transactional behavior since we have to call it multiple times for different gen, it is okay as long as we mention it in the description/doc

remove gc.totalmemorysize since it can be covered by adding up different gens

| process.runtime.dotnet.**gc.count** | Garbage Collection count | `{times}` | ObservableCounter | `Int64` | gen | gen0, gen1, gen2 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suggested by @noahfalk - for folks who have been using the EventCounters, we might consider something like "legacy time in GC" and make it an explicit opt-in (with a clear API name indicating that it has an unclear definition, and we don't recommend it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw we should be able to give a precise definition for the legacy counter too, it just isn't always useful or intuitive. I'm pretty sure the existing counter is computed like this:

GC_n = most recent GC that has ended prior to computing the value
GC_n-1 = GC that ended prior to GC_n

T_end_n = time when GC_n ended
T_end_n-1 = time when GC_n-1 ended
T_start_n = time when GC_n started

inter_gc_time = T_end_n - T_end_n-1
gc_duration = T_end_n - T_start_n
counter value = 100 * gc_duration / inter_gc_time

As an example if GC_n-1 ran from 2.1 sec to 2.2 sec and GC_n ran from 2.5 sec to 3.0 sec then
inter_gc_time = 3.0 - 2.2 = 0.8
gc_duration = 3.0 - 2.5 = 0.5
counter_value = 100 * 0.5 / 0.8 = 62.5%

People usually assume that the value measures the fraction of GC time during a fixed sampling interval (say every 5 minutes) but really it is measuring the fraction of GC time during a variable interval between the most recent GC and the one before that. If the most recent two GCs happen to be close together the counter may appear unexpectedly high and this often causes confusion.


Question for .NET team: is GC.GetTotalMemory(false) always equal to the sum of
xiang17 marked this conversation as resolved.
Show resolved Hide resolved
GC.GetGCMemoryInfo().GenerationInfo[i].SizeAfterBytes (i from 0 to 4)?

I need to decide whether it makes sense to include both of them in the memory/GC
size metrics.

- [GC.GetTotalMemory](https://docs.microsoft.com/dotnet/api/system.gc.gettotalmemory):
The number of bytes currently thought to be allocated.
It does not wait for garbage collection to occur before returning.

- [GC.CollectionCount](https://docs.microsoft.com/dotnet/api/system.gc.collectioncount):
The number of times garbage collection has occurred for the specified generation
of objects.

### Additional GC metrics only available for NETCOREAPP3_1_OR_GREATER

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|---------------------------------------------------|--------------------------------------------------|-------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**gc.allocated.bytes** | Bytes allocated over the lifetime of the process | `By` | ObservableCounter | `Int64` | | |
xiang17 marked this conversation as resolved.
Show resolved Hide resolved
| process.runtime.dotnet.**gc.fragmentation.ratio** | GC fragmentation ratio | `1` | ObservableGauge | `Double` | | |
xiang17 marked this conversation as resolved.
Show resolved Hide resolved

- [GC.GetTotalAllocatedBytes](https://docs.microsoft.com/dotnet/api/system.gc.gettotalallocatedbytes):
Gets a count of the bytes allocated over the lifetime of the process. The returned
value does not include any native allocations. The value is an approximate count.

- GC fragmentation ratio is calculated as:
If `gcMemoryInfo.HeapSizeBytes != 0`,
the value is
`gcMemoryInfo.FragmentedBytes * 1.0d / gcMemoryInfo.HeapSizeBytes`,
otherwise the value is `0`, where `var gcMemoryInfo = GC.GetGCMemoryInfo()`.

- [GCMemoryInfo.FragmentedBytes](https://docs.microsoft.com/dotnet/api/system.gcmemoryinfo.fragmentedbytes?view=netcore-3.1):
Gets the total fragmentation when the last garbage collection occurred.
- [GCMemoryInfo.HeapSizeBytes](https://docs.microsoft.com/dotnet/api/system.gcmemoryinfo.heapsizebytes?view=netcore-3.1#system-gcmemoryinfo-heapsizebytes):
Gets the total heap size when the last garbage collection occurred.

### Additional GC metrics only available for NET6_0_OR_GREATER

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-----------------------------------------|--------------------|-------|-----------------|------------|------------------|----------------------------|
| process.runtime.dotnet.**gc.committed** | GC Committed Bytes | `By` | ObservableGauge | `Int64` | | |
| process.runtime.dotnet.**gc.heapsize** | | `By` | ObservableGauge | `Int64` | gen | gen0, gen1, gen2, loh, poh |

- [GCMemoryInfo.TotalCommittedBytes](https://docs.microsoft.com/dotnet/api/system.gcmemoryinfo.totalcommittedbytes?view=net-6.0#system-gcmemoryinfo-totalcommittedbytes):
Gets the total committed bytes of the managed heap.

- [GC.GetGCMemoryInfo().GenerationInfo[i].SizeAfterBytes](https://docs.microsoft.com/dotnet/api/system.gcgenerationinfo):
Represents the size in bytes of a generation on exit of the GC reported in GCMemoryInfo.
(The number of generations `i` is limited by [GC.MaxGeneration](https://docs.microsoft.com/dotnet/api/system.gc.maxgeneration))
xiang17 marked this conversation as resolved.
Show resolved Hide resolved

## JIT Compiler related metrics

The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsJitEnabled` switch.

These metrics are only available for NET6_0_OR_GREATER.
Copy link
Member

@CodeBlanch CodeBlanch Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the preprocessor directives in this doc like NET6_0_OR_GREATER is very technical. Maybe just say like...

These metrics are available when targeting .NET6 or later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is opened to just solicit feedback only.


| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-------------------------------------------------|--------------------------|-------------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**il.bytes.jitted** | IL Bytes Jitted | `By` | ObservableCounter | `Int64` | | |
| process.runtime.dotnet.**methods.jitted.count** | Number of Methods Jitted | `{methods}` | ObservableCounter | `Int64` | | |
| process.runtime.dotnet.**time.in.jit** | Time spent in JIT | `ns` | ObservableCounter | `Int64` | | |

[JitInfo.GetCompiledILBytes](https://docs.microsoft.com/dotnet/api/system.runtime.jitinfo.getcompiledilbytes?view=net-6.0#system-runtime-jitinfo-getcompiledilbytes(system-boolean)):
xiang17 marked this conversation as resolved.
Show resolved Hide resolved
Gets the number of bytes of intermediate language that have been compiled.
The scope of this value is global.

[JitInfo.GetCompiledMethodCount](https://docs.microsoft.com/dotnet/api/system.runtime.jitinfo.getcompiledmethodcount?view=net-6.0#system-runtime-jitinfo-getcompiledmethodcount(system-boolean)):
Gets the number of methods that have been compiled.
The scope of this value is global.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to state The scope of this value is global. on all of these? Could probably just drop that since it seems obvious (IMO) or move it to the heading so we only state it once? 🤷


[JitInfo.GetCompilationTime](https://docs.microsoft.com/dotnet/api/system.runtime.jitinfo.getcompilationtime?view=net-6.0#system-runtime-jitinfo-getcompilationtime(system-boolean)):
Gets the amount of time the JIT Compiler has spent compiling methods.
The scope of this value is global.

## Threading related metrics

The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsThreadingEnabled` switch.

These metrics are only available for NETCOREAPP3_1_OR_GREATER.

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-------------------------------------------------------------|--------------------------------------|-------------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**monitor.lock.contention.count** | Monitor Lock Contention Count | `{times}` | ObservableCounter | `Int64` | | |
| process.runtime.dotnet.**threadpool.thread.count** | ThreadPool Thread Count | `{threads}` | ObservableGauge | `Int32` | | |
| process.runtime.dotnet.**threadpool.completed.items.count** | ThreadPool Completed Work Item Count | `{items}` | ObservableCounter | `Int64` | | |
| process.runtime.dotnet.**threadpool.queue.length** | ThreadPool Queue Length | `{items}` | ObservableGauge | `Int64` | | |
| process.runtime.dotnet.**active.timer.count** | Number of Active Timers | `{timers}` | ObservableGauge | `Int64` | | |

- [Monitor.LockContentionCount](https://docs.microsoft.com/dotnet/api/system.threading.monitor.lockcontentioncount?view=netcore-3.1):
Gets the number of times there was contention when trying to take the monitor's
lock.
- [ThreadPool.ThreadCount](https://docs.microsoft.com/dotnet/api/system.threading.threadpool.threadcount?view=netcore-3.1):
Gets the number of thread pool threads that currently exist.
- [ThreadPool.CompletedWorkItemCount](https://docs.microsoft.com/dotnet/api/system.threading.threadpool.completedworkitemcount?view=netcore-3.1):
Gets the number of work items that have been processed so far.
- [ThreadPool.PendingWorkItemCount](https://docs.microsoft.com/dotnet/api/system.threading.threadpool.pendingworkitemcount?view=netcore-3.1):
Gets the number of work items that are currently queued to be processed.
- [Timer.ActiveCount](https://docs.microsoft.com/dotnet/api/system.threading.timer.activecount?view=netcore-3.1):
Gets the number of timers that are currently active. An active timer is registered
to tick at some point in the future, and has not yet been canceled.

## Process related metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question - do we want these to be in the OpenTelemetry.Instrumentation.Runtime package, or we want them to be in a different package (and if yes, what should be the name of that package)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote separate package and OpenTelemetry.Instrumentation.Process as a potential name?

Copy link
Contributor Author

@xiang17 xiang17 Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR to remove it: #446, and an issue to add them in a new package: #447.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the CPU metrics percentage collector... and will also push that to new namespace... #437 thx @reyang for ping about info of new namespace...


The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsProcessEnabled` switch.

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-------------------------|----------------------------------------|----------------|-------------------|------------|------------------|------------------|
| process.cpu.utilization | CPU utilization of this process | `1` | ObservableGauge | `Double` | | |
| process.cpu.time | Processor time of this process | `s` | ObservableCounter | `Int64` | state | user, system |
| process.cpu.count | The number of available logical CPUs | `{processors}` | ObservableGauge | `Int64` | | |
xiang17 marked this conversation as resolved.
Show resolved Hide resolved
| process.memory.usage | The amount of physical memory in use | `By` | ObservableGauge | `Int64` | | |
| process.memory.virtual | The amount of committed virtual memory | `By` | ObservableGauge | `Int64` | | |
xiang17 marked this conversation as resolved.
Show resolved Hide resolved

- CPU utilization
- [Process.TotalProcessorTime](https://docs.microsoft.com/dotnet/api/system.diagnostics.process.totalprocessortime)
divided by ([Environment.ProcessorCount](https://docs.microsoft.com/dotnet/api/system.environment.processorcount)
\* ([DateTime.Now](https://docs.microsoft.com/dotnet/api/system.datetime.now) -
[Process.StartTime](https://docs.microsoft.com/dotnet/api/system.diagnostics.process.starttime)))

- CPU Time:
- [Process.UserProcessorTime](https://docs.microsoft.com/dotnet/api/system.diagnostics.process.userprocessortime):
Gets the user processor time for this process.
- [Process.PrivilegedProcessorTime](https://docs.microsoft.com/dotnet/api/system.diagnostics.process.privilegedprocessortime):
Gets the privileged processor time for this process.

- [Environment.ProcessorCount](https://docs.microsoft.com/dotnet/api/system.environment.processorcount):
Gets the number of processors available to the current process.
- Memory usage: [Process.GetCurrentProcess().WorkingSet64](https://docs.microsoft.com/dotnet/api/system.diagnostics.process.workingset64):
Gets the amount of physical memory, in bytes, allocated for the currently
active process.
- Memory virtual: [Process.GetCurrentProcess().VirtualMemorySize64](https://docs.microsoft.com/dotnet/api/system.diagnostics.process.virtualmemorysize64):
Gets the amount of the virtual memory, in bytes, allocated for the currently
active process.

Question: EventCounter implementation exposes a metric named `working-set` with
`Environment.WorkingSet`. Is it equal to `Process.GetCurrentProcess().WorkingSet64`
property? I need to decide on which is more suitable for showing users the memory
usage for the process, or whether to include both.

- [Environment.WorkingSet](https://docs.microsoft.com/en-us/dotnet/api/system.environment.workingset?view=net-6.0):
A 64-bit signed integer containing the number of bytes of physical memory mapped
to the process context.

## Assemblies related metrics

The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsAssembliesEnabled` switch.

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-------------------------------------------|-----------------------------|----------------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**assembly.count** | Number of Assemblies Loaded | `{assemblies}` | ObservableCounter | `Int64` | | |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.NET runtime has the ability to unload assemblies, so AppDomain.GetAssemblies().Length could decrease in some occasions. The instrument type should be ObservableUpDownCounter (before ObservableUpDownCounter is ready, I'll use ObservableGauge in implementation). I'll update the doc later.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment is for .NET Framework and not for .NET Core as .NET Core doesn't allow more than one appomain in the process.

AppDomain.GetAssemblies().Length could decrease in some occasions.

I am not sure if this is accurate either. You may unload the whole appdomain and not individual assembly. So, technically the AppDomain.GetAssemblies().Length doesn't decrease. but the total assemblies in the process can decrease.

In .NET Core, uses AssemblyLoadContext for unloading all assemblies in that context by unloading the context.

CC @noahfalk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it, but I would assume that assemblies in any AssemblyLoadContext, including an unloadable one, appear in the AppDomain.GetAssemblies() list. In that case AppDomain.GetAssemblies().Length could decrease because a load context unloaded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will such a counter have really a useful meaning in .NET core apps? And how can users handle such a counter in .NET Framework apps? or we don't care much about .NET Framework cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My answer is YES. I can think of a valid use case - certain online service / background job run out of memory after hours and the engineers can tell from the "assembly loaded" metrics that the process kept loading plugins without unloading them.

This happened in my previous team, where a backend processing job detects new XML schema and compiles it to assembly (for sake of perf) and load it dynamically, the backend job ended up loading tens of thousands of assemblies.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Tarek and Noah! I'm getting the assembly count for the "current domain" AppDomain.CurrentDomain.GetAssemblies().Length, and the metric has to work for both .NET Framework and .NET Core. If the value could decrease in any case, I couldn't use ObservableCounter because a metric data point could be dropped.

I found one remark about a special case when the assembly is in the default application domain:

If an assembly is loaded into the default application domain, it cannot be unloaded from memory while the process is running.

If Noah's right about assemblies in AssemblyLoadContext would appear in AppDomain.CurrentDomain.GetAssemblies(), then I should use ObservableGauge (and use ObservableUpDownCounter when available). Otherwise it might lead to loss of data.

Copy link

@tarekgh tarekgh Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting the assembly count for the "current domain" AppDomain.CurrentDomain.GetAssemblies().Length

If we are reporting the number from the current appdomain only, I'll question the @reyang's scenario certain online service / background job run out of memory after hours and the engineers can tell from the "assembly loaded" metrics that the process kept loading plugins without unloading them. when running on .NET Framework. Looks what we report wouldn't be enough for that scenario.

To clarify, .NET Framework can have multiple appdomains. I guess it will be more useful reporting the total count of assemblies in all appdomains and not necessary the current appdomain.

In .NET Core, there is only one appdomain, so reporting the number from the current appdomain would work, I guess. We need to confirm loading and unloading AssemblyLoadContext is going to be reflected to the number we report. When I get some time, I can try to do that.

Copy link

@tarekgh tarekgh Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed in .NET Core apps when unloading the AssemblyLoadContext, the unloaded assemblies inside that context will be reflected in the AppDomain.CurrentDomain.GetAssemblies(). In other word, the number of assemblies in the appsomain can decrease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the instrument type to ObservableGauge.

For implementation on getting all AppDomains, I found three articles 1, 2 and 3, but they all seem like very old school code depending on MSCorEE.dll.

I still need to figure out whether I could use those APIs, or find better alternatives.


- [AppDomain.GetAssemblies](https://docs.microsoft.com/dotnet/api/system.appdomain.getassemblies):
Gets the number of the assemblies that have been loaded into the execution context
of this application domain.

## Exception counter metric

The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsExceptionCounterEnabled` switch.

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|--------------------------------------------|--------------------------------------------|------------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**exception.count** | Number of exception thrown in managed code | `{timers}` | ObservableCounter | `Int64` | | |

- [AppDomain.FirstChanceException](https://docs.microsoft.com/dotnet/api/system.appdomain.firstchanceexception)
Occurs when an exception is thrown in managed code, before the runtime searches
the call stack for an exception handler in the application domain.

## Currently out of scope

Regarding process.runtime.dotnet.**time-in-gc**: (DisplayName in [EventCounter implementation](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs#L96)
is "% Time in GC since last GC".) A new metric should replace it by calling a new
API GC.GetTotalPauseDuration().
The new API is added in code but not available yet.
It is targeted for 7.0.0 milestone in .NET Runtime repo.
See [dotnet/runtime#65989](https://github.com/dotnet/runtime/issues/65989)