Add Structure and Continuity; remove instantaneous Temporality #181

jmacd · 2020-07-28T07:42:02Z

This is simplified from #168 by removing INSTANTANEOUS Temporality and making Structure and Continuity independent fields, yielding 12 combinations of Temporality, Structure, and Continuity.

Like #168, this moves Monotonicity into a Structure value, removes it from Type.

The removal of INSTANTANEOUS is accommodated by making StartTimeUnixNano optional when data is unaggregated. Raw measurements correspond to the API specification for raw data, meaning:

CONTINUOUS ADDING(_MONTONIC) raw measurements are changes to a sum
SNAPSHOT ADDING(_MONTONIC) raw measurements are direct sums

bogdandrutu · 2020-07-28T14:42:53Z

opentelemetry/proto/metrics/v1/metrics.proto

+
+  // Continuity describes how the measurement was captured,
+  // indicating whether measurements represent application-level
+  // events or were generated through a callback invoked by the SDK.


Is not only the callback that can generate snapshots, aggregating any already aggregated data is a snapshot. Think in collector I can receive points from a Counter aggregated as a sum/delta then building a Histogram there makes the input snapshot

Also for a Counter (aggregating cumulative Sum) vs SumObserver (cumulative Sum calculated outside) - both sums were calculated by an aggregator (in our outside our library), and at the moment you export the sum there is minimal information that Snapshot vs Continuous provide - no more information about the number of measurements available. But in case of a Histogram where we also include the count of measurements this information may be useful (and probably is). So I am not 100% convinced that Continuity makes sense for scalar metrics.

aggregating any already aggregated data is a snapshot

I would have stated that:

aggregating already-aggregated SNAPSHOT data is SNAPSHOT data

aggregating already-aggregated CONTINUOUS data is CONTINUOUS data

Combining multiple input metrics of different Continuity is not well defined in this (variation of the) protocol, and I'm not sure we should be aiming for a protocol that supports multiple input metrics regardless of whether they agree on Structure/Continuity. In all cases, the presence of StartTimeUnixNano != 0 indicates that the input data spanned a period of time as opposed to being defined at an instant where StartTimeUnixNano == TimeUnixNano.

at the moment you export the sum there is minimal information that Snapshot vs Continuous provide

My intention is that both of these fields tell us about the input measurements, which do not change when we aggregate.

There may be minimal information, but there is information here. If you see two raw measurements with the same timestamp and they are SNAPSHOT, they were made by the same callback and can be compared as a ratio. If you see two raw measurements with the same timestamp and they are CONTINUOUS, they are merely coincidental and should not be compared as a ratio.

A sum is a sum, and we know that it is sensible to compute a rate of a sum. A "gauge" (or non-sum) should not be exposed as a rate, because we don't know that the distance between A and B is the same as the distance between (A+x) and (B+x). This quality is the reason that Sums can be displayed as rates regardless of whether they are monotonic or not.

My intention is that both of these fields tell us about the input measurements, which do not change when we aggregate.

I think this is not clear to me, will try to show an example to show where my confusion comes from:

An "UpDownSumObserver" for "memory usage", and having an export interval of 10seconds, every 10 second I report the cumulative sum of all "new/delete" calls. User configures the library to just export the value read from the kernel with no changes.

An "UpDownCounter" for "new/delete" calls, by instrumenting tcmalloc library. We have configured a cumulative sum aggregation (maybe without exemplars if that will be possible). Export interval is the same 10 seconds.

In this two cases I get exactly the same data - every 10 second a scalar metric with the sum of new/delete from the beginning of the process. The only difference may be "exemplars" which may or may not be present for the second metric (not guaranteed that they are present, if the user configures to not keep exemplars in the view configuration).

Now the idea is, way do we need to know and complicate the model for this case with continuous vs snapshot? Both are snapshots because even in the second metric we actually take a snapshot of the aggregator at a specific moment of time.

If you see two raw measurements with the same timestamp and they are SNAPSHOT, they were made by the same callback and can be compared as a ratio. If you see two raw measurements with the same timestamp and they are CONTINUOUS, they are merely coincidental and should not be compared as a ratio.

Here is where I agree with you, but I see things a bit different as proposed in #179, we have "Raw measurements" a.k.a CONTINUOUS which represents measurements recorded using the sync instruments with no other aggregation and any other "measurements" are snapshot independent of where that aggregation happens (in our library like in the case of the second metric, or outside our library in case of the first metric) because we actually take a snapshot of an Aggregator.

Probably I miss something, but here is where my current understanding is.

So in my proposal to clarify "Raw Measurements" imply continuos all the other types imply snapshot because we take a snapshot of an Aggregator that somewhere exists (inside or outside of our library)

In this two cases I get exactly the same data - every 10 second a scalar metric with the sum of new/delete from the beginning of the process

The count of label sets means something different in these cases. The Observer instrument produces N points at the same logical instant in time, whereas the Recorder instrument produces N points over a window of time. If I ask how many distinct label sets there are from a Recorder (CONTINUOUS), I have to specify a time window for the question to be meaningful. For an Observer (SNAPSHOT), I can answer the question ("how many distinct label sets") without specifying a time range.

The count of label sets means something different in these cases. The Observer instrument produces N points at the same logical instant in time, whereas the Recorder instrument produces N points over a window of time. If I ask how many distinct label sets there are from a Recorder (CONTINUOUS), I have to specify a time window for the question to be meaningful. For an Observer (SNAPSHOT), I can answer the question ("how many distinct label sets") without specifying a time range.

I think this is where I miss something (maybe here is where the disagreement comes from). In both cases for me when I look at the N points (N different label sets) they accumulated (sums) values over that time interval:

-------- // start time t0'

t0 - new object with location "stack" size 10

t1 - new object with location "heap" size 100

t2 - delete object with location "stack" size 10

t3 - new object with location "stack" size 20

-------- // exporting t3'

t4 - delete object with location "stack" size 20

t5 - new object with location "stack" size 30

t6 - delete object with location "heap" size 100

t7 - new object with location "heap" size 200

-------- // exporting t7'

t8 - delete object with location "stack" size 30

t9 - delete object with location "heap" size 200

-------- // exporting t9'

First Metric:

first export (start_time= t0'; end_time = t3' - "location"="stack" 20; "location"="heap" 100)

second export (start_time= t0'; end_time = t7' - "location"="stack" 30; "location"="heap" 200)

third export (start_time= t0'; end_time = t9' - "location"="stack" 0; "location"="heap" 0)

Second Metric:

first export (start_time= t0'; end_time = t3' - "location"="stack" 20; "location"="heap" 100)

second export (start_time= t0'; end_time = t7' - "location"="stack" 30; "location"="heap" 200)

third export (start_time= t0'; end_time = t9' - "location"="stack" 0; "location"="heap" 0)

In both cases we have the same number of unique label sets, and we cannot assign a specific moment when a value for a specific label set was updated, because that information is lost during aggregation, the only thing we know is that in the interval t0' - tx' we had this number of distinct label sets and these are the "current values" corresponding to every label set.

To get a bit more confused, even for the "second metric" we actually read the value at one specific moment "exporting time", so why is that different? I think once we left the library by adding the data to the wire, we see these metrics exactly the same, inside the library there may be some interpretations like the one you had, but once we put the data on the wire they are identical to me.

I think I misunderstood this:

In all cases, the presence of StartTimeUnixNano != 0 indicates that the input data spanned a period of time as opposed to being defined at an instant where StartTimeUnixNano == TimeUnixNano.

I am trying again to see if that makes sense:

There are some Observers where we know the start time, for example a SumObserver may have the start time equal with the start of the process, and in this case their data will look exactly the same as the "UpDownCounter" example. In this case the metric is considered "CONTINUOUS".

There are some Observers where we cannot determine the start time correctly, even for a SumObserver, in this case the collected points will miss the start_time and hence they will be considered "SNAPSHOT".

If this is the example that you have in mind, I completely agree that this is a possibility in the real world, but this makes me to think that Observers especially "SumObserver" and "UpDownSumObserver" need to support a configuration where users will specify the start_time (or just a boolean to say if it is equal to the start of the process).

jmacd · 2020-07-28T16:00:14Z

There is a part of this proposal that doesn't sit well with me--the distinction between raw snapshot and raw continuous measurements is arbitrary. In open-telemetry/oteps#88 we discussed a number of potential instruments that are sensible but not included in the API specification, those would have hypothetical names SumRecorder, UpDownSumRecorder (CONTINUOUS) and DeltaObserver, UpDownDeltaObserver (SNAPSHOT) in current terms. I'd like it if OTLP supported this kind of data, even if the API doesn't expose such instruments.

If we were to introduce these instruments, we'd need a way to distinguish whether a raw sum was entered as a change or as a sum, which is something that @bogdandrutu included under the oneof in #179, as this seems to be independent of Temporality in the case where there is no temporal aggregation. I was thinking of a new type-related field to indicate this property, but can't find the right name.

… the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) IMPORTANT: * This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>

… the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) * Decide if support for INSTANTANEOUS temporality is still needed. IMPORTANT: * This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>

jmacd · 2020-07-29T19:36:10Z

See #182

…tails about the aggregation and temporality. (#182) * Change Metric type to be a more detailed structure with details about the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) * Decide if support for INSTANTANEOUS temporality is still needed. IMPORTANT: * This PR is not equivalent with #168 or #181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com> * Address feedback, added more TODOs and created issues Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Aaron Abbott <aaronabbott@google.com> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * Fix indentation and comment for Type Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com> Co-authored-by: Aaron Abbott <aaronabbott@google.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

jmacd added 5 commits July 28, 2020 00:24

Add Structure and Continuity

b257c60

Tidy comments

f3baec4

More tidy

073dd4d

More tidy

4ac91cd

Histogram and Summary MUST set StartTimeUnixNano

4c21cfd

jmacd added the spec:metrics label Jul 28, 2020

bogdandrutu reviewed Jul 28, 2020

View reviewed changes

bogdandrutu mentioned this pull request Jul 29, 2020

Add Gauge. Change Metric type to be a more detailed structure with details about the aggregation and temporality. #182

Merged

jmacd closed this Jul 29, 2020

jmacd mentioned this pull request Aug 18, 2020

Converting Traces and Metrics into logs. open-telemetry/oteps#120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Structure and Continuity; remove instantaneous Temporality #181

Add Structure and Continuity; remove instantaneous Temporality #181

jmacd commented Jul 28, 2020

bogdandrutu Jul 28, 2020

bogdandrutu Jul 28, 2020

jmacd Jul 28, 2020

jmacd Jul 28, 2020

bogdandrutu Jul 28, 2020 •

edited

Loading

bogdandrutu Jul 28, 2020

jmacd Jul 28, 2020

bogdandrutu Jul 28, 2020 •

edited

Loading

bogdandrutu Jul 28, 2020 •

edited

Loading

jmacd commented Jul 28, 2020

jmacd commented Jul 29, 2020

Add Structure and Continuity; remove instantaneous Temporality #181

Add Structure and Continuity; remove instantaneous Temporality #181

Conversation

jmacd commented Jul 28, 2020

bogdandrutu Jul 28, 2020

Choose a reason for hiding this comment

bogdandrutu Jul 28, 2020

Choose a reason for hiding this comment

jmacd Jul 28, 2020

Choose a reason for hiding this comment

jmacd Jul 28, 2020

Choose a reason for hiding this comment

bogdandrutu Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

bogdandrutu Jul 28, 2020

Choose a reason for hiding this comment

jmacd Jul 28, 2020

Choose a reason for hiding this comment

bogdandrutu Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

bogdandrutu Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

jmacd commented Jul 28, 2020

jmacd commented Jul 29, 2020

bogdandrutu Jul 28, 2020 •

edited

Loading

bogdandrutu Jul 28, 2020 •

edited

Loading

bogdandrutu Jul 28, 2020 •

edited

Loading