Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename exception.* to error.* (aligning with ECS) #3198

Closed
Tracked by #1012
trask opened this issue Feb 9, 2023 · 26 comments
Closed
Tracked by #1012

Rename exception.* to error.* (aligning with ECS) #3198

trask opened this issue Feb 9, 2023 · 26 comments
Assignees
Labels
area:semantic-conventions Related to semantic conventions semconv:HTTP spec:trace Related to the specification/trace directory

Comments

@trask
Copy link
Member

trask commented Feb 9, 2023

What are you trying to achieve?

Rename exception.* attributes to the corresponding ECS names under error.* (see open-telemetry/oteps#222 for motivation).

Rename:

@trask trask added area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory semconv:HTTP labels Feb 9, 2023
@trask trask self-assigned this Feb 9, 2023
@trask
Copy link
Member Author

trask commented Feb 10, 2023

@AlexanderWert it's not clear to us the best mapping from OTel's all-in-one-string (unstructured) exception.stacktrace to ECS's multi-field/wildcard error.stack_trace.

Is error.stack_trace.text the convention for capturing an all-in-one-string (unstructured) stack trace?

@AlexanderWert
Copy link
Member

@trask In ECS (and Elasticsearch) there are different types of strings: keyword (indexed as is for direct access) and text (optimized for search in text). A field can have both characteristics, in that case fieldname is the keyword and fieldname.text is the text type field.

error.stack_trace is such a wildcard type which is a special case of keyword type. And it also searchable, so it has in addition the error.stack_trace.text field.

For OTel semconv the keyword / wildcard typed field is relevant and we can ignore the text field.

@mwear
Copy link
Member

mwear commented Mar 14, 2023

I don't have strong feelings on this either way, but wanted to link two previous issues where this was discussed. #427 and its successor #697. The original proposal named the conventions error.*. It was changed due to the discussions on those PRs.

@trask
Copy link
Member Author

trask commented Apr 12, 2023

Some potential reasons not to rename exception.* to error.*:

@AlexanderWert
Copy link
Member

AlexanderWert commented Apr 24, 2023

Thinking more about this, I'm wondering if it wouldn't be better for exception.* and error.* to coexist:

  1. The main reasoning for renaming error.* to exception.* in Exception reporting specification #697 was that "we record an exception, which may or may not be an error". So, not every exception is an error.
  2. On the other hand, not every error is an exception (think of ERROR log messages that do not expose exception information, or HTTP 5XX errors, etc.).

So, while in the scope of spans exception.* might be the only relevant namespace, for other signals (e.g. logging), error.* is reasonable as an additional attribute set as well.

So, how about keeping OTel's exception.* attributes as is, BUT adding ECS' error.* fields as well (not necessarily as span attributes, but as attributes to general semantic conventions)?

@trask
Copy link
Member Author

trask commented Apr 24, 2023

Thinking more about this

same...

I'd like to see if we can avoid having both error.* and exception.*, as that seems a bit confusing.

And I think error.* is the more general term, so I'd lean towards keeping that (it seems reasonable to say exceptions are a kind of error).

I think the existing API method Span#RecordException is still accurate, since it specifically records an exception on a span. The fact that it would record it using (the more general) attributes error.* seems reasonable.

@Oberon00
Copy link
Member

Oberon00 commented Apr 24, 2023

And I think error.* is the more general term, so I'd lean towards keeping that (it seems reasonable to say exceptions are a kind of error).

Hmm, but not in the sense of the original argument of (re)naming that thing exception. An exception may indicate that some problem occurred but was handled, in which case, was it really an error? Or an exception might be used for other things, like Python's GeneratorExit

@trask
Copy link
Member Author

trask commented Apr 24, 2023

ya, I don't disagree, I guess my preference is to stretch the meaning a bit instead of having both error.* and exception.* since the distinction is pretty subtle for most(?) users

@pyohannes
Copy link
Contributor

I'd like to see if we can avoid having both error.* and exception.*, as that seems a bit confusing.

I agree. Having both error and exception namespaces with type, message, and stacktrace fields would be very confusing.

And I think error.* is the more general term, so I'd lean towards keeping that (it seems reasonable to say exceptions are a kind of error).

I agree with the first part of the statement. An error is a generic concept, whereas an exception is a language specific construct. Not all languages have an exception construct, e. g. Go and Rust.

To summarize, I'd be strongly in favor of having just one of error or exception. I have no strong opinion on which one of these two to settle, but I have slight preference for error for the following reasons:

  • An "error" is a more generic concept than an "exception". There is value in capturing errors that are no exceptions.
  • For telemetry consumers, the crucial distinction often is not whether something is an error or an exception, but whether something is expected or unexpected. Both errors and exceptions can be either expected or unexpected and need to be interpreted in context. For HTTP client spans with 4xx status codes we require the span kind to be set to ERROR, but this can be expected (e. g. for checking whether a certain DB entry does not exist). With that in mind, I don't see much benefit in using the "exception" term rather than "error".

@Oberon00
Copy link
Member

Oberon00 commented Apr 25, 2023

What about the Span status? An error can be recorded by setting the span status to ERROR and setting a appropriate status text. Cf. #306.

I think there is totally space for separate error and exception conventions, and they might not even look that similar: #306 (comment)

Actually, that ECONNREFUSED highlights another use case that is not well-supported by that statusCode field: If we had some generic convention for OS errors, you could do something like setting os.errcode = "ECONNREFUSED", os.errfunc = "connect" on an HTTP span and you would know that there was an OS-level error, not an HTTP-level error.

EDIT: Or maybe err.kind = "os", err.code = "ECONNREFUSED", err.os.func = "connect". Could be adapted to err.kind = "exception" err.code = "java.net.ConnectException", err.msg = "Connection to [::2]:14426 refused.". We need semantic conventions for errors and exceptions!

Here, by @jmacd on the original name decision: #427 (comment)

I would support changing "error.type" and "error.message" to "exception.type" and "exception.message". It's OK to have an exception that is not an error. When you wrote throw new RuntimeException("message") that translates into exception.type=java.lang.RuntimeException and exception.message=message.

Likewise, it's fine to record a stacktrace for a non-error. Can we just call the stacktrace attrribute stacktrace? That's unambiguous, I think.
[...]

=> So the intention of the OTel exception.* conventions is to be used just for exceptions. For errors (and even to indicate whether an exception is an error or not), you'd use other conventions (e.g. http.status_code) or the span Status (message+ERROR/OK/UNSET)

@ruflin
Copy link

ruflin commented Apr 25, 2023

One of the problems error always had in ECS was that it was used in too many different ways. Here the current description:

Use them for errors that happen while fetching events or in cases where the event itself contains an error.

In addition to the mentioned two use cases above, it is also used for errors happening during ingest. If we already touch error, we should narrow down on what our targeted use case is no matter if it is error or exception. Most cases I see described above are focused on "cases where the event itself contains an error." That is also what I would focus on and scrap "fetching" or "ingest" errors.

@AlexanderWert
Copy link
Member

=> So the intention of the OTel exception.* conventions is to be used just for exceptions. For errors (and even to indicate whether an exception is an error or not), you'd use other conventions (e.g. http.status_code) or the span Status (message+ERROR/OK/UNSET)

Agree with that for spans!

How would that look like for other use cases (e.g. error logs that are not exceptions)?

For example, an Apache error log:

[Thu Mar 13 19:04:13 2014] [error] [client 50.0.134.125] File does not exist: /var/www/favicon.ico

An example of a MySQL error message:

2018-10-02T03:20:39.410387Z 768 [ERROR] [MY-010045] [Server] Event Scheduler:
[evtuser@localhost][myschema.e_daily] Unknown database 'mydb'

So for capturing the above examples as OTel logs, right now the only way is to write them into exception.* (which feels wrong because it's not an exception) and SpanStatus and http.status_code are not applicable either. So that's why I think we need error.* either as an additional concept or as more general one.

@Oberon00
Copy link
Member

Oberon00 commented Apr 25, 2023

I also think a separate error convention would be the way to go. If we want consistency with span's status, it would need to be status.category (optional enum with just "OK", "ERROR" and "UNSET" might be represented as actually unset) and status.message.

@pyohannes
Copy link
Contributor

I also think a separate error convention would be the way to go.

I understand some of the reasoning behind this, but when I try to picture a concrete solution, it feels quite murky to me:

  • Exceptions that are no errors: you'll have exception.[type,message,stacktrace], the span status will be likely be OK or UNSET, if it wasn't set to ERROR due to another reason not related to the recorded exception.
  • Errors that are no exceptions: you'll have error.[code,message,stacktrace], and span status will likely be ERROR, if the error wasn't handled in the scope of this span or if the error was expected.
  • Exceptions that are errors. Should both exception.* and error.* attributes be set? The span status might be ERROR if the exception wasn't handled in the scope of the span, and something else otherwise?

A lot of emphasis is put on exceptions that aren't errors (but instead used for flow control). There might be some controversy about this, but I would argue that recording details about such exceptions has only limited value in the context of telemetry data and should be avoided if possible (what value would we get from e. g. GeneratorExit stack traces on spans and logs). With the exception (no pun intended) of Python, this usage of exceptions is also very niche in all languages I'm familiar with.

To summarize, separating errors and exceptions seems cleaner on a conceptual level, however, it adds a fair amount of complexity. The additional value provided seems small to me, especially in the context of telemetry data, and I'm not convinced it is worth the additional complexity.

@tsloughter
Copy link
Member

tsloughter commented Apr 25, 2023

T> With the exception (no pun intended) of Python, this usage of exceptions is also very niche in all languages I'm familiar with.

Erlang is another one that uses exceptions for control flow at times and I'd completely agree that these should not be recorded in an exception.* attribute. So I support replacing exception with error.

The only thing I'd add is error.stacktrace shouldn't be a required attribute for having an error -- in recent Erlang versions you can only get the stacktrace if you have an exception :)

@Oberon00
Copy link
Member

Oberon00 commented Apr 25, 2023

Errors that are no exceptions: you'll have error.[code,message,stacktrace], and span status will likely be ERROR

For spans, you won't have error.message, since you would store it in the status (which needs to be error then). As for stacktrace, I don't see how it is related to errors in particular. We might want a general stacktrace attribute though (note that the exception's stacktrace attribute is not actually a "pure" stacktrace (as in, a list of frames), but the full exception detail representation, usually including exception, causing exception, etc. I protested that name quite vehemently when it was introduced, but since a few languages do call it stacktace and there is no commonly used specific other name for the concept, it stuck)

@trask
Copy link
Member Author

trask commented Apr 26, 2023

What about the Span status? An error can be recorded by setting the span status to ERROR and setting a appropriate status text.

That's a great point that spans already have both concepts:

  • Span.RecordException(Exception)
  • Span.SetStatus(StatusCode.ERROR, Description)

So maybe having separate error.* attributes would not be so confusing.

As far as introducing error.* attributes, how about?

  • error.type (low-cardinality) - equivalent of Span's StatusCode.ERROR Description
  • error.code (low-cardinality) - this could be a useful span attribute for systems which have error codes
  • error.message (not low-cardinality) - this could be useful on spans which want to report a more detailed reason associated with StatusCode.ERROR

(and error.* would not need a stack trace attribute, since that's typically the domain of exceptions)

@Oberon00
Copy link
Member

Oberon00 commented Apr 27, 2023

Hmm, I think error.message would be the status description. At least most usages of the status description I have seen don't use a fixed string, but put in some exception message or similar.
Also, I think we need something to represent explicit OK/ERROR status for other telemetry if we want to go this direction.

And actuallly, I think we should use the ones defined here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/mapping-to-non-otlp.md#span-status: otel.status_code and otel.status_description (NB: these names are marked stable)

@trask
Copy link
Member Author

trask commented Apr 28, 2023

Hmm, I think error.message would be the status description. At least most usages of the status description I have seen don't use a fixed string, but put in some exception message or similar.

interesting, we (@open-telemetry/java-instrumentation-maintainers) have interpreted this differently so far: open-telemetry/opentelemetry-java-instrumentation#6710 (comment)

using exception message contradicts the language used in spec which requires for the description to be predictable and documented. I interpret it to mean that there should be a fixed list of descriptions.

@trask
Copy link
Member Author

trask commented Apr 28, 2023

I think we need something to represent explicit OK/ERROR status for other telemetry if we want to go this direction.

on the logging side we have Severity, though it doesn't map 1-1 with SpanKind OK/ERROR, and it doesn't solve the issue for metrics signal

@lmolkova
Copy link
Contributor

lmolkova commented Apr 29, 2023

wrt ote.status_ attributes:

otel.status_code definition is very strict and cannot be used as an equivalent of general-purpose error.code, but it's probably fine:

  • for systems that have status codes, status code is not necessarily an error or success, so they won't use error.code for generic code. They can define a system-specific code plus use otel.status_code to describe if it's error/success/unknown.

Functionally, we can extend otel.status_code and otel.status_description to be cross-signal attributes, but the whole "otel.*" namespace seem to be reserved for internal otel business. Also, given how diverse and mature logging world is, having OTel-specific attributes for errors is a bit bold.

@Oberon00
Copy link
Member

Oberon00 commented May 1, 2023

@trask Your interpretation may very well be what was originially intended, but it's not generally used that way, e.g. the Python API provides an option to automatically record the exception messge as span status: https://github.com/open-telemetry/opentelemetry-python/blob/v1.17.0/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py#L951

@trask
Copy link
Member Author

trask commented May 10, 2023

@Oberon00 do you think I should raise a spec issue to clarify, or do you think the spec is reasonably clear and I should open a bug report in the python repo? thx

@Oberon00
Copy link
Member

Oberon00 commented May 10, 2023

This is a spec issue IMHO. Python was just one example, the same happens in some .NET instrumentations, e.g.: https://github.com/open-telemetry/opentelemetry-dotnet/blob/fea6793a5590a18ff6d5e99460f9dfd8cf4c95b2/src/OpenTelemetry.Instrumentation.Http/Implementation/HttpHandlerDiagnosticListener.cs#L248

At this point, I think you have to adapt the spec to explicitly allow this. What these do was not originally the intention maybe, but it is useful and not obviously wrong.

@cbandy
Copy link

cbandy commented Aug 13, 2023

Some exception attributes are "reserved" here:

## Reserved Attributes
Semantic conventions MUST provide the following attributes:
- Resource
- `service.name`
- `telemetry.sdk.language`
- `telemetry.sdk.name`
- `telemetry.sdk.version`
- Event
- `exception.escaped`
- `exception.message`
- `exception.stacktrace`
- `exception.type`

@trask
Copy link
Member Author

trask commented Apr 23, 2024

exception.* attributes are now stable: open-telemetry/semantic-conventions#619

@trask trask closed this as completed Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions semconv:HTTP spec:trace Related to the specification/trace directory
Development

No branches or pull requests

10 participants