New component: Blob Attribute Uploader Connector #33737

michaelsafyan · 2024-06-24T21:51:00Z

The purpose and use-cases of the new component

The Blob Attribute Uploader Connector takes selected attributes (from spans, span events, logs, etc.) and:

Writes them to a large blob storage system
Replaces them in the original signal with a "Foreign Attribute Reference" referencing the URI of where it was written
Forwards the signal to a pipeline for the same signal type for further processing, export

This component is intended to address a number of concerns:

Sensitivity of data: certain data may be necessary to retain for debugging but may not be suitable for access by all oncallers or others with access to general operational data; writing certain attributes to a separate blob storage system may allow for finer-grained, alternative access restrictions to be applied compared with the general ops backend.
Size of the data:: some operational backends may have limitations around the size of the data they can receive; sending large attributes to a separate blob storage backend may avoid these limitations.
Costs of storage: while most operational data may need to be available quickly to address incidents, certain attributes may be needed to be accessed less frequently and may be suitable for lower cost, long-term storage options.

Motivating Examples:

HTTP request/response pairs stored in span attributes (http.request.body.content and http.response.body.content)
LLM prompt/response pairs stored in span event attributes ( gen_ai.prompt and gen_ai.completion)

Use Cases Related to the Examples:

Additional restrictions around the access are needed beyond that of the general operations solution; writing to a separate blob storage allows additional access controls to be applied. Links to the destination enable the results to be located in a separate backend storage system that provides the necessary checks on access.
Full request/responses get used rarely by the oncallers, only when their end user opens a ticket through their support mechanism; writing this data to a separate, low-cost storage system allows the user to save on their ops storage costs.

Example configuration for the component (subject to change)

The following is intended to illustrate the general idea, but is subject to change:

The configuration consists of a list of ConfigStanzas:

config := LIST[ConfigStanza]

Each config stanza defines how it will handle exactly one type of attribute. The properties of the stanza are:

match_attribute_key: (REQUIRED) The exact attribute key to match (e.g. http.request.body.content)
match_attribute_only_in: (OPTIONAL) Allows the key to be matched in only a specific part of the signal.
- Supported values include:
  - SPAN: only look at span-level attributes (not resource, scope, or event attributes)
  - RESOURCE: only look at resource-level attributes (not span, scope, or event attributes)
  - SCOPE: only look at scope-level attributes (not span, resource, or event attributes)
  - EVENT: only look at event-level attributes (not span, resource, or scope attributes)
destination_uri: (Required) The pattern to which to write the data.
- Ex: gs://example-bucket/full-http/request/payloads/${trace_id}/${span_id}.txt
- Patterns may reference other parts of the signal, including:
  - trace_id
  - span_id
  - resource.attributes
  - span.attributes
  - scope.attributes
- Keys can be referenced with dot or bracket notation (e.g. span.attributes.foo or span.attributes[foo]).
content_type: (OPTIONAL) Indicates the content type of the attribute (default: AUTO)
- Options include:
  - AUTO: attempt to infer the content type automatically
  - extract_from: expr: derive it from other information in the signal
    - Ex: extract_from: span.attributes["http.request.header.content-type"]
  - any literal string (e.g. "application/json"): to use a static value
fraction_to_write: (OPTIONAL) Allows down sampling of the payloads. Defaults to 1.0 (i.e. 100%)
fraction_written_behavior: (OPTIONAL) Defaults to REPLACE_WITH_REFERENCE.
- Options include:
  - REPLACE_WITH_REFERENCE: replace the value with a reference to the destination location.
  - KEEP: the write is a copy, but the original data is not altered.
  - DROP: the fact that a write happened will not be recorded in the attribute
fraction_not_written_behavior: (Optional) Defaults to DROP.
- Options include:
  - DROP: remove the attribute in its entirety
  - KEEP: don't modify the original data if this fraction wasn't matched

Here is a full example with the above in mind:

 - match_attribute_key: http.request.body.content
   match_only_in: SPAN
   destination_uri:  "gs://${env.GCS_BUCKET}/${trace_id}/${span_id}/request.json"
   content_type: "application/json"

 - match_attribute_key: http.response.body.content
   match_only_in: SPAN
   destination_uri: "gs://${env.GCS_BUCKET}/${trace_id}/${span_id}/response.json"
   content_type: "application/json"

Telemetry data types supported

Traces

Is this a vendor-specific component?

This is a vendor-specific component
If this is a vendor-specific component, I am a member of the OpenTelemetry organization.
If this is a vendor-specific component, I am proposing to contribute and support it as a representative of the vendor.

Code Owner(s)

braydonk, michaelsafyan, dashpole

Sponsor (optional)

dashpole

Additional context

No response

The text was updated successfully, but these errors were encountered:

dashpole · 2024-06-25T14:55:50Z

I am willing to potentially sponsor this, but I would would love to see if any others have needed to store very large or sensitive attributes separately. I plan to raise this tomorrow at the SIG meeting.

dashpole · 2024-06-26T18:28:38Z

I raised this at the SIG meeting today, but this wasn't an issue people on the call had run into before.

dashpole · 2024-07-10T23:35:41Z

There is some consideration of moving the "larger" genai attributes. open-telemetry/semantic-conventions#483 (comment)

karthikscale3 · 2024-07-11T00:02:48Z

We Langtrace are also interested to test out this span processor as we are also thinking about this problem. We currently have 2 GenAI OTEL instrumentation libraries - python and typescript.

lmolkova · 2024-07-11T23:14:01Z

The LLM Semconv WG is considering reporting prompts and completions in event payloads (and breaking them down into individual structured pieces) - open-telemetry/semantic-conventions#980

Still, there is a possibility that prompts/completion messages could be big. There is interest in the community to record generated images, audio, etc for debugging/evaluation purposes.

From general semconv perspective, we don't usually define span attributes that may contain unbounded data (gen_ai.prompt and completion are temporary exceptions), are are likely to recommend events/logs payloads for this.

In this context, it could make sense to also support blob uploads with LogProcessor. See also open-telemetry/semantic-conventions#1217 where a similar concerns have been raised for logs.

michaelsafyan · 2024-07-12T15:47:01Z

In the interests of transparency, I have started related work on this here:

https://github.com/michaelsafyan/open-telemetry.opentelemetry-collector-contrib/tree/blob_writer_span_processor

I originally started with a "processor", but I'm having doubts regarding whether this functionality is possible with a processor and am now looking into representing it as an "exporter" that wraps another exporter (but perhaps this is incorrect?). In any event, the (very early, not yet complete code) is in development here:

https://github.com/michaelsafyan/open-telemetry.opentelemetry-collector-contrib/tree/blob_writer_span_processor/exporter/blobattributeexporter

I appreciate the insight that this may shift to a different representation... with that in mind, I am going to try to make this more general. While I will start with span attributes to handle current representations, I will keep the naming general and allow this to grow to address write-aside to blob storage from other signal types and other parts of the signal.

michaelsafyan · 2024-07-18T15:42:31Z

Quick Status update:

Still working on this
Current ETA expectation is ~2 weeks to get a working demo

Will give another update in 2 weeks time or when this is working, whichever is sooner.

michaelsafyan · 2024-08-05T08:45:30Z

Apologies that this is taking longer than expected. I am, however, still working on this.

michaelsafyan · 2024-08-12T21:43:25Z

The general shape of this is now present and can be found in:

https://github.com/michaelsafyan/open-telemetry.opentelemetry-collector-contrib/tree/blob_writer_span_processor/connector/blobattributeuploadconnector

I still need to polish this and create end-to-end testing, but there is probably enough here to get early feedback.

Note that while the original scope was intended to focus on spans, the above covers BOTH spans AND span events, given the pivot of the GenAI semantic conventions towards span event attributes.

I also pivoted from hand-rolling the string interpolation, to trying to leverage OTTL to do it:

... this required some hackery in OTTL, though, and am wondering if there is an even cleaner approach than this.

codefromthecrypt · 2024-08-13T00:05:26Z

@michaelsafyan thanks! To catch you up to date, the current semver 1.27.0 is already span events, so this is relevant.

What's a question mark to many is the change to log events. For example, not all backends know what to do with them, and there is some implied indexing. So, I would expect that once this is in, folks will want to transform log events (with span context) back to span events.

Do you feel up to adding a function like interpolateSpanEvent to do that? Something like logEventWithSpanContextToSpanEvent?

michaelsafyan · 2024-08-13T13:59:41Z

@codefromthecrypt can you elaborate on what you mean by folks will want to transform log events (with span context) back to span events. Is that so that separate logs can get processed by this connector?

The way that I'm thinking about this is that blobattributeuploadconnector will be a generic component that enable:

Uploading attribute content to a blob storage destination.
Replacing the original attribute value with a "Foreign Attribute Reference" (see foreignattr.go)

What I have there now targets:

span attributes
span event attributes

A logical expansion of this logic would be to also handle:

log attributes
(maybe?) log body

Other types of conversions (such as span events to logs, or logs back into span events) make sense and would be useful, but probably should be considered out of scope for this particular component (and should probably be tracked in a separate issue), though I agree that it is important for different users to decide whether their events data is recorded as events attached to a span or as separate logs (and that a connector is likely to be a good way to implement that).

codefromthecrypt · 2024-08-14T01:30:48Z

@michaelsafyan so the main q about log events was in relation to the genai spec which is about to switch to them. Since this spec is noted in the description, that's why I thought it might be in scope for this change/PR.

What do you think is a better place to move the topic of transform "span events to log events" to? If you don't have a specific idea, I'll open a new issue, just didn't want to duplicate this, if it was in scope.

michaelsafyan · 2024-08-14T17:07:51Z

I think new, separate issues for "Log Events -> Span Event Connector" and "Span Events -> Logs Connector" would make sense.

codefromthecrypt · 2024-08-15T02:52:35Z

cool. I opened #34695 first, and if I made any mistakes in the description please correct if you have karma to do so, or ask me to, if you don't.

michaelsafyan · 2024-08-30T19:04:28Z

Just providing another update, since it has been a while.

I was out on vacation last week and had other work to catch up on this past week.

I am hoping to resume this work this coming week.

This is still on my plate.

michaelsafyan · 2024-09-06T20:28:19Z

Quick status update:

Believe that the code (for spans and span events) is largely complete, but bugs may turn up as tests are written
Iterating on unit tests (traces_test.go).

I am, however, encountering merge conflicts when attempting to sync from upstream ... so this may require some additional work to resolve.

michaelsafyan added needs triage New item requiring triage Sponsor Needed New component seeking sponsor labels Jun 24, 2024

github-actions bot mentioned this issue Jul 2, 2024

Weekly Report: 2024-06-25 - 2024-07-02 #33839

Open

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Open

github-actions bot mentioned this issue Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Closed

github-actions bot mentioned this issue Jul 23, 2024

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

github-actions bot mentioned this issue Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

github-actions bot mentioned this issue Aug 6, 2024

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Closed

github-actions bot mentioned this issue Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

michaelsafyan changed the title ~~New component: blob writer span processor~~ New component: Blob Attribute Uploader Connector Aug 14, 2024

codefromthecrypt mentioned this issue Aug 15, 2024

[pkg/ottl] Add LogEvent context #34695

Closed

michaelsafyan mentioned this issue Aug 15, 2024

Please make it easy to do simple string interpolation using the OTTL library to facilitate developing other components #34700

Open

github-actions bot mentioned this issue Aug 20, 2024

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

github-actions bot mentioned this issue Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

github-actions bot mentioned this issue Sep 3, 2024

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Open

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New component: Blob Attribute Uploader Connector #33737

New component: Blob Attribute Uploader Connector #33737

michaelsafyan commented Jun 24, 2024 •

edited

Loading

dashpole commented Jun 25, 2024

dashpole commented Jun 26, 2024

dashpole commented Jul 10, 2024

karthikscale3 commented Jul 11, 2024

lmolkova commented Jul 11, 2024

michaelsafyan commented Jul 12, 2024

michaelsafyan commented Jul 18, 2024

michaelsafyan commented Aug 5, 2024

michaelsafyan commented Aug 12, 2024

codefromthecrypt commented Aug 13, 2024

michaelsafyan commented Aug 13, 2024

codefromthecrypt commented Aug 14, 2024

michaelsafyan commented Aug 14, 2024

codefromthecrypt commented Aug 15, 2024

michaelsafyan commented Aug 30, 2024

michaelsafyan commented Sep 6, 2024

New component: Blob Attribute Uploader Connector #33737

New component: Blob Attribute Uploader Connector #33737

Comments

michaelsafyan commented Jun 24, 2024 • edited Loading

The purpose and use-cases of the new component

Example configuration for the component (subject to change)

Telemetry data types supported

Is this a vendor-specific component?

Code Owner(s)

Sponsor (optional)

Additional context

dashpole commented Jun 25, 2024

dashpole commented Jun 26, 2024

dashpole commented Jul 10, 2024

karthikscale3 commented Jul 11, 2024

lmolkova commented Jul 11, 2024

michaelsafyan commented Jul 12, 2024

michaelsafyan commented Jul 18, 2024

michaelsafyan commented Aug 5, 2024

michaelsafyan commented Aug 12, 2024

codefromthecrypt commented Aug 13, 2024

michaelsafyan commented Aug 13, 2024

codefromthecrypt commented Aug 14, 2024

michaelsafyan commented Aug 14, 2024

codefromthecrypt commented Aug 15, 2024

michaelsafyan commented Aug 30, 2024

michaelsafyan commented Sep 6, 2024

michaelsafyan commented Jun 24, 2024 •

edited

Loading