[receiver/windowseventlog] Extract parsing logic #34131

djaglowski · 2024-07-17T13:14:18Z

Component(s)

receiver/windowseventlog

Is your feature request related to a problem? Please describe.

The receiver currently forces users to choose between raw xml events or parsed. There are cases where users may need both. (See open-telemetry/semantic-conventions#1217 and open-telemetry/opentelemetry-specification#3932.)

Describe the solution you'd like

Instead of forcing users to make a choice between raw or parsed, I proposed that we should standardize on raw within the receiver and separate the parsing functionality. Parsing can be provided as both a stanza operator and OTTL function.

Suggested migration path:

Extract a stanza parser from the windows event log input operator. Use the raw flag on the input operator to control whether this parser is embedded and used within the input operator. At this point there has been no change to user-facing functionality.
Add a feature gate (e.g. wel.alwaysRaw) controlling whether the raw flag may be used at all. In alpha stage, the flag may still be used.
When the feature gate moves to beta, also deprecate the raw parameter. It may still be used, but requires disabling the feature gate.
When the gate moves to stable, remove the parameter altogether. Users who want to parse the raw xml can still attach the new parser operator to do so.

Later:

Extract the parsing functionality in a way where it can also be used by OTTL. See [pkg/ottl] Parse uri string to url.* SemConv attributes #32433 for similar example.
Update the parser to automatically stash the original log body into attributes["log.record.original"] (once the semantic convention is released)

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-17T13:14:36Z

Pinging code owners:

receiver/windowseventlog: @armstrmi @pjanotti

See Adding Labels via Comments if you do not have permissions to add labels yourself.

pjanotti · 2024-07-18T02:08:18Z

@djaglowski my understanding is that this can also help with issues like #32952, correct?

djaglowski · 2024-07-18T13:29:59Z

@djaglowski my understanding is that this can also help with issues like #32952, correct?

I suppose by delaying parsing logic until later it may be easier to offer alternative parsing logic. Is that what you're getting at or something else?

pjanotti · 2024-07-19T16:58:16Z

Yes, my assumption is that by delaying parsing logic and ensuring that we can handle all data via OTTL there won't be the need to keep changing formats as suggested in #32952.

djaglowski · 2024-07-19T17:02:40Z

I think that the windows event xml is complex enough that there's value in having a dedicated parser for it, kind of like the new container parser but if it can be replaced by granular parsing operations then I'm not necessarily opposed.

djaglowski · 2024-08-08T18:37:19Z

Looking into this a bit more, it's not entirely clear to me that it is possible to parse the raw format.

The problem is that both raw and formatted logs are created from syscalls which require an event handle, but the event handle is not available after we emit the log record.

There might be some way to recreate the equivalent logic, but my interpretation at this point is that the events may need to be rendered in the receiver. This would mean that users needing both formats (e.g. for different backends) must set up two receivers to read the same data. Perhaps a better alternative here is to allow one payload to carry both formats (likely by replacing the raw config option with format: raw | parsed | both). If using the both setting, we could emit what would normally be the raw body as attributes[log.record.original].

@pjanotti, do you reach the same conclusion or am I missing a way to postpone parsing?

pjanotti · 2024-08-08T22:52:57Z

No, @djaglowski you are not missing a way to postpone that. I was with the mental model of ETW events, but, here we are dealing with Event Logs.

For Event Logs, the API only gives the opaque handle and returns XML from it. The XML has the same schema for raw and formatted, the raw just don't have the RenderingInfo. In a sense the complete log record is the formatted one since it contains the message (and other user friendly renderings) that the "raw" doesn't have.

Same event, as body of raw and formatted:

<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
    <System>
        <Provider Name='otelcorecol' />
        <EventID Qualifiers='0'>1</EventID>
        <Version>0</Version>
        <Level>4</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime='2024-08-08T22:20:04.7760478Z' />
        <EventRecordID>3005055</EventRecordID>
        <Correlation />
        <Execution ProcessID='36124' ThreadID='0' />
        <Channel>Application</Channel>
        <Computer>MyComputer</Computer>
        <Security UserID='S-1-5-21-1783686499-2158177463-2193993347-31799' />
    </System>
    <EventData>
        <Data>Creating event provider for 'otelcorecol'</Data>
    </EventData>
</Event>

<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
    <System>
        <Provider Name='otelcorecol' />
        <EventID Qualifiers='0'>1</EventID>
        <Version>0</Version>
        <Level>4</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime='2024-08-08T22:20:04.7760478Z' />
        <EventRecordID>3005055</EventRecordID>
        <Correlation />
        <Execution ProcessID='36124' ThreadID='0' />
        <Channel>Application</Channel>
        <Computer>MyComputer</Computer>
        <Security UserID='S-1-5-21-1783686499-2158177463-2193993347-31799' />
    </System>
    <EventData>
        <Data>Creating event provider for 'otelcorecol'</Data>
    </EventData>
    <RenderingInfo Culture='en-US'>
        <Message>Creating event provider for 'otelcorecol'</Message>
        <Level>Information</Level>
        <Opcode>Info</Opcode>
        <Keywords>
            <Keyword>Classic</Keyword>
        </Keywords>
    </RenderingInfo>
</Event>

As you can see the formatted carries all the info from the raw one. I don't see a scenario that someone needs both.

djaglowski · 2024-08-09T13:43:29Z

Thanks for the detailed example @pjanotti.

I don't see a scenario that someone needs both.

There are indeed times when both are needed. In short, it is necessary because some backends require the raw xml bytes, while others prefer that it be parsed ahead of time. I've described such cases in more detail here when arguing for a dedicated original body field (which ended up being a semantic convention).

The examples you gave show important distinctions in terms of the information conveyed, but I'm more focused on the format in which the information is represented. The difference between raw and formatted is quite substantial in this regard. A raw log body is a []byte containing xml, but a formatted body is an object with predefined fields that we've populated from the contents of the xml.

pjanotti · 2024-08-09T15:53:55Z

Ah, I think I got it now @djaglowski

In this case, my first reaction is to treat the XML as the raw log body, no matter if it contains the RenderingInfo or not. We could also unify the formatted body, since no matter how we obtained the XML we can always fill most of the predefined fields that you linked above.

With the unified formatted body we could also use your suggestion to, optionally, pass the original XML (with or without RenderingInfo) via attributes[log.record.original].

djaglowski · 2024-08-09T16:08:50Z

Great, that's sounds just right to me.

Circling back to the configuration then, do you think we should have a format: raw | parsed | both instead of the current raw bool? The alternative that I can think of would be to add a rendering_info bool and define the behavior something like this:

raw	rendering_info	Result
true	false	same as "raw: true" today
false	true	same as "raw: false" today
false	false	similar to "raw: false" today, but don't include the rendering info in the formatted body
true	true	similar to "raw: false" today, but include the raw body as "log.record.original" attribute

This feels a little more complex but I suppose it gives users the option to emit a formatted body without having to make the syscall to retrieve the rendering info.

pjanotti · 2024-08-09T17:10:05Z

The current option raw mixes the final output with the question of including or not RenderingInfo, I think we should separate them. If starting from zero I would have (using long names to reduce ambiguity):

output_format:
1. raw or perhaps xml: log entry is the XML string as of today, i.e.: the same as raw equal true (but with the XML including the rendering info unless suppressed, see below)
2. parsed: log entry is a map similar as current object (but updated to really cover all the XML, anyway, that's a separated issue)
suppress_rendering_info: a boolean to tell if the user really wants to skip the attempt to generate the rendering info. Obs.:
1. This is a type of optimization: the user setting it means: don't pay the cost of attempting to render the event friendly message
2. On the default there is no guarantee that the rendering info is going to be always present, it only tells that there will be an attempt to get it
log_original_record: another boolean, ignored if output_format is raw. It adds the XML to the output of parsed. Alternatively, we could add a 3rd output_format, something like parsed_with_original_record, and remove this option.

This is a larger change to current settings, but, I think it is more comprehensible and better reflects the options.

djaglowski · 2024-08-09T18:45:31Z

I like your design. I think we could migrate to it relatively painlessly as long as each of the current behaviors has an equivalent to some combination of the settings. Even better if defaults for the new settings achieve the current default behavior.

djaglowski · 2024-08-12T17:38:24Z

Circling back to the original proposal here, I think it should be possible to move "parsing" into a seperate stanza operator, and eventually a processor or OTTL function.

The input to this parser is an xml string, which can be unmarshaled into the EventXML struct. Then values can be moved into a structured body as we do today.

Incorporating this into the design in your most recent comment, this would mean suppress_rendering_info is still a setting we want in the input operator, since it directly influences which syscalls must be made. However, it removes the need for output_format and log_original_record settings - as if raw: true is always the case. The parser or processor could then be applied to the body if desired. WDYT?

pjanotti · 2024-08-12T19:35:41Z

My understanding of the stanza operator is superficial, but, it seems reasonable: the lower level is always the XML, a stanza operator could transform it into the EventXML struct if needed.

Some Qs:

Would EventXML always carry the original XML?
Do we want to keep/drop the original XML as configurable option?
What should be the default?

djaglowski · 2024-08-12T19:52:06Z

Would EventXML always carry the original XML?

I am imagining that EventXML is just the go struct which is used to unmarshal the original xml. Then we immediately convert this into a map[string]any and assign it (typically to the body). The map[string]any can be the same as we currently produce in parseBody.

Do we want to keep/drop the original XML as configurable option? What should be the default?

I think there should be a bool that automatically moves the original xml to attributes[log.record.original]. I'm not sure what the default should be but I think this is a question for many stanza parsers (and perhaps many OTTL log functions). For now I would say the default is to not preserve the original in this way, since it may increase the size of the record substantially.

pjanotti · 2024-08-12T19:55:06Z

Sounds reasonable to me @djaglowski

djaglowski added enhancement New feature or request needs triage New item requiring triage labels Jul 17, 2024

github-actions bot added the receiver/windowseventlog label Jul 17, 2024

djaglowski changed the title ~~[receiver/windowseventlog]~~ [receiver/windowseventlog] Extract parsing logic Jul 18, 2024

github-actions bot mentioned this issue Jul 23, 2024

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

This was referenced Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Closed

github-actions bot mentioned this issue Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

djaglowski linked a pull request Aug 16, 2024 that will close this issue

[receiver/windowseventlog] Add suppress_rendering_info parameter and simplify internal logic. #34720

Draft

github-actions bot mentioned this issue Aug 20, 2024

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

This was referenced Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Open

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/windowseventlog] Extract parsing logic #34131

[receiver/windowseventlog] Extract parsing logic #34131

djaglowski commented Jul 17, 2024

github-actions bot commented Jul 17, 2024

pjanotti commented Jul 18, 2024

djaglowski commented Jul 18, 2024

pjanotti commented Jul 19, 2024

djaglowski commented Jul 19, 2024

djaglowski commented Aug 8, 2024

pjanotti commented Aug 8, 2024 •

edited

Loading

djaglowski commented Aug 9, 2024

pjanotti commented Aug 9, 2024

djaglowski commented Aug 9, 2024

pjanotti commented Aug 9, 2024 •

edited

Loading

djaglowski commented Aug 9, 2024

djaglowski commented Aug 12, 2024

pjanotti commented Aug 12, 2024

djaglowski commented Aug 12, 2024

pjanotti commented Aug 12, 2024

[receiver/windowseventlog] Extract parsing logic #34131

[receiver/windowseventlog] Extract parsing logic #34131

Comments

djaglowski commented Jul 17, 2024

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Jul 17, 2024

pjanotti commented Jul 18, 2024

djaglowski commented Jul 18, 2024

pjanotti commented Jul 19, 2024

djaglowski commented Jul 19, 2024

djaglowski commented Aug 8, 2024

pjanotti commented Aug 8, 2024 • edited Loading

djaglowski commented Aug 9, 2024

pjanotti commented Aug 9, 2024

djaglowski commented Aug 9, 2024

pjanotti commented Aug 9, 2024 • edited Loading

djaglowski commented Aug 9, 2024

djaglowski commented Aug 12, 2024

pjanotti commented Aug 12, 2024

djaglowski commented Aug 12, 2024

pjanotti commented Aug 12, 2024

pjanotti commented Aug 8, 2024 •

edited

Loading

pjanotti commented Aug 9, 2024 •

edited

Loading