Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/windowseventlog] Extract parsing logic #34131

Open
djaglowski opened this issue Jul 17, 2024 · 16 comments · May be fixed by #34720
Open

[receiver/windowseventlog] Extract parsing logic #34131

djaglowski opened this issue Jul 17, 2024 · 16 comments · May be fixed by #34720
Labels
enhancement New feature or request needs triage New item requiring triage receiver/windowseventlog

Comments

@djaglowski
Copy link
Member

Component(s)

receiver/windowseventlog

Is your feature request related to a problem? Please describe.

The receiver currently forces users to choose between raw xml events or parsed. There are cases where users may need both. (See open-telemetry/semantic-conventions#1217 and open-telemetry/opentelemetry-specification#3932.)

Describe the solution you'd like

Instead of forcing users to make a choice between raw or parsed, I proposed that we should standardize on raw within the receiver and separate the parsing functionality. Parsing can be provided as both a stanza operator and OTTL function.

Suggested migration path:

  1. Extract a stanza parser from the windows event log input operator. Use the raw flag on the input operator to control whether this parser is embedded and used within the input operator. At this point there has been no change to user-facing functionality.
  2. Add a feature gate (e.g. wel.alwaysRaw) controlling whether the raw flag may be used at all. In alpha stage, the flag may still be used.
  3. When the feature gate moves to beta, also deprecate the raw parameter. It may still be used, but requires disabling the feature gate.
  4. When the gate moves to stable, remove the parameter altogether. Users who want to parse the raw xml can still attach the new parser operator to do so.

Later:

Describe alternatives you've considered

No response

Additional context

No response

@djaglowski djaglowski added enhancement New feature or request needs triage New item requiring triage labels Jul 17, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@pjanotti
Copy link
Contributor

@djaglowski my understanding is that this can also help with issues like #32952, correct?

@djaglowski djaglowski changed the title [receiver/windowseventlog] [receiver/windowseventlog] Extract parsing logic Jul 18, 2024
@djaglowski
Copy link
Member Author

@djaglowski my understanding is that this can also help with issues like #32952, correct?

I suppose by delaying parsing logic until later it may be easier to offer alternative parsing logic. Is that what you're getting at or something else?

@pjanotti
Copy link
Contributor

Yes, my assumption is that by delaying parsing logic and ensuring that we can handle all data via OTTL there won't be the need to keep changing formats as suggested in #32952.

@djaglowski
Copy link
Member Author

I think that the windows event xml is complex enough that there's value in having a dedicated parser for it, kind of like the new container parser but if it can be replaced by granular parsing operations then I'm not necessarily opposed.

@djaglowski
Copy link
Member Author

Looking into this a bit more, it's not entirely clear to me that it is possible to parse the raw format.

The problem is that both raw and formatted logs are created from syscalls which require an event handle, but the event handle is not available after we emit the log record.

There might be some way to recreate the equivalent logic, but my interpretation at this point is that the events may need to be rendered in the receiver. This would mean that users needing both formats (e.g. for different backends) must set up two receivers to read the same data. Perhaps a better alternative here is to allow one payload to carry both formats (likely by replacing the raw config option with format: raw | parsed | both). If using the both setting, we could emit what would normally be the raw body as attributes[log.record.original].

@pjanotti, do you reach the same conclusion or am I missing a way to postpone parsing?

@pjanotti
Copy link
Contributor

pjanotti commented Aug 8, 2024

No, @djaglowski you are not missing a way to postpone that. I was with the mental model of ETW events, but, here we are dealing with Event Logs.

For Event Logs, the API only gives the opaque handle and returns XML from it. The XML has the same schema for raw and formatted, the raw just don't have the RenderingInfo. In a sense the complete log record is the formatted one since it contains the message (and other user friendly renderings) that the "raw" doesn't have.

Same event, as body of raw and formatted:

<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
    <System>
        <Provider Name='otelcorecol' />
        <EventID Qualifiers='0'>1</EventID>
        <Version>0</Version>
        <Level>4</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime='2024-08-08T22:20:04.7760478Z' />
        <EventRecordID>3005055</EventRecordID>
        <Correlation />
        <Execution ProcessID='36124' ThreadID='0' />
        <Channel>Application</Channel>
        <Computer>MyComputer</Computer>
        <Security UserID='S-1-5-21-1783686499-2158177463-2193993347-31799' />
    </System>
    <EventData>
        <Data>Creating event provider for 'otelcorecol'</Data>
    </EventData>
</Event>
<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
    <System>
        <Provider Name='otelcorecol' />
        <EventID Qualifiers='0'>1</EventID>
        <Version>0</Version>
        <Level>4</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime='2024-08-08T22:20:04.7760478Z' />
        <EventRecordID>3005055</EventRecordID>
        <Correlation />
        <Execution ProcessID='36124' ThreadID='0' />
        <Channel>Application</Channel>
        <Computer>MyComputer</Computer>
        <Security UserID='S-1-5-21-1783686499-2158177463-2193993347-31799' />
    </System>
    <EventData>
        <Data>Creating event provider for 'otelcorecol'</Data>
    </EventData>
    <RenderingInfo Culture='en-US'>
        <Message>Creating event provider for 'otelcorecol'</Message>
        <Level>Information</Level>
        <Opcode>Info</Opcode>
        <Keywords>
            <Keyword>Classic</Keyword>
        </Keywords>
    </RenderingInfo>
</Event>

As you can see the formatted carries all the info from the raw one. I don't see a scenario that someone needs both.

@djaglowski
Copy link
Member Author

Thanks for the detailed example @pjanotti.

I don't see a scenario that someone needs both.

There are indeed times when both are needed. In short, it is necessary because some backends require the raw xml bytes, while others prefer that it be parsed ahead of time. I've described such cases in more detail here when arguing for a dedicated original body field (which ended up being a semantic convention).

The examples you gave show important distinctions in terms of the information conveyed, but I'm more focused on the format in which the information is represented. The difference between raw and formatted is quite substantial in this regard. A raw log body is a []byte containing xml, but a formatted body is an object with predefined fields that we've populated from the contents of the xml.

@pjanotti
Copy link
Contributor

pjanotti commented Aug 9, 2024

Ah, I think I got it now @djaglowski

In this case, my first reaction is to treat the XML as the raw log body, no matter if it contains the RenderingInfo or not. We could also unify the formatted body, since no matter how we obtained the XML we can always fill most of the predefined fields that you linked above.

With the unified formatted body we could also use your suggestion to, optionally, pass the original XML (with or without RenderingInfo) via attributes[log.record.original].

@djaglowski
Copy link
Member Author

Great, that's sounds just right to me.


Circling back to the configuration then, do you think we should have a format: raw | parsed | both instead of the current raw bool? The alternative that I can think of would be to add a rendering_info bool and define the behavior something like this:

raw rendering_info Result
true false same as "raw: true" today
false true same as "raw: false" today
false false similar to "raw: false" today, but don't include the rendering info in the formatted body
true true similar to "raw: false" today, but include the raw body as "log.record.original" attribute

This feels a little more complex but I suppose it gives users the option to emit a formatted body without having to make the syscall to retrieve the rendering info.

@pjanotti
Copy link
Contributor

pjanotti commented Aug 9, 2024

The current option raw mixes the final output with the question of including or not RenderingInfo, I think we should separate them. If starting from zero I would have (using long names to reduce ambiguity):

  • output_format:
    1. raw or perhaps xml: log entry is the XML string as of today, i.e.: the same as raw equal true (but with the XML including the rendering info unless suppressed, see below)
    2. parsed: log entry is a map similar as current object (but updated to really cover all the XML, anyway, that's a separated issue)
  • suppress_rendering_info: a boolean to tell if the user really wants to skip the attempt to generate the rendering info. Obs.:
    1. This is a type of optimization: the user setting it means: don't pay the cost of attempting to render the event friendly message
    2. On the default there is no guarantee that the rendering info is going to be always present, it only tells that there will be an attempt to get it
  • log_original_record: another boolean, ignored if output_format is raw. It adds the XML to the output of parsed. Alternatively, we could add a 3rd output_format, something like parsed_with_original_record, and remove this option.

This is a larger change to current settings, but, I think it is more comprehensible and better reflects the options.

@djaglowski
Copy link
Member Author

I like your design. I think we could migrate to it relatively painlessly as long as each of the current behaviors has an equivalent to some combination of the settings. Even better if defaults for the new settings achieve the current default behavior.

@djaglowski
Copy link
Member Author

Circling back to the original proposal here, I think it should be possible to move "parsing" into a seperate stanza operator, and eventually a processor or OTTL function.

The input to this parser is an xml string, which can be unmarshaled into the EventXML struct. Then values can be moved into a structured body as we do today.

Incorporating this into the design in your most recent comment, this would mean suppress_rendering_info is still a setting we want in the input operator, since it directly influences which syscalls must be made. However, it removes the need for output_format and log_original_record settings - as if raw: true is always the case. The parser or processor could then be applied to the body if desired. WDYT?

@pjanotti
Copy link
Contributor

My understanding of the stanza operator is superficial, but, it seems reasonable: the lower level is always the XML, a stanza operator could transform it into the EventXML struct if needed.

Some Qs:

  • Would EventXML always carry the original XML?
  • Do we want to keep/drop the original XML as configurable option?
  • What should be the default?

@djaglowski
Copy link
Member Author

Would EventXML always carry the original XML?

I am imagining that EventXML is just the go struct which is used to unmarshal the original xml. Then we immediately convert this into a map[string]any and assign it (typically to the body). The map[string]any can be the same as we currently produce in parseBody.

Do we want to keep/drop the original XML as configurable option? What should be the default?

I think there should be a bool that automatically moves the original xml to attributes[log.record.original]. I'm not sure what the default should be but I think this is a question for many stanza parsers (and perhaps many OTTL log functions). For now I would say the default is to not preserve the original in this way, since it may increase the size of the record substantially.

@pjanotti
Copy link
Contributor

Sounds reasonable to me @djaglowski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage New item requiring triage receiver/windowseventlog
Projects
None yet
2 participants