Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent usage of .raw, and discussion about .raw vs .keyword naming #87

Closed
webmat opened this issue Aug 16, 2018 · 6 comments
Closed
Assignees
Labels

Comments

@webmat
Copy link
Contributor

webmat commented Aug 16, 2018

This issue is about two distinct problems that I think are closely related, hence the creation of a single issue.

Inconsistent usage of .raw

We currently have 7 fields ending in .raw in ECS. Here they are, listed with their type, and the "meaning" of .raw in their context:

Field Name Type Meaning of .raw
event.raw keyword Original value, prior to parsing
file.path.raw keyword Nested field of a multi-field
file.target_path.raw keyword Nested field of a multi-field
url.href.raw keyword Nested field of a multi-field
url.path.raw keyword Nested field of a multi-field
url.query.raw keyword Nested field of a multi-field
user_agent.raw text Original value, prior to parsing

In this list, the middle 5 are following one of the conventions for multi-field: text indexing for the top level field (e.g. file.path) and keyword indexing for the nested field (e.g. file.path.raw).

The other two are not following that convention:

  • user_agent.raw is the one breaking from the convention the most. It's not part of a multi-field, and user_agent.raw is actually of type text, not type keyword. Meaning a user could not use this field for aggregations, as opposed to what the .raw convention establishes. It's named .raw because it's the full user agent string, prior to breaking it up into name, OS, version fields & so on.
  • event.raw happens to be of type keyword which is good. But it's not actually part of a multi-field. It just means the original message is stored there.

Naming of the nested field for multi-field

I'm not 100% clear on the exact timeline. But my experience with ElasticSearch in monitoring pipelines from time immemorial, has been using the .raw nomenclature for the name of a sub-field of type keyword. Since about Stack v5 (or perhaps ES 2.x and Kibana 4.x), I've seen the naming shift to using .keyword for the nested field, instead of .raw.

Given the inconsistency I've outlined in the first part of this issue, I wonder if we shouldn't move to the new naming convention of using .keyword for multi-field, and having the freedom to use .raw for fields where we actually mean an original value.

If we decide to stick with the .raw naming convention for multi-field, I think we should address the event.raw and user_agent.raw inconsistencies, however. Perhaps rename them event.original and user_agent.original or something to that effect.

I don't have a strong preference for .keyword or .raw, but I think we need to address the inconsistency. I was curious what folks think about this.

@ruflin
Copy link
Member

ruflin commented Aug 16, 2018

We should definitively should fix user_agent.raw, I would argue this is even a bug in ECS.

I'm a bit hesitant to use .keyword for the reason that it implies someone needs knowledge of Elasticsearch. ECS is built with a focus on Elasticsearch but it should not require knowledge about it on the consumer side (user that writes the queries, looks at the data). If .raw is a multi field or just a raw values (like event.raw) matters for the person running the process which ingests the data but not the person consuming it. As consuming happens much more often and hopefully by a wider variety of people I think we should focus on the consumer in naming.

@ruflin ruflin added the discuss label Aug 16, 2018
@vbohata
Copy link

vbohata commented Aug 16, 2018

Personally I use .keyword for text fields which rarely needs to be processed as keyword and .integer for other text types which rarely needs to be processed as integer if possible. From my experience this is more understandable for our users than .raw. In cases like user_agent.raw I use user_agent.line, but in user_agent.raw you can also see this is unparsed user_agent. For multi fields the .raw is very confusing because nobody can quickly see what is .raw - is it keyword/text/integer, raw of what ...

@MikePaquette
Copy link
Contributor

I have no objection to changing event.raw to event.original
I generally agree with @ruflin about the consumer focus in naming.

@webmat
Copy link
Contributor Author

webmat commented Aug 17, 2018

Even though keyword happens to be the actual ElasticSearch type, I don't think it's more obscure than raw. I actually remember wondering what .raw meant, when I initially started seeing it, in my first serious ELK stack.

My preference would actually be to go with the ES default naming of keyword. Two reasons:

  • I think it actually makes it easier to explain to a non-technical person that this field is indexed as a full "keyword" and therefore it's not broken up for full text search.
  • keyword also has the benefit of being aligned with the new ElasticSearch default, so most of the documentation and training already shows fields ending in .keyword, and ECS would simply follow that as well.

I initially stated that I didn't have a strong opinion, but as I'm writing this I'm convincing myself more and more ;-)

@webmat
Copy link
Contributor Author

webmat commented Aug 23, 2018

We've discussed this internally and came to the conclusion that the ambiguous fields should be renamed to .original (#102) and that we should align with the ElasticSearch default of using .keyword instead of .raw (#103).

@webmat webmat closed this as completed Aug 23, 2018
@webmat
Copy link
Contributor Author

webmat commented Aug 23, 2018

We also discussed revisiting the current list of fields to see if more of them should be multi-field (#104).

webmat pushed a commit to webmat/ecs that referenced this issue Sep 19, 2018
The field that actually started this whole discussion had been forgotten from elastic#87 and elastic#103 :-)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants