further .text and keyword discussion #570

neu5ron · 2019-10-01T14:46:07Z

for security use cases, large error messages, etc.. .text is an important aspect in searching.

pro(s):

faster searches
severely reduces impact on the backend (and ultimately customer/user experience)

con(s):

increases storage

temporary proposal to meet in the middle (for now) instead of adding .text into the dynamic template for all strings fields VS having no .text fields at all
ECS and thus corresponding beats index templates have .text field for high value/impacting fields. List of fields off the top of my head would be:

process.executable.text
process.args.text
url.original
user.email
user_agent.original
error.stack_trace
file.path
host.name
any .user.name fields
any .domain fields
any .as.organization.name fields
http.request.body.content
http.response.body.content
os.name.full

existing ECS reference:
#340
#104

The text was updated successfully, but these errors were encountered:

webmat · 2019-10-07T14:01:30Z

Thanks @neu5ron! Yes, the subtext here is that when only keyword fields are available, people will do searches with wildcards, potentially multiple wildcards. Prefix searches (aka trailing wildcard: "myword*") are very fast on keyword fields, but a search such as "*myword*" is essentially equivalent to a full table scan.

Another subtext is that users are free to add their own .text multi-fields (as custom fields) wherever they need it, even if it's not in ECS. On the other hand, I don't think people do that a lot in real life. They either don't know they can do that, or it's way too late and they are threat hunting right now. A project to go back & identify which fields should have a .text multi-fields, modifying the template, then reindexing their data is obviously out of the question when that happens. So "multiple wildcard search" it is... And the cluster goes 💥

So I think this is worth addressing in ECS, and making sure that a baseline of fields where full text search is useful are defined from the get go.

Other suggestions and thoughts about this are welcome :-)

webmat · 2019-10-10T19:43:57Z

I would add os.name.full to the list. See why in #576 (comment)

randomuserid · 2019-10-24T19:52:10Z

So I reviewed all of our current rule sets and there are a few more ECS fields that are used by hunting searches. This list is not ranked but the process fields are clearly the most heavily used and process.args / process.command_line may be the most frequently used. These are the fields not listed above:

process.command_line
process.name
process.parent.args
process.parent.executable
process.pe.original_file_name
process.pe.signer_name
process.working_directory

There are also 13 non-ECS fields that are used by hunting searches or by ML job datafeed queries in addition to 12 ECS fields for a total of 25. The complete breakdown is in here:
https://docs.google.com/spreadsheets/d/1xOicLbct4Vk10hj4_Xaa1ViYjU9W_unFZ_Rv_mTe-ks/edit#gid=0

webmat · 2019-10-29T17:25:20Z

@randomuserid process.parent isn't in ECS yet, but this should get in soon.

But what's process.pe?

"pe" = portable executable, the modern Microsoft executable file format. original file name is a recent addition in sysmon intended to catch malware that renames itself during process init. The signer name is the name associated with the code signing certificate, most Windows binaries are signed in order to identify where they came from.

webmat · 2019-10-29T17:33:34Z

Another question that came up (this is for everyone reading this) is whether we need to set a specific Elasticsearch analyzer and/or search_analyzer for some of these fields, or is the default analyzer good enough here?

I'm thinking the potentially big fields that have an arbitrary structure (request payloads, user agent) should likely use the default one.

But fields that are structured (path, url, domain, email) may have analyzers that are better suited for specific needs?

randomuserid · 2019-10-29T23:04:26Z

After doing some counting today, the lack of wildcard support prevents us from using the best rules we have and practically 100% of the Endgame rules. Sigma rules are also nearly 100% blocked. We may have less than 40 signal rules in 7.6 if we have to hold back all of the wildcard or regex search rules. With wildcards, we would have between 200-300 or more.

neu5ron · 2019-10-29T23:17:18Z

@randomuserid thanks for the analysis! I just want to make sure - this isn't a discussion to get rid of keyword fields that support better wildcard/regex - this is solely to add the .text fields that were taken away that are very useful for many of the aspects of elasticsearch.. ie:

fast matching by a single value/search that hits an indexed parameter (the true root/purpose of elasticsearch)
Interval queries for ordering of values/parameters
the points mentioned above about scale for when users shouldn't use wildcards and should leverage indexed/analyzed values
probably a lot of other use cases I am missing that analyzed fields allow

randomuserid · 2019-10-29T23:25:04Z

IDT Mat's idea involved removing the keyword fields - rather to add text fields as a possible way to enable the hunting searches that use wildcards today, if we rewrote them to do substring matches instead of wildcards

neu5ron · 2019-10-29T23:31:41Z

Another question that came up (this is for everyone reading this) is whether we need to set a specific Elasticsearch analyzer and/or search_analyzer for some of these fields, or is the default analyzer good enough here?

I'm thinking the potentially big fields that have an arbitrary structure (request payloads, user agent) should likely use the default one.

But fields that are structured (path, url, domain, email) may have analyzers that are better suited for specific needs?

I think it could be good to add in documentation on possibilities of using custom analyzers for these - and I definitely agree different analyzers are useful which is what I had always done in the past. However, I am thinking this could get into a long drawn out discussion/debate of what to use or not use.. And so I think we just need to move forward - it would never hurt to change analyzer later versus continuing to not have one as is the case now :)
Either way, one that we use in HELK for CLI and files can be used as a starting point for some to help understand.
https://github.com/Cyb3rWard0g/HELK/blob/e990fd21a09acc80867c4572a9293bf5ece881ca/docker/helk-logstash/output_templates/50-logs-winevent-all.json#L8
case change is one thing always like to add for file names/paths - allows an example like: if file is SecretPayroll.docx then gets split into tokens of secret payroll and docx versus tokens of secretpayroll and docx
One would imagine users commonly name files with casing when they are not splitting by spaces/dashes/underscores.

Again I want to add - analyzed field additions matters at scale.. If we don't have it then I can be pretty certain using leading and appending wild cards is going to cause some major issues...

ypid-geberit · 2021-09-02T18:26:38Z

I thought about how to make host.name easier to search for our users and came up with #1599.

kgeller · 2021-11-30T17:50:17Z

Closing this issue as we have since added match_only_text and wildcard types into ECS.

This was referenced Nov 29, 2019

Adopting ECS in Logstash elastic/logstash#11306

Closed

Case sensitivity for keywords #623

Closed

webmat mentioned this issue Dec 6, 2019

Add the default text analyzer to some fields #680

Merged

kgeller closed this as completed Nov 30, 2021

ebeahan mentioned this issue Mar 14, 2022

match_only_text and general necessity for case insensitive and exact match (search) #1837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

further .text and keyword discussion #570

further .text and keyword discussion #570

neu5ron commented Oct 1, 2019 •

edited

Loading

webmat commented Oct 7, 2019 •

edited

Loading

webmat commented Oct 10, 2019

randomuserid commented Oct 24, 2019

webmat commented Oct 29, 2019 •

edited by randomuserid

Loading

webmat commented Oct 29, 2019 •

edited

Loading

randomuserid commented Oct 29, 2019 •

edited

Loading

neu5ron commented Oct 29, 2019

randomuserid commented Oct 29, 2019

neu5ron commented Oct 29, 2019 •

edited

Loading

ypid-geberit commented Sep 2, 2021 •

edited

Loading

kgeller commented Nov 30, 2021

further .text and keyword discussion #570

further .text and keyword discussion #570

Comments

neu5ron commented Oct 1, 2019 • edited Loading

webmat commented Oct 7, 2019 • edited Loading

webmat commented Oct 10, 2019

randomuserid commented Oct 24, 2019

webmat commented Oct 29, 2019 • edited by randomuserid Loading

webmat commented Oct 29, 2019 • edited Loading

randomuserid commented Oct 29, 2019 • edited Loading

neu5ron commented Oct 29, 2019

randomuserid commented Oct 29, 2019

neu5ron commented Oct 29, 2019 • edited Loading

ypid-geberit commented Sep 2, 2021 • edited Loading

kgeller commented Nov 30, 2021

neu5ron commented Oct 1, 2019 •

edited

Loading

webmat commented Oct 7, 2019 •

edited

Loading

webmat commented Oct 29, 2019 •

edited by randomuserid

Loading

webmat commented Oct 29, 2019 •

edited

Loading

randomuserid commented Oct 29, 2019 •

edited

Loading

neu5ron commented Oct 29, 2019 •

edited

Loading

ypid-geberit commented Sep 2, 2021 •

edited

Loading