Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further .text and keyword discussion #570

Closed
neu5ron opened this issue Oct 1, 2019 · 11 comments
Closed

further .text and keyword discussion #570

neu5ron opened this issue Oct 1, 2019 · 11 comments

Comments

@neu5ron
Copy link

neu5ron commented Oct 1, 2019

for security use cases, large error messages, etc.. .text is an important aspect in searching.

pro(s):

  • faster searches
  • severely reduces impact on the backend (and ultimately customer/user experience)

con(s):

  • increases storage

temporary proposal to meet in the middle (for now) instead of adding .text into the dynamic template for all strings fields VS having no .text fields at all
ECS and thus corresponding beats index templates have .text field for high value/impacting fields. List of fields off the top of my head would be:

  • process.executable.text
  • process.args.text
  • url.original
  • user.email
  • user_agent.original
  • error.stack_trace
  • file.path
  • host.name
  • any .user.name fields
  • any .domain fields
  • any .as.organization.name fields
  • http.request.body.content
  • http.response.body.content
  • os.name.full

existing ECS reference:
#340
#104

@webmat
Copy link
Contributor

webmat commented Oct 7, 2019

Thanks @neu5ron! Yes, the subtext here is that when only keyword fields are available, people will do searches with wildcards, potentially multiple wildcards. Prefix searches (aka trailing wildcard: "myword*") are very fast on keyword fields, but a search such as "*myword*" is essentially equivalent to a full table scan.

Another subtext is that users are free to add their own .text multi-fields (as custom fields) wherever they need it, even if it's not in ECS. On the other hand, I don't think people do that a lot in real life. They either don't know they can do that, or it's way too late and they are threat hunting right now. A project to go back & identify which fields should have a .text multi-fields, modifying the template, then reindexing their data is obviously out of the question when that happens. So "multiple wildcard search" it is... And the cluster goes 💥

So I think this is worth addressing in ECS, and making sure that a baseline of fields where full text search is useful are defined from the get go.

Other suggestions and thoughts about this are welcome :-)

@webmat
Copy link
Contributor

webmat commented Oct 10, 2019

I would add os.name.full to the list. See why in #576 (comment)

@randomuserid
Copy link

So I reviewed all of our current rule sets and there are a few more ECS fields that are used by hunting searches. This list is not ranked but the process fields are clearly the most heavily used and process.args / process.command_line may be the most frequently used. These are the fields not listed above:

process.command_line
process.name
process.parent.args
process.parent.executable
process.pe.original_file_name
process.pe.signer_name
process.working_directory

There are also 13 non-ECS fields that are used by hunting searches or by ML job datafeed queries in addition to 12 ECS fields for a total of 25. The complete breakdown is in here:
https://docs.google.com/spreadsheets/d/1xOicLbct4Vk10hj4_Xaa1ViYjU9W_unFZ_Rv_mTe-ks/edit#gid=0

@webmat
Copy link
Contributor

webmat commented Oct 29, 2019

@randomuserid process.parent isn't in ECS yet, but this should get in soon.

But what's process.pe?

"pe" = portable executable, the modern Microsoft executable file format. original file name is a recent addition in sysmon intended to catch malware that renames itself during process init. The signer name is the name associated with the code signing certificate, most Windows binaries are signed in order to identify where they came from.

@webmat
Copy link
Contributor

webmat commented Oct 29, 2019

Another question that came up (this is for everyone reading this) is whether we need to set a specific Elasticsearch analyzer and/or search_analyzer for some of these fields, or is the default analyzer good enough here?

I'm thinking the potentially big fields that have an arbitrary structure (request payloads, user agent) should likely use the default one.

But fields that are structured (path, url, domain, email) may have analyzers that are better suited for specific needs?

@randomuserid
Copy link

randomuserid commented Oct 29, 2019

After doing some counting today, the lack of wildcard support prevents us from using the best rules we have and practically 100% of the Endgame rules. Sigma rules are also nearly 100% blocked. We may have less than 40 signal rules in 7.6 if we have to hold back all of the wildcard or regex search rules. With wildcards, we would have between 200-300 or more.

@neu5ron
Copy link
Author

neu5ron commented Oct 29, 2019

@randomuserid thanks for the analysis! I just want to make sure - this isn't a discussion to get rid of keyword fields that support better wildcard/regex - this is solely to add the .text fields that were taken away that are very useful for many of the aspects of elasticsearch.. ie:

  • fast matching by a single value/search that hits an indexed parameter (the true root/purpose of elasticsearch)
  • Interval queries for ordering of values/parameters
  • the points mentioned above about scale for when users shouldn't use wildcards and should leverage indexed/analyzed values
  • probably a lot of other use cases I am missing that analyzed fields allow

@randomuserid
Copy link

IDT Mat's idea involved removing the keyword fields - rather to add text fields as a possible way to enable the hunting searches that use wildcards today, if we rewrote them to do substring matches instead of wildcards

@neu5ron
Copy link
Author

neu5ron commented Oct 29, 2019

Another question that came up (this is for everyone reading this) is whether we need to set a specific Elasticsearch analyzer and/or search_analyzer for some of these fields, or is the default analyzer good enough here?

I'm thinking the potentially big fields that have an arbitrary structure (request payloads, user agent) should likely use the default one.

But fields that are structured (path, url, domain, email) may have analyzers that are better suited for specific needs?

I think it could be good to add in documentation on possibilities of using custom analyzers for these - and I definitely agree different analyzers are useful which is what I had always done in the past. However, I am thinking this could get into a long drawn out discussion/debate of what to use or not use.. And so I think we just need to move forward - it would never hurt to change analyzer later versus continuing to not have one as is the case now :)
Either way, one that we use in HELK for CLI and files can be used as a starting point for some to help understand.
https://github.com/Cyb3rWard0g/HELK/blob/e990fd21a09acc80867c4572a9293bf5ece881ca/docker/helk-logstash/output_templates/50-logs-winevent-all.json#L8
case change is one thing always like to add for file names/paths - allows an example like: if file is SecretPayroll.docx then gets split into tokens of secret payroll and docx versus tokens of secretpayroll and docx
One would imagine users commonly name files with casing when they are not splitting by spaces/dashes/underscores.

Again I want to add - analyzed field additions matters at scale.. If we don't have it then I can be pretty certain using leading and appending wild cards is going to cause some major issues...

@ypid-geberit
Copy link
Contributor

ypid-geberit commented Sep 2, 2021

I thought about how to make host.name easier to search for our users and came up with #1599.

@kgeller
Copy link
Contributor

kgeller commented Nov 30, 2021

Closing this issue as we have since added match_only_text and wildcard types into ECS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants