[Auditbeat] Cherry-pick #9693 to 6.6: Report process errors #9845
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-pick of PR #9693 to 6.6 branch. Original message:
So far, the
process
metricset has been rather strict. If an unexpected error occurred while collecting process information, the whole collection would stop and return an error.This changes it to keep iterating through processes even when that happens. The unexpected error will be stored in the
Process
object and sent to Elasticsearch as well as logged as a warning. This only happens the first time the error is encountered for a process, not on subsequent collection cycles (with a typical collection frequency of 1s, that would flood the log and ES).For error documents, it sets
event.kind: error
andevent.action: process_error
.Fyi, I have renamed
ProcessInfo
toProcess
not just because it now contains more than justtypes.ProcessInfo
, but also to bring it in line withSocket
insocket.go
.Socket
already contains anError
field (and that was the inspiration for this change).Beware: The diff Github shows is misleading in places, it shows replacements/deletions where a few lines have just moved down a bit.
Some additional background on why this change can be found in this comment thread on a PR that introduced some error catching during process collection.
If anybody wants to test what happens with errors, run it as non-root and comment the
continue
statement in line 375 - it will report errors for processes of other users. At some point, we might want to have a test that simulates an error.