-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix leak caused by input runners created when checking their configuration #23722
Fix leak caused by input runners created when checking their configuration #23722
Conversation
330929e
to
eba6805
Compare
if err != nil { | ||
return err | ||
} | ||
runner.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop
is not blocking I think. Or is it? We also have a Wait
method on some inputs. If we do not 'block' there might be a chance that our tests become flaky due to the go-routine check and resources being freed eventually asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this stops the runner. Wait
is not exposed in the runner.
The goroutine checker has a wait loop with a configurable timeout, we already have non-flaky tests that stop the inputs and wait for their goroutines to finish, also with other features that stop goroutines asynchronously. I wouldn't expect this to introduce additional flakiness.
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
…scover-check-config
Pinging @elastic/integrations (Team:Integrations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filebeat/fileset/factory.go
Outdated
if err != nil { | ||
return nil, err | ||
return nil, fmt.Errorf("create module registry for filesets: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("create module registry for filesets: %w", err) | |
return nil, fmt.Errorf("could not create module registry for filesets: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on changing the wording that something did not work out
Disclaimer: I'm no native speaker
I'd prefer if our logs do not use didn't
, couldn't
or similar. Instead put some emphasis on the not
by using could not
, can not
, must not
. If something "Failed", let's use failed to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 edited to use the full form of negation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, adding could not
.
I have been changing my opinion on these kind of messages over time. Lately I am wondering if, when wrapping, it is not enough to mention what was being done. The wrapped error err
already has the root cause, and the caller knows how it affects upper layers or the user, the context provided in places like here give an orientation of what was being done.
Thinking in a made up example, these are the kind of messages logged when one or the other strategy are used:
Failed to read configuration: could not read configuration from file /somefile: could not open file: permission denied
.Failed to read configuration: read configuration from file /somefile: open file: permission denied
.
Both messages provide the same information: what failed, what was it trying to do and what was the root cause of the issue. In the first version the could not
s don't provide additional information, and when many errors are chained they can look too redundant. Second option looks enough, more compact and easier to read.
In the code, the message looks more complete when mentioning that something didn't went as expected, but this information is actually redundant with the fact that an error is being wrapped and returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, part of the problem is we do not capture stack traces (or locations) on errors + generic error types normally do not carry additional 'context' (and we do not define our own error types when we should). Plus there is no generic support to add diagnostic context to add extra information to an error value (I'm using sderr like sderr.With("file", ...).Wrap(err, "read configuration")
if I need this). Aaannnd we do not spend too much time thinking how the final log message might look like :)
If you look at the error from the stdlib, it includes the operation and the cause like open file: permission denied
. In order to create user-readable errors one might use an encoding like <operation>: <optional context>: <optional message>: <cause>
, where <cause>
should follow a similar pattern.
Plus our logs are sometimes redundant to our error message. Following the aforementioned schema + reduce redundancy in the log (we do not need to mention that "reading configuration" failed), the log message would become:
Error: read configuration: filename /somefile: open file: permission denied
.
Beside the Error
and the permission denied
there is no mention of failing
, can not
, or similar. The message more or less resembles that call stack (at the cost of having "readable" english).
But our logs/errors normally follow the schema Error doing X: failed to <message similar to X>: could not <message>: <cause>
. Although it does not read nice, we start each sentence with "failed" and might want to keep that.
Some related readings:
filebeat/fileset/factory.go
Outdated
return err | ||
_, pConfigs, err := f.createRegistry(c) | ||
if err != nil { | ||
return fmt.Errorf("create module registry for filesets: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("create module registry for filesets: %w", err) | |
return fmt.Errorf("could not create module registry for filesets: %w", err) |
) | ||
|
||
// Outlet is an empty outlet for testing. | ||
type Outlet struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have some "support" in order to create more extensive tests. e.g.
type Outlet struct {
DoneSig chan struct{} // optional channel to simulate shutdown
Event func(beat.Event) bool // optional callback
}
e.g. the callback can be used to 'record' events in unit tests in order to validate them afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, but we wouldn't be using these features here by now, wdyt about adding them when there is a use case?
jenkins run the tests please |
…ation (elastic#23722) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)
…ation (elastic#23722) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)
…ation (elastic#23722) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)
…ation (#23722) (#23803) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)
…ation (#23722) (#23804) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)
…ation (#23722) (#23805) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)
…ation (elastic#23722) (elastic#23805) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit 488e020)
What does this PR do?
Stop input v1 runners created to check config.
CheckConfig
for v1 inputs actually calls the constructors of the inputs. In some cases, as in thelog
input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations.As discovered in #23658, this started to cause leaks since 7.8.0 (specifically since #16715), but I am not sure why this was not an issue before as config checkers were moved there but not really changed. Perhaps before this change input configs were not actually checked.
Considered alternatives
Start()
Run()
doesn't return errors, and it may be too late to do some checks expected now in the builder.CheckConfig
. Not sure if this would be very different at the end, and we would need to rely on inputs releasing their resources if the context is done.CheckConfig
for inputs. A feature would be lost.Why is it important?
Avoid goroutines and other possible leaks with autodiscover in Filebeat.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
-httpprof :6060
.curl http://localhost:6060/debug/pprof/goroutine -o /tmp/goroutines-base.prof
).go tool pprof -top -base /tmp/goroutines-base.prof http://localhost:6060/debug/pprof/goroutine
CloseOnSignal
orSubOutlets
.cleanup_timeout
) the started goroutines are not in pprof anymore.Related issues