Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leak caused by input runners created when checking their configuration #23722

Merged

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Jan 27, 2021

What does this PR do?

Stop input v1 runners created to check config.

CheckConfig for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations.

As discovered in #23658, this started to cause leaks since 7.8.0 (specifically since #16715), but I am not sure why this was not an issue before as config checkers were moved there but not really changed. Perhaps before this change input configs were not actually checked.

Considered alternatives

  • Avoid creating goroutines in v1 input builders. I may give a try to this option, but the problem I see is that Start() Run() doesn't return errors, and it may be too late to do some checks expected now in the builder.
  • Add a runner builder that creates a runner using fake contexts and connectors that can be controlled from CheckConfig. Not sure if this would be very different at the end, and we would need to rely on inputs releasing their resources if the context is done.
  • Migrate all inputs to v2. Long term is the best solution, but it can be a big effort to be done now, and we would still need a fix to backport to released branches.
  • Do nothing on CheckConfig for inputs. A feature would be lost.

Why is it important?

Avoid goroutines and other possible leaks with autodiscover in Filebeat.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works.
    • Not easy to add a test for this. It needs at least the module loader, the autodiscover process, and a v1 input. Doing it as system test would be flaky because there are several other goroutines.
    • Added tests to check that v1 inputs don't leak goroutines if their context is stopped after being created.
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Confirm that all v1 inputs can be stopped without being started before.

How to test this PR locally

  • Start filebeat with docker (or kubernetes) autodiscover and -httpprof :6060.
  • Collect a base profile (curl http://localhost:6060/debug/pprof/goroutine -o /tmp/goroutines-base.prof).
  • Start a container, and check that filebeat starts collecting its logs.
  • Check with pprof that new goroutines have been started
    • go tool pprof -top -base /tmp/goroutines-base.prof http://localhost:6060/debug/pprof/goroutine
    • Look for goroutines like CloseOnSignal or SubOutlets.
  • Stop the container.
  • Confirm that eventually (~after cleanup_timeout) the started goroutines are not in pprof anymore.

Related issues

@jsoriano jsoriano added the Team:Integrations Label for the Integrations team label Jan 27, 2021
@jsoriano jsoriano self-assigned this Jan 27, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 27, 2021
@jsoriano jsoriano force-pushed the stop-runners-autodiscover-check-config branch from 330929e to eba6805 Compare January 27, 2021 20:12
filebeat/fileset/factory.go Outdated Show resolved Hide resolved
@jsoriano jsoriano added the needs_backport PR is waiting to be backported to other branches. label Jan 27, 2021
if err != nil {
return err
}
runner.Stop()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop is not blocking I think. Or is it? We also have a Wait method on some inputs. If we do not 'block' there might be a chance that our tests become flaky due to the go-routine check and resources being freed eventually asynchronously.

Copy link
Member Author

@jsoriano jsoriano Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this stops the runner. Wait is not exposed in the runner.

The goroutine checker has a wait loop with a configurable timeout, we already have non-flaky tests that stop the inputs and wait for their goroutines to finish, also with other features that stop goroutines asynchronously. I wouldn't expect this to introduce additional flakiness.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 29, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: jsoriano commented: jenkins run the tests please

    • Start Time: 2021-02-02T09:50:22.840+0000
  • Duration: 45 min 38 sec

  • Commit: 904ec10

Test stats 🧪

Test Results
Failed 0
Passed 12999
Skipped 2033
Total 15032

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 12999
Skipped 2033
Total 15032

@jsoriano jsoriano marked this pull request as ready for review February 1, 2021 12:09
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jsoriano! I only left some nits. Leaving the final +1 for @urso who already commented and knows the specific codebase.

if err != nil {
return nil, err
return nil, fmt.Errorf("create module registry for filesets: %w", err)
Copy link
Member

@ChrsMark ChrsMark Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("create module registry for filesets: %w", err)
return nil, fmt.Errorf("could not create module registry for filesets: %w", err)

Copy link

@urso urso Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing the wording that something did not work out

Disclaimer: I'm no native speaker

I'd prefer if our logs do not use didn't, couldn't or similar. Instead put some emphasis on the not by using could not, can not, must not. If something "Failed", let's use failed to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 edited to use the full form of negation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, adding could not.

I have been changing my opinion on these kind of messages over time. Lately I am wondering if, when wrapping, it is not enough to mention what was being done. The wrapped error err already has the root cause, and the caller knows how it affects upper layers or the user, the context provided in places like here give an orientation of what was being done.

Thinking in a made up example, these are the kind of messages logged when one or the other strategy are used:

  • Failed to read configuration: could not read configuration from file /somefile: could not open file: permission denied.
  • Failed to read configuration: read configuration from file /somefile: open file: permission denied.

Both messages provide the same information: what failed, what was it trying to do and what was the root cause of the issue. In the first version the could nots don't provide additional information, and when many errors are chained they can look too redundant. Second option looks enough, more compact and easier to read.

In the code, the message looks more complete when mentioning that something didn't went as expected, but this information is actually redundant with the fact that an error is being wrapped and returned.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, part of the problem is we do not capture stack traces (or locations) on errors + generic error types normally do not carry additional 'context' (and we do not define our own error types when we should). Plus there is no generic support to add diagnostic context to add extra information to an error value (I'm using sderr like sderr.With("file", ...).Wrap(err, "read configuration") if I need this). Aaannnd we do not spend too much time thinking how the final log message might look like :)

If you look at the error from the stdlib, it includes the operation and the cause like open file: permission denied. In order to create user-readable errors one might use an encoding like <operation>: <optional context>: <optional message>: <cause>, where <cause> should follow a similar pattern.

Plus our logs are sometimes redundant to our error message. Following the aforementioned schema + reduce redundancy in the log (we do not need to mention that "reading configuration" failed), the log message would become:

Error: read configuration: filename /somefile: open file: permission denied.

Beside the Error and the permission denied there is no mention of failing, can not, or similar. The message more or less resembles that call stack (at the cost of having "readable" english).

But our logs/errors normally follow the schema Error doing X: failed to <message similar to X>: could not <message>: <cause>. Although it does not read nice, we start each sentence with "failed" and might want to keep that.

Some related readings:

return err
_, pConfigs, err := f.createRegistry(c)
if err != nil {
return fmt.Errorf("create module registry for filesets: %w", err)
Copy link
Member

@ChrsMark ChrsMark Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("create module registry for filesets: %w", err)
return fmt.Errorf("could not create module registry for filesets: %w", err)

)

// Outlet is an empty outlet for testing.
type Outlet struct{}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some "support" in order to create more extensive tests. e.g.

type Outlet struct {
  DoneSig chan struct{} // optional channel to simulate shutdown
  Event func(beat.Event) bool // optional callback
}

e.g. the callback can be used to 'record' events in unit tests in order to validate them afterwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but we wouldn't be using these features here by now, wdyt about adding them when there is a use case?

@jsoriano
Copy link
Member Author

jsoriano commented Feb 2, 2021

jenkins run the tests please

@jsoriano jsoriano merged commit e6bb5c9 into elastic:master Feb 2, 2021
@jsoriano jsoriano deleted the stop-runners-autodiscover-check-config branch February 2, 2021 12:00
@jsoriano jsoriano added v7.12.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 2, 2021
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 2, 2021
…ation (elastic#23722)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 2, 2021
…ation (elastic#23722)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 2, 2021
…ation (elastic#23722)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
jsoriano added a commit that referenced this pull request Feb 2, 2021
…ation (#23722) (#23803)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
jsoriano added a commit that referenced this pull request Feb 5, 2021
…ation (#23722) (#23804)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
jsoriano added a commit that referenced this pull request Feb 5, 2021
…ation (#23722) (#23805)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…ation (elastic#23722) (elastic#23805)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit 488e020)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Goroutine leak in Filebeat with Kubernetes autodiscovery
4 participants