Fix leak caused by input runners created when checking their configuration #23722

jsoriano · 2021-01-27T20:10:08Z

What does this PR do?

Stop input v1 runners created to check config.

CheckConfig for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations.

As discovered in #23658, this started to cause leaks since 7.8.0 (specifically since #16715), but I am not sure why this was not an issue before as config checkers were moved there but not really changed. Perhaps before this change input configs were not actually checked.

Considered alternatives

Avoid creating goroutines in v1 input builders. I may give a try to this option, but the problem I see is that ~~Start()~~ Run() doesn't return errors, and it may be too late to do some checks expected now in the builder.
Add a runner builder that creates a runner using fake contexts and connectors that can be controlled from CheckConfig. Not sure if this would be very different at the end, and we would need to rely on inputs releasing their resources if the context is done.
Migrate all inputs to v2. Long term is the best solution, but it can be a big effort to be done now, and we would still need a fix to backport to released branches.
Do nothing on CheckConfig for inputs. A feature would be lost.

Why is it important?

Avoid goroutines and other possible leaks with autodiscover in Filebeat.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works.
- Not easy to add a test for this. It needs at least the module loader, the autodiscover process, and a v1 input. Doing it as system test would be flaky because there are several other goroutines.
- Added tests to check that v1 inputs don't leak goroutines if their context is stopped after being created.
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Confirm that all v1 inputs can be stopped without being started before.

How to test this PR locally

Start filebeat with docker (or kubernetes) autodiscover and -httpprof :6060.
Collect a base profile (curl http://localhost:6060/debug/pprof/goroutine -o /tmp/goroutines-base.prof).
Start a container, and check that filebeat starts collecting its logs.
Check with pprof that new goroutines have been started
- go tool pprof -top -base /tmp/goroutines-base.prof http://localhost:6060/debug/pprof/goroutine
- Look for goroutines like CloseOnSignal or SubOutlets.
Stop the container.
Confirm that eventually (~after cleanup_timeout) the started goroutines are not in pprof anymore.

Related issues

Fix Goroutine leak in Filebeat with Kubernetes autodiscovery #23658.

filebeat/fileset/factory.go

filebeat/input/testing/input.go

urso · 2021-01-28T15:12:45Z

filebeat/input/runnerfactory.go

+	if err != nil {
+		return err
+	}
+	runner.Stop()


Stop is not blocking I think. Or is it? We also have a Wait method on some inputs. If we do not 'block' there might be a chance that our tests become flaky due to the go-routine check and resources being freed eventually asynchronously.

Actually this stops the runner. Wait is not exposed in the runner.

The goroutine checker has a wait loop with a configurable timeout, we already have non-flaky tests that stop the inputs and wait for their goroutines to finish, also with other features that stop goroutines asynchronously. I wouldn't expect this to introduce additional flakiness.

…scover-check-config

elasticmachine · 2021-01-29T12:08:43Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: jsoriano commented: jenkins run the tests please
- Start Time: 2021-02-02T09:50:22.840+0000
Duration: 45 min 38 sec
Commit: 904ec10

Test stats 🧪

Test	Results
Failed	0
Passed	12999
Skipped	2033
Total	15032

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	12999
Skipped	2033
Total	15032

…scover-check-config

elasticmachine · 2021-02-01T12:09:35Z

Pinging @elastic/integrations (Team:Integrations)

ChrsMark

Thanks @jsoriano! I only left some nits. Leaving the final +1 for @urso who already commented and knows the specific codebase.

ChrsMark · 2021-02-01T13:08:33Z

filebeat/fileset/factory.go

 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("create module registry for filesets: %w", err)


Suggested change

return nil, fmt.Errorf("create module registry for filesets: %w", err)

return nil, fmt.Errorf("could not create module registry for filesets: %w", err)

+1 on changing the wording that something did not work out

Disclaimer: I'm no native speaker

I'd prefer if our logs do not use didn't, couldn't or similar. Instead put some emphasis on the not by using could not, can not, must not. If something "Failed", let's use failed to

👍🏼 edited to use the full form of negation

Ok, adding could not.

I have been changing my opinion on these kind of messages over time. Lately I am wondering if, when wrapping, it is not enough to mention what was being done. The wrapped error err already has the root cause, and the caller knows how it affects upper layers or the user, the context provided in places like here give an orientation of what was being done.

Thinking in a made up example, these are the kind of messages logged when one or the other strategy are used:

Failed to read configuration: could not read configuration from file /somefile: could not open file: permission denied.

Failed to read configuration: read configuration from file /somefile: open file: permission denied.

Both messages provide the same information: what failed, what was it trying to do and what was the root cause of the issue. In the first version the could nots don't provide additional information, and when many errors are chained they can look too redundant. Second option looks enough, more compact and easier to read.

In the code, the message looks more complete when mentioning that something didn't went as expected, but this information is actually redundant with the fact that an error is being wrapped and returned.

Well, part of the problem is we do not capture stack traces (or locations) on errors + generic error types normally do not carry additional 'context' (and we do not define our own error types when we should). Plus there is no generic support to add diagnostic context to add extra information to an error value (I'm using sderr like sderr.With("file", ...).Wrap(err, "read configuration") if I need this). Aaannnd we do not spend too much time thinking how the final log message might look like :)

If you look at the error from the stdlib, it includes the operation and the cause like open file: permission denied. In order to create user-readable errors one might use an encoding like <operation>: <optional context>: <optional message>: <cause>, where <cause> should follow a similar pattern.

Plus our logs are sometimes redundant to our error message. Following the aforementioned schema + reduce redundancy in the log (we do not need to mention that "reading configuration" failed), the log message would become:

Error: read configuration: filename /somefile: open file: permission denied.

Beside the Error and the permission denied there is no mention of failing, can not, or similar. The message more or less resembles that call stack (at the cost of having "readable" english).

But our logs/errors normally follow the schema Error doing X: failed to <message similar to X>: could not <message>: <cause>. Although it does not read nice, we start each sentence with "failed" and might want to keep that.

Some related readings:

https://commandcenter.blogspot.com/2017/12/error-handling-in-upspin.html

https://middlemost.com/failure-is-your-domain/

go-txfile error doc

ChrsMark · 2021-02-01T13:08:58Z

filebeat/fileset/factory.go

-	return err
+	_, pConfigs, err := f.createRegistry(c)
+	if err != nil {
+		return fmt.Errorf("create module registry for filesets: %w", err)


Suggested change

return fmt.Errorf("create module registry for filesets: %w", err)

return fmt.Errorf("could not create module registry for filesets: %w", err)

urso · 2021-02-01T15:00:45Z

filebeat/input/inputtest/input.go

+)
+
+// Outlet is an empty outlet for testing.
+type Outlet struct{}


It would be nice to have some "support" in order to create more extensive tests. e.g.

type Outlet struct { DoneSig chan struct{} // optional channel to simulate shutdown Event func(beat.Event) bool // optional callback }

e.g. the callback can be used to 'record' events in unit tests in order to validate them afterwards.

+1, but we wouldn't be using these features here by now, wdyt about adding them when there is a use case?

jsoriano · 2021-02-02T09:49:54Z

jenkins run the tests please

…ation (elastic#23722) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)

…ation (#23722) (#23803) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)

…ation (#23722) (#23804) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)

…ation (#23722) (#23805) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit e6bb5c9)

…ation (elastic#23722) (elastic#23805) Stop input v1 runners created to check config. `CheckConfig` for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations. (cherry picked from commit 488e020)

jsoriano added the Team:Integrations Label for the Integrations team label Jan 27, 2021

jsoriano requested review from urso and ChrsMark January 27, 2021 20:10

jsoriano self-assigned this Jan 27, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 27, 2021

Stop runners created when checking their configuration

eba6805

jsoriano force-pushed the stop-runners-autodiscover-check-config branch from 330929e to eba6805 Compare January 27, 2021 20:12

jsoriano commented Jan 27, 2021

View reviewed changes

filebeat/fileset/factory.go Outdated Show resolved Hide resolved

jsoriano added the needs_backport PR is waiting to be backported to other branches. label Jan 27, 2021

jsoriano added 2 commits January 28, 2021 11:33

Use CheckConfig for individual inputs in filesets

18d5e08

Add tests to check that v1 inputs don't leak goroutines on check config

4afe023

urso reviewed Jan 28, 2021

View reviewed changes

filebeat/input/testing/input.go Outdated Show resolved Hide resolved

urso reviewed Jan 28, 2021

View reviewed changes

jsoriano added 3 commits January 29, 2021 12:18

Merge remote-tracking branch 'origin/master' into stop-runners-autodi…

bd6d8b5

…scover-check-config

Rename input testing module to inputtest

df2bcfe

Fix license headers

7f2131e

jsoriano added 2 commits January 29, 2021 13:29

Fix GCP test

92f11a0

Merge remote-tracking branch 'origin/master' into stop-runners-autodi…

526df51

…scover-check-config

jsoriano marked this pull request as ready for review February 1, 2021 12:09

ChrsMark reviewed Feb 1, 2021

View reviewed changes

urso reviewed Feb 1, 2021

View reviewed changes

Fix error messages

904ec10

urso approved these changes Feb 1, 2021

View reviewed changes

jsoriano merged commit e6bb5c9 into elastic:master Feb 2, 2021

jsoriano deleted the stop-runners-autodiscover-check-config branch February 2, 2021 12:00

jsoriano mentioned this pull request Feb 2, 2021

Cherry-pick #23722 to 7.x: Fix leak caused by input runners created when checking their configuration #23803

Merged

7 tasks

jsoriano added v7.12.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 2, 2021

jsoriano mentioned this pull request Feb 2, 2021

Cherry-pick #23722 to 7.11: Fix leak caused by input runners created when checking their configuration #23804

Merged

7 tasks

jsoriano added the v7.11.0 label Feb 2, 2021

jsoriano mentioned this pull request Feb 2, 2021

Cherry-pick #23722 to 7.10: Fix leak caused by input runners created when checking their configuration #23805

Merged

7 tasks

jsoriano added the v7.10.3 label Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leak caused by input runners created when checking their configuration #23722

Fix leak caused by input runners created when checking their configuration #23722

jsoriano commented Jan 27, 2021 •

edited

Loading

urso Jan 28, 2021

jsoriano Jan 29, 2021 •

edited

Loading

elasticmachine commented Jan 29, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Test stats 🧪

elasticmachine commented Feb 1, 2021

ChrsMark left a comment

ChrsMark Feb 1, 2021 •

edited

Loading

urso Feb 1, 2021 •

edited

Loading

ChrsMark Feb 1, 2021

jsoriano Feb 1, 2021

urso Feb 1, 2021

ChrsMark Feb 1, 2021 •

edited

Loading

urso Feb 1, 2021

jsoriano Feb 1, 2021

jsoriano commented Feb 2, 2021

	return nil, fmt.Errorf("create module registry for filesets: %w", err)
	return nil, fmt.Errorf("could not create module registry for filesets: %w", err)

	return fmt.Errorf("create module registry for filesets: %w", err)
	return fmt.Errorf("could not create module registry for filesets: %w", err)

Fix leak caused by input runners created when checking their configuration #23722

Fix leak caused by input runners created when checking their configuration #23722

Conversation

jsoriano commented Jan 27, 2021 • edited Loading

What does this PR do?

Considered alternatives

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

urso Jan 28, 2021

Choose a reason for hiding this comment

jsoriano Jan 29, 2021 • edited Loading

Choose a reason for hiding this comment

elasticmachine commented Jan 29, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

elasticmachine commented Feb 1, 2021

ChrsMark left a comment

Choose a reason for hiding this comment

ChrsMark Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

urso Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

ChrsMark Feb 1, 2021

Choose a reason for hiding this comment

jsoriano Feb 1, 2021

Choose a reason for hiding this comment

urso Feb 1, 2021

Choose a reason for hiding this comment

ChrsMark Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

urso Feb 1, 2021

Choose a reason for hiding this comment

jsoriano Feb 1, 2021

Choose a reason for hiding this comment

jsoriano commented Feb 2, 2021

jsoriano commented Jan 27, 2021 •

edited

Loading

jsoriano Jan 29, 2021 •

edited

Loading

elasticmachine commented Jan 29, 2021 •

edited by jenkins-beats-ci bot

Loading

ChrsMark Feb 1, 2021 •

edited

Loading

urso Feb 1, 2021 •

edited

Loading

ChrsMark Feb 1, 2021 •

edited

Loading