Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestAddCertsToWatch_remove_ca #2586

Closed
albertteoh opened this issue Oct 25, 2020 · 2 comments · Fixed by #2610
Closed

Flaky test: TestAddCertsToWatch_remove_ca #2586

albertteoh opened this issue Oct 25, 2020 · 2 comments · Fixed by #2610
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@albertteoh
Copy link
Contributor

albertteoh commented Oct 25, 2020

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
https://travis-ci.com/github/albertteoh/jaeger/jobs/406895163

=== RUN   TestAddCertsToWatch_remove_ca
    cert_watcher_test.go:307: 
        	Error Trace:	cert_watcher_test.go:307
        	Error:      	Should be true
        	Test:       	TestAddCertsToWatch_remove_ca
--- FAIL: TestAddCertsToWatch_remove_ca (0.00s)

Expected behavior
Passing test.

What troubleshooting steps did you try?
Adding the following at the end of the test:

fmt.Println(logObserver.All())

Prints:

[
{{warn 2020-10-25 16:37:28.214382 +1100 AEDT m=+0.001567434  Certificate has been removed, using the last known version undefined } [{certificate 15 0 /var/folders/3p/yms48z2s6v7c8fy2m_1481g00000gn/T/ca.crt611136954 <nil>}]}
{{warn 2020-10-25 16:37:28.214443 +1100 AEDT m=+0.001628482  Certificate has been removed, using the last known version undefined } [{certificate 15 0 /var/folders/3p/yms48z2s6v7c8fy2m_1481g00000gn/T/clientCa.crt633105361 <nil>}]}
]

I believe the root cause is a race where the following waitUntil finishes after the first detected "Certificate has been removed..." log, and before the second clientCAFile deletion is detected by the certWatcher:

waitUntil(func() bool {
	return logObserver.FilterMessage("Certificate has been removed, using the last known version").Len() > 0
}, 100, time.Millisecond*100)

Causing the second assertion to fail:

assert.True(t, logObserver.FilterMessage("Certificate has been removed, using the last known version").FilterField(zap.String("certificate", clientCaFile.Name())).Len() > 0)

Proposed Fix

Modify the log observer check to wait until at least 2 cert deletions are seen (>= 2 for readability):

return logObserver.FilterMessage("Certificate has been removed, using the last known version").Len() >= 2

Confirmed that this change doesn't just pass tests due to timeout and hence, does not negatively impact test execution time:

  • With >= 2:
=== RUN   TestAddCertsToWatch_remove_ca
--- PASS: TestAddCertsToWatch_remove_ca (0.10s)
  • Whereas with >=3 still passes but requires a 10s wait to timeout the waitUntil:
=== RUN   TestAddCertsToWatch_remove_ca
--- PASS: TestAddCertsToWatch_remove_ca (10.19s)
@albertteoh albertteoh added the bug label Oct 25, 2020
chlunde added a commit to chlunde/jaeger that referenced this issue Oct 27, 2020
There are a few file descriptor leaks in the unit tests, and on an error
condition in certwatcher.

This is an issue when checking for flaky tests with
`go test -test.count ..`. One leak is in an error condition in
certwatcher but it is unlikely to be a runtime issue because.

Related jaegertracing#2586

Signed-off-by: Carl Henrik Lunde <chlunde@ifi.uio.no>
yurishkuro pushed a commit that referenced this issue Oct 28, 2020
There are a few file descriptor leaks in the unit tests, and on an error
condition in certwatcher.

This is an issue when checking for flaky tests with
`go test -test.count ..`. One leak is in an error condition in
certwatcher but it is unlikely to be a runtime issue because.

Related #2586

Signed-off-by: Carl Henrik Lunde <chlunde@ifi.uio.no>
@yurishkuro
Copy link
Member

@albertteoh the fix might have introduced a different issue (possible with the order of asserts), the test again failed on master:

https://travis-ci.com/github/jaegertracing/jaeger/jobs/428075939#L2104

@yurishkuro yurishkuro reopened this Nov 8, 2020
@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners and removed needs-triage labels Nov 8, 2020
@albertteoh
Copy link
Contributor Author

@albertteoh the fix might have introduced a different issue (possible with the order of asserts), the test again failed on master:

https://travis-ci.com/github/jaegertracing/jaeger/jobs/428075939#L2104

@yurishkuro the fix for this issue is for the TestAddCertsToWatch_remove_ca test, which is independent of, and should not share state with, the failing test in the aforementioned travis build which is TestReload. I've created a new issue for it in #2622. Please let me know if the analysis and proposed fix are incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants