Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure of test_coproc_delete_topic unit test #3384

Closed
andrewhsu opened this issue Jan 3, 2022 · 21 comments · Fixed by #3340
Closed

Failure of test_coproc_delete_topic unit test #3384

andrewhsu opened this issue Jan 3, 2022 · 21 comments · Fixed by #3340
Assignees
Labels
area/coproc Legacy WASM coprocessor, please use area/wasm instead ci-disabled-test ci-failure kind/bug Something isn't working

Comments

@andrewhsu
Copy link
Member

andrewhsu commented Jan 3, 2022

Version & Environment

Nightly runs of tests using code from dev branch.

What went wrong?

Buildkite jobs are red and logs indicate failure in coproc_fixture_rpunit test.

What should have happened instead?

Awesomeness.

How to reproduce the issue?

Since I've seen buildkite jobs run the same tests with the same git commit and result in no failures, this looks like a flakey test failure to me.

Additional information

Example buildkite job log of failure with git commit f0a443d:
https://buildkite.com/vectorized/redpanda/builds/5850#a4eeb029-c210-4436-87d9-88ac956ec94f/6-6019

The following tests FAILED:
	 18 - coproc_fixture_rpunit (Failed)
Errors while running CTest

Earlier in the same buildkite job log, it says:

../../../src/v/coproc/tests/kafka_api_materialized_tests.cc(0): Leaving test case "find_coordinator_for_non_replicatable_topic"; testing time: 1914865us
Leaving test module "../../../src/v/coproc/tests/retry_logic_tests.cc"; testing time: 183155114us
 
*** 1 failure is detected in the test module "../../../src/v/coproc/tests/retry_logic_tests.cc"
Test Exit code 201
@andrewhsu andrewhsu added kind/bug Something isn't working area/coproc Legacy WASM coprocessor, please use area/wasm instead ci-failure labels Jan 3, 2022
@andrewhsu
Copy link
Member Author

andrewhsu commented Jan 3, 2022

I found an older issue #2613 that seems to have described the same failure.

@andrewhsu
Copy link
Member Author

Another occurrence of the same failure with the same git commit f0a443d:

https://buildkite.com/vectorized/redpanda/builds/5857#8893589b-556e-44dd-8a78-18b0330356e4/6-6014

The following tests FAILED:
	 18 - coproc_fixture_rpunit (Failed)
Errors while running CTest

There were also green builds with the same git commit:

@graphcareful
Copy link
Contributor

Unfortunately ctest isn't that helpful when it comes to debugging what exact test within the suite of tests failed. It will mention that the failure is in the file retry_logic_tests.cc everytime when this may not be the case. Searching through the logs i found this:

unknown location(0): fatal error: in "test_copro_delete_topic": seastar::timed_out_error: timedout

Looks like this is the cause. I actually have a PR open for this now that never managed to go in before the winter break: #3340

@jcsp jcsp changed the title Occasional failure of coproc_fixture_rpunit test Failure of coproc_fixture_rpunit test Jan 14, 2022
@graphcareful
Copy link
Contributor

Closing as was resolved by #3340

@ajfabbri
Copy link
Contributor

Just hit a very similar CI failure here.

@ajfabbri ajfabbri reopened this Jan 28, 2022
@ajfabbri
Copy link
Contributor

ajfabbri commented Jan 28, 2022

Double-checked I have the commit from #3340:

$ git branch --contains c721f18e390a4f3688228a4a986c8beeaa0b6963
  dev
* full-disk-analysis

@graphcareful
Copy link
Contributor

@ajfabbri you'll also need this: #3493

@ajfabbri
Copy link
Contributor

ajfabbri commented Feb 1, 2022

@ajfabbri you'll also need this: #3493

I have that fix as well. This is recently rebased on dev, whereas that fix is about two weeks old.

$ git branch --contains 66ad1dbe5ec3644ed3632db841d309114039d8e5
  dev
* full-disk-analysis

@graphcareful
Copy link
Contributor

Ok leaving this PR open

@andrewhsu
Copy link
Member Author

@jcsp
Copy link
Contributor

jcsp commented Mar 4, 2022

@jcsp
Copy link
Contributor

jcsp commented Mar 8, 2022

This is still one of the more frequent failures (https://buildkite.com/redpanda/redpanda/builds/7908#fc5ec8e2-c5e2-4208-be8e-77220257956e last night).

@graphcareful do you have a sense of what is going wrong here?

@gousteris
Copy link
Contributor

jcsp added a commit to jcsp/redpanda that referenced this issue Mar 9, 2022
@graphcareful
Copy link
Contributor

I have investigated and have found the culprit is a single test within the coproc_unit_test binary, here

I believe the reason the test is failing is due to not accounting for a particular edge case where coproc will attempt to re-create a deleted log.

@gousteris
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/8249#89788d01-31d9-4859-add0-258ccaa221c9/6-6195

The following tests FAILED:
--
  | 20 - coproc_fixture_rpunit (Failed)
  | task: Failed to run task "rp:test": exit status 8
  | task: Failed to run task "ci:rp": task: Failed to run task "ci:rp:test": task: Failed to run task "docker:task": exit status 1
  | 🚨 Error: The command exited with status 1

@graphcareful
Copy link
Contributor

@gousteris this seems like a new issue, a memory leak with the test find_coordinator_for_non_replicatable_topic i will file

@graphcareful
Copy link
Contributor

Filed here #4053

@jcsp
Copy link
Contributor

jcsp commented Mar 21, 2022

@graphcareful did you mean to close this issue? I only see a PR disabling the test.

@graphcareful
Copy link
Contributor

I filed #4053 as discussion in this issue pertains to a different, already resolved, fix

@jcsp
Copy link
Contributor

jcsp commented Mar 21, 2022

I get that #4053 is for find_coordinator_for_non_replicatable_topic -- but this ticket's last activity related to test_copro_delete_topic. That test is still disabled, so unless there's another ticket elsewhere for test_copro_delete_topic, then this ticket is still live.

@graphcareful
Copy link
Contributor

Ok good points, then to avoid confusion in will re-open this and change the title of the issue

@graphcareful graphcareful reopened this Mar 21, 2022
@graphcareful graphcareful changed the title Failure of coproc_fixture_rpunit test Failure of test_coproc_delete_topic unit test Mar 21, 2022
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jun 3, 2022
- These tests all attempt to remove a materialized topic while the
coprocessor is still running.

- However due to the initial design of the system, coproc will attempt
to recreate the topic and re-populate it up until the previous high
watermark. If this was not performed there would be an inconsistency
between the coprocessors defined metadata and random commands sent to the
cluster by the user.

- The tests have been beneficial in understanding that this type of
concurrent delete can occur without any crashes.

- If a user wants to truly delete a materialized topic he/she must
shutdown the coprocessor first.

- Fixes: redpanda-data#3384

(cherry picked from commit 964bb41)
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/coproc Legacy WASM coprocessor, please use area/wasm instead ci-disabled-test ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants