Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in kafka_streams_test.KafkaStreamsWikipedia #2889

Closed
NyaliaLui opened this issue Nov 6, 2021 · 9 comments · Fixed by #2960 or #6920
Closed

Failure in kafka_streams_test.KafkaStreamsWikipedia #2889

NyaliaLui opened this issue Nov 6, 2021 · 9 comments · Fixed by #2960 or #6920

Comments

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Nov 6, 2021

https://buildkite.com/vectorized/redpanda/builds/4042#a185db01-f1ab-494c-b801-7083a5547589

TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ExampleRunner-1-140671614982704 node 1 on docker_n_19']

From the debug log, the program used to generate load for this test did not generate output which is necessary to check for successful execution.

@jcsp
Copy link
Contributor

jcsp commented Nov 17, 2021

Looks like this failed again post-fixup here: https://buildkite.com/vectorized/redpanda/builds/4434#7613b197-2d5a-4916-b96f-4c472fefd21e

@NyaliaLui
Copy link
Contributor Author

In prep for ci-party

Problem description

In CI, the driver (load generator) infrequently fails to generate data. This is a problem because there will be no output from the driver which is necessary to validate the functional test.

Reproducing the issue

Difficulty: hard
Reason: It is difficult to get the driver to fail generating data I run the test 30 times overnight and they all passed locally.

Steps to reproduce:

  • checkout the following branch on my redpanda fork: https://github.com/NyaliaLui/redpanda/tree/kstreams-wikipedia-fix
  • build redpanda in release mode
  • run the KafkaStreamsWikipedia example in ducktape: task rp:run-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/compatibility/kafka_streams_test.py::KafkaStreamsWikipedia"
  • Hope that the driver fails. There is a driver.wait() on line 84, which has default timeout of 600s. Feel free to change that. Line 84 is where the failure should occur.

@NyaliaLui
Copy link
Contributor Author

I fixed this CI failure in the linked PR but the PR is stale now. Once I get my fork back up I can re-create the PR and get it merged.

@VadimPlh
Copy link
Contributor

Pr with fix #2962

NyaliaLui pushed a commit to NyaliaLui/redpanda that referenced this issue Apr 27, 2022
The KStreams example classes use the same names as the KStreams tests.
Rename them and distinguish between an example and a driver. This will
improve code readability.

Within this commit are changes for redpanda-data#3032. The main problem in redpanda-data#3032 is
that the internal Java application sometimes fails to count all input
messages. This may be related to the 1min window used within the
program. Re-enable with ok_to_fail for now until a fix is proposed.

This commit also re-enables KStreams wikipedia test since it was
disabled due to redpanda-data#2889. Running the test multiple times (50+) did not
reproduce the issue in redpanda-data#2889. Redpanda has changed alot since redpanda-data#2889
was last seen, so re-enable with ok_to_fail to see if the problem
still exists.
@NyaliaLui
Copy link
Contributor Author

The PR with fix is now #4461 since I lost access to the previous PR.

NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jul 25, 2022
This test was disabled due to redpanda-data#2889 but a lot has changed since then.
Re-enable the test with ok_to_fail to see if it continues to fail.
NyaliaLui pushed a commit to NyaliaLui/redpanda that referenced this issue Jul 27, 2022
The KStreams example classes use the same names as the KStreams tests.
Rename them and distinguish between an example and a driver. This will
improve code readability.

This commit also re-enables KStreamsWikipedia with ok_to_fail.
The test was removed from the test suite but Redpanda has changed alot
since the issue was reported. Let's re-enable to see if the issue
continues. See redpanda-data#2889 for details.
@NyaliaLui
Copy link
Contributor Author

PR to re-enable this test ok_to_fail is here #5615
The test has been absent from the tree due to old way of "disabling" tests.

NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jul 28, 2022
This test was disabled due to redpanda-data#2889 but a lot has changed since then.
Re-enable the test with ok_to_fail to see if it continues to fail.
andrwng pushed a commit to andrwng/redpanda that referenced this issue Aug 5, 2022
This test was disabled due to redpanda-data#2889 but a lot has changed since then.
Re-enable the test with ok_to_fail to see if it continues to fail.
@NyaliaLui
Copy link
Contributor Author

I'm able to repro the issue locally. It fails 10/1000 runs so now I can make observations to discover the issue. This will likely take a few days to identify the root cause so I'll update the ticket as I find new information

@NyaliaLui
Copy link
Contributor Author

NyaliaLui commented Oct 20, 2022

I discovered that on all the failed tests, the _schemas topic internal to the schema registry was never created. The _schemas topic is created on the first request to the registry so this could be a lead into the root cause.

I need to inspect the KStreams code base more for when/what is supposed to submit requests to the registry.

@NyaliaLui
Copy link
Contributor Author

NyaliaLui commented Oct 24, 2022

Weekend test runs revealed that the producers within KafkaStreams sometimes fail to generate data. So I'm testing some adjustments to the KafkaStreams code now to see if that resolves the issue.

NyaliaLui added a commit to redpanda-data/kafka-streams-examples that referenced this issue Oct 25, 2022
The wikipedia driver was originally coded to produce messages with a
random number generator. The generator was for the range [0,99] which
meant there was a chance that no records are produced. This caused a
problem in Redpanda integration testing where success is determined by
parsing output.

This commit ensures that atleast 1 record is always sent to the example
topics.

Fixes: redpanda-data/redpanda#2889
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Oct 25, 2022
The new commit incorporates a small change to the KStreams Wikipedia
example. Previously it was possible for 0 messages to be sent to topics.
Now a minimum of 1 message is. This problem manifested as a ducktape
timeout error since the test relies on parsing output.

Fixes: redpanda-data#2889
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment