Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExampleRunner times out in KafkaStreamsPageView.test_kafka_streams_page_view #4637

Closed
rystsov opened this issue May 10, 2022 · 9 comments · Fixed by #4805
Closed

ExampleRunner times out in KafkaStreamsPageView.test_kafka_streams_page_view #4637

rystsov opened this issue May 10, 2022 · 9 comments · Fixed by #4805
Assignees
Labels

Comments

@rystsov
Copy link
Contributor

rystsov commented May 10, 2022

https://buildkite.com/redpanda/redpanda/builds/9879#340a6f85-c026-4e64-a8ba-59ccbd86db2f

test_id:    rptest.tests.compatibility.kafka_streams_test.KafkaStreamsPageView.test_kafka_streams
status:     FAIL
run time:   10 minutes 49.499 seconds


    TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ExampleRunner-1-140248366961616 node 1 on docker-rp-3']")
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/compatibility/kafka_streams_test.py", line 84, in test_kafka_streams
    driver.wait()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ExampleRunner-1-140248366961616 node 1 on docker-rp-3']
@rystsov rystsov added kind/bug Something isn't working area/tests ci-failure labels May 10, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue May 16, 2022
This is commented out rather than ok_to_fail, because
of the way these tests are defined, as subclasses rather
than unique test methods.

Related redpanda-data#4637
@piyushredpanda piyushredpanda assigned NyaliaLui and unassigned VadimPlh May 16, 2022
@NyaliaLui
Copy link
Contributor

NyaliaLui commented May 16, 2022

On CI, the failure is happening because the KafkaStreams Driver is failing to report output which is necessary to check for test success.

I'm still working on a reproducer for our local setups.

@dimitriscruz
Copy link
Contributor

New occurrence:

FAIL test: KafkaStreamsPageView.test_kafka_streams (1/30 runs)
  failure at 2022-05-16T07:48:39.505Z: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ExampleRunner-1-140165779509104 node 1 on docker-rp-14']")
      in job https://buildkite.com/redpanda/redpanda/builds/10173#3d435509-259b-4f78-b100-ead4703ded91
test_id:    rptest.tests.compatibility.kafka_streams_test.KafkaStreamsPageView.test_kafka_streams
--
  | status:     FAIL
  | run time:   10 minutes 49.025 seconds
  |  
  |  
  | TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ExampleRunner-1-140165779509104 node 1 on docker-rp-14']")
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/compatibility/kafka_streams_test.py", line 84, in test_kafka_streams
  | driver.wait()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/services/background_thread.py", line 72, in wait
  | super(BackgroundThreadService, self).wait(timeout_sec)
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/services/service.py", line 267, in wait
  | raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
  | ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ExampleRunner-1-140165779509104 node 1 on docker-rp-14']

jcsp added a commit to jcsp/redpanda that referenced this issue May 17, 2022
This is commented out rather than ok_to_fail, because
of the way these tests are defined, as subclasses rather
than unique test methods.

Related redpanda-data#4637
jcsp added a commit to jcsp/redpanda that referenced this issue May 17, 2022
This is commented out rather than ok_to_fail, because
of the way these tests are defined, as subclasses rather
than unique test methods.

Related redpanda-data#4637
@jcsp
Copy link
Contributor

jcsp commented May 24, 2022

I think this is still live, right? Just now with logging so that we can figure it out when it has an OFAIL state.

@jcsp jcsp reopened this May 24, 2022
@NyaliaLui
Copy link
Contributor

NyaliaLui commented May 24, 2022

I think this is still live, right? Just now with logging so that we can figure it out when it has an OFAIL state.

Correct. I'm letting the test run in CI until Friday at which point I'll check CI results across OFAIL-ed instances.
Thanks for re-opening 👍🏿

@NyaliaLui
Copy link
Contributor

Since improving logging, this test OFAIL'd once in the last 7 days https://buildkite.com/redpanda/redpanda/builds/10528#0180f9ec-3da4-404b-af15-0058aa83c463/562-6876

That failure revealed that the internal Java application is throwing a NullPointerException at runtime. I do not immediately know what's causing the Null Pointer so I'll need to do a deeper investigation. If the issue is not caused by something on the Redpanda side then I may need to post the problem upstream on the KStreams original repo.

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Jun 8, 2022

TL;DR - the root cause is unknown but a hacky fix is to edit the java application to handle null pointers

More investigation revealed that the "region" (a field within an Avro Record) can sometimes be null.

.leftJoin(userRegions, (view, region) -> {
  final GenericRecord viewRegion = new GenericData.Record(schema);
  viewRegion.put("user", view.get("user"));
  viewRegion.put("page", view.get("page"));
  viewRegion.put("region", region); // <-- region here is null sometimes
  return viewRegion;
})

That line of code is here

But that should not happen because the load generator is selecting a region within the proper bounds of its region array

    final String[] regions = {"europe", "usa", "asia", "africa"};
   ...
        final String some_region = regions[random.nextInt(regions.length)]; // <-- selects a region within [0,3] which is legal
        userProfileBuilder.set("region", some_region);

That line of code is here

This particular application has a list of users whom reside in a "region" and each user has their own "web page." The load generator issues a random number of page views on each user's webpage and sends the view count to a separate topic from the region name.

Furthermore, the application joins records between two topics (shown here). The javadoc reveals that it's possible for "region" to be null when the system does not read a valid record from the topic stream If no KTable record was found during lookup, a null value will be provided to ValueJoiner. Doc is here.

It is likely that there is some inconsistency between the ratio of page views and region names that leads to a null region.
It is not clear if the inconsistency comes from Redpanda but I would believe so since confluent, the authors of this application, run this test frequently and have not reported a similar issue. The redpanda log does not reveal anything troubling however, so more investigation is needed. I could hack the java application to handle null values but that doesn't solve the root cause.

@jcsp jcsp changed the title ExampleRunner times out in KafkaStreamsPageView.test_kafka_streams ExampleRunner times out in KafkaStreamsPageView.test_kafka_streams_page_view Jul 27, 2022
@NyaliaLui
Copy link
Contributor

Recently failed for an issue separate to the NullPointer. The new one is on the schema registry but could be related.
https://buildkite.com/redpanda/redpanda/builds/13222#0182430a-350f-4821-890d-e64b70649e24

Going back through the logs of previous failures, there is an backtrace that I can go through and need to investigate.

@NyaliaLui
Copy link
Contributor

There is a pattern In all occurrences of this failure where the Null value is read close around app startup.

In this test there are two java apps running concurrently. Given how uncommon this failure occurs (once every couple months) it is likely that there is some data race going on here. I designed this test in a way where the load generator can start before the other application is ready to receive data. I'll submit a patch that redesigns the tests to prevent this.

@rystsov
Copy link
Contributor Author

rystsov commented Sep 13, 2022

We haven't seen it for more than two weeks, closing assuming it's fixed (if it isn't then it'll pop up as a pr blocker).

@rystsov rystsov closed this as completed Sep 13, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants