-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExampleRunner times out in KafkaStreamsPageView
.test_kafka_streams_page_view
#4637
Comments
This is commented out rather than ok_to_fail, because of the way these tests are defined, as subclasses rather than unique test methods. Related redpanda-data#4637
On CI, the failure is happening because the KafkaStreams Driver is failing to report output which is necessary to check for test success. I'm still working on a reproducer for our local setups. |
New occurrence:
|
This is commented out rather than ok_to_fail, because of the way these tests are defined, as subclasses rather than unique test methods. Related redpanda-data#4637
This is commented out rather than ok_to_fail, because of the way these tests are defined, as subclasses rather than unique test methods. Related redpanda-data#4637
I think this is still live, right? Just now with logging so that we can figure it out when it has an OFAIL state. |
Correct. I'm letting the test run in CI until Friday at which point I'll check CI results across OFAIL-ed instances. |
Since improving logging, this test OFAIL'd once in the last 7 days https://buildkite.com/redpanda/redpanda/builds/10528#0180f9ec-3da4-404b-af15-0058aa83c463/562-6876 That failure revealed that the internal Java application is throwing a |
TL;DR - the root cause is unknown but a hacky fix is to edit the java application to handle null pointers More investigation revealed that the "region" (a field within an Avro Record) can sometimes be null.
That line of code is here But that should not happen because the load generator is selecting a region within the proper bounds of its region array
That line of code is here This particular application has a list of users whom reside in a "region" and each user has their own "web page." The load generator issues a random number of page views on each user's webpage and sends the view count to a separate topic from the region name. Furthermore, the application joins records between two topics (shown here). The javadoc reveals that it's possible for "region" to be null when the system does not read a valid record from the topic stream It is likely that there is some inconsistency between the ratio of page views and region names that leads to a null region. |
KafkaStreamsPageView.test_kafka_streams
KafkaStreamsPageView
.test_kafka_streams_page_view
Recently failed for an issue separate to the NullPointer. The new one is on the schema registry but could be related. Going back through the logs of previous failures, there is an backtrace that I can go through and need to investigate. |
There is a pattern In all occurrences of this failure where the Null value is read close around app startup. In this test there are two java apps running concurrently. Given how uncommon this failure occurs (once every couple months) it is likely that there is some data race going on here. I designed this test in a way where the load generator can start before the other application is ready to receive data. I'll submit a patch that redesigns the tests to prevent this. |
We haven't seen it for more than two weeks, closing assuming it's fixed (if it isn't then it'll pop up as a pr blocker). |
https://buildkite.com/redpanda/redpanda/builds/9879#340a6f85-c026-4e64-a8ba-59ccbd86db2f
The text was updated successfully, but these errors were encountered: