Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_kafka_streams_page_view: "Unexpected files in data directory" #7680

Closed
dotnwat opened this issue Dec 9, 2022 · 7 comments
Closed

test_kafka_streams_page_view: "Unexpected files in data directory" #7680

dotnwat opened this issue Dec 9, 2022 · 7 comments
Assignees
Labels
area/kafka ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@dotnwat
Copy link
Member

dotnwat commented Dec 9, 2022

https://buildkite.com/redpanda/redpanda/builds/19571#0184f582-22ce-49b9-80cb-5c237459b23f

[INFO  - 2022-12-09 06:56:51,014 - service - free - lineno:311]: RedpandaService-0-281473029448528 node 3 on docker-rp-27: freeing node
[INFO  - 2022-12-09 06:56:51,014 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.compatibility.kafka_streams_test.KafkaStreamsPageView.test_kafka_streams_page_view: Summary: RuntimeError('Unexpected files in data directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 133, in run
    self.setup_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 218, in setup_test
    self.test.setup()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/test.py", line 91, in setup
    self.setUp()
  File "/root/tests/rptest/tests/redpanda_test.py", line 99, in setUp
    self.redpanda.start()
  File "/root/tests/rptest/services/redpanda.py", line 936, in start
    raise RuntimeError("Unexpected files in data directory")
RuntimeError: Unexpected files in data directory
Test requested 5 nodes, used only 3
[INFO  - 2022-12-09 06:56:51,016 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.compatibility.kafka_streams_test.KafkaStreamsPageView.test_kafka_streams_page_view: Data: None
@dotnwat dotnwat added kind/bug Something isn't working ci-failure labels Dec 9, 2022
@piyushredpanda
Copy link
Contributor

@NyaliaLui -- this is back again? Can you please quickly triage this? Does the test need an overhaul? Would love to have this test be consistently passing and not be back with failures often (as I think it has been -- but I could be wrong?)

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Dec 19, 2022

The test failed because the __consumer_offsets topic was still on the disk at startup. It is likely that another test, using the same node(s), failed and did not clean __consumer_offsets from disk. I'm searching the DT run now to see what test caused this.

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Dec 19, 2022

Yeah it looks like the failing ConsumerGroupTest at #7681 may have left state in the data directory. This is still a working hypothesis as I'm still looking for confirmation but it is likely that 7681 also caused all the other failures in that DT run.

@dotnwat dotnwat added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Jan 22, 2023
@rystsov
Copy link
Contributor

rystsov commented Jan 26, 2023

@dotnwat why do you think it's sev/low? (not criticizing, asking mostly for the educational purposes and to make sure we have an explicit reason expressed in case the conditions change and we need to re-evaluate)

@dotnwat
Copy link
Member Author

dotnwat commented Jan 27, 2023

@rystsov my interpretation of @NyaliaLui's investigation was that this test passes the vast majority of the time, but some other test in this case had a clean-up issue that this test discovered causing the failure. so there wasn't an indication of any underlying problem with redpanda.

happy to change sev level. still trying to understand how to apply sev label.

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Jan 31, 2023

Looking back at this failure, it looks like the KafkaCliConsumer from ConsumerGroupTest.test_dead_group_recovery failed to stop. This means that when other tests are started, the __consumer_offsets topic is created because the consumer is still running. This would explain why many tests, not just Kstreams, failed with "Unexpected files in data directory".

This is the same reason that #6490 failed but in that ticket, the consumer failed on consumer.wait instead of consumer.stop. #6501 was the fix.

So, I think this failure is not related to this test and therefore can be closed.
I'll speak with the team as well as as @VladLazar as he is the author for #6501 before closing but I think the approprate action here is to include an overload to service.clean_node method in the KafkaCliConsumer

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Feb 9, 2023

TLDR - I confirmed locally that the failure in this ticket is caused by the consumer in ConsumerGroupTest.test_dead_group_recovery failing to stop

I attempted to recreate this error locally by:
Commenting out the below lines in kafa_cli_consumer.py which is the consumer used in ConsumerGroupTest.test_dead_group_recovery:

    def stop_node(self, node):
        self._stopping.set()
        # node.account.kill_process("java", clean_shutdown=True)

        # try:
        #     wait_until(lambda: self._done is None or self._done == True,
        #                timeout_sec=10)
        # except:
        #     self.logger.warn(
        #         f"{self._instance_name} running on {node.name} failed to stop gracefully"
        #     )
        #     node.account.kill_process("java", clean_shutdown=False)
        #     wait_until(
        #         lambda: self._done is None or self._done == True,
        #         timeout_sec=5,
        #         err_msg=
        #         f"{self._instance_name} running on {node.name} failed to stop after SIGKILL"
        #     )

I then ran the consumer group test with the kafka streams test to see what happens task rp:run-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/consumer_group_test.py::ConsumerGroupTest.test_dead_group_recovery tests/rptest/tests/compatibility/kafka_streams_test.py::KafkaStreamsPageView" BUILD_TYPE=debug. Here are the results ...

First we see that the consumer group test with static_members=False fails because some consumers are still alive (that is expected due to my changes above)

[INFO:2023-02-09 21:22:23,682]: RunnerClient: rptest.tests.consumer_group_test.ConsumerGroupTest.test_dead_group_recovery.static_members=False: Setting up...
[WARNING - 2023-02-09 21:22:24,789 - redpanda - cov_enabled - lineno:2590]: enable_cov should be one of 'ON', or 'OFF'
[WARNING - 2023-02-09 21:22:24,792 - redpanda - cov_enabled - lineno:2590]: enable_cov should be one of 'ON', or 'OFF'
[WARNING - 2023-02-09 21:22:24,826 - redpanda - cov_enabled - lineno:2590]: enable_cov should be one of 'ON', or 'OFF'
[INFO:2023-02-09 21:22:28,906]: RunnerClient: rptest.tests.consumer_group_test.ConsumerGroupTest.test_dead_group_recovery.static_members=False: Running...
[WARNING - 2023-02-09 21:22:31,232 - service - clean_node - lineno:304]: KafkaCliConsumer-0-140525057230112: clean_node has not been overriden. This may be fine if the service leaves no persistent state.
[WARNING - 2023-02-09 21:22:31,233 - service - clean_node - lineno:304]: KafkaCliConsumer-1-140525057239904: clean_node has not been overriden. This may be fine if the service leaves no persistent state.
[WARNING - 2023-02-09 21:22:35,531 - service - clean_node - lineno:304]: RpkProducer-0-140525057230400: clean_node has not been overriden. This may be fine if the service leaves no persistent state.
[ERROR - 2023-02-09 21:32:41,686 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_group_test.py", line 391, in test_dead_group_recovery
    c.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['KafkaCliConsumer-0-140525057230112 node 1 on docker-rp-4']
[INFO:2023-02-09 21:32:41,888]: RunnerClient: rptest.tests.consumer_group_test.ConsumerGroupTest.test_dead_group_recovery.static_members=False: FAIL: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['KafkaCliConsumer-0-140525057230112 node 1 on docker-rp-4']")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_group_test.py", line 391, in test_dead_group_recovery
    c.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['KafkaCliConsumer-0-140525057230112 node 1 on docker-rp-4']

Then the next tests fail because of unexpected files in the data directory

test_id:    rptest.tests.consumer_group_test.ConsumerGroupTest.test_dead_group_recovery.static_members=True
status:     FAIL
run time:   11.379 seconds


    RuntimeError('Unexpected files in data directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 133, in run
    self.setup_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 218, in setup_test
    self.test.setup()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/test.py", line 91, in setup
    self.setUp()
  File "/root/tests/rptest/tests/redpanda_test.py", line 99, in setUp
    self.redpanda.start()
  File "/root/tests/rptest/services/redpanda.py", line 1033, in start
    raise RuntimeError("Unexpected files in data directory")
RuntimeError: Unexpected files in data directory
Test requested 6 nodes, used only 3
--------------------------------------------------------------------------------
test_id:    rptest.tests.compatibility.kafka_streams_test.KafkaStreamsPageView.test_kafka_streams_page_view
status:     FAIL
run time:   11.359 seconds


    RuntimeError('Unexpected files in data directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 133, in run
    self.setup_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 218, in setup_test
    self.test.setup()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/test.py", line 91, in setup
    self.setUp()
  File "/root/tests/rptest/tests/redpanda_test.py", line 99, in setUp
    self.redpanda.start()
  File "/root/tests/rptest/services/redpanda.py", line 1033, in start
    raise RuntimeError("Unexpected files in data directory")
RuntimeError: Unexpected files in data directory
Test requested 5 nodes, used only 3

So I look at the debug log to see what files are in the data directory

[ERROR - 2023-02-09 21:33:05,595 - redpanda - start - lineno:1030]: node docker-rp-6: unexpected namespace: kafka, topics: {'__consumer_offsets'}
[INFO  - 2023-02-09 21:33:05,596 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.compatibility.kafka_streams_test.KafkaStreamsPageView.test_kafka_streams_page_view: FAIL: RuntimeError('Unexpected files in data directory')

We see that __consumer_offsets is still in the data directory causing the failure. Considering that our docker containers share the same disk, both locally and in CI, it makes sense that we would get a Runtime error here due to leftover state.

Therefore, I am closing this ticket because it seems that the kafka_streams_page_view test and the other tests in this DT run failed because the Kafka CLI consumer went rogue in ConsumerGroupTest.test_dead_group_recovery. The issue with the consumer group test is already reported in #7681.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

No branches or pull requests

5 participants