Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure of rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic test #3400

Closed
andrewhsu opened this issue Jan 5, 2022 · 6 comments · Fixed by #3433 or #3538
Closed
Assignees
Labels
ci-failure kind/bug Something isn't working

Comments

@andrewhsu
Copy link
Member

Version & Environment

Scheduled nightly run of tests using code from dev branch on git commit 80f8a78.

In the debug-clang-amd64 step:
https://buildkite.com/vectorized/redpanda/builds/5938#e00fdbdd-8b61-4cef-b0f8-ddb04a138711/1411-4370

What went wrong?

Buildkite job red and logs indicate failure in rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic test.

What should have happened instead?

World peace.

How to reproduce the issue?

I haven't tried to reproduce this on my local dev env and there has not been another buildkite build on the dev branch yet.

Additional information

From the logs:

test_id:    rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic
status:     FAIL
run time:   5 minutes 31.199 seconds
 
    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 340, in test_dynamic
    self._move_and_verify()
  File "/root/tests/rptest/tests/partition_movement_test.py", line 155, in _move_and_verify
    wait_until(status_done, timeout_sec=90, backoff_sec=2)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@andrewhsu andrewhsu added kind/bug Something isn't working ci-failure labels Jan 5, 2022
@andrewhsu
Copy link
Member Author

andrewhsu commented Jan 5, 2022

I saw an older issue around partition movement test, test_dynamic but not sure if it is a related cause: #2588

The log message from that older issue looks really similar:

test_id:    rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic
status:     FAIL
run time:   4 minutes 40.767 seconds
 
    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 340, in test_dynamic
    self._move_and_verify()
  File "/root/tests/rptest/tests/partition_movement_test.py", line 163, in _move_and_verify
    wait_until(derived_done, timeout_sec=90, backoff_sec=2)
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@dswang
Copy link
Contributor

dswang commented Jan 10, 2022

@mmaslankaprv Would you please take a look?

@andrewhsu
Copy link
Member Author

Seen again today:
https://buildkite.com/vectorized/redpanda/builds/6023#979671b6-510d-4123-a0d6-6b0b53ec31e4/6677-9661

test_id:    rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic
status:     FAIL
run time:   5 minutes 11.028 seconds
 
    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 340, in test_dynamic
    self._move_and_verify()
  File "/root/tests/rptest/tests/partition_movement_test.py", line 155, in _move_and_verify
    wait_until(status_done, timeout_sec=90, backoff_sec=2)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

Note, a different test also failed in the same build, but not sure if it is related: #3277 (comment)

@mmaslankaprv
Copy link
Member

The last instance of the error was caused by the fact that recovery of the offset translator state took a very long time. Added a PR that will move an offset_translator state together with the partition when it is moved cross cores.

@NyaliaLui
Copy link
Contributor

@mmaslankaprv
Copy link
Member

this failed because of bad log line

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 19, 2022
Added configuration replication error to log entries error allow list.
The configuration replication may fail when there was a truncation
during partition movement operation. This is perfectly fine,
configuration update operation should be retried.

Fixes: redpanda-data#3400

Signed-off-by: Michal Maslanka <michal@vectorized.io>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working
Projects
None yet
4 participants