Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ducktape test to execute workloads through multi-release upgrades #8253

Merged
merged 8 commits into from
Jun 30, 2023

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Jan 16, 2023

A framework to execute tests on a sequence of redpanda versions, composed of 2 components:

  • PWorkload, an interface to implement
  • RedpandaUpgradeTest, a test that runs a collection of PWorkloads through a cluster while upgrading it

A Workload can implement PWorkload, spin up an external producer/consumer, and check progress on the external service in a stable and upgrading cluster.
Workloads run in parallel, so care should be taken not to disrupt other workloads, like changing cluster properties or removing nodes.

Various tests implementing a upgrading+testing pattern can be moved to this test to reduce the number of tests and total ci execution time while doing a more rigorous test.

Fixes #7310

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

  • none

@andijcr andijcr force-pushed the feat/test/upgrade_tests branch 2 times, most recently from e6c2394 to eb08344 Compare March 24, 2023 12:12
@andijcr andijcr marked this pull request as ready for review March 24, 2023 12:13
@andijcr andijcr requested review from andrwng and jcsp March 24, 2023 12:23
@andijcr andijcr changed the title Draft: Feat/test/upgrade tests test that executes workloads during multi-release upgrades Mar 24, 2023
@andijcr andijcr changed the title test that executes workloads during multi-release upgrades Ducktape test to execute workloads through multi-release upgrades Mar 27, 2023
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple thoughts on things to consider w.r.t versioning, nothing really blocking. I think I need to digest the workload protocol and adapters a bit more and let them sit, but at first glance this looks great!

tests/rptest/services/redpanda_installer.py Outdated Show resolved Hide resolved
tests/rptest/services/redpanda_installer.py Show resolved Hide resolved
tests/rptest/services/workload_protocol.py Show resolved Hide resolved
tests/rptest/tests/workload_upgrade_runner_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/workload_upgrade_runner_test.py Outdated Show resolved Hide resolved
@andijcr
Copy link
Contributor Author

andijcr commented Jun 28, 2023

https://buildkite.com/redpanda/redpanda/builds/32089#0188fed7-8481-4e6d-92ae-d25de019d22f
failures are
#10836
#11449
#11169

and a failure related to the pr in debug mode:

File "/root/tests/rptest/tests/workload_producer_consumer.py", line 86, in end
    consumer.wait()

edit: actually the timeout is before the check in the test itself, it's related to kgo consumer

@andijcr
Copy link
Contributor Author

andijcr commented Jun 28, 2023

/ci-repeat 3 debug skip-debug dt-repeat=3 tests/rptest/tests/workload_upgrade_runner_test.py::RedpandaUpgradeTest.test_workloads_through_releases

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test code looks fine to me. Does the new test pass reliably-ish?

timeout_sec=30,
backoff_sec=1)

old_version = current_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this assignment needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old_version is passed on line 258 to mid_upgrade_check() , for the nodes that are not yet updated

Comment on lines +439 to +447
head_line = self.head_version()[0:2]
oldest_supported_line = (head_line[0] - 1, head_line[1])
latest_unsupported_line = (oldest_supported_line[0],
oldest_supported_line[1] - 1)
if latest_unsupported_line[1] == 0:
# if going back, version vX.0 is v(X-1).3
latest_unsupported_line = (latest_unsupported_line[0] - 1, 3)
return latest_unsupported_line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be nice to wrap this version back and forth in a type (i.e. RedpandaVersionTriple becomes a dataclass and this stuff is wrapped in a member)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pain with a dataclass is that we use "head" as a version in the codebase, and sometimes we need to convert it to and actual triple. That requires RedpandaInstaller to do it, and a query to a running instance of redpanda in principle, breaking a bit the encapsulation of a dataclass.

Comment on lines 82 to 83
# old release have a bug where cloud_storage_enabled is sticky. forcing a leadership transfer is a workaround for this
#Admin(self.ctx.redpanda).partition_transfer_leadership("kafka", self.topic.name, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember discussing this. Did this solution work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove the comment, but it didn't appear to work reliably. The solution used here is to just start with v22.3 for cloud storage

return PWorkload.DONE

@abstractmethod
def progress(self, version: RedpandaVersionTriple) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rename this to something like on_cluster_upgraded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

"""
return

def partial_progress(self, versions: dict[Any,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rename this to on_partial_cluster_upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

def partial_progress(self, versions: dict[Any,
RedpandaVersionTriple]) -> int:
"""
This method is called while upgrading a cluster, after each node is upgraded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this comment correct? Looks like this is called after half the cluster was upgraded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, comment out of sync

and some other python related typing
this callback is used to run checks mid-upgrade
This Protocol defines the interface that a workload has to implement to
run inside the workload upgrade runner, introduced in the next commit.

Only the methods marked as abstract needs to be implemented, the other
are optional.

additionally, tests/rptest/tests/workload_dummy.py provides an example
implementation
@andijcr andijcr force-pushed the feat/test/upgrade_tests branch 2 times, most recently from faadc8d to be61a1f Compare June 29, 2023 07:45
@andijcr
Copy link
Contributor Author

andijcr commented Jun 29, 2023

force push: fixed a missing function rename in the code

@andijcr
Copy link
Contributor Author

andijcr commented Jun 29, 2023

/ci-repeat 1 debug tests/rptest/tests/workload_producer_consumer.py

the tests setup a producer and consumer, ensures that data gets written
in cloud storage, and checks the content of the partition manifest to
ensure progress
this test is a runner for a collection PWorkload. it will create an
upgrade path, insert patch downgrades, setup a cluster and run the
workloads concurrently against the cluster. at the end of the test it
will report failed workloads.

WorkloadAdapter is a wrapper to keep track workload state and to store
any thrown exception
the method is used in a loop by RedpandaUpgradeTest and it's a bit noisy
@andijcr
Copy link
Contributor Author

andijcr commented Jun 29, 2023

/ci-repeat 1 debug tests/rptest/tests/workload_upgrade_runner_test.py

@andijcr
Copy link
Contributor Author

andijcr commented Jun 29, 2023

force push: comment fix and new commit to skip-debug since

@andijcr
Copy link
Contributor Author

andijcr commented Jun 29, 2023

https://buildkite.com/redpanda/redpanda/builds/32226#0189080c-5ffe-40e2-b5be-257439b688c7
issues are
#11062
#11508

and one for this test, in debug mode. test is marked as @skip_debug_mode, so there is some weird interaction

====================================================================================================
test_id:    rptest.tests.workload_upgrade_runner_test.RedpandaUpgradeTest.test_workloads_through_releases
status:     FAIL
run time:   17.606 seconds


    JSONDecodeError('Expecting value: line 1 column 1 (char 0)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 155, in wrapped
    self.redpanda.stop_and_scrub_object_storage()
  File "/root/tests/rptest/services/redpanda.py", line 3657, in stop_and_scrub_object_storage
    report = json.loads(output)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

edit: last issue was solved by moving @skip_debug_build to the top of the stack of decorators

@andijcr andijcr requested a review from VladLazar June 29, 2023 20:14
transition v23.1 -> v23.2 seems to be flaky in debug mode and a
producer/consumer workload running
@andijcr
Copy link
Contributor Author

andijcr commented Jun 30, 2023

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jcsp jcsp merged commit 5b5b726 into redpanda-data:dev Jun 30, 2023
17 checks passed
@andijcr andijcr deleted the feat/test/upgrade_tests branch June 30, 2023 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests: fix upgrade tests not to skip versions
4 participants