Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion error in rptest.tests.topic_recovery_test.TopicRecoveryTest.test_fast2 #4601

Closed
abhijat opened this issue May 6, 2022 · 1 comment · Fixed by #5700
Closed

assertion error in rptest.tests.topic_recovery_test.TopicRecoveryTest.test_fast2 #4601

abhijat opened this issue May 6, 2022 · 1 comment · Fixed by #5700
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working

Comments

@abhijat
Copy link
Contributor

abhijat commented May 6, 2022

https://buildkite.com/redpanda/redpanda/builds/9821#83e28000-1277-4baa-bc41-a5b30be75e1a/1527-7522
seen during #4404

test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_fast2
status:     FAIL
run time:   3 minutes 8.709 seconds

    <BadLogLines nodes=docker-rp-3(2) example="ERROR 2022-05-06 06:13:32,581 [shard 1] assert - Assert failure: (../../../src/v/utils/retry_chain_node.cc:166) '_num_children == 0' Fiber stopped before its dependencies">
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 47, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/root/tests/rptest/services/redpanda.py", line 1062, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-3(2) example="ERROR 2022-05-06 06:13:32,581 [shard 1] assert - Assert failure: (../../../src/v/utils/retry_chain_node.cc:166) '_num_children == 0' Fiber stopped before its dependencies">

@abhijat abhijat added kind/bug Something isn't working ci-failure labels May 6, 2022
@jcsp jcsp added the area/cloud-storage Shadow indexing subsystem label May 6, 2022
@Lazin Lazin self-assigned this May 6, 2022
@piyushredpanda piyushredpanda assigned BenPope and unassigned Lazin Jul 28, 2022
@BenPope
Copy link
Member

BenPope commented Jul 28, 2022

From the log:

DEBUG 2022-05-06 06:13:32,580 [shard 1] archival - [fiber6 kafka/panda-topic-1/0] - ntp_archiver_service.cc:261 - upload candidate not found, start_upload_offset: 6294, last_stable_offset: 6294
ERROR 2022-05-06 06:13:32,581 [shard 1] assert - Assert failure: (../../../src/v/utils/retry_chain_node.cc:166) '_num_children == 0' Fiber stopped before its dependencies
ERROR 2022-05-06 06:13:32,581 [shard 1] assert - Backtrace below:
0x2bb704ea 0x3b77dca4 0x3b77d967 0x3b77fe89 0x3b780ce2 0x37c08968 0x2c8ae42f 0x2c8ae3ac 0x2c8ae343 0x2c8ae284 0x2c8ae0d0 0x2c8ae070 0x2c8ae03a 0x2c8adb64 0x2c8ac974 0x2c8abda3 0x2c8abb52 0x2c8abab8 0x2c8aba68 0x2c8aba18 0x2c8ab9c8 0x2c8ab978 0x2c8ab89f 0x2c8ab818 0x2c8ab760 0x2c8ab7b8 0x2c0c6708 0x2c8a8f5a 0x2c8d3411 0x2c8d3164 0x2c8d1d23 0x2c8d68a3 0x3a8709bd 0x3a8769f9 0x3a87b57b 0x3a9c89e7 0x3a9c76ac 0x3a9c752c 0x3a9c7494 0x3a9c21f0 0x2bed3ee5 0x2be27d48 0x3a793598 /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03a6ab6914247a274-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/lib/libpthread.so.0+0x9298 /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03a6ab6914247a274-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/lib/libc.so.6+0x1006a2
   --------
   seastar::smp_message_queue::async_work_item<seastar::sharded<archival::scheduler_service>::stop()::'lambda'(seastar::future<void>)::operator()(seastar::future<void>) const::'lambda'(unsigned int)::operator()(unsigned int) const::'lambda'()>

It appears that there is outstanding work on fiber6. The gates look fine.

The implementation is of scheduler_service::stop is:

ss::future<> scheduler_service_impl::stop() {
    vlog(_rtclog.info, "Scheduler service stop");
    _timer.cancel();
    std::vector<ss::future<>> outstanding;
    for (auto& it : _archivers) {
        outstanding.emplace_back(it.second->stop());
    }
    return ss::do_with(
      std::move(outstanding), [this](std::vector<ss::future<>>& outstanding) {
          return ss::when_all_succeed(outstanding.begin(), outstanding.end())
            .finally([this] { return _gate.close(); });
      });
}

When the ntp_archiver is added:

ss::future<> scheduler_service_impl::add_ntp_archiver(
  ss::lw_shared_ptr<ntp_archiver> archiver) {
    vassert(
      !_archivers.contains(archiver->get_ntp()),
      "archiver for ntp {} already added!",
      archiver->get_ntp());

    if (_gate.is_closed()) {
        return ss::now();
    }
    return archiver->download_manifest().then(
      [this, archiver](cloud_storage::download_result result) {
          auto ntp = archiver->get_ntp();
          switch (result) {
          case cloud_storage::download_result::success:
              vlog(
                _rtclog.info,
                "Found manifest for partition {}",
                archiver->get_ntp());
              _probe.start_archiving_ntp();

              _archivers.emplace(archiver->get_ntp(), archiver);
              archiver->run_upload_loop();

A download is performed before being added to _archivers.

It looks like during shutdown, if an archiver has just been started, and is currently downloading, then it will not have stop() called on it, the gate won't be waited on, and this assertion can fire.

BenPope added a commit to BenPope/redpanda that referenced this issue Jul 28, 2022
If an ntp_archiver is added, and then the service is stopped prior
to completion of manifest download, the ntp_archiver will not be,
stopped, and its gate not waited on.

Fix that by always adding the arhiver to _archivers.

Fix redpanda-data#4601

Signed-off-by: Ben Pope <ben@redpanda.com>
BenPope added a commit to BenPope/redpanda that referenced this issue Jul 28, 2022
If an `ntp_archiver` is added, and the service is stopped prior
to completion of manifest download, the `ntp_archiver` will not be,
stopped, and its gate not waited on.

Fix that by always adding the `ntp_archiver` to `_archivers`.

Fix redpanda-data#4601

Signed-off-by: Ben Pope <ben@redpanda.com>
BenPope added a commit to BenPope/redpanda that referenced this issue Jul 28, 2022
If an `ntp_archiver` is added, and the service is stopped prior
to completion of manifest download, the `ntp_archiver` will not be,
stopped, and its gate not waited on.

Fix that by always adding the `ntp_archiver` to `_archivers`.

Fix redpanda-data#4601

Signed-off-by: Ben Pope <ben@redpanda.com>
BenPope added a commit to BenPope/redpanda that referenced this issue Jul 28, 2022
If an `ntp_archiver` is added, and the service is stopped prior
to completion of manifest download, the `ntp_archiver` will not be,
stopped, and its gate not waited on.

Fix that by always adding the `ntp_archiver` to `_archivers`.

Fix redpanda-data#4601

Signed-off-by: Ben Pope <ben@redpanda.com>
BenPope added a commit to BenPope/redpanda that referenced this issue Jul 29, 2022
If an `ntp_archiver` is added, and the service is stopped prior
to completion of manifest download, the `ntp_archiver` will not be,
stopped, and its gate not waited on.

Fix that by always adding the `ntp_archiver` to `_archivers`.

Fix redpanda-data#4601

Signed-off-by: Ben Pope <ben@redpanda.com>
(cherry picked from commit f10c42c)
BenPope added a commit to BenPope/redpanda that referenced this issue Jul 29, 2022
If an `ntp_archiver` is added, and the service is stopped prior
to completion of manifest download, the `ntp_archiver` will not be,
stopped, and its gate not waited on.

Fix that by always adding the `ntp_archiver` to `_archivers`.

Fix redpanda-data#4601

Signed-off-by: Ben Pope <ben@redpanda.com>
(cherry picked from commit f10c42c)
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants