Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14679 pool: Report on stopping sp_stopping #14374

Merged
merged 9 commits into from
Jun 3, 2024
Merged

Conversation

liw
Copy link
Contributor

@liw liw commented May 15, 2024

When destroying a pool on an engine (as part of a pool destroy command),
we may block in

ds_pool_stop
  pool_put_sync // putting the last reference
    pool_free_ref // called after the hash rec deletion
      pool_child_delete_one
        pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

  • Move the pool_child_delete_one collective call from pool_free_ref to
    before the last ds_pool_put call in ds_pool_stop. This ensures that
    the ds_pool object remains in the LRU cache in the sp_stopping state
    while we wait for ds_pool_child references (or some other upper
    layer resource).

  • Wait in ds_pool_stop for all other ds_pool references before calling
    the pool_child_delete_one collective.

  • Remove the pool_put_sync trick, which was introduced because of the
    pool_child_delete_one collective call in pool_free_ref.

  • Return -DER_AGAIN from ds_pool_stop when the ds_pool object is in
    the sp_stopping state, so that the new pool destroy command in the
    aforementioned scenario will wait for the original pool destroy
    operation.

  • Register a reply aggregation callback for MGMT_TGT_DESTROY so that
    an error can reach the control plane.

Features: pool

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented May 15, 2024

Ticket title is 'LRZ: dmg pool destroy fails'
Status is 'Blocked'
Labels: 'LRZ,triaged'
https://daosio.atlassian.net/browse/DAOS-14679

When destroying a pool on an engine (as part of a pool destroy command),
we may block in

  ds_pool_stop
    pool_put_sync // putting the last reference
      pool_free_ref // called after the hash rec deletion
        pool_child_delete_one
          pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

  - Move the pool_child_delete_one collective call from pool_free_ref to
    before the ds_pool_put call in ds_pool_stop. This makes sure that
    the ds_pool object remains in the LRU cache in the sp_stopping state
    while we wait for ds_pool_child references (or some other upper
    layer resource).

  - Remove the pool_put_sync trick, which was introduced because of the
    pool_child_delete_one collective call in pool_free_ref.

  - Return an error from ds_pool_stop when the ds_pool object is in the
    sp_stopping state, so that the new pool destroy command in the
    aforementioned scenario will return an explicit error.

  - Register a reply aggregation callback for MGMT_TGT_DESTROY so that
    an error can reach the control plane.

Features: pool
Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14374/3/testReport/

@liw
Copy link
Contributor Author

liw commented May 20, 2024

container/boundary: DAOS-15857

@liw liw marked this pull request as ready for review May 20, 2024 01:20
@liw liw requested review from a team as code owners May 20, 2024 01:20
@@ -1263,9 +1247,13 @@ ds_pool_stop(uuid_t uuid)

ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
ds_migrate_stop(pool, -1, -1);

pool_stop_children(pool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it needs be called on error cleanup in ds_pool_start(). (and in pool_alloc_ref() when the collective call partially failed?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed... Will fix. Thanks, Niu.

liw added 2 commits May 20, 2024 14:35
Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
Features: pool
Required-githooks: true
src/pool/srv_target.c Show resolved Hide resolved
@@ -1263,9 +1261,13 @@ ds_pool_stop(uuid_t uuid)

ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
ds_migrate_stop(pool, -1, -1);

pool_child_delete_all(pool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sponsor of ds_pool_stop() is not the last user for the ds_pool, then the other reference holders may still use related pool_child. Under such case, is it possible to cause some race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought about this question too. I think if there's no other bug, there should be no other user of this ds_pool object. Moreover, if there were such a user, there was no ds_pool_child pointer in ds_pool. So, although that user might encounter some error, he should not encounter serious problems like segfaults.

Copy link
Contributor

@Nasf-Fan Nasf-Fan May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race may happen, for example, during ds_pool_tgt_map_update(), someone triggers ds_pool_stop() as to some pool_child is released before collective tasks for ds_pool_tgt_map_update completed, such as dtx_resync may get failure. It is risky for that because dtx_resync is global operation, if miss some pool_child, it will affect DTX check result as to the decision. Further more, if the DTX leader wants to commit or abort some DTX, but related pool shard is closed, then it may leave orphan DTX entry on related pool shard. If the pool will be destroyed, it's OK; otherwise, it will be trouble. I am not aware of some know segmentation fault issues under such case, but really not sure for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can pull in one more change originally planned for future work: Wait for all other ds_pool references to be released. If we do that before calling pool_child_delete_all, the situation should become easier for other ds_pool reference holders.

The dtx_resync concern sounds like a different problem. Let's chat to understand where the problem is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root reason for dtx_resync trouble is not related with the patch, but the changes in this patch will increase the risk to expose the dtx_resync trouble.

Copy link
Contributor Author

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan, thanks for the review. Here are my replies/explanations.

src/pool/srv_target.c Show resolved Hide resolved
@@ -1263,9 +1261,13 @@ ds_pool_stop(uuid_t uuid)

ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
ds_migrate_stop(pool, -1, -1);

pool_child_delete_all(pool);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought about this question too. I think if there's no other bug, there should be no other user of this ds_pool object. Moreover, if there were such a user, there was no ds_pool_child pointer in ds_pool. So, although that user might encounter some error, he should not encounter serious problems like segfaults.

Features: pool
Required-githooks: true
@liw
Copy link
Contributor Author

liw commented May 21, 2024

Merged with master to fix a minor conflict in daos_srv/pool.h.

Copy link
Contributor Author

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan, thanks for the details. Here are my replies.

src/pool/srv_target.c Show resolved Hide resolved
@@ -1263,9 +1261,13 @@ ds_pool_stop(uuid_t uuid)

ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
ds_migrate_stop(pool, -1, -1);

pool_child_delete_all(pool);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can pull in one more change originally planned for future work: Wait for all other ds_pool references to be released. If we do that before calling pool_child_delete_all, the situation should become easier for other ds_pool reference holders.

The dtx_resync concern sounds like a different problem. Let's chat to understand where the problem is.

Features: pool
Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
@liw liw requested review from Nasf-Fan and NiuYawei May 22, 2024 06:22
NiuYawei
NiuYawei previously approved these changes May 22, 2024
D_INFO(DF_UUID": pool stopped\n", DP_UUID(uuid));

while (!daos_lru_is_last_user(&pool->sp_entry))
dss_sleep(1000 /* ms */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we have to wait for the ds_pool reference dropped here, it's better to print some warning messages if it hangs here due to potential defects.

Copy link
Contributor Author

@liw liw May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NiuYawei, please see https://github.com/daos-stack/daos/pull/14374/files#r1609159549 (just below). There may be RPC handlers calling some functions who has looked up the ds_pool object before we set sp_stopping in ds_pool_stop. I think Fan Yong's point is that such a function may assume that if it has a ds_pool reference then it can expect the children are present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't think currently any user expect that, the ds_pool_child could be stopped due to faulty target at any time. Anyway, let's see if any potential ds_pool reference issue could be discovered by this new change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have indeed told Fan Yong that nowadays ds_pool_child may be stopped individually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to log repeated D_INFO messages here, so only added a completion message so that we know if we are blocked in the reference wait or pool_child_delete_all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch seems some complex. Follow your patch logic, wait() here is necessary, but the potential risk is that whether related reference holders can get the stopping signal and release the reference in time. We may have to resolve related issues when hit long-time held reference one by one next step.

On the other hand, as my understand, the main issue to be resolved is that some pool destroy is in-processing, but timeout (for some unknown reason), and the subsequent retried pool destroy returns succeed since it cannot find it in the pool cache. If that is true, when why not only return "-DER_AGAIN" as your current patch does but without other changes? That seems enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan, active cancellation seems to be a large project. It is out of the scope of this PR, but has been in my plan for a long time.

"That" is not enough. Without this PR, when we wait for ds_pool_child references, we are in pool_free_ref, which is called after the ds_pool object has been deleted from the LRU cache. Hence a retried pool destroy operation will not find the ds_pool object. Please see the PR description. (I began this PR with only such a change, but testing proved it to be insufficient.) On the other hand, stopping activities when freeing the last reference is not a good model, so I plan to move that out anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no better idea. Then let's resolve the potential issues when really hit in the future.

Features: pool
Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
D_INFO(DF_UUID": pool stopped\n", DP_UUID(uuid));

while (!daos_lru_is_last_user(&pool->sp_entry))
dss_sleep(1000 /* ms */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no better idea. Then let's resolve the potential issues when really hit in the future.

@jolivier23
Copy link
Contributor

Perhaps you should add

Allow-unstable-test: true

So it at least runs all hardware tests if it's blocked by a CI bug on VM tests.

Features: pool
Allow-unstable-test: true
Required-githooks: true
Features: pool
Allow-unstable-test: true
Required-githooks: true
@liw
Copy link
Contributor Author

liw commented May 28, 2024

Merged master again because the previous CI job was stuck in "Test RPMs".

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14374/8/display/redirect

Features: pool
Allow-unstable-test: true
Required-githooks: true
@liw liw requested a review from a team June 2, 2024 23:31
@NiuYawei NiuYawei merged commit dd530c9 into master Jun 3, 2024
51 of 52 checks passed
@NiuYawei NiuYawei deleted the liw/ds_pool_stop branch June 3, 2024 04:22
jolivier23 added a commit that referenced this pull request Jun 12, 2024
DAOS-14679 pool: Report on stopping sp_stopping (#14374)
DAOS-15145 pool: add pool collective function (#13764)
*DAOS-14105 object: collectively punch object (#13493)

* partial backport, just the bitmap function

Required-githooks: true
Change-Id: I2b21b8121cbdecc79ae49a464a42b1d47fb9be10
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23 jolivier23 mentioned this pull request Jun 12, 2024
18 tasks
jolivier23 added a commit that referenced this pull request Jun 17, 2024
DAOS-14679 pool: Report on stopping sp_stopping (#14374)
DAOS-15514 container: fix container destroy failure (#14108)
DAOS-15672 rebuild: Fix pool destroy hangs (#14183)
DAOS-15145 pool: add pool collective function (#13764)
*DAOS-14105 object: collectively punch object (#13493)

partial backport, just the bitmap function

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Signed-off-by: Li Wei <wei.g.li@intel.com>
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
liw added a commit that referenced this pull request Jun 25, 2024
When destroying a pool on an engine (as part of a pool destroy command),
we may block in

  ds_pool_stop
    pool_put_sync // putting the last reference
      pool_free_ref // called after the hash rec deletion
        pool_child_delete_one
          pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

  - Move the pool_child_delete_one collective call from pool_free_ref to
    before the ds_pool_put call in ds_pool_stop. This makes sure that
    the ds_pool object remains in the LRU cache in the sp_stopping state
    while we wait for ds_pool_child references (or some other upper
    layer resource).

  - Remove the pool_put_sync trick, which was introduced because of the
    pool_child_delete_one collective call in pool_free_ref.

  - Return an error from ds_pool_stop when the ds_pool object is in the
    sp_stopping state, so that the new pool destroy command in the
    aforementioned scenario will return an explicit error.

  - Register a reply aggregation callback for MGMT_TGT_DESTROY so that
    an error can reach the control plane.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
liw added a commit to liw/daos that referenced this pull request Jul 25, 2024
When destroying a pool on an engine (as part of a pool destroy command),
we may block in

  ds_pool_stop
    pool_put_sync // putting the last reference
      pool_free_ref // called after the hash rec deletion
        pool_child_delete_one
          pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

  - Move the pool_child_delete_one collective call from pool_free_ref to
    before the ds_pool_put call in ds_pool_stop. This makes sure that
    the ds_pool object remains in the LRU cache in the sp_stopping state
    while we wait for ds_pool_child references (or some other upper
    layer resource).

  - Remove the pool_put_sync trick, which was introduced because of the
    pool_child_delete_one collective call in pool_free_ref.

  - Return an error from ds_pool_stop when the ds_pool object is in the
    sp_stopping state, so that the new pool destroy command in the
    aforementioned scenario will return an explicit error.

  - Register a reply aggregation callback for MGMT_TGT_DESTROY so that
    an error can reach the control plane.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
@liw liw mentioned this pull request Jul 26, 2024
18 tasks
liw added a commit that referenced this pull request Jul 29, 2024
When destroying a pool on an engine (as part of a pool destroy command),
we may block in

  ds_pool_stop
    pool_put_sync // putting the last reference
      pool_free_ref // called after the hash rec deletion
        pool_child_delete_one
          pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

  - Move the pool_child_delete_one collective call from pool_free_ref to
    before the ds_pool_put call in ds_pool_stop. This makes sure that
    the ds_pool object remains in the LRU cache in the sp_stopping state
    while we wait for ds_pool_child references (or some other upper
    layer resource).

  - Remove the pool_put_sync trick, which was introduced because of the
    pool_child_delete_one collective call in pool_free_ref.

  - Return an error from ds_pool_stop when the ds_pool object is in the
    sp_stopping state, so that the new pool destroy command in the
    aforementioned scenario will return an explicit error.

  - Register a reply aggregation callback for MGMT_TGT_DESTROY so that
    an error can reach the control plane.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
gnailzenh pushed a commit that referenced this pull request Aug 3, 2024
* DAOS-14679 pool: Report on stopping sp_stopping (#14374)

When destroying a pool on an engine (as part of a pool destroy command),
we may block in

  ds_pool_stop
    pool_put_sync // putting the last reference
      pool_free_ref // called after the hash rec deletion
        pool_child_delete_one
          pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

  - Move the pool_child_delete_one collective call from pool_free_ref to
    before the ds_pool_put call in ds_pool_stop. This makes sure that
    the ds_pool object remains in the LRU cache in the sp_stopping state
    while we wait for ds_pool_child references (or some other upper
    layer resource).

  - Remove the pool_put_sync trick, which was introduced because of the
    pool_child_delete_one collective call in pool_free_ref.

  - Return an error from ds_pool_stop when the ds_pool object is in the
    sp_stopping state, so that the new pool destroy command in the
    aforementioned scenario will return an explicit error.

  - Register a reply aggregation callback for MGMT_TGT_DESTROY so that
    an error can reach the control plane.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true

* DAOS-15998 pool: Let all pool_svc rest on ds_pool (#14681)

Currently, only leader pool_svc objects depend on ds_pool objects. This
unintentionally allows the creation of a PS replica even when the local
ds_pool object is being stopped, leading to the hang in the Jira ticket.
This patch lets every follower pool_svc object depend on the ds_pool
object too, so that if the latter is stopping or doesn't exist, the
pool_svc object won't be created successfully.

  - Initialize pool_svc.ps_pool when allocating pool_svc, rather than
    when stepping up.
      - Move ps_pool up in pool_svc, as it's no longer a leader-only
	field.
      - Modify init_svc_pool accordingly and rename it to
	update_svc_pool.

  - Move ds_pool_svc_stop into ds_pool_stop, because we want to set
    ds_pool.sp_stopping before stopping the PS (if any).
      - Move the PS start code into ds_pool_start for symmetry.
      - Clean up the error handling in and for ds_pool_svc_stop.

  - Change ds_pool_stop_all to stop all pools concurrently.
      - Remove ds_rsvc_stop_all, as it's no longer used anywhere.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true

* DAOS-16254 pool: Fix rfcheck_ult infinite looping (#14789)

wenbaoxu@163.com found that pool_svc_rfcheck_ult may enter an infinite
loop due to -DER_NOTLEADERs when the PS leader is stepping down:

    src/rdb/rdb_raft.c:3146 rdb_raft_wait_applied() 564e1a52[1]: waiting
      for entry 0 to be applied
    src/container/srv_container.c:5050 ds_cont_rdb_iterate() 564e1a52
      container iter: rc -2008
    src/pool/srv_pool.c:6405 pool_svc_rfcheck_ult() 564e1a52 check rf
      with -2008 and retry
    [...]

This patch adds a check for whether the "sched" is canceled in
pool_svc_rfcheck_ult before sleeping and retrying, and fixes a nonsense
dss_sleep(0) call.


Signed-off-by: Li Wei <wei.g.li@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants