Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14105 object: collectively punch object #13493

Merged
merged 1 commit into from
Jan 2, 2024
Merged

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 13, 2023

Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy groups, the client will scan the object layout and generate bitmap and targets information for each DAOS engine that have object shards on it. And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These bitmap and targets information may be too large to be transferred via RPC body, then they will be sent to the DTX leader via RDAM. On DTX leader side, it will not directly forward the request to all related engines, instead, the DTX leader will split the engines into multiple engine groups. For engine group, it will randomly choose a relay engine that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the others in the same engine group. From current engine's perspective, for each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay engine in such engine group with the bitmap and targets information that are only related with the engines in such engine group. These method can be recursively used by the relay engines until all related engines have received the collective punch request.

We control the count of engine groups on the DTX leader to guarantee that the size of related bitmap and targets information for each engine group will not exceed RPC bulk threshold. Then there will be no RDAM among the engines for collectively punching the object.

For each related DAOS engine, the local punch for related object shards will be driven via collective task on each own local vos targets. That will save a lot of RPCs and resources.

We allow the user to set object collective punch threshold via client side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the min) value is 31.

On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer completed dtx_memberships to DAOS engines, instead, related engines in spite leader or not, will append each local shards bitmap and targets information to the client given MBS header. That is enough for commit and abort collectively. As for DTX resync, the (new) DTX leader needs to re-calculate the completed MBS. Since it is relative rare, even if the overhead for such re-calculation would be quite high, it will not affect the whole system too much.

It mainly contains the following components:

  1. General framework for bitmap-based collective task on engine. That
    is suitable for existing pool/container/object collective tasks.

  2. General framework for generating execution targets bitmap (and ID)
    array based on the object layout. That can be shared by all object
    level collective operations, such as punch object or query key.

  3. General framework for handling collective object RPC on server.
    Currently, it is shared by collective punch and collecitve query.

  4. General framework for collective DTX - collective DTX commit and
    collective DTX abort, both synchronous and asynchronous modes.

  5. New DTX membership data structure (compatible with existing DTX)
    to support collecitve DXT resync.

  6. General framework for multiple-cast RPC among specified engines.
    That supports to dispatch different content to differnt targets.
    Shared by both client and server.

  7. Object collective punch RPC.

  8. Change multi-layered "if-else" branches as "switch" for handling
    pre-RPC based attribute. That is more efficient for all obj RPCs.

  9. Placement: pack target information into object layout. That will
    bypass pool_map_find_target() when need DAOS targets information
    for related object layout.

  10. Some test cases for collectively punch object.

The patch also fixes some weird existing bugs when pass CI tests:

a. An incarnation log bug in check_equal() that should check "id_out"
status (committed or not) instead of "id_in". Because the "id_out"
is the "inplace" one to be changed. Otherwise, for ilog_update(),
the "id_in" always has zero id_tx_id that is regarded as committed
by DXT logic. So checking "id_in" committed or not is meaningless.
This issue may potentially cause incorrect data visibility:

a.1 For read operation, it may break data consistency semantics or
get stale information.

a.2 As for modification, it may make related modification based on
some non-committable ilog as to lost the modification silently.

b. We need to persistently store the DTX even if it changed nothing.
From current engine's perspective, it does not know whether the
other DTX participants will change something remotely or not.
If we do not save the empty DTX, then it may misguide subsequent
DTX resync to regard it as a failed transaction and be aborted.
That may further affect the other transactions that based on it.

c. Sometimes, a in-DRAM DTX entry that has already attached to some
persistent DTX blob may be cancelled because of some trouble in
subsequent process, such as hit other in-process DTX. Under such
case, the up layer sponsor may retry the DTX after DTX refresh.
At that time, such DTX entry should not reuse former persistent
DTX blob because that the former local TX has been aborted, and
related persistent DTX blob maybe released or reused by others.
If the retried DTX reuses such persistent DTX blob, finally, it
may overwrite other's modification as to data corruption.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link

github-actions bot commented Dec 13, 2023

Bug-tracker data:
Ticket title is 'Punch large-scaled object collectively'
Status is 'Awaiting Verification'
Labels: 'tds'
https://daosio.atlassian.net/browse/DAOS-14105

@daosbuild1
Copy link
Collaborator

Test stage Fault injection testing on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/1/execution/node/1125/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@Nasf-Fan Nasf-Fan marked this pull request as ready for review December 14, 2023 15:15
@Nasf-Fan Nasf-Fan requested review from a team as code owners December 14, 2023 15:15
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/10/execution/node/1411/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/10/execution/node/1349/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-14105_7 branch 2 times, most recently from aa3b8a3 to b8d3c6d Compare December 16, 2023 05:45
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/12/execution/node/1365/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/24/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/24/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/24/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/24/execution/node/1435/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/24/execution/node/1419/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/24/execution/node/1398/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/25/execution/node/1342/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-14105_7 branch 2 times, most recently from dd7b4e7 to 3fc9199 Compare December 26, 2023 06:55
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/27/execution/node/1319/log

@Nasf-Fan
Copy link
Contributor Author

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/27/execution/node/1319/log

ec_mdtest_smoke failed for known issue DAOS-12526.

@Nasf-Fan Nasf-Fan added the priority Ticket has high priority (automatically managed) label Dec 27, 2023
@Nasf-Fan
Copy link
Contributor Author

Ping reviewers. Thanks!

Currently, when punch an object with multiple redundancy groups,
to guarantee the atomicity, we handle the whole punch via single
internal distributed transaction. The DTX leader will forward the
CPD RPC to every object shard within the same transaction. For a
large-scaled object, such as a SX object, punching it will generate
N RPCs (N is equal to the count of all the vos targets in the system).
That will be very slow and hold a lot of system resource for relative
long time. If the system is under heavy load, related RPC(s) may get
timeout, then trigger DTX abort, and then client will resend RPC to
the DTX leader for retry, that will make the situation to be worse
and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy
groups, the client will scan the object layout and generate bitmap and
targets information for each DAOS engine that have object shards on it.
And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These
bitmap and targets information may be too large to be transferred via
RPC body, then they will be sent to the DTX leader via RDAM. On DTX
leader side, it will not directly forward the request to all related
engines, instead, the DTX leader will split the engines into multiple
engine groups. For engine group, it will randomly choose a relay engine
that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the
others in the same engine group. From current engine's perspective, for
each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay
engine in such engine group with the bitmap and targets information that
are only related with the engines in such engine group. These method can
be recursively used by the relay engines until all related engines have
received the collective punch request.

We control the count of engine groups on the DTX leader to guarantee that
the size of related bitmap and targets information for each engine group
will not exceed RPC bulk threshold. Then there will be no RDAM among the
engines for collectively punching the object.

For each related DAOS engine, the local punch for related object shards
will be driven via collective task on each own local vos targets. That
will save a lot of RPCs and resources.

We allow the user to set object collective punch threshold via client
side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the
min) value is 31.

On the other hand, for large-scaled object, transferring related DTX
participants information (that will be huge) will be heavy burden in
spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer
completed dtx_memberships to DAOS engines, instead, related engines in
spite leader or not, will append each local shards bitmap and targets
information to the client given MBS header. That is enough for commit
and abort collectively. As for DTX resync, the (new) DTX leader needs
to re-calculate the completed MBS. Since it is relative rare, even if
the overhead for such re-calculation would be quite high, it will not
affect the whole system too much.

It mainly contains the following components:

1. General framework for bitmap-based collective task on engine. That
   is suitable for existing pool/container/object collective tasks.

2. General framework for generating execution targets bitmap (and ID)
   array based on the object layout. That can be shared by all object
   level collective operations, such as punch object or query key.

3. General framework for handling collective object RPC on server.
   Currently, it is shared by collective punch and collecitve query.

4. General framework for collective DTX - collective DTX commit and
   collective DTX abort, both synchronous and asynchronous modes.

5. New DTX membership data structure (compatible with existing DTX)
   to support collecitve DXT resync.

6. General framework for multiple-cast RPC among specified engines.
   That supports to dispatch different content to differnt targets.
   Shared by both client and server.

7. Object collective punch RPC.

8. Change multi-layered "if-else" branches as "switch" for handling
   pre-RPC based attribute. That is more efficient for all obj RPCs.

9. Placement: pack target information into object layout. That will
   bypass pool_map_find_target() when need DAOS targets information
   for related object layout.

10. Some test cases for collectively punch object.

The patch also fixes some weird existing bugs when pass CI tests:

a. An incarnation log bug in check_equal() that should check "id_out"
   status (committed or not) instead of "id_in". Because the "id_out"
   is the "inplace" one to be changed. Otherwise, for ilog_update(),
   the "id_in" always has zero id_tx_id that is regarded as committed
   by DXT logic. So checking "id_in" committed or not is meaningless.
   This issue may potentially cause incorrect data visibility:

   a.1 For read operation, it may break data consistency semantics or
       get stale information.

   a.2 As for modification, it may make related modification based on
       some non-committable ilog as to lost the modification silently.

b. We need to persistently store the DTX even if it changed nothing.
   From current engine's perspective, it does not know whether the
   other DTX participants will change something remotely or not.
   If we do not save the empty DTX, then it may misguide subsequent
   DTX resync to regard it as a failed transaction and be aborted.
   That may further affect the other transactions that based on it.

c. Sometimes, a in-DRAM DTX entry that has already attached to some
   persistent DTX blob may be cancelled because of some trouble in
   subsequent process, such as hit other in-process DTX. Under such
   case, the up layer sponsor may retry the DTX after DTX refresh.
   At that time, such DTX entry should not reuse former persistent
   DTX blob because that the former local TX has been aborted, and
   related persistent DTX blob maybe released or reused by others.
   If the retried DTX reuses such persistent DTX blob, finally, it
   may overwrite other's modification as to data corruption.

Required-githooks: true

Signed-off-by: Fan Yong <fan.yong@intel.com>
@github-actions github-actions bot removed the priority Ticket has high priority (automatically managed) label Dec 29, 2023
Copy link
Contributor

@liuxuezhao liuxuezhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a rough read, generally LGTM, a few questions to confirm.

continue;

/* Skip the target that (re-)joined the system after the DTX. */
if (target->ta_comp.co_ver > dtx_ver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For EXTEND or REINT it will only update co_in_ver and don't change co_ver, here should compare with co_in_ver?
same for several other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some confused by current usage about "co_ver" and "co_in_ver". It seems that "co_in_ver" should be used instead of "co_ver". @wangdi1 , would you please to give more explanation about that? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuxuezhao , you are right. "co_ver" cannot replace "co_in_ver". I will change it and other related DTX logic in subsequent collective query patch. Thanks!

if (target->ta_comp.co_status != PO_COMP_ST_UP &&
target->ta_comp.co_status != PO_COMP_ST_UPIN &&
target->ta_comp.co_status != PO_COMP_ST_NEW &&
target->ta_comp.co_status != PO_COMP_ST_DRAIN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems for UP or NEW status target it should never get the DTX RPC req? (as they should not in the obj layout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the target is fully integrated into the pool, its status will be UP instead of UPIN, right? If yes, the UP target will get RPC. As for "NEW", I am not quite sure. @wangdi1 , any idea for that?

if (dce->dce_ranks == NULL)
D_GOTO(out, rc = -DER_NOMEM);

D_ALLOC_ARRAY(dce->dce_hints, node_nr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirm why the number of dce_hints is not same as dce_ranks? the hint is not 1:1 for rank?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first rank is for current engine that will not send RPC to itself. So the length of ranks array is at most "node_nr - 1" or smaller. The array of hints is spare to speedup lookup. Each rank can directly locate its slot based on its rank#. So the length of hints is equal to the count of ranks.

@@ -149,6 +180,20 @@ extern uint32_t dtx_batched_ult_max;
*/
#define DTX_INLINE_MBS_SIZE 512

#define DTX_COLL_TREE_WIDTH 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a problem, just seems 16 is too large for knomial tree branch ratio. probably 4 is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out that. let's reduce it in subsequent patch.


if (out_source->dco_status != 0 &&
(out_target->dco_status == 0 || out_target->dco_status == -DER_NONEXIST))
out_target->dco_status = out_source->dco_status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, a few log maybe useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

else if (dct->dct_tgt_cap <= 8)
size = dct->dct_tgt_cap << 1;
else
size = dct->dct_tgt_cap + 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why size 4 is enough for "dct->dct_tgt_cap == 0", and not srue why the size calculate as that, a few comment maybe helpful. thx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not know how many shards will be on the rank before completing the scan. So at the beginning, we allocate 4 items, if not enough, double the size until up to 16, if need more, add 8 items each time to avoid allocating too much. I will add more comment in subsequent patch.

}

used = crp_proc_get_size_used(cpca->cpca_proc);
if (unlikely(used > size)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you hit this case in test? if so the "size = (*p_size * 9) >> 3" maybe not enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have ever hit similar issue when directly use proc in other place. "size = (*p_size * 9) >> 3" may be not enough. We cannot assume the lower layer proc function will not consume more space with additional information packed. That is why we have the logic "size = used && goto again".


/* Try another for leader. */
leader = (leader + 1) % coa->coa_dct_nr;
goto new_leader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems possible to get dead loop here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be impossible, otherwise, all the shards are either in-rebuilding or in-reintegrating.

uuid_copy(opi->opi_co_hdl, args->pa_coh_uuid);
uuid_copy(opi->opi_co_uuid, args->pa_cont_uuid);
uuid_copy(opi->opi_co_hdl, shard->do_co->dc_cont_hdl);
uuid_copy(opi->opi_co_uuid, shard->do_co->dc_uuid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious that it should be same as original code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of space limitation in related data structure, po_{coh,cont}_uuid are not there any longer.

@@ -63,7 +63,9 @@ struct pl_obj_shard {
uint32_t po_shard; /* shard identifier */
uint32_t po_target; /* target id */
uint32_t po_fseq; /* The latest failure sequence */
uint32_t po_rebuilding:1, /* rebuilding status */
uint16_t po_rank; /* The rank on which the shard exists */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is a good idea to define it as 16 bits, the number of engines possibly > 64K in near future?
or just enlarge the structure with more 8B

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good question. Honestly, I am not sure for this. Enlarging it is possible, but almost waste. We can consider to use 20bits (to support 1M ranks) for that? @gnailzenh what's your idea?

Copy link
Contributor

@gnailzenh gnailzenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch is clean, just have one more question because the patch is very large and I can't understand all the details:
if one engine failed to execute the collective punch, can we guarantee all the collective members will abort the operation? I'm asking this because now we have indirect RPC forwarding, which coudl make things more complex

if (rc < 0)
D_GOTO(out, rc);

/* The target that (re-)joined the system after DTX cannot be the leader. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new logic right? I didn't see it in the original code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new logic, just move the logic from dtx_resync.c to here for sharing logic with others.

dcrc->dcrc_ptr = dcr;
dcrc->dcrc_dte = dtx_entry_get(rbund->entry);
d_list_add_tail(&dcrc->dcrc_gl_committable, &cont->sc_dtx_cos_list);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: the same code in dtx_cos_rec_alloc(), would be nice to have an inline function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can be done in subsequent patch.

*dcks = dck_buf;
*p_dce = dtx_coll_entry_get(dcrc->dcrc_dce);

return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that we only fetch one collective RPC is because it's more heavier weight then other RPCs right? If this is the case, can we do synchronous commit for collective RPC?
If it will make code more complex, please ignore this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, it is difficult to batch multiple collective DTX entries into one RPC, that will make the RPC body to be too complex.

Secondly, it may cause the RPC body size to exceed BULK size limitation if related shards layout for different DTX entries are different on related engines.

We have ever implemented synchronous collective DTX, its performance was not good. That is why we re-implemented asynchronous collective DTX. From application perspective, it decreases the latency.

dck_buf[i].oid = dcrc->dcrc_ptr->dcr_oid;
dck_buf[i].dkey_hash = dcrc->dcrc_ptr->dcr_dkey_hash;

if (unlikely(oid != NULL && dcrc->dcrc_ptr == NULL)) {
if (i > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to call daos_unit_oid_compare() in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"oid != NULL" means we only want to find out the DTX entires that touch some specified object. It is used for object sync.

@@ -375,6 +372,140 @@ dtx_handler(crt_rpc_t *rpc)
ds_cont_child_put(cont);
}

static void
dtx_coll_handler(crt_rpc_t *rpc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is on xstream-0 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, collective DTX RPC is via cart level broadcast, the receive peer must be 0-XS, that is different from collective punch (the latter one is our object level bcast). But the dtx_coll_handler just starts ULT on other XS, not involve it any layout/bitmap scan/calculation, so it is expected to be very light operation.


/* More ranks joined after obj_coll_oper_args_init(). */
if (unlikely(shard->do_target_rank >= coa->coa_dct_cap)) {
D_REALLOC_ARRAY(dct, coa->coa_dcts, coa->coa_dct_cap, shard->do_target_rank + 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

* It stores the identifiers of shards on the engine, in spite of on which VOS target,
* only for modification case.
*/
uint32_t *dct_tgt_ids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this if we already have the bitmap "dct_bitmap"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this if we already have the bitmap "dct_bitmap"

bitmap is used for punch, tgt_is is used for DTX resync.

uint32_t dce_ver;
uint32_t dce_refs;
d_rank_list_t *dce_ranks;
uint8_t *dce_hints;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry what is the dce_hints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used for collective DTX to indicate on which vos target we can find out the the DTX entry. There is comment in the dtx_internal.h

struct dc_pool *pool = obj->cob_pool;

D_RWLOCK_RDLOCK(&pool->dp_map_lock);
if (shard_cnt < pool_map_node_nr(pool->dp_map) << 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from this patch, but I think pool_map_node_nr() actually returns rank right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool_map_node_nr() returns the count of rank in the system. Here the comparison means that if the shards count is not double's the count of ranks, then it is unnecessary to use collective operation.

size = dct->dct_bitmap_sz << 3;

/* Randomly choose a XS as the local leader on target engine for load balance. */
for (i = 0, pos = (rand != 0 ? rand : d_rand()) % dct->dct_tgt_nr; i < size; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this include xstream-0? I would suggest to not use xstream-o if it does, otherwise nm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collective punch (not include collect DTX) does not touch 0-XS.

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 2, 2024

I think this patch is clean, just have one more question because the patch is very large and I can't understand all the details:
if one engine failed to execute the collective punch, can we guarantee all the collective members will abort the operation? I'm asking this because now we have indirect RPC forwarding, which coudl make things more complex

If one engine failed to execute the punch, in spite of it is the leaf node or the relay engine, or the leader engine, the DTX leader engine will trigger collective DXT abort to abort the whole transaction. That is similar as regular DTX leader handle failure case. As for "guarantee", it is the same as current regular DTX abort.

@gnailzenh gnailzenh merged commit 0aae80f into master Jan 2, 2024
47 of 48 checks passed
@gnailzenh gnailzenh deleted the Nasf-Fan/DAOS-14105_7 branch January 2, 2024 12:26
jolivier23 added a commit that referenced this pull request Jun 12, 2024
DAOS-14679 pool: Report on stopping sp_stopping (#14374)
DAOS-15145 pool: add pool collective function (#13764)
*DAOS-14105 object: collectively punch object (#13493)

* partial backport, just the bitmap function

Required-githooks: true
Change-Id: I2b21b8121cbdecc79ae49a464a42b1d47fb9be10
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23 jolivier23 mentioned this pull request Jun 12, 2024
18 tasks
jolivier23 added a commit that referenced this pull request Jun 17, 2024
DAOS-14679 pool: Report on stopping sp_stopping (#14374)
DAOS-15514 container: fix container destroy failure (#14108)
DAOS-15672 rebuild: Fix pool destroy hangs (#14183)
DAOS-15145 pool: add pool collective function (#13764)
*DAOS-14105 object: collectively punch object (#13493)

partial backport, just the bitmap function

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Signed-off-by: Li Wei <wei.g.li@intel.com>
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants