Skip to content

Commit

Permalink
DAOS-14105 object: collectively punch object
Browse files Browse the repository at this point in the history
Currently, when punch an object with multiple redundancy groups,
to guarantee the atomicity, we handle the whole punch via single
internal distributed transaction. The DTX leader will forward the
CPD RPC to every object shard within the same transaction. For a
large-scaled object, such as a SX object, punching it will generate
N RPCs (N is equal to the count of all the vos targets in the system).
That will be very slow and hold a lot of system resource for relative
long time. If the system is under heavy load, related RPC(s) may get
timeout, then trigger DTX abort, and then client will resend RPC to
the DTX leader for retry, that will make the situation to be worse
and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy
groups, the client will scan the object layout and generate bitmap and
targets information for each DAOS engine that have object shards on it.
And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These
bitmap and targets information may be too large to be transferred via
RPC body, then they will be sent to the DTX leader via RDAM. On DTX
leader side, it will not directly forward the request to all related
engines, instead, the DTX leader will split the engines into multiple
engine groups. For engine group, it will randomly choose a relay engine
that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the
others in the same engine group. From current engine's perspective, for
each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay
engine in such engine group with the bitmap and targets information that
are only related with the engines in such engine group. These method can
be recursively used by the relay engines until all related engines have
received the collective punch request.

We control the count of engine groups on the DTX leader to guarantee that
the size of related bitmap and targets information for each engine group
will not exceed RPC bulk threshold. Then there will be no RDAM among the
engines for collectively punching the object.

For each related DAOS engine, the local punch for related object shards
will be driven via collective task on each own local vos targets. That
will save a lot of RPCs and resources.

We allow the user to set object collective punch threshold via client
side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the
min) value is 31.

On the other hand, for large-scaled object, transferring related DTX
participants information (that will be huge) will be heavy burden in
spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer
completed dtx_memberships to DAOS engines, instead, related engines in
spite leader or not, will append each local shards bitmap and targets
information to the client given MBS header. That is enough for commit
and abort collectively. As for DTX resync, the (new) DTX leader needs
to re-calculate the completed MBS. Since it is relative rare, even if
the overhead for such re-calculation would be quite high, it will not
affect the whole system too much.

Required-githooks: true

Signed-off-by: Fan Yong <fan.yong@intel.com>
  • Loading branch information
Nasf-Fan committed Dec 26, 2023
1 parent ea01b50 commit 3095903
Show file tree
Hide file tree
Showing 41 changed files with 4,088 additions and 747 deletions.
25 changes: 16 additions & 9 deletions src/container/srv_target.c
Original file line number Diff line number Diff line change
Expand Up @@ -1653,6 +1653,8 @@ ds_cont_tgt_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid,
struct dss_coll_ops coll_ops = { 0 };
struct dss_coll_args coll_args = { 0 };
struct ds_pool *pool;
int *exclude_tgts = NULL;
uint32_t exclude_tgt_nr = 0;
int rc;

/* Only for debugging purpose to compare srv_cont_hdl with cont_hdl_uuid */
Expand Down Expand Up @@ -1685,18 +1687,22 @@ ds_cont_tgt_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid,
coll_args.ca_func_args = &arg;

/* setting aggregator args */
rc = ds_pool_get_failed_tgt_idx(pool_uuid, &coll_args.ca_exclude_tgts,
&coll_args.ca_exclude_tgts_cnt);
if (rc) {
rc = ds_pool_get_failed_tgt_idx(pool_uuid, &exclude_tgts, &exclude_tgt_nr);
if (rc != 0) {
D_ERROR(DF_UUID "failed to get index : rc "DF_RC"\n",
DP_UUID(pool_uuid), DP_RC(rc));
return rc;
goto out;
}

rc = dss_thread_collective_reduce(&coll_ops, &coll_args, 0);
D_FREE(coll_args.ca_exclude_tgts);
if (exclude_tgts != NULL) {
rc = dss_build_coll_bitmap(exclude_tgts, exclude_tgt_nr, &coll_args.ca_tgt_bitmap,
&coll_args.ca_tgt_bitmap_sz);
if (rc != 0)
goto out;
}

if (rc != 0) {
rc = dss_thread_collective_reduce(&coll_ops, &coll_args, 0);
if (rc != 0)
/* Once it exclude the target from the pool, since the target
* might still in the cart group, so IV cont open might still
* come to this target, especially if cont open/close will be
Expand All @@ -1706,9 +1712,10 @@ ds_cont_tgt_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid,
D_ERROR("open "DF_UUID"/"DF_UUID"/"DF_UUID":"DF_RC"\n",
DP_UUID(pool_uuid), DP_UUID(cont_uuid),
DP_UUID(cont_hdl_uuid), DP_RC(rc));
return rc;
}

out:
D_FREE(coll_args.ca_tgt_bitmap);
D_FREE(exclude_tgts);
return rc;
}

Expand Down
3 changes: 2 additions & 1 deletion src/dtx/SConscript
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ def scons():
# dtx
denv.Append(CPPDEFINES=['-DDAOS_PMEM_BUILD'])
dtx = denv.d_library('dtx',
['dtx_srv.c', 'dtx_rpc.c', 'dtx_resync.c', 'dtx_common.c', 'dtx_cos.c'],
['dtx_srv.c', 'dtx_rpc.c', 'dtx_resync.c', 'dtx_common.c', 'dtx_cos.c',
'dtx_coll.c'],
install_off="../..")
denv.Install('$PREFIX/lib64/daos_srv', dtx)

Expand Down
Loading

0 comments on commit 3095903

Please sign in to comment.