DAOS-14105 object: collectively punch object

Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse. To resolve such bad situation, we will collectively punch the object. The basic idea is that: when punch an object with multiple redundancy groups, the client will scan the object layout and generate bitmap and targets information for each DAOS engine that have object shards on it. And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These bitmap and targets information may be too large to be transferred via RPC body, then they will be sent to the DTX leader via RDAM. On DTX leader side, it will not directly forward the request to all related engines, instead, the DTX leader will split the engines into multiple engine groups. For engine group, it will randomly choose a relay engine that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the others in the same engine group. From current engine's perspective, for each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay engine in such engine group with the bitmap and targets information that are only related with the engines in such engine group. These method can be recursively used by the relay engines until all related engines have received the collective punch request. We control the count of engine groups on the DTX leader to guarantee that the size of related bitmap and targets information for each engine group will not exceed RPC bulk threshold. Then there will be no RDAM among the engines for collectively punching the object. For each related DAOS engine, the local punch for related object shards will be driven via collective task on each own local vos targets. That will save a lot of RPCs and resources. We allow the user to set object collective punch threshold via client side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the min) value is 31. On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer completed dtx_memberships to DAOS engines, instead, related engines in spite leader or not, will append each local shards bitmap and targets information to the client given MBS header. That is enough for commit and abort collectively. As for DTX resync, the (new) DTX leader needs to re-calculate the completed MBS. Since it is relative rare, even if the overhead for such re-calculation would be quite high, it will not affect the whole system too much. Required-githooks: true Signed-off-by: Fan Yong <fan.yong@intel.com>
daos-stack · Dec 26, 2023 · 3095903 · 3095903
1 parent ea01b50
commit 3095903
Show file tree

Hide file tree

Showing 41 changed files with 4,088 additions and 747 deletions.
diff --git a/src/container/srv_target.c b/src/container/srv_target.c
@@ -1653,6 +1653,8 @@ ds_cont_tgt_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid,
 	struct dss_coll_ops	coll_ops = { 0 };
 	struct dss_coll_args	coll_args = { 0 };
 	struct ds_pool		*pool;
+	int			*exclude_tgts = NULL;
+	uint32_t		exclude_tgt_nr = 0;
 	int			rc;
 
 	/* Only for debugging purpose to compare srv_cont_hdl with cont_hdl_uuid */
@@ -1685,18 +1687,22 @@ ds_cont_tgt_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid,
 	coll_args.ca_func_args	= &arg;
 
 	/* setting aggregator args */
-	rc = ds_pool_get_failed_tgt_idx(pool_uuid, &coll_args.ca_exclude_tgts,
-					&coll_args.ca_exclude_tgts_cnt);
-	if (rc) {
+	rc = ds_pool_get_failed_tgt_idx(pool_uuid, &exclude_tgts, &exclude_tgt_nr);
+	if (rc != 0) {
 		D_ERROR(DF_UUID "failed to get index : rc "DF_RC"\n",
 			DP_UUID(pool_uuid), DP_RC(rc));
-		return rc;
+		goto out;
 	}
 
-	rc = dss_thread_collective_reduce(&coll_ops, &coll_args, 0);
-	D_FREE(coll_args.ca_exclude_tgts);
+	if (exclude_tgts != NULL) {
+		rc = dss_build_coll_bitmap(exclude_tgts, exclude_tgt_nr, &coll_args.ca_tgt_bitmap,
+					   &coll_args.ca_tgt_bitmap_sz);
+		if (rc != 0)
+			goto out;
+	}
 
-	if (rc != 0) {
+	rc = dss_thread_collective_reduce(&coll_ops, &coll_args, 0);
+	if (rc != 0)
 		/* Once it exclude the target from the pool, since the target
 		 * might still in the cart group, so IV cont open might still
 		 * come to this target, especially if cont open/close will be
@@ -1706,9 +1712,10 @@ ds_cont_tgt_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid,
 		D_ERROR("open "DF_UUID"/"DF_UUID"/"DF_UUID":"DF_RC"\n",
 			DP_UUID(pool_uuid), DP_UUID(cont_uuid),
 			DP_UUID(cont_hdl_uuid), DP_RC(rc));
-		return rc;
-	}
 
+out:
+	D_FREE(coll_args.ca_tgt_bitmap);
+	D_FREE(exclude_tgts);
 	return rc;
 }
 

diff --git a/src/dtx/SConscript b/src/dtx/SConscript
@@ -18,7 +18,8 @@ def scons():
     # dtx
     denv.Append(CPPDEFINES=['-DDAOS_PMEM_BUILD'])
     dtx = denv.d_library('dtx',
-                         ['dtx_srv.c', 'dtx_rpc.c', 'dtx_resync.c', 'dtx_common.c', 'dtx_cos.c'],
+                         ['dtx_srv.c', 'dtx_rpc.c', 'dtx_resync.c', 'dtx_common.c', 'dtx_cos.c',
+                          'dtx_coll.c'],
                          install_off="../..")
     denv.Install('$PREFIX/lib64/daos_srv', dtx)