DAOS-14105 object: collectively punch object #13287

Nasf-Fan · 2023-11-03T17:43:33Z

Currently, when punch an object with multiple redundancy groups,
to guarantee the atomicity, we handle the whole punch via single
internal distributed transaction. The DTX leader will forward the
CPD RPC to every object shard within the same transaction. For a
large-scaled object, such as a SX object, punching it will generate
N RPCs (N is equal to the count of all the vos targets in the system).
That will be very slow and hold a lot of system resource for relative
long time. If the system is under heavy load, related RPC(s) may get
timeout, then trigger DTX abort, and then client will resend RPC to
the DTX leader for retry, that will make the situation to be worse
and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy
groups, the client will send OBJ_COLL_PUNCH RPC to the DTX leader. On
the DTX leader, instead of forwarding the request to all related vos
targets, it uses bcast RPC to spread the OBJ_COLL_PUNCH request to all
involved engines. And then related engines will generate collective
tasks to punch the object shards on each own local vos targets. That
will save a lot of RPCs and resources.

On the other hand, for large-scaled object, transferring related DTX
participants information (that will be huge) will be heavy burden in
spite of via RPC body or RDMA (for bulk data). So OBJ_COLL_PUNCH RPC
does not transfer dtx_memberships, instead, related engines in spite
leader or not, will calculate the dtx_memberships data based on the
obejct layout by themselves. That will cause some overhead. Compare
with broadcast huge DTX participants information on network, it may
be better choice.

Introduce two environment varilables to control the collective punch:

DTX_COLL_TREE_WIDTH: the bcast RPC tree width for collective transaction
on server. The valid range is [4, 64], the default value is 16.

OBJ_COLL_PUNCH_THRESHOLD: the threshold for triggerring collectively
punch object on client. The default (and also the min) value is 16.

Required-githooks: true

Signed-off-by: Fan Yong fan.yong@intel.com

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2023-11-03T17:43:53Z

Bug-tracker data:
Ticket title is 'Punch large-scaled object collectively'
Status is 'Awaiting Verification'
Labels: 'tds'
https://daosio.atlassian.net/browse/DAOS-14105

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2023-11-04T02:36:37Z

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/2/testReport/

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1

LGTM. No errors found by checkpatch.

Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse. To resolve such bad situation, we will collectively punch the object. The basic idea is that: when punch an object with multiple redundancy groups, the client will send OBJ_COLL_PUNCH RPC to the DTX leader. On the DTX leader, instead of forwarding the request to all related vos targets, it uses bcast RPC to spread the OBJ_COLL_PUNCH request to all involved engines. And then related engines will generate collective tasks to punch the object shards on each own local vos targets. That will save a lot of RPCs and resources. On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA (for bulk data). So OBJ_COLL_PUNCH RPC does not transfer dtx_memberships, instead, related engines in spite leader or not, will calculate the dtx_memberships data based on the obejct layout by themselves. That will cause some overhead. Compare with broadcast huge DTX participants information on network, it may be better choice. Introduce two environment varilables to control the collective punch: DTX_COLL_TREE_TOPO: the bcast RPC tree topo for collective transaction on server. The valid range is [8, 128], the default value is 32. OBJ_COLL_PUNCH_THRESHOLD: the threshold for triggerring collectively punch object on client. The default (and also the min) value is 32. Required-githooks: true Signed-off-by: Fan Yong <fan.yong@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

Nasf-Fan · 2023-11-15T02:51:28Z

Ping reviewers, thanks!

mchaarawi

I have not reviewed, but I was just testing this on Aurora again, with a much smaller set of server (just for functionality), and the server was crashing / asserting with:
aurora-daos-0247: ERROR: daos_engine:0 11/16-15:23:07.08 aurora-daos-0247 DAOS[5065/13/527240] object EMRG src/object/srv_obj.c:5912 obj_coll_punch_prep() Assertion 'ddt[0].ddt_id == ocpi->ocpi_leader_id' failed
aurora-daos-0247: daos_engine: src/object/srv_obj.c:5912: obj_coll_punch_prep: Assertion `0' failed.
aurora-daos-0247: ERROR: daos_engine:0 *** Process 5065 (daos_engine) received signal 6 (Aborted) ***
aurora-daos-0247: Associated errno: Success (0)

So this cannot land without doing more validation at larger scale unfortunately.

liuxuezhao

just read partial code...

liuxuezhao · 2023-11-23T06:51:26Z

src/dtx/dtx_internal.h

@@ -131,6 +162,13 @@ extern uint32_t dtx_agg_thd_age_lo;
 /* The default count of DTX batched commit ULTs. */
 #define DTX_BATCHED_ULT_DEF	32

+/* The bcast RPC tree topo for collective transaction. */
+#define DTX_COLL_TREE_TOPO_MAX		128


the max branch ratio defined in cart is CRT_TREE_MAX_RATIO (64), so the 128 is too large and will get error.
And the default value set as 32 seems too large, maybe 8 is enough.
To bcast to 1024 nodes, knomial 32 tree will with depth 2 and max children 62
knomial 8 tree will with depth 4 and max children 22
knomial 2 tree will with depth 10 and max children 10
KARY tree's branch ratio is the number of children, but KNOMIAL tree possibly with more children.
So the MIN can be just 2, and MAX can be 64, DEFAULT as 8 looks better for KNOMIAL tree.
Just FYI.

BTW, maybe "DTX_COLL_TREE_BRANCH" is a better name than "DTX_COLL_TREE_TOPO" as it did not set tree type.

For aurora case, we have 2K engines, I do not want to make the depth is too deep, that may cause more latency. So I prepare to change the default at 16 to control the depth within 3, the max will set as 64.

liuxuezhao · 2023-11-23T07:53:15Z

src/object/cli_obj.c

+	int			 i;
+
+	for (i = 0, off = obj->cob_md.omd_id.lo % obj->cob_shards_nr; i < obj->cob_shards_nr;
+	     i++, off = (off + 1) % obj->cob_shards_nr) {


the "off" seems no place use it? why need it then?
or you want "obj_shard_open(obj, off, map_ver, &shard);" below?

Right, I will fix it.

liuxuezhao · 2023-11-23T08:11:19Z

src/object/srv_obj.c

@@ -2196,6 +2200,28 @@ obj_ioc_begin_lite(uint32_t rpc_map_ver, uuid_t pool_uuid,
 		D_GOTO(out, rc = -DER_STALE);
 	} else if (DAOS_FAIL_CHECK(DAOS_DTX_STALE_PM)) {


minor, the FAIL_CHECK commonly is the last branch

I will fix it.

liuxuezhao · 2023-11-23T08:12:55Z

src/object/srv_obj.c

@@ -2596,8 +2622,6 @@ ds_obj_tgt_update_handler(crt_rpc_t *rpc)

 		if (rc < 0 && rc != -DER_NONEXIST)
 			D_GOTO(out, rc);
-
-		dtx_flags |= DTX_RESEND;


just curious why don't need it now?

DTX_RESEND flag is used to set dth->dth_resent that is useless for a long time. So cleanup them in the patch.

liuxuezhao · 2023-11-23T08:16:27Z

src/object/srv_obj.c

@@ -2951,6 +2977,9 @@ ds_obj_rw_handler(crt_rpc_t *rpc)
 	/* Execute the operation on all targets */
 	rc = dtx_leader_exec_ops(dlh, obj_tgt_update, NULL, 0, &exec_arg);

+	if (max_ver < dlh->dlh_rmt_ver)
+		max_ver = dlh->dlh_rmt_ver;


could you please explain a little bit why add the "max_ver" checks in this func? thanks.

Because different DTX participants may have different pool map versions, here we collect the max version from all the other participants and reply it back to client when version unmatched.

liuxuezhao · 2023-11-23T08:19:44Z

src/object/srv_obj.c

@@ -3004,6 +3033,9 @@ ds_obj_rw_handler(crt_rpc_t *rpc)
 			       DP_DTI(&orw->orw_dti), DP_RC(rc1));
 	}

+	if (ioc.ioc_map_ver < max_ver)
+		ioc.ioc_map_ver = max_ver;


is it the case the client IO RPC's map_ver is larger than server-side pool's map ver right? why reset ioc.ioc_map_ver here?

It is that some DTX participant (such as some replica) has large pool map version than client, then reply it back to client.

Nasf-Fan · 2023-11-25T05:32:25Z

I have not reviewed, but I was just testing this on Aurora again, with a much smaller set of server (just for functionality), and the server was crashing / asserting with: aurora-daos-0247: ERROR: daos_engine:0 11/16-15:23:07.08 aurora-daos-0247 DAOS[5065/13/527240] object EMRG src/object/srv_obj.c:5912 obj_coll_punch_prep() Assertion 'ddt[0].ddt_id == ocpi->ocpi_leader_id' failed aurora-daos-0247: daos_engine: src/object/srv_obj.c:5912: obj_coll_punch_prep: Assertion `0' failed. aurora-daos-0247: ERROR: daos_engine:0 *** Process 5065 (daos_engine) received signal 6 (Aborted) *** aurora-daos-0247: Associated errno: Success (0)

So this cannot land without doing more validation at larger scale unfortunately.

Which test have you done, just simple punch? what is the object class? Is there rebuild or reintegrate during the test? What is your test configuration? Thanks!

BTW, I cannot reproduce the issue by punch SX object on 12 targets * 8 engines system.

Signed-off-by: Fan Yong <fan.yong@intel.com>

Fix issues for review feedback. Signed-off-by: Fan Yong <fan.yong@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

Nasf-Fan · 2023-11-25T07:30:19Z

@mchaarawi , would you please to verify the new patch without collective query patch? Thanks!

daosbuild1 · 2023-11-25T08:06:05Z

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/6/testReport/

RPC throttling for collective punch. Signed-off-by: Fan Yong <fan.yong@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2023-11-26T03:57:08Z

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/7/testReport/

daosbuild1 · 2023-11-26T20:38:50Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13287/7/execution/node/1603/log

Nasf-Fan · 2023-12-06T02:27:00Z

Replaced by #13386

Nasf-Fan mentioned this pull request Nov 3, 2023

DAOS-14105 object: collective punch object #13144

Closed

18 tasks

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14105_5 branch from 34d29aa to 6a872d4 Compare November 4, 2023 02:00

daosbuild1 reviewed Nov 4, 2023

View reviewed changes

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14105_5 branch from 6a872d4 to 7135554 Compare November 4, 2023 05:02

Nasf-Fan changed the title ~~DAOS-14105 object: collective punch object~~ DAOS-14105 object: collectively punch object Nov 4, 2023

daosbuild1 reviewed Nov 4, 2023

View reviewed changes

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14105_5 branch from 7135554 to 62c1859 Compare November 4, 2023 05:45

daosbuild1 reviewed Nov 4, 2023

View reviewed changes

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14105_5 branch from 62c1859 to 4670bc5 Compare November 5, 2023 14:27

daosbuild1 reviewed Nov 5, 2023

View reviewed changes

Nasf-Fan marked this pull request as ready for review November 6, 2023 06:18

Nasf-Fan requested review from a team as code owners November 6, 2023 06:18

Nasf-Fan requested review from liuxuezhao, gnailzenh, wangdi1 and mchaarawi November 7, 2023 06:44

mchaarawi requested changes Nov 16, 2023

View reviewed changes

liuxuezhao reviewed Nov 23, 2023

View reviewed changes

Nasf-Fan added 2 commits November 25, 2023 13:32

Merge branch 'master' into Nasf-Fan/DAOS-14105_5

5716b19

Signed-off-by: Fan Yong <fan.yong@intel.com>

DAOS-14105 object: fix issues for review feedback

9b90be1

Fix issues for review feedback. Signed-off-by: Fan Yong <fan.yong@intel.com>

daosbuild1 reviewed Nov 25, 2023

View reviewed changes

Nasf-Fan requested a review from mchaarawi November 25, 2023 07:28

DAOS-14105 object: RPC throttling for collective punch

3378e79

RPC throttling for collective punch. Signed-off-by: Fan Yong <fan.yong@intel.com>

daosbuild1 reviewed Nov 26, 2023

View reviewed changes

Nasf-Fan closed this Dec 6, 2023

Nasf-Fan deleted the Nasf-Fan/DAOS-14105_5 branch December 21, 2023 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14105 object: collectively punch object #13287

DAOS-14105 object: collectively punch object #13287

Nasf-Fan commented Nov 3, 2023 •

edited

Loading

github-actions bot commented Nov 3, 2023

daosbuild1 left a comment

daosbuild1 commented Nov 4, 2023

daosbuild1 left a comment

daosbuild1 left a comment

daosbuild1 left a comment

Nasf-Fan commented Nov 15, 2023

mchaarawi left a comment

liuxuezhao left a comment

liuxuezhao Nov 23, 2023

Nasf-Fan Nov 25, 2023

liuxuezhao Nov 23, 2023

Nasf-Fan Nov 25, 2023

liuxuezhao Nov 23, 2023

Nasf-Fan Nov 25, 2023

liuxuezhao Nov 23, 2023

Nasf-Fan Nov 25, 2023

liuxuezhao Nov 23, 2023

Nasf-Fan Nov 25, 2023

liuxuezhao Nov 23, 2023

Nasf-Fan Nov 25, 2023

Nasf-Fan commented Nov 25, 2023 •

edited

Loading

daosbuild1 left a comment

Nasf-Fan commented Nov 25, 2023

daosbuild1 commented Nov 25, 2023

daosbuild1 left a comment

daosbuild1 commented Nov 26, 2023

daosbuild1 commented Nov 26, 2023

Nasf-Fan commented Dec 6, 2023

		@@ -2196,6 +2200,28 @@ obj_ioc_begin_lite(uint32_t rpc_map_ver, uuid_t pool_uuid,
		D_GOTO(out, rc = -DER_STALE);
		} else if (DAOS_FAIL_CHECK(DAOS_DTX_STALE_PM)) {

DAOS-14105 object: collectively punch object #13287

DAOS-14105 object: collectively punch object #13287

Conversation

Nasf-Fan commented Nov 3, 2023 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Nov 3, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 4, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

Nasf-Fan commented Nov 15, 2023

mchaarawi left a comment

Choose a reason for hiding this comment

liuxuezhao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan commented Nov 25, 2023 • edited Loading

daosbuild1 left a comment

Choose a reason for hiding this comment

Nasf-Fan commented Nov 25, 2023

daosbuild1 commented Nov 25, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 26, 2023

daosbuild1 commented Nov 26, 2023

Nasf-Fan commented Dec 6, 2023

Nasf-Fan commented Nov 3, 2023 •

edited

Loading

Nasf-Fan commented Nov 25, 2023 •

edited

Loading