-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14105 object: collectively punch object #13287
Conversation
Bug-tracker data: |
34d29aa
to
6a872d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/2/testReport/ |
6a872d4
to
7135554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
7135554
to
62c1859
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse. To resolve such bad situation, we will collectively punch the object. The basic idea is that: when punch an object with multiple redundancy groups, the client will send OBJ_COLL_PUNCH RPC to the DTX leader. On the DTX leader, instead of forwarding the request to all related vos targets, it uses bcast RPC to spread the OBJ_COLL_PUNCH request to all involved engines. And then related engines will generate collective tasks to punch the object shards on each own local vos targets. That will save a lot of RPCs and resources. On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA (for bulk data). So OBJ_COLL_PUNCH RPC does not transfer dtx_memberships, instead, related engines in spite leader or not, will calculate the dtx_memberships data based on the obejct layout by themselves. That will cause some overhead. Compare with broadcast huge DTX participants information on network, it may be better choice. Introduce two environment varilables to control the collective punch: DTX_COLL_TREE_TOPO: the bcast RPC tree topo for collective transaction on server. The valid range is [8, 128], the default value is 32. OBJ_COLL_PUNCH_THRESHOLD: the threshold for triggerring collectively punch object on client. The default (and also the min) value is 32. Required-githooks: true Signed-off-by: Fan Yong <fan.yong@intel.com>
62c1859
to
4670bc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Ping reviewers, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not reviewed, but I was just testing this on Aurora again, with a much smaller set of server (just for functionality), and the server was crashing / asserting with:
aurora-daos-0247: ERROR: daos_engine:0 11/16-15:23:07.08 aurora-daos-0247 DAOS[5065/13/527240] object EMRG src/object/srv_obj.c:5912 obj_coll_punch_prep() Assertion 'ddt[0].ddt_id == ocpi->ocpi_leader_id' failed
aurora-daos-0247: daos_engine: src/object/srv_obj.c:5912: obj_coll_punch_prep: Assertion `0' failed.
aurora-daos-0247: ERROR: daos_engine:0 *** Process 5065 (daos_engine) received signal 6 (Aborted) ***
aurora-daos-0247: Associated errno: Success (0)
So this cannot land without doing more validation at larger scale unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just read partial code...
src/dtx/dtx_internal.h
Outdated
@@ -131,6 +162,13 @@ extern uint32_t dtx_agg_thd_age_lo; | |||
/* The default count of DTX batched commit ULTs. */ | |||
#define DTX_BATCHED_ULT_DEF 32 | |||
|
|||
/* The bcast RPC tree topo for collective transaction. */ | |||
#define DTX_COLL_TREE_TOPO_MAX 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the max branch ratio defined in cart is CRT_TREE_MAX_RATIO (64), so the 128 is too large and will get error.
And the default value set as 32 seems too large, maybe 8 is enough.
To bcast to 1024 nodes, knomial 32 tree will with depth 2 and max children 62
knomial 8 tree will with depth 4 and max children 22
knomial 2 tree will with depth 10 and max children 10
KARY tree's branch ratio is the number of children, but KNOMIAL tree possibly with more children.
So the MIN can be just 2, and MAX can be 64, DEFAULT as 8 looks better for KNOMIAL tree.
Just FYI.
BTW, maybe "DTX_COLL_TREE_BRANCH" is a better name than "DTX_COLL_TREE_TOPO" as it did not set tree type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For aurora case, we have 2K engines, I do not want to make the depth is too deep, that may cause more latency. So I prepare to change the default at 16 to control the depth within 3, the max will set as 64.
int i; | ||
|
||
for (i = 0, off = obj->cob_md.omd_id.lo % obj->cob_shards_nr; i < obj->cob_shards_nr; | ||
i++, off = (off + 1) % obj->cob_shards_nr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "off" seems no place use it? why need it then?
or you want "obj_shard_open(obj, off, map_ver, &shard);" below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I will fix it.
src/object/srv_obj.c
Outdated
@@ -2196,6 +2200,28 @@ obj_ioc_begin_lite(uint32_t rpc_map_ver, uuid_t pool_uuid, | |||
D_GOTO(out, rc = -DER_STALE); | |||
} else if (DAOS_FAIL_CHECK(DAOS_DTX_STALE_PM)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, the FAIL_CHECK commonly is the last branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
@@ -2596,8 +2622,6 @@ ds_obj_tgt_update_handler(crt_rpc_t *rpc) | |||
|
|||
if (rc < 0 && rc != -DER_NONEXIST) | |||
D_GOTO(out, rc); | |||
|
|||
dtx_flags |= DTX_RESEND; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious why don't need it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DTX_RESEND flag is used to set dth->dth_resent that is useless for a long time. So cleanup them in the patch.
@@ -2951,6 +2977,9 @@ ds_obj_rw_handler(crt_rpc_t *rpc) | |||
/* Execute the operation on all targets */ | |||
rc = dtx_leader_exec_ops(dlh, obj_tgt_update, NULL, 0, &exec_arg); | |||
|
|||
if (max_ver < dlh->dlh_rmt_ver) | |||
max_ver = dlh->dlh_rmt_ver; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please explain a little bit why add the "max_ver" checks in this func? thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because different DTX participants may have different pool map versions, here we collect the max version from all the other participants and reply it back to client when version unmatched.
@@ -3004,6 +3033,9 @@ ds_obj_rw_handler(crt_rpc_t *rpc) | |||
DP_DTI(&orw->orw_dti), DP_RC(rc1)); | |||
} | |||
|
|||
if (ioc.ioc_map_ver < max_ver) | |||
ioc.ioc_map_ver = max_ver; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the case the client IO RPC's map_ver is larger than server-side pool's map ver right? why reset ioc.ioc_map_ver here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is that some DTX participant (such as some replica) has large pool map version than client, then reply it back to client.
Which test have you done, just simple punch? what is the object class? Is there rebuild or reintegrate during the test? What is your test configuration? Thanks! BTW, I cannot reproduce the issue by punch SX object on 12 targets * 8 engines system. |
Signed-off-by: Fan Yong <fan.yong@intel.com>
Fix issues for review feedback. Signed-off-by: Fan Yong <fan.yong@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
@mchaarawi , would you please to verify the new patch without collective query patch? Thanks! |
Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/6/testReport/ |
RPC throttling for collective punch. Signed-off-by: Fan Yong <fan.yong@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/7/testReport/ |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13287/7/execution/node/1603/log |
Replaced by #13386 |
Currently, when punch an object with multiple redundancy groups,
to guarantee the atomicity, we handle the whole punch via single
internal distributed transaction. The DTX leader will forward the
CPD RPC to every object shard within the same transaction. For a
large-scaled object, such as a SX object, punching it will generate
N RPCs (N is equal to the count of all the vos targets in the system).
That will be very slow and hold a lot of system resource for relative
long time. If the system is under heavy load, related RPC(s) may get
timeout, then trigger DTX abort, and then client will resend RPC to
the DTX leader for retry, that will make the situation to be worse
and worse.
To resolve such bad situation, we will collectively punch the object.
The basic idea is that: when punch an object with multiple redundancy
groups, the client will send OBJ_COLL_PUNCH RPC to the DTX leader. On
the DTX leader, instead of forwarding the request to all related vos
targets, it uses bcast RPC to spread the OBJ_COLL_PUNCH request to all
involved engines. And then related engines will generate collective
tasks to punch the object shards on each own local vos targets. That
will save a lot of RPCs and resources.
On the other hand, for large-scaled object, transferring related DTX
participants information (that will be huge) will be heavy burden in
spite of via RPC body or RDMA (for bulk data). So OBJ_COLL_PUNCH RPC
does not transfer dtx_memberships, instead, related engines in spite
leader or not, will calculate the dtx_memberships data based on the
obejct layout by themselves. That will cause some overhead. Compare
with broadcast huge DTX participants information on network, it may
be better choice.
Introduce two environment varilables to control the collective punch:
DTX_COLL_TREE_WIDTH: the bcast RPC tree width for collective transaction
on server. The valid range is [4, 64], the default value is 16.
OBJ_COLL_PUNCH_THRESHOLD: the threshold for triggerring collectively
punch object on client. The default (and also the min) value is 16.
Required-githooks: true
Signed-off-by: Fan Yong fan.yong@intel.com
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: