-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14105 object: collectively punch object #13493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Bug-tracker data: |
Test stage Fault injection testing on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/1/execution/node/1125/log |
f24e978
to
f66a3d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
f66a3d8
to
b8b358f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
b8b358f
to
38fb6df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
38fb6df
to
4a120f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
4a120f6
to
d7d5e1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
d7d5e1f
to
9570c91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
9570c91
to
ced5d03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
ced5d03
to
fd34a97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
fd34a97
to
27b3f13
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/10/execution/node/1411/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/10/execution/node/1349/log |
aa3b8a3
to
b8d3c6d
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/12/execution/node/1365/log |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/24/testReport/ |
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/24/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/24/testReport/ |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/24/execution/node/1435/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/24/execution/node/1419/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/24/execution/node/1398/log |
5ceba08
to
3095903
Compare
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/ |
Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/ |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13493/25/testReport/ |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/25/execution/node/1342/log |
dd7b4e7
to
3fc9199
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13493/27/execution/node/1319/log |
ec_mdtest_smoke failed for known issue DAOS-12526. |
Ping reviewers. Thanks! |
Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse. To resolve such bad situation, we will collectively punch the object. The basic idea is that: when punch an object with multiple redundancy groups, the client will scan the object layout and generate bitmap and targets information for each DAOS engine that have object shards on it. And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These bitmap and targets information may be too large to be transferred via RPC body, then they will be sent to the DTX leader via RDAM. On DTX leader side, it will not directly forward the request to all related engines, instead, the DTX leader will split the engines into multiple engine groups. For engine group, it will randomly choose a relay engine that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the others in the same engine group. From current engine's perspective, for each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay engine in such engine group with the bitmap and targets information that are only related with the engines in such engine group. These method can be recursively used by the relay engines until all related engines have received the collective punch request. We control the count of engine groups on the DTX leader to guarantee that the size of related bitmap and targets information for each engine group will not exceed RPC bulk threshold. Then there will be no RDAM among the engines for collectively punching the object. For each related DAOS engine, the local punch for related object shards will be driven via collective task on each own local vos targets. That will save a lot of RPCs and resources. We allow the user to set object collective punch threshold via client side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the min) value is 31. On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer completed dtx_memberships to DAOS engines, instead, related engines in spite leader or not, will append each local shards bitmap and targets information to the client given MBS header. That is enough for commit and abort collectively. As for DTX resync, the (new) DTX leader needs to re-calculate the completed MBS. Since it is relative rare, even if the overhead for such re-calculation would be quite high, it will not affect the whole system too much. It mainly contains the following components: 1. General framework for bitmap-based collective task on engine. That is suitable for existing pool/container/object collective tasks. 2. General framework for generating execution targets bitmap (and ID) array based on the object layout. That can be shared by all object level collective operations, such as punch object or query key. 3. General framework for handling collective object RPC on server. Currently, it is shared by collective punch and collecitve query. 4. General framework for collective DTX - collective DTX commit and collective DTX abort, both synchronous and asynchronous modes. 5. New DTX membership data structure (compatible with existing DTX) to support collecitve DXT resync. 6. General framework for multiple-cast RPC among specified engines. That supports to dispatch different content to differnt targets. Shared by both client and server. 7. Object collective punch RPC. 8. Change multi-layered "if-else" branches as "switch" for handling pre-RPC based attribute. That is more efficient for all obj RPCs. 9. Placement: pack target information into object layout. That will bypass pool_map_find_target() when need DAOS targets information for related object layout. 10. Some test cases for collectively punch object. The patch also fixes some weird existing bugs when pass CI tests: a. An incarnation log bug in check_equal() that should check "id_out" status (committed or not) instead of "id_in". Because the "id_out" is the "inplace" one to be changed. Otherwise, for ilog_update(), the "id_in" always has zero id_tx_id that is regarded as committed by DXT logic. So checking "id_in" committed or not is meaningless. This issue may potentially cause incorrect data visibility: a.1 For read operation, it may break data consistency semantics or get stale information. a.2 As for modification, it may make related modification based on some non-committable ilog as to lost the modification silently. b. We need to persistently store the DTX even if it changed nothing. From current engine's perspective, it does not know whether the other DTX participants will change something remotely or not. If we do not save the empty DTX, then it may misguide subsequent DTX resync to regard it as a failed transaction and be aborted. That may further affect the other transactions that based on it. c. Sometimes, a in-DRAM DTX entry that has already attached to some persistent DTX blob may be cancelled because of some trouble in subsequent process, such as hit other in-process DTX. Under such case, the up layer sponsor may retry the DTX after DTX refresh. At that time, such DTX entry should not reuse former persistent DTX blob because that the former local TX has been aborted, and related persistent DTX blob maybe released or reused by others. If the retried DTX reuses such persistent DTX blob, finally, it may overwrite other's modification as to data corruption. Required-githooks: true Signed-off-by: Fan Yong <fan.yong@intel.com>
3fc9199
to
30346fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a rough read, generally LGTM, a few questions to confirm.
continue; | ||
|
||
/* Skip the target that (re-)joined the system after the DTX. */ | ||
if (target->ta_comp.co_ver > dtx_ver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For EXTEND or REINT it will only update co_in_ver and don't change co_ver, here should compare with co_in_ver?
same for several other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some confused by current usage about "co_ver" and "co_in_ver". It seems that "co_in_ver" should be used instead of "co_ver". @wangdi1 , would you please to give more explanation about that? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuxuezhao , you are right. "co_ver" cannot replace "co_in_ver". I will change it and other related DTX logic in subsequent collective query patch. Thanks!
if (target->ta_comp.co_status != PO_COMP_ST_UP && | ||
target->ta_comp.co_status != PO_COMP_ST_UPIN && | ||
target->ta_comp.co_status != PO_COMP_ST_NEW && | ||
target->ta_comp.co_status != PO_COMP_ST_DRAIN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems for UP or NEW status target it should never get the DTX RPC req? (as they should not in the obj layout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the target is fully integrated into the pool, its status will be UP instead of UPIN, right? If yes, the UP target will get RPC. As for "NEW", I am not quite sure. @wangdi1 , any idea for that?
if (dce->dce_ranks == NULL) | ||
D_GOTO(out, rc = -DER_NOMEM); | ||
|
||
D_ALLOC_ARRAY(dce->dce_hints, node_nr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirm why the number of dce_hints is not same as dce_ranks? the hint is not 1:1 for rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first rank is for current engine that will not send RPC to itself. So the length of ranks array is at most "node_nr - 1" or smaller. The array of hints is spare to speedup lookup. Each rank can directly locate its slot based on its rank#. So the length of hints is equal to the count of ranks.
@@ -149,6 +180,20 @@ extern uint32_t dtx_batched_ult_max; | |||
*/ | |||
#define DTX_INLINE_MBS_SIZE 512 | |||
|
|||
#define DTX_COLL_TREE_WIDTH 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a problem, just seems 16 is too large for knomial tree branch ratio. probably 4 is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out that. let's reduce it in subsequent patch.
|
||
if (out_source->dco_status != 0 && | ||
(out_target->dco_status == 0 || out_target->dco_status == -DER_NONEXIST)) | ||
out_target->dco_status = out_source->dco_status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, a few log maybe useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
else if (dct->dct_tgt_cap <= 8) | ||
size = dct->dct_tgt_cap << 1; | ||
else | ||
size = dct->dct_tgt_cap + 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why size 4 is enough for "dct->dct_tgt_cap == 0", and not srue why the size calculate as that, a few comment maybe helpful. thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not know how many shards will be on the rank before completing the scan. So at the beginning, we allocate 4 items, if not enough, double the size until up to 16, if need more, add 8 items each time to avoid allocating too much. I will add more comment in subsequent patch.
} | ||
|
||
used = crp_proc_get_size_used(cpca->cpca_proc); | ||
if (unlikely(used > size)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you hit this case in test? if so the "size = (*p_size * 9) >> 3" maybe not enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have ever hit similar issue when directly use proc in other place. "size = (*p_size * 9) >> 3" may be not enough. We cannot assume the lower layer proc function will not consume more space with additional information packed. That is why we have the logic "size = used && goto again".
|
||
/* Try another for leader. */ | ||
leader = (leader + 1) % coa->coa_dct_nr; | ||
goto new_leader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems possible to get dead loop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be impossible, otherwise, all the shards are either in-rebuilding or in-reintegrating.
uuid_copy(opi->opi_co_hdl, args->pa_coh_uuid); | ||
uuid_copy(opi->opi_co_uuid, args->pa_cont_uuid); | ||
uuid_copy(opi->opi_co_hdl, shard->do_co->dc_cont_hdl); | ||
uuid_copy(opi->opi_co_uuid, shard->do_co->dc_uuid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious that it should be same as original code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of space limitation in related data structure, po_{coh,cont}_uuid are not there any longer.
@@ -63,7 +63,9 @@ struct pl_obj_shard { | |||
uint32_t po_shard; /* shard identifier */ | |||
uint32_t po_target; /* target id */ | |||
uint32_t po_fseq; /* The latest failure sequence */ | |||
uint32_t po_rebuilding:1, /* rebuilding status */ | |||
uint16_t po_rank; /* The rank on which the shard exists */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is a good idea to define it as 16 bits, the number of engines possibly > 64K in near future?
or just enlarge the structure with more 8B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good question. Honestly, I am not sure for this. Enlarging it is possible, but almost waste. We can consider to use 20bits (to support 1M ranks) for that? @gnailzenh what's your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this patch is clean, just have one more question because the patch is very large and I can't understand all the details:
if one engine failed to execute the collective punch, can we guarantee all the collective members will abort the operation? I'm asking this because now we have indirect RPC forwarding, which coudl make things more complex
if (rc < 0) | ||
D_GOTO(out, rc); | ||
|
||
/* The target that (re-)joined the system after DTX cannot be the leader. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new logic right? I didn't see it in the original code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new logic, just move the logic from dtx_resync.c to here for sharing logic with others.
dcrc->dcrc_ptr = dcr; | ||
dcrc->dcrc_dte = dtx_entry_get(rbund->entry); | ||
d_list_add_tail(&dcrc->dcrc_gl_committable, &cont->sc_dtx_cos_list); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: the same code in dtx_cos_rec_alloc(), would be nice to have an inline function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can be done in subsequent patch.
*dcks = dck_buf; | ||
*p_dce = dtx_coll_entry_get(dcrc->dcrc_dce); | ||
|
||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that we only fetch one collective RPC is because it's more heavier weight then other RPCs right? If this is the case, can we do synchronous commit for collective RPC?
If it will make code more complex, please ignore this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, it is difficult to batch multiple collective DTX entries into one RPC, that will make the RPC body to be too complex.
Secondly, it may cause the RPC body size to exceed BULK size limitation if related shards layout for different DTX entries are different on related engines.
We have ever implemented synchronous collective DTX, its performance was not good. That is why we re-implemented asynchronous collective DTX. From application perspective, it decreases the latency.
dck_buf[i].oid = dcrc->dcrc_ptr->dcr_oid; | ||
dck_buf[i].dkey_hash = dcrc->dcrc_ptr->dcr_dkey_hash; | ||
|
||
if (unlikely(oid != NULL && dcrc->dcrc_ptr == NULL)) { | ||
if (i > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to call daos_unit_oid_compare() in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"oid != NULL" means we only want to find out the DTX entires that touch some specified object. It is used for object sync.
@@ -375,6 +372,140 @@ dtx_handler(crt_rpc_t *rpc) | |||
ds_cont_child_put(cont); | |||
} | |||
|
|||
static void | |||
dtx_coll_handler(crt_rpc_t *rpc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is on xstream-0 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, collective DTX RPC is via cart level broadcast, the receive peer must be 0-XS, that is different from collective punch (the latter one is our object level bcast). But the dtx_coll_handler just starts ULT on other XS, not involve it any layout/bitmap scan/calculation, so it is expected to be very light operation.
|
||
/* More ranks joined after obj_coll_oper_args_init(). */ | ||
if (unlikely(shard->do_target_rank >= coa->coa_dct_cap)) { | ||
D_REALLOC_ARRAY(dct, coa->coa_dcts, coa->coa_dct_cap, shard->do_target_rank + 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question
* It stores the identifiers of shards on the engine, in spite of on which VOS target, | ||
* only for modification case. | ||
*/ | ||
uint32_t *dct_tgt_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this if we already have the bitmap "dct_bitmap"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this if we already have the bitmap "dct_bitmap"
bitmap is used for punch, tgt_is is used for DTX resync.
uint32_t dce_ver; | ||
uint32_t dce_refs; | ||
d_rank_list_t *dce_ranks; | ||
uint8_t *dce_hints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry what is the dce_hints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used for collective DTX to indicate on which vos target we can find out the the DTX entry. There is comment in the dtx_internal.h
struct dc_pool *pool = obj->cob_pool; | ||
|
||
D_RWLOCK_RDLOCK(&pool->dp_map_lock); | ||
if (shard_cnt < pool_map_node_nr(pool->dp_map) << 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from this patch, but I think pool_map_node_nr() actually returns rank right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pool_map_node_nr() returns the count of rank in the system. Here the comparison means that if the shards count is not double's the count of ranks, then it is unnecessary to use collective operation.
size = dct->dct_bitmap_sz << 3; | ||
|
||
/* Randomly choose a XS as the local leader on target engine for load balance. */ | ||
for (i = 0, pos = (rand != 0 ? rand : d_rand()) % dct->dct_tgt_nr; i < size; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this include xstream-0? I would suggest to not use xstream-o if it does, otherwise nm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collective punch (not include collect DTX) does not touch 0-XS.
If one engine failed to execute the punch, in spite of it is the leaf node or the relay engine, or the leader engine, the DTX leader engine will trigger collective DXT abort to abort the whole transaction. That is similar as regular DTX leader handle failure case. As for "guarantee", it is the same as current regular DTX abort. |
DAOS-14679 pool: Report on stopping sp_stopping (#14374) DAOS-15145 pool: add pool collective function (#13764) *DAOS-14105 object: collectively punch object (#13493) * partial backport, just the bitmap function Required-githooks: true Change-Id: I2b21b8121cbdecc79ae49a464a42b1d47fb9be10 Signed-off-by: Jeff Olivier <jeffolivier@google.com>
DAOS-14679 pool: Report on stopping sp_stopping (#14374) DAOS-15514 container: fix container destroy failure (#14108) DAOS-15672 rebuild: Fix pool destroy hangs (#14183) DAOS-15145 pool: add pool collective function (#13764) *DAOS-14105 object: collectively punch object (#13493) partial backport, just the bitmap function Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Li Wei <wei.g.li@intel.com> Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse.
To resolve such bad situation, we will collectively punch the object.
The basic idea is that: when punch an object with multiple redundancy groups, the client will scan the object layout and generate bitmap and targets information for each DAOS engine that have object shards on it. And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These bitmap and targets information may be too large to be transferred via RPC body, then they will be sent to the DTX leader via RDAM. On DTX leader side, it will not directly forward the request to all related engines, instead, the DTX leader will split the engines into multiple engine groups. For engine group, it will randomly choose a relay engine that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the others in the same engine group. From current engine's perspective, for each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay engine in such engine group with the bitmap and targets information that are only related with the engines in such engine group. These method can be recursively used by the relay engines until all related engines have received the collective punch request.
We control the count of engine groups on the DTX leader to guarantee that the size of related bitmap and targets information for each engine group will not exceed RPC bulk threshold. Then there will be no RDAM among the engines for collectively punching the object.
For each related DAOS engine, the local punch for related object shards will be driven via collective task on each own local vos targets. That will save a lot of RPCs and resources.
We allow the user to set object collective punch threshold via client side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the min) value is 31.
On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer completed dtx_memberships to DAOS engines, instead, related engines in spite leader or not, will append each local shards bitmap and targets information to the client given MBS header. That is enough for commit and abort collectively. As for DTX resync, the (new) DTX leader needs to re-calculate the completed MBS. Since it is relative rare, even if the overhead for such re-calculation would be quite high, it will not affect the whole system too much.
It mainly contains the following components:
General framework for bitmap-based collective task on engine. That
is suitable for existing pool/container/object collective tasks.
General framework for generating execution targets bitmap (and ID)
array based on the object layout. That can be shared by all object
level collective operations, such as punch object or query key.
General framework for handling collective object RPC on server.
Currently, it is shared by collective punch and collecitve query.
General framework for collective DTX - collective DTX commit and
collective DTX abort, both synchronous and asynchronous modes.
New DTX membership data structure (compatible with existing DTX)
to support collecitve DXT resync.
General framework for multiple-cast RPC among specified engines.
That supports to dispatch different content to differnt targets.
Shared by both client and server.
Object collective punch RPC.
Change multi-layered "if-else" branches as "switch" for handling
pre-RPC based attribute. That is more efficient for all obj RPCs.
Placement: pack target information into object layout. That will
bypass pool_map_find_target() when need DAOS targets information
for related object layout.
Some test cases for collectively punch object.
The patch also fixes some weird existing bugs when pass CI tests:
a. An incarnation log bug in check_equal() that should check "id_out"
status (committed or not) instead of "id_in". Because the "id_out"
is the "inplace" one to be changed. Otherwise, for ilog_update(),
the "id_in" always has zero id_tx_id that is regarded as committed
by DXT logic. So checking "id_in" committed or not is meaningless.
This issue may potentially cause incorrect data visibility:
a.1 For read operation, it may break data consistency semantics or
get stale information.
a.2 As for modification, it may make related modification based on
some non-committable ilog as to lost the modification silently.
b. We need to persistently store the DTX even if it changed nothing.
From current engine's perspective, it does not know whether the
other DTX participants will change something remotely or not.
If we do not save the empty DTX, then it may misguide subsequent
DTX resync to regard it as a failed transaction and be aborted.
That may further affect the other transactions that based on it.
c. Sometimes, a in-DRAM DTX entry that has already attached to some
persistent DTX blob may be cancelled because of some trouble in
subsequent process, such as hit other in-process DTX. Under such
case, the up layer sponsor may retry the DTX after DTX refresh.
At that time, such DTX entry should not reuse former persistent
DTX blob because that the former local TX has been aborted, and
related persistent DTX blob maybe released or reused by others.
If the retried DTX reuses such persistent DTX blob, finally, it
may overwrite other's modification as to data corruption.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: