DAOS-14105 object: collectively punch object #13493

Nasf-Fan · 2023-12-13T07:07:39Z

Currently, when punch an object with multiple redundancy groups, to guarantee the atomicity, we handle the whole punch via single internal distributed transaction. The DTX leader will forward the CPD RPC to every object shard within the same transaction. For a large-scaled object, such as a SX object, punching it will generate N RPCs (N is equal to the count of all the vos targets in the system). That will be very slow and hold a lot of system resource for relative long time. If the system is under heavy load, related RPC(s) may get timeout, then trigger DTX abort, and then client will resend RPC to the DTX leader for retry, that will make the situation to be worse and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy groups, the client will scan the object layout and generate bitmap and targets information for each DAOS engine that have object shards on it. And then client will send OBJ_COLL_PUNCH RPC to the DTX leader. These bitmap and targets information may be too large to be transferred via RPC body, then they will be sent to the DTX leader via RDAM. On DTX leader side, it will not directly forward the request to all related engines, instead, the DTX leader will split the engines into multiple engine groups. For engine group, it will randomly choose a relay engine that will help current engine to dispatch the OBJ_COLL_PUNCH RPC to the others in the same engine group. From current engine's perspective, for each engine group, it only forwards one OBJ_COLL_PUNCH RPC to the relay engine in such engine group with the bitmap and targets information that are only related with the engines in such engine group. These method can be recursively used by the relay engines until all related engines have received the collective punch request.

We control the count of engine groups on the DTX leader to guarantee that the size of related bitmap and targets information for each engine group will not exceed RPC bulk threshold. Then there will be no RDAM among the engines for collectively punching the object.

For each related DAOS engine, the local punch for related object shards will be driven via collective task on each own local vos targets. That will save a lot of RPCs and resources.

We allow the user to set object collective punch threshold via client side environment "DAOS_OBJ_COLL_PUNCH_THD". The default (and also the min) value is 31.

On the other hand, for large-scaled object, transferring related DTX participants information (that will be huge) will be heavy burden in spite of via RPC body or RDMA. So OBJ_COLL_PUNCH RPC will not transfer completed dtx_memberships to DAOS engines, instead, related engines in spite leader or not, will append each local shards bitmap and targets information to the client given MBS header. That is enough for commit and abort collectively. As for DTX resync, the (new) DTX leader needs to re-calculate the completed MBS. Since it is relative rare, even if the overhead for such re-calculation would be quite high, it will not affect the whole system too much.

It mainly contains the following components:

General framework for bitmap-based collective task on engine. That
is suitable for existing pool/container/object collective tasks.
General framework for generating execution targets bitmap (and ID)
array based on the object layout. That can be shared by all object
level collective operations, such as punch object or query key.
General framework for handling collective object RPC on server.
Currently, it is shared by collective punch and collecitve query.
General framework for collective DTX - collective DTX commit and
collective DTX abort, both synchronous and asynchronous modes.
New DTX membership data structure (compatible with existing DTX)
to support collecitve DXT resync.
General framework for multiple-cast RPC among specified engines.
That supports to dispatch different content to differnt targets.
Shared by both client and server.
Object collective punch RPC.
Change multi-layered "if-else" branches as "switch" for handling
pre-RPC based attribute. That is more efficient for all obj RPCs.
Placement: pack target information into object layout. That will
bypass pool_map_find_target() when need DAOS targets information
for related object layout.
Some test cases for collectively punch object.

The patch also fixes some weird existing bugs when pass CI tests:

a. An incarnation log bug in check_equal() that should check "id_out"
status (committed or not) instead of "id_in". Because the "id_out"
is the "inplace" one to be changed. Otherwise, for ilog_update(),
the "id_in" always has zero id_tx_id that is regarded as committed
by DXT logic. So checking "id_in" committed or not is meaningless.
This issue may potentially cause incorrect data visibility:

a.1 For read operation, it may break data consistency semantics or
get stale information.

a.2 As for modification, it may make related modification based on
some non-committable ilog as to lost the modification silently.

b. We need to persistently store the DTX even if it changed nothing.
From current engine's perspective, it does not know whether the
other DTX participants will change something remotely or not.
If we do not save the empty DTX, then it may misguide subsequent
DTX resync to regard it as a failed transaction and be aborted.
That may further affect the other transactions that based on it.

c. Sometimes, a in-DRAM DTX entry that has already attached to some
persistent DTX blob may be cancelled because of some trouble in
subsequent process, such as hit other in-process DTX. Under such
case, the up layer sponsor may retry the DTX after DTX refresh.
At that time, such DTX entry should not reuse former persistent
DTX blob because that the former local TX has been aborted, and
related persistent DTX blob maybe released or reused by others.
If the retried DTX reuses such persistent DTX blob, finally, it
may overwrite other's modification as to data corruption.

Required-githooks: true