Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14105 object: collectively punch object #13287

Closed
wants to merge 4 commits into from

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Nov 3, 2023

Currently, when punch an object with multiple redundancy groups,
to guarantee the atomicity, we handle the whole punch via single
internal distributed transaction. The DTX leader will forward the
CPD RPC to every object shard within the same transaction. For a
large-scaled object, such as a SX object, punching it will generate
N RPCs (N is equal to the count of all the vos targets in the system).
That will be very slow and hold a lot of system resource for relative
long time. If the system is under heavy load, related RPC(s) may get
timeout, then trigger DTX abort, and then client will resend RPC to
the DTX leader for retry, that will make the situation to be worse
and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy
groups, the client will send OBJ_COLL_PUNCH RPC to the DTX leader. On
the DTX leader, instead of forwarding the request to all related vos
targets, it uses bcast RPC to spread the OBJ_COLL_PUNCH request to all
involved engines. And then related engines will generate collective
tasks to punch the object shards on each own local vos targets. That
will save a lot of RPCs and resources.

On the other hand, for large-scaled object, transferring related DTX
participants information (that will be huge) will be heavy burden in
spite of via RPC body or RDMA (for bulk data). So OBJ_COLL_PUNCH RPC
does not transfer dtx_memberships, instead, related engines in spite
leader or not, will calculate the dtx_memberships data based on the
obejct layout by themselves. That will cause some overhead. Compare
with broadcast huge DTX participants information on network, it may
be better choice.

Introduce two environment varilables to control the collective punch:

DTX_COLL_TREE_WIDTH: the bcast RPC tree width for collective transaction
on server. The valid range is [4, 64], the default value is 16.

OBJ_COLL_PUNCH_THRESHOLD: the threshold for triggerring collectively
punch object on client. The default (and also the min) value is 16.

Required-githooks: true

Signed-off-by: Fan Yong fan.yong@intel.com

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Nov 3, 2023

Bug-tracker data:
Ticket title is 'Punch large-scaled object collectively'
Status is 'Awaiting Verification'
Labels: 'tds'
https://daosio.atlassian.net/browse/DAOS-14105

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/2/testReport/

@Nasf-Fan Nasf-Fan changed the title DAOS-14105 object: collective punch object DAOS-14105 object: collectively punch object Nov 4, 2023
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Currently, when punch an object with multiple redundancy groups,
to guarantee the atomicity, we handle the whole punch via single
internal distributed transaction. The DTX leader will forward the
CPD RPC to every object shard within the same transaction. For a
large-scaled object, such as a SX object, punching it will generate
N RPCs (N is equal to the count of all the vos targets in the system).
That will be very slow and hold a lot of system resource for relative
long time. If the system is under heavy load, related RPC(s) may get
timeout, then trigger DTX abort, and then client will resend RPC to
the DTX leader for retry, that will make the situation to be worse
and worse.

To resolve such bad situation, we will collectively punch the object.

The basic idea is that: when punch an object with multiple redundancy
groups, the client will send OBJ_COLL_PUNCH RPC to the DTX leader. On
the DTX leader, instead of forwarding the request to all related vos
targets, it uses bcast RPC to spread the OBJ_COLL_PUNCH request to all
involved engines. And then related engines will generate collective
tasks to punch the object shards on each own local vos targets. That
will save a lot of RPCs and resources.

On the other hand, for large-scaled object, transferring related DTX
participants information (that will be huge) will be heavy burden in
spite of via RPC body or RDMA (for bulk data). So OBJ_COLL_PUNCH RPC
does not transfer dtx_memberships, instead, related engines in spite
leader or not, will calculate the dtx_memberships data based on the
obejct layout by themselves. That will cause some overhead. Compare
with broadcast huge DTX participants information on network, it may
be better choice.

Introduce two environment varilables to control the collective punch:

DTX_COLL_TREE_TOPO: the bcast RPC tree topo for collective transaction
on server. The valid range is [8, 128], the default value is 32.

OBJ_COLL_PUNCH_THRESHOLD: the threshold for triggerring collectively
punch object on client. The default (and also the min) value is 32.

Required-githooks: true

Signed-off-by: Fan Yong <fan.yong@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@Nasf-Fan Nasf-Fan marked this pull request as ready for review November 6, 2023 06:18
@Nasf-Fan Nasf-Fan requested review from a team as code owners November 6, 2023 06:18
@Nasf-Fan
Copy link
Contributor Author

Ping reviewers, thanks!

Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not reviewed, but I was just testing this on Aurora again, with a much smaller set of server (just for functionality), and the server was crashing / asserting with:
aurora-daos-0247: ERROR: daos_engine:0 11/16-15:23:07.08 aurora-daos-0247 DAOS[5065/13/527240] object EMRG src/object/srv_obj.c:5912 obj_coll_punch_prep() Assertion 'ddt[0].ddt_id == ocpi->ocpi_leader_id' failed
aurora-daos-0247: daos_engine: src/object/srv_obj.c:5912: obj_coll_punch_prep: Assertion `0' failed.
aurora-daos-0247: ERROR: daos_engine:0 *** Process 5065 (daos_engine) received signal 6 (Aborted) ***
aurora-daos-0247: Associated errno: Success (0)

So this cannot land without doing more validation at larger scale unfortunately.

Copy link
Contributor

@liuxuezhao liuxuezhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just read partial code...

@@ -131,6 +162,13 @@ extern uint32_t dtx_agg_thd_age_lo;
/* The default count of DTX batched commit ULTs. */
#define DTX_BATCHED_ULT_DEF 32

/* The bcast RPC tree topo for collective transaction. */
#define DTX_COLL_TREE_TOPO_MAX 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the max branch ratio defined in cart is CRT_TREE_MAX_RATIO (64), so the 128 is too large and will get error.
And the default value set as 32 seems too large, maybe 8 is enough.
To bcast to 1024 nodes, knomial 32 tree will with depth 2 and max children 62
knomial 8 tree will with depth 4 and max children 22
knomial 2 tree will with depth 10 and max children 10
KARY tree's branch ratio is the number of children, but KNOMIAL tree possibly with more children.
So the MIN can be just 2, and MAX can be 64, DEFAULT as 8 looks better for KNOMIAL tree.
Just FYI.

BTW, maybe "DTX_COLL_TREE_BRANCH" is a better name than "DTX_COLL_TREE_TOPO" as it did not set tree type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For aurora case, we have 2K engines, I do not want to make the depth is too deep, that may cause more latency. So I prepare to change the default at 16 to control the depth within 3, the max will set as 64.

int i;

for (i = 0, off = obj->cob_md.omd_id.lo % obj->cob_shards_nr; i < obj->cob_shards_nr;
i++, off = (off + 1) % obj->cob_shards_nr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "off" seems no place use it? why need it then?
or you want "obj_shard_open(obj, off, map_ver, &shard);" below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will fix it.

@@ -2196,6 +2200,28 @@ obj_ioc_begin_lite(uint32_t rpc_map_ver, uuid_t pool_uuid,
D_GOTO(out, rc = -DER_STALE);
} else if (DAOS_FAIL_CHECK(DAOS_DTX_STALE_PM)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, the FAIL_CHECK commonly is the last branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

@@ -2596,8 +2622,6 @@ ds_obj_tgt_update_handler(crt_rpc_t *rpc)

if (rc < 0 && rc != -DER_NONEXIST)
D_GOTO(out, rc);

dtx_flags |= DTX_RESEND;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious why don't need it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DTX_RESEND flag is used to set dth->dth_resent that is useless for a long time. So cleanup them in the patch.

@@ -2951,6 +2977,9 @@ ds_obj_rw_handler(crt_rpc_t *rpc)
/* Execute the operation on all targets */
rc = dtx_leader_exec_ops(dlh, obj_tgt_update, NULL, 0, &exec_arg);

if (max_ver < dlh->dlh_rmt_ver)
max_ver = dlh->dlh_rmt_ver;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please explain a little bit why add the "max_ver" checks in this func? thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because different DTX participants may have different pool map versions, here we collect the max version from all the other participants and reply it back to client when version unmatched.

@@ -3004,6 +3033,9 @@ ds_obj_rw_handler(crt_rpc_t *rpc)
DP_DTI(&orw->orw_dti), DP_RC(rc1));
}

if (ioc.ioc_map_ver < max_ver)
ioc.ioc_map_ver = max_ver;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the case the client IO RPC's map_ver is larger than server-side pool's map ver right? why reset ioc.ioc_map_ver here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is that some DTX participant (such as some replica) has large pool map version than client, then reply it back to client.

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Nov 25, 2023

I have not reviewed, but I was just testing this on Aurora again, with a much smaller set of server (just for functionality), and the server was crashing / asserting with: aurora-daos-0247: ERROR: daos_engine:0 11/16-15:23:07.08 aurora-daos-0247 DAOS[5065/13/527240] object EMRG src/object/srv_obj.c:5912 obj_coll_punch_prep() Assertion 'ddt[0].ddt_id == ocpi->ocpi_leader_id' failed aurora-daos-0247: daos_engine: src/object/srv_obj.c:5912: obj_coll_punch_prep: Assertion `0' failed. aurora-daos-0247: ERROR: daos_engine:0 *** Process 5065 (daos_engine) received signal 6 (Aborted) *** aurora-daos-0247: Associated errno: Success (0)

So this cannot land without doing more validation at larger scale unfortunately.

Which test have you done, just simple punch? what is the object class? Is there rebuild or reintegrate during the test? What is your test configuration? Thanks!

BTW, I cannot reproduce the issue by punch SX object on 12 targets * 8 engines system.

Signed-off-by: Fan Yong <fan.yong@intel.com>
Fix issues for review feedback.

Signed-off-by: Fan Yong <fan.yong@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@Nasf-Fan
Copy link
Contributor Author

@mchaarawi , would you please to verify the new patch without collective query patch? Thanks!

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/6/testReport/

RPC throttling for collective punch.

Signed-off-by: Fan Yong <fan.yong@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13287/7/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13287/7/execution/node/1603/log

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Dec 6, 2023

Replaced by #13386

@Nasf-Fan Nasf-Fan closed this Dec 6, 2023
@Nasf-Fan Nasf-Fan deleted the Nasf-Fan/DAOS-14105_5 branch December 21, 2023 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants