Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14484 cart: Implement per-context inflight queue #13202

Merged
merged 23 commits into from
Jan 4, 2024

Conversation

frostedcmos
Copy link
Contributor

  • Implement per-context inflight queue

Signed-off-by: Alexander Oganezov alexander.a.oganezov@intel.com

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
- Implement RPC inflight quota

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
@frostedcmos frostedcmos requested a review from a team as a code owner October 19, 2023 06:40
@github-actions
Copy link

github-actions bot commented Oct 19, 2023

Bug-tracker data:
Ticket title is 'Refactor RPC-in-flight limit functionality'
Status is 'Blocked'
https://daosio.atlassian.net/browse/DAOS-14484

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13202/1/testReport/

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/2/execution/node/1121/log

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

* failure.
*/
int crt_context_quotas_finalize(crt_context_t crt_ctx);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are init / finalize public APIs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still going back and forth whether we want to have a runtime control to disable (and subsequently re-enable) quotas. I started with that idea, which is why these are public, but for now auto-enabling it on every context

src/include/cart/types.h Show resolved Hide resolved
src/cart/crt_context.c Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/3/execution/node/1266/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/3/execution/node/1404/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/3/execution/node/1358/log

- add comment to not implemented quotas
- set default to 32 inflight

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/4/execution/node/1266/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/4/execution/node/1404/log

its quota reservation for any rpc in the list.

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/5/execution/node/1266/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/5/execution/node/1404/log

when quota limit is reached.

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/6/execution/node/1399/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

jgmoore-or
jgmoore-or previously approved these changes Dec 11, 2023
@@ -139,6 +139,12 @@ This file lists the environment variables used in CaRT.
It its value exceed 256, then will use 256 for flow control.
Set it to zero means disable the flow control in cart.

. D_QUOTA_RPCS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just call it D_RPC_MAX_IN_FLIGHT even if it's a little longer ? :) as I feel D_QUOTA_RPCS might be too generic and introduce a different nomenclature with D_QUOTA (although I know that's what you're trying to do here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I am open to naming, but i dont like D_RPC_MAX_IN_FLIGHT as its long and harder to remember.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with D_QUOTA_RPCS.

* failure.
*/
int crt_context_quota_limit_get(crt_context_t crt_ctx, crt_quota_type_t quota, int *value);

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it I'm not certain we should introduce those new APIs unless we already have a specific use for them. Right now I don't think we really have one yet ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am mixed on this. I was thinking that one use-case we have is self-test (or future perf tools?) to be able to adjust quotas on a fly, based cmd line args and not have to set env.

@@ -257,6 +258,8 @@ prov_data_init(struct crt_prov_gdata *prov_data, crt_provider_t provider,
return DER_SUCCESS;
}

#define CRT_QUOTA_RPCS_DEFAULT 64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that default should be set in the middle of nowhere but maybe in some place where other defaults are set ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in a middle of nowhere, but sure:) It was declared outside of the function that used it. i ll find a better place:)

/** Total count of supported quotas */
CRT_QUOTA_COUNT,
} crt_quota_type_t;

/** @}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, maybe we should not have all this quota API and keep it simple for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah if we decide we dont need public apis, this will move to internal header

@@ -11,6 +11,8 @@
#include "crt_internal.h"

static void crt_epi_destroy(struct crt_ep_inflight *epi);
static int context_quotas_init(crt_context_t crt_ctx);
static int context_quotas_finalize(crt_context_t crt_ctx);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, unless we have more, maybe that's not necessary to have quotas init and finalize (is finalize really needed also?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have quota mutex to destroy in finalize. Possibly for other quotas we might want to clean any lists, but not needed for rpc list as untrack logic of each rpc will take care of it. But we might need for allocation queues if we ever add those.

@@ -1264,6 +1286,10 @@ crt_context_req_track(struct crt_rpc_priv *rpc_priv)
/* reference taken by d_hash_rec_find or "epi->epi_ref = 1" above */
D_MUTEX_LOCK(&crt_ctx->cc_mutex);
d_hash_rec_decref(&crt_ctx->cc_epi_table, &epi->epi_link);

if (quota_rc == -DER_QUOTA_LIMIT)
d_list_add_tail(&rpc_priv->crp_waitq_link, &crt_ctx->cc_quotas.rpc_waitq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not within the block at line 1254 ?

Copy link
Contributor Author

@frostedcmos frostedcmos Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not within the block at line 1254 ?

So that it would not be done with epi->epi_mutex lock, but instead needs context lock.

We can reorganize things nicer once we can get rid of EP credits-related code, which should simplify this and few other calls greatly.

int rc;

if (rpc == NULL)
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be an assert ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just remove it now or change to assert yes. its a left-over from a previous behavior

if (tmp_rpc != NULL)
dispatch_rpc(tmp_rpc);
else
crt_context_put_quota_resource(rpc_priv->crp_pub.cr_ctx, CRT_QUOTA_RPCS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why crt_context_put_quota_resource needs to be invoked all the time ? which acquires/releases another lock ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once the rpc is done, you either process the next rpc (reusing the existing quota) or you put the quota back if there is nothing else queued

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13202/12/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13202/12/execution/node/1552/log

- move default to a diff file

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

jolivier23 added a commit that referenced this pull request Dec 12, 2023
- Add per-context quotas
- Implement RPC inflight quota

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

if (ctx->cc_quotas.current[quota] < ctx->cc_quotas.limit[quota])
ctx->cc_quotas.current[quota]++;
else {
D_WARN("Quota limit reached for quota_type=%d\n", quota);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overly chatty and should be a debug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: "overly chatty" == gigabytes of client logs under load. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- crt_req_set/get quota resource shortened and static inline now
- changed warning to debug message when exceeding quotas

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
@@ -139,6 +139,12 @@ This file lists the environment variables used in CaRT.
It its value exceed 256, then will use 256 for flow control.
Set it to zero means disable the flow control in cart.

. D_QUOTA_RPCS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with D_QUOTA_RPCS.

mjmac pushed a commit that referenced this pull request Jan 2, 2024
Backport of in-flight upstream PR #13202

- Add per-context quotas
- Implement RPC inflight quota

Required-githooks: true

Change-Id: I3fdf77082d66d8009ee7099cc838fcff5da72d4b
Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@frostedcmos frostedcmos merged commit f7ab01c into master Jan 4, 2024
46 of 47 checks passed
@frostedcmos frostedcmos deleted the aaoganez/DAOS-14484 branch January 4, 2024 17:29
jolivier23 added a commit that referenced this pull request Jan 23, 2024
Backport of in-flight upstream PR #13202

- Add per-context quotas
- Implement RPC inflight quota

Required-githooks: true

Change-Id: I3fdf77082d66d8009ee7099cc838fcff5da72d4b
Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
frostedcmos added a commit that referenced this pull request Feb 9, 2024
- D_QUOTA_RPCS envariable added. When set, limits the number of RPCs on a wire being sent out by the process.
- RPCs that exceed quota limit (if set), will now be queued by the sender
- Quota support code added to handle and track resources

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
jolivier23 pushed a commit that referenced this pull request Feb 28, 2024
- D_QUOTA_RPCS envariable added. When set, limits the number of RPCs on a wire being sent out by the process.
- RPCs that exceed quota limit (if set), will now be queued by the sender
- Quota support code added to handle and track resources

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
jolivier23 pushed a commit that referenced this pull request Mar 12, 2024
- D_QUOTA_RPCS envariable added. When set, limits the number of RPCs on a wire being sent out by the process.
- RPCs that exceed quota limit (if set), will now be queued by the sender
- Quota support code added to handle and track resources

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
jolivier23 pushed a commit that referenced this pull request Apr 10, 2024
- D_QUOTA_RPCS envariable added. When set, limits the number of RPCs on a wire being sent out by the process.
- RPCs that exceed quota limit (if set), will now be queued by the sender
- Quota support code added to handle and track resources

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants