Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish runtime metric in seconds #5893

Merged
merged 7 commits into from
Aug 11, 2022

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Aug 8, 2022

Note: Depends on redpanda-data/seastar#31.

Cover letter

This PR adds the following new metric:
Name: redpanda_scheduler_runtime_seconds
Description: Accumulated runtime of task queue associated with this
scheduling group
Labels:

  • redpanda_scheduling_group
  • shard

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

  • none

Release notes

  • none

@VladLazar VladLazar force-pushed the public-metrics-uptime-seconds branch from 230b504 to a9750b2 Compare August 8, 2022 14:59
@VladLazar VladLazar requested a review from BenPope August 8, 2022 17:51
src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved
src/v/resource_mgmt/cpu_scheduling.h Show resolved Hide resolved
src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved
@VladLazar VladLazar force-pushed the public-metrics-uptime-seconds branch 2 times, most recently from c97d474 to e63aab2 Compare August 10, 2022 15:31
@VladLazar VladLazar marked this pull request as ready for review August 10, 2022 15:33
@VladLazar VladLazar requested a review from BenPope August 10, 2022 15:40
BenPope
BenPope previously approved these changes Aug 10, 2022
src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved
@BenPope
Copy link
Member

BenPope commented Aug 10, 2022

I don't think the sum of these is the same as vectorized_reactor_cpu_busy_ms. So maybe we should add redpanda_cpu_busy_seconds_total?

@VladLazar
Copy link
Contributor Author

No, it's not. I think the "busy time" includes the wait time too. Sure, we can add it. It's already exposed by the reactor.

Vlad Lazar added 5 commits August 11, 2022 10:57
This patch adds a public method that returns a list containing constant
references to all the scheduling groups created by redpanda.
This patch introduces a probe that queries each scheduling group
for its usage stats and publishes metrics based on that.

The following new metric is introduced:
Name: redpanda_scheduler_runtime_seconds_total
Description: Accumulated runtime of task queue associated with this
scheduling group
Labels:
  - redpanda_scheduling_group
  - shard
This commit wires up a scheduling_groups_probe in order to publish
metrics based on the scheduling groups stats. Note how the probe is
cleared before the scheduling groups are destroyed to prevent publishing
metrics from a destroyed group.
This commit splits the registration of internal an public metrics into
two separate methods. It also adds two new metrics to the
"public_metrics" endpoint:

redpanda_application_uptime_total_seconds
Description: Redpanda uptime in seconds
Labels: none

redpanda_application_busy_total_seconds
Description: Total CPU busy time in seconds
Labels: none
scheduler_runtime_ms used to be replicated from the seastar metrics.
Previous patches introduced redpanda_scheduler_runtime_seconds_total as
a replacement.
@VladLazar
Copy link
Contributor Author

Changes in force push:

  • publish scheduler stats from the default group too
  • added redpanda_application_uptime_seconds_total and redpanda_application_busy_seconds_total

@VladLazar VladLazar requested a review from BenPope August 11, 2022 10:40
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

Comment on lines +84 to +85
ss::scheduling_group _default{
seastar::default_scheduling_group()}; // created and managed by seastar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

.count();
},
sm::description("Total CPU busy time in seconds"))
.aggregate({sm::shard_label}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth keeping the shards here

Copy link
Contributor Author

@VladLazar VladLazar Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are only reported from one shard. That's why I aggregated. Just drops the label basically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl -s  localhost:19644/metrics | grep  cpu_busy_ms
# HELP vectorized_reactor_cpu_busy_ms Total cpu busy time in milliseconds
# TYPE vectorized_reactor_cpu_busy_ms counter
vectorized_reactor_cpu_busy_ms{shard="0"} 6268
vectorized_reactor_cpu_busy_ms{shard="1"} 5003

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean this probe is only on one shard? Can you register a metric for each shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I've confused myself. It's reported from one shard because it's only registered on one shard. We should probably register it on all shards. This means that's probably not the right place for the "busy_time" metric. Let me have a think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can invoke_on, or submit_to, or create a sharded<probe> and .invoke_on_all, or move it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. Just feels a bit unnatural to do it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. Do we want the uptime for every shard? It would probably be redundant. Busy time for each shard makes sense though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's how it is:

curl -s localhost:19644/metrics | grep uptime
# HELP vectorized_application_uptime Redpanda uptime in milliseconds
# TYPE vectorized_application_uptime gauge
vectorized_application_uptime{shard="0"} 24680.000000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I wrapped the metric_groups object into a ss::sharded. uptime is only reported from the home shard, while busy is reported from all shards.

@BenPope
Copy link
Member

BenPope commented Aug 11, 2022

I set this for backport to v22.2.x

Vlad Lazar added 2 commits August 11, 2022 14:41
This patch adds a wrapper for seastar::metrics::metric_groups which is
intended for usage with seastar::sharded. The only interesting thing
about it is that it clears the metrics on stop.
This patch wraps the metric_groups object owned by the application into
a sharded service. This change allows us to register metrics on specific
shards where required. For instance,
redpanda_application_uptime_seconds_total is only registered on one
shard, while redpanda_cpu_busy_seconds_total is registered on all
shards.
@VladLazar VladLazar requested a review from BenPope August 11, 2022 13:45
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

I built and tested this locally

@VladLazar
Copy link
Contributor Author

Awesome.

I built and tested this locally

Appreciate that. Thanks for trying!

@BenPope BenPope merged commit be504a9 into redpanda-data:dev Aug 11, 2022
@BenPope
Copy link
Member

BenPope commented Aug 11, 2022

/backport v22.2.x

Comment on lines +340 to +343
_public_metrics
.invoke_on(
ss::this_shard_id(),
[](auto& public_metrics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VladLazar just curious about using invoke_on(this_shard_id, ... as opposed to _public_metrics.local(). .. is there something special about using invoke on that i'm missing here?

Copy link
Contributor Author

@VladLazar VladLazar Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not missing anything. I just slightly misused the api (unintentionally). Should have used local(). I don't think there is any difference in behaviour though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me neither, looks fine. thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants