Publish runtime metric in seconds #5893

VladLazar · 2022-08-08T12:34:31Z

Note: Depends on redpanda-data/seastar#31.

Cover letter

This PR adds the following new metric:
Name: redpanda_scheduler_runtime_seconds
Description: Accumulated runtime of task queue associated with this
scheduling group
Labels:

redpanda_scheduling_group
shard

Backport Required

UX changes

none

Release notes

none

src/v/resource_mgmt/scheduling_groups_probe.h

src/v/resource_mgmt/cpu_scheduling.h

src/v/resource_mgmt/scheduling_groups_probe.h

BenPope · 2022-08-10T18:07:18Z

I don't think the sum of these is the same as vectorized_reactor_cpu_busy_ms. So maybe we should add redpanda_cpu_busy_seconds_total?

VladLazar · 2022-08-11T09:50:58Z

No, it's not. I think the "busy time" includes the wait time too. Sure, we can add it. It's already exposed by the reactor.

This patch adds a public method that returns a list containing constant references to all the scheduling groups created by redpanda.

This patch introduces a probe that queries each scheduling group for its usage stats and publishes metrics based on that. The following new metric is introduced: Name: redpanda_scheduler_runtime_seconds_total Description: Accumulated runtime of task queue associated with this scheduling group Labels: - redpanda_scheduling_group - shard

This commit wires up a scheduling_groups_probe in order to publish metrics based on the scheduling groups stats. Note how the probe is cleared before the scheduling groups are destroyed to prevent publishing metrics from a destroyed group.

This commit splits the registration of internal an public metrics into two separate methods. It also adds two new metrics to the "public_metrics" endpoint: redpanda_application_uptime_total_seconds Description: Redpanda uptime in seconds Labels: none redpanda_application_busy_total_seconds Description: Total CPU busy time in seconds Labels: none

scheduler_runtime_ms used to be replicated from the seastar metrics. Previous patches introduced redpanda_scheduler_runtime_seconds_total as a replacement.

VladLazar · 2022-08-11T10:40:53Z

Changes in force push:

publish scheduler stats from the default group too
added redpanda_application_uptime_seconds_total and redpanda_application_busy_seconds_total

BenPope

Very nice

BenPope · 2022-08-11T11:12:22Z

src/v/resource_mgmt/cpu_scheduling.h

+    ss::scheduling_group _default{
+      seastar::default_scheduling_group()}; // created and managed by seastar


Good catch!

BenPope · 2022-08-11T11:17:16Z

src/v/redpanda/application.cc

+                .count();
+          },
+          sm::description("Total CPU busy time in seconds"))
+          .aggregate({sm::shard_label}),


I think it's worth keeping the shards here

These metrics are only reported from one shard. That's why I aggregated. Just drops the label basically.

curl -s localhost:19644/metrics | grep cpu_busy_ms # HELP vectorized_reactor_cpu_busy_ms Total cpu busy time in milliseconds # TYPE vectorized_reactor_cpu_busy_ms counter vectorized_reactor_cpu_busy_ms{shard="0"} 6268 vectorized_reactor_cpu_busy_ms{shard="1"} 5003

Oh, you mean this probe is only on one shard? Can you register a metric for each shard?

Right. I've confused myself. It's reported from one shard because it's only registered on one shard. We should probably register it on all shards. This means that's probably not the right place for the "busy_time" metric. Let me have a think.

You can invoke_on, or submit_to, or create a sharded<probe> and .invoke_on_all, or move it.

I know. Just feels a bit unnatural to do it there.

Btw. Do we want the uptime for every shard? It would probably be redundant. Busy time for each shard makes sense though.

Yep, that's how it is:

curl -s localhost:19644/metrics | grep uptime # HELP vectorized_application_uptime Redpanda uptime in milliseconds # TYPE vectorized_application_uptime gauge vectorized_application_uptime{shard="0"} 24680.000000

Done. I wrapped the metric_groups object into a ss::sharded. uptime is only reported from the home shard, while busy is reported from all shards.

BenPope · 2022-08-11T13:14:05Z

I set this for backport to v22.2.x

This patch adds a wrapper for seastar::metrics::metric_groups which is intended for usage with seastar::sharded. The only interesting thing about it is that it clears the metrics on stop.

This patch wraps the metric_groups object owned by the application into a sharded service. This change allows us to register metrics on specific shards where required. For instance, redpanda_application_uptime_seconds_total is only registered on one shard, while redpanda_cpu_busy_seconds_total is registered on all shards.

BenPope

Awesome.

I built and tested this locally

VladLazar · 2022-08-11T16:00:47Z

Awesome.

I built and tested this locally

Appreciate that. Thanks for trying!

BenPope · 2022-08-11T16:36:53Z

/backport v22.2.x

dotnwat · 2022-08-14T18:27:27Z

src/v/redpanda/application.cc

+    _public_metrics
+      .invoke_on(
+        ss::this_shard_id(),
+        [](auto& public_metrics) {


@VladLazar just curious about using invoke_on(this_shard_id, ... as opposed to _public_metrics.local(). .. is there something special about using invoke on that i'm missing here?

You're not missing anything. I just slightly misused the api (unintentionally). Should have used local(). I don't think there is any difference in behaviour though.

me neither, looks fine. thx

github-actions bot added the area/redpanda label Aug 8, 2022

VladLazar mentioned this pull request Aug 8, 2022

metrics: expose seastar runtime metric #5845

Merged

5 tasks

BenPope reviewed Aug 8, 2022

View reviewed changes

src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved

VladLazar force-pushed the public-metrics-uptime-seconds branch from 230b504 to a9750b2 Compare August 8, 2022 14:59

VladLazar requested a review from BenPope August 8, 2022 17:51

BenPope reviewed Aug 9, 2022

View reviewed changes

src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved

src/v/resource_mgmt/cpu_scheduling.h Show resolved Hide resolved

src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved

VladLazar force-pushed the public-metrics-uptime-seconds branch 2 times, most recently from c97d474 to e63aab2 Compare August 10, 2022 15:31

VladLazar marked this pull request as ready for review August 10, 2022 15:33

VladLazar requested a review from BenPope August 10, 2022 15:40

BenPope previously approved these changes Aug 10, 2022

View reviewed changes

src/v/resource_mgmt/scheduling_groups_probe.h Outdated Show resolved Hide resolved

Vlad Lazar added 5 commits August 11, 2022 10:57

cpu_scheduling: expose all scheduling groups

14b45a4

This patch adds a public method that returns a list containing constant references to all the scheduling groups created by redpanda.

redpanda/main: use scheduling groups probe

1be0e7c

This commit wires up a scheduling_groups_probe in order to publish metrics based on the scheduling groups stats. Note how the probe is cleared before the scheduling groups are destroyed to prevent publishing metrics from a destroyed group.

application: remove scheduler_runtime_ms metric

1c87686

scheduler_runtime_ms used to be replicated from the seastar metrics. Previous patches introduced redpanda_scheduler_runtime_seconds_total as a replacement.

VladLazar dismissed BenPope’s stale review via 1c87686 August 11, 2022 10:39

VladLazar force-pushed the public-metrics-uptime-seconds branch from e63aab2 to 1c87686 Compare August 11, 2022 10:39

VladLazar requested a review from BenPope August 11, 2022 10:40

BenPope reviewed Aug 11, 2022

View reviewed changes

Vlad Lazar added 2 commits August 11, 2022 14:41

ssx: add metric_groups wrapper to use as service

30ce207

This patch adds a wrapper for seastar::metrics::metric_groups which is intended for usage with seastar::sharded. The only interesting thing about it is that it clears the metrics on stop.

VladLazar requested a review from BenPope August 11, 2022 13:45

BenPope approved these changes Aug 11, 2022

View reviewed changes

BenPope merged commit be504a9 into redpanda-data:dev Aug 11, 2022

vbotbuildovich mentioned this pull request Aug 11, 2022

[v22.2.x] Publish runtime metric in seconds #5976

Merged

dotnwat reviewed Aug 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish runtime metric in seconds #5893

Publish runtime metric in seconds #5893

VladLazar commented Aug 8, 2022 •

edited by BenPope

Loading

BenPope commented Aug 10, 2022

VladLazar commented Aug 11, 2022

VladLazar commented Aug 11, 2022

BenPope left a comment

BenPope Aug 11, 2022

BenPope Aug 11, 2022

VladLazar Aug 11, 2022 •

edited

Loading

BenPope Aug 11, 2022

BenPope Aug 11, 2022

VladLazar Aug 11, 2022

BenPope Aug 11, 2022

VladLazar Aug 11, 2022

VladLazar Aug 11, 2022

BenPope Aug 11, 2022

VladLazar Aug 11, 2022

BenPope commented Aug 11, 2022

BenPope left a comment •

edited

Loading

VladLazar commented Aug 11, 2022

BenPope commented Aug 11, 2022

dotnwat Aug 14, 2022

VladLazar Aug 15, 2022 •

edited

Loading

dotnwat Aug 15, 2022

		ss::scheduling_group _default{
		seastar::default_scheduling_group()}; // created and managed by seastar

Publish runtime metric in seconds #5893

Publish runtime metric in seconds #5893

Conversation

VladLazar commented Aug 8, 2022 • edited by BenPope Loading

Cover letter

Backport Required

UX changes

Release notes

BenPope commented Aug 10, 2022

VladLazar commented Aug 11, 2022

VladLazar commented Aug 11, 2022

BenPope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenPope commented Aug 11, 2022

BenPope left a comment • edited Loading

Choose a reason for hiding this comment

VladLazar commented Aug 11, 2022

BenPope commented Aug 11, 2022

Choose a reason for hiding this comment

VladLazar Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar commented Aug 8, 2022 •

edited by BenPope

Loading

VladLazar Aug 11, 2022 •

edited

Loading

BenPope left a comment •

edited

Loading

VladLazar Aug 15, 2022 •

edited

Loading