rpk: grafana-generate - support public metrics #6165

r-vasquez · 2022-08-22T15:52:41Z

Cover letter

rpk generate grafana-dashboard have a summary section that has some metrics that don't exist in the new /public_metrics endpoint.

This PR creates a new summary section with the new metrics available in the /public_metrics endpoint.

Fixes #5646

Grafana dashboard example: here

Backport Required

v22.2.x

UX changes

Old:

Now:

We are keeping backward compatibility for /metrics endpoint and added new panels / modified some aggregation criteria for the new endpoint only:

Node Up and Partitions panels

Now it queries redpanda_cluster_brokers and redpanda_cluster_partitions.

Latency of Kafka Request

We are splitting produce and consume requests in each percentile. We are querying redpanda_kafka_request_latency_seconds_bucket now.

Throughput

Is now calculed by sum(rate(redpanda_kafka_request_bytes_total[2m])) by (redpanda_request).

Changes that affect all panels

We are removing the shard label, filtering, and aggregation criteria in the /public_metrics endpoint because the new metrics don't have a shard label.

Release notes

Improvements

rpk generate grafana-dashboard now supports /public_metrics endpoint.

VladLazar

I'll defer to someone with better go knowledge (mine's non-existent) for the code review.

Functionally it looks nice. There's a couple of improvements we could make:

Remove the labels that never have a value from the legend. For all the panels in "Redpanda Summary" we can remove the redpanda_request label from the legent as it never has a value.
Same for the panels in the "Internal RPC Latency" section. The panel in the "Throughput" section should only keep the "redpanda_request" label in the legend.
Can we change the Y axis unit for panels that represent time (latency) to seconds?

@BenPope, @travisdowns could you take a look too? I generated some load between 14:20 and 14:25 UTC so the panels should display something around that time.

BenPope

The intervals for rates should prefer[$__rate_interval] over a fixed period such as [2m], but you may need to set a min step of 1m.
The units should always be set; e.g., it looks like the latency is off by some orders of magnitude, and should be set to seconds.
I'd be tempted to also aggregate schema registry and HTTP Proxy errors by redpanda_status - it's in the legend, after all.
In general I'd prefer rate over irate, at least for simple queries - I don't want to get into subqueries here.
Memory stats have shard in the legend, but not in aggregation criteria (probably worth adding a new option that is cluster,instance,shard in the dropdown.
The unit for scheduler stats is seconds per second - it should be percent (0..1), not ops/s.
The storage section has rate queries, and I don't know what we're trying to measure. Probably worth combining them into a ratio of disk used %.
Internal latency of schema registry / http proxy isn't internal, it's the request latency.

travisdowns · 2022-08-24T20:16:57Z

Nice to see support for public metrics. I had some feedback but Ben got all of it in his comments above (particularly important to fix the units for latency before it goes out).

Minor concern about how we detect the endpoint: we look for the string public_metrics in the URL is that right? Could it be a problem if someone is using URL-rewriting so that it does not appear or (less likely) they happen to have this string (e.g. in the hostname part) in the hostname so it appears spuriously? I think it's good enough and it this would only be something to consider changing if we have a better solution, but I'm not sure what that is: @VladLazar is there some strong indicator in the metrics results themselves to distinguish public from oldschool metrics?

BenPope · 2022-08-24T20:26:34Z

The new metrics all start with redpanda_, the old ones with vectorized_

r-vasquez · 2022-08-26T15:38:39Z

Force Push:

Removed the labels that never have a value from the legend. (Summary and Internal RPC Latency section)
Changed latency units to seconds.
Changed interval for rates from a fixed [2m] to [$__rate_interval] with a minimum of [1m]
Changed unit for scheduler stats to be a percent (0...1)

New Grafana example can be found here

r-vasquez · 2022-08-26T15:39:50Z

src/go/rpk/pkg/cli/cmd/generate/grafana_test.go

@@ -65,7 +65,7 @@ vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{le="20.000000
 vectorized_memory_allocated_memory_bytes{shard="0",type="bytes"} 40837120
 vectorized_memory_allocated_memory_bytes{shard="1",type="bytes"} 36986880
 `
-	expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}`
+	expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"interval":"1m","title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"interval":"1m","title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"interval":"1m","title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}`


Single line change: it's changing because now I'm adding the default interval rate to be [1m]

In hindsight I think it would have been better to create a golden file instead of putting all this in a JSON single-liner. it would make reviewing these changes easier.

VladLazar

Looking good. I think a few of Ben's comments still need addressing:

Rename "Internal latency of request for schema_registry" to "Schema Registry Request Latency" and change unit to seconds
Rename "Internal latency of rest_proxy" to "REST Proxy Request Latency" and change unit to seconds
Replace the panels in the storage section with two new panels that display the ratio of disk available:
- Disk Usage per Broker (the percentage of disk currently in use): 1 - (redpanda_storage_disk_free_bytes / redpanda_storage_disk_total_bytes)
Aggregate the rest proxy and schema registry errors by redpanda_status. The queries should change like this:
sum(...) by ($aggr_criteria, redpanda_status). Note the new label we aggregate by.

Also, I would replace the panels in the memory section with the following:

Memory Usage per Broker (the percentage of memory currently in use): redpanda_memory_allocated_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory)

r-vasquez · 2022-08-30T14:41:00Z

@VladLazar We take the name of the panel from the HELP of each metric, in the case of Internal Latency of Request for schema_registry and rest_proxy:

# HELP redpanda_schema_registry_request_latency_seconds Internal latency of request for schema_registry
# TYPE redpanda_schema_registry_request_latency_seconds histogram
redpanda_schema_registry_request_latency_seconds_sum{} 0

# HELP redpanda_rest_proxy_request_latency_seconds Internal latency of request for rest_proxy
# TYPE redpanda_rest_proxy_request_latency_seconds histogram
redpanda_rest_proxy_request_latency_seconds_sum{} 0

To avoid hardcoding this, can we change the response from /public_metrics ? lmk so I can make the change if needed.

Aggregate the rest proxy and schema registry errors by redpanda_status.

Is this just for the 2 panels or can we add the aggregation criteria for every other panel?

Replace the panels in the storage section with two new panels that display the ratio of disk available

Is it helpful to leave the ones that we have and add the 2 new panels?

Also, I would replace the panels in the memory section with the following

Same as above.

I'm asking all this because we will have a lot of custom hardcoded logic in rpk for public_metrics vs the way we handle /metrics endpoint's panels and want to make sure that I understand all 😃

VladLazar · 2022-08-31T10:54:42Z

@VladLazar We take the name of the panel from the HELP of each metric, in the case of Internal Latency of Request for schema_registry and rest_proxy:

If that's the case then let's leave it as is for now and we can change the description of the metrics in redpanda.

Is this just for the 2 panels or can we add the aggregation criteria for every other panel?

Just for those two panels. The others don't have this label attached to them.

Is it helpful to leave the ones that we have and add the 2 new panels?

I would just replace them. They'd display the same information, but in a different way. I think what a user actually
cares about is how much disk and memory are being used.

src/go/rpk/pkg/cli/cmd/generate/grafana.go

twmb · 2022-09-02T00:46:39Z

src/go/rpk/pkg/cli/cmd/generate/grafana.go

-			panel = newCounterPanel(family)
+		// hack around redpanda_storage_* metrics: these should be gauge
+		// panels but the metrics type come as COUNTER
+		if family.GetType() == dto.MetricType_COUNTER && !strings.Contains(name, "redpanda_storage") {


Should this be !strings.HasPrefix?

Also, is the counter vs. gauge thing a redpanda issue?

Yeah this looks like a bug. I think all of the repdanda_storage metrics in public_metrics should be gauges. Filed: #6316

twmb

Two non-blocking questions

emaxerrno · 2022-09-06T06:19:05Z

@tmgstevens may have thoughts on the default panels we show on /public_metrics i think CS found other panels to yield higher information/better signal.

tmgstevens · 2022-09-12T20:49:59Z

@tmgstevens may have thoughts on the default panels we show on /public_metrics i think CS found other panels to yield higher information/better signal.

This is an example of an ops dashboard that we've produced in CS which covers some of the things we'd be encouraging people to monitor. Unfortunately it seems that some of these things aren't available in public_metrics, so we may need a separate issue to add those things.
List as follows

Nodes Up
Uptime
No. Partitions
No. Topics
Leadership transfer rate (not present)
Under replicated partitions
Leaderless partitions
CPU Utilisation (not present)
Allocated Memory
Leadership balance
Currently active connections (not present)
Cluster info (build numbers, versions, etc) (not present)
Produce latency
Consumer latency
Storage bytes written (not present)
Storage bytes read (not present)
Network bytes received
Network bytes sent
Under-replicated partitions (by topic) (not present)
Leaderless partitions (list)
Under replicated partitions by cluster (not present)
Number of groups for which a node is a leader
Partition leadership per broker

emaxerrno · 2022-09-12T23:24:16Z

@r-vasquez ^^ see tristan's comments above.

VladLazar · 2022-09-13T10:29:50Z

@emaxerrno would it be a good idea to decouple the fix (this PR) and the improvements suggested by @tmgstevens?
I'm on-board with improving the generated dashboards as Tristan suggests, but lacking integration with Grafana for public_metrics is a regression when compared to the metrics endpoint and it's blocking adoption.

BenPope · 2022-09-13T11:32:28Z

Leadership transfer rate (not present)

Yep, not available.

CPU Utilisation (not present)

You may find these useful:

# HELP redpanda_cpu_busy_seconds_total Total CPU busy time in seconds
# TYPE redpanda_cpu_busy_seconds_total gauge
# HELP redpanda_scheduler_runtime_seconds_total Accumulated runtime of task queue associated with this scheduling group
# TYPE redpanda_scheduler_runtime_seconds_total counter

Currently active connections (not present)

I thought we had this, but apparently not.

Cluster info (build numbers, versions, etc) (not present)

I thought we had this, but apparently not.

Storage bytes written (not present)

Storage bytes read (not present)

Yep, not available.

Under-replicated partitions (by topic) (not present)

# HELP redpanda_kafka_under_replicated_replicas Number of under replicated replicas (i.e. replicas that are live, but not at the latest offest)
# TYPE redpanda_kafka_under_replicated_replicas gauge

Under replicated partitions by cluster (not present)

Can be derived from redpanda_kafka_under_replicated_replicas

tmgstevens · 2022-09-13T12:47:58Z

@emaxerrno would it be a good idea to decouple the fix (this PR) and the improvements suggested by @tmgstevens? I'm on-board with improving the generated dashboards as Tristan suggests, but lacking integration with Grafana for public_metrics is a regression when compared to the metrics endpoint and it's blocking adoption.

I would, however, suggest going through the list I mentioned and making sure that we've got as many of those things on the generated dashboard. For example - IIRC Under-replicated partitions isn't anywhere near the top of the page, but should be as it's really important to monitor.

twmb · 2022-09-13T14:04:40Z

We need to separate further dashboard improvements into a separate PR -- the scope of this PR was originally to migrate from our overly excessive /metrics to our new /public_metrics.

The best long term fix would be to remove the dashboard generation from pure-Go, rpk-only code and separate it to something that can be maintained and extended by all teams.

If we are ok with the current dashboards in this PR, we should merge this PR. I think the comments above indicate that the current dashboards are good, pending any disagreements, I think we should merge this by EOD.

We should have two followup issues, one tracking how to separate these dashboards from pure-Go code so that the dashboards can be maintained more broadly, and one to further improve the dashboards per @tmgstevens's suggestions above.

Lmk if there are any disagreements here, otherwise the plan is to merge EOD.

0x5d

lgtm besides unresolved comments.

0x5d · 2022-09-13T16:12:58Z

src/go/rpk/pkg/cli/cmd/generate/grafana_test.go

@@ -65,7 +65,7 @@ vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{le="20.000000
 vectorized_memory_allocated_memory_bytes{shard="0",type="bytes"} 40837120
 vectorized_memory_allocated_memory_bytes{shard="1",type="bytes"} 36986880
 `
-	expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}`
+	expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"interval":"1m","title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"interval":"1m","title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"interval":"1m","title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}`


In hindsight I think it would have been better to create a golden file instead of putting all this in a JSON single-liner. it would make reviewing these changes easier.

tmgstevens · 2022-09-13T18:33:35Z

We need to separate further dashboard improvements into a separate PR -- the scope of this PR was originally to migrate from our overly excessive /metrics to our new /public_metrics.

The best long term fix would be to remove the dashboard generation from pure-Go, rpk-only code and separate it to something that can be maintained and extended by all teams.

If we are ok with the current dashboards in this PR, we should merge this PR. I think the comments above indicate that the current dashboards are good, pending any disagreements, I think we should merge this by EOD.

We should have two followup issues, one tracking how to separate these dashboards from pure-Go code so that the dashboards can be maintained more broadly, and one to further improve the dashboards per @tmgstevens's suggestions above.

Lmk if there are any disagreements here, otherwise the plan is to merge EOD.

I'm cool with that. I haven't managed to actually run the code anywhere, if someone can generate a dashboard for me then I'll happily test it, but generally, bar the comments above, LGTM

rpk generate grafana-dashboard have a summary section that has some metrics that don't exist in the new /public_metrics endpoint.

r-vasquez · 2022-09-13T20:53:50Z

Latest Grafana dashboard example: here

Force Push:

More accurate string validation for both metrics endpoint and hack around redpanda_storage until we solve Some storage metrics should be guages not counters #6316
graf README update with the new variable INTERVAL
Typos

The improvements to the generated dashboard will be solved by #6382 so we can split the fix to make public_metric work and the improvements of the dashboard. 😄

r-vasquez · 2022-09-19T14:17:35Z

/backport v22.1.x : my mistake.

vbotbuildovich · 2022-09-19T14:18:09Z

Branch name "v22.1.x" not found.

Workflow run logs.

r-vasquez · 2022-09-19T14:20:44Z

/backport v22.2.x

vbotbuildovich · 2022-09-19T14:21:13Z

Branch name "v22.2.x" not found.

Workflow run logs.

github-actions bot added the area/rpk label Aug 22, 2022

r-vasquez force-pushed the grafana-with-public-metrics branch from 389d455 to 580ef79 Compare August 22, 2022 15:54

r-vasquez marked this pull request as ready for review August 22, 2022 16:06

r-vasquez requested review from twmb and 0x5d as code owners August 22, 2022 16:06

r-vasquez requested a review from VladLazar August 22, 2022 16:06

VladLazar reviewed Aug 23, 2022

View reviewed changes

BenPope self-requested a review August 24, 2022 09:51

BenPope reviewed Aug 24, 2022

View reviewed changes

r-vasquez force-pushed the grafana-with-public-metrics branch 2 times, most recently from 2e1243a to b6e96bf Compare August 26, 2022 15:33

r-vasquez commented Aug 26, 2022

View reviewed changes

r-vasquez requested review from BenPope and VladLazar August 26, 2022 15:40

VladLazar reviewed Aug 30, 2022

View reviewed changes

twmb reviewed Sep 2, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/generate/grafana.go Outdated Show resolved Hide resolved

twmb reviewed Sep 2, 2022

View reviewed changes

twmb previously approved these changes Sep 2, 2022

View reviewed changes

travisdowns mentioned this pull request Sep 5, 2022

Some storage metrics should be guages not counters #6316

Closed

VladLazar requested a review from tmgstevens September 7, 2022 09:50

r-vasquez mentioned this pull request Sep 13, 2022

rpk: Improve generated Grafana dashboard with public_metrics #6382

Closed

0x5d reviewed Sep 13, 2022

View reviewed changes

r-vasquez dismissed twmb’s stale review via f61cc7c September 13, 2022 20:47

r-vasquez force-pushed the grafana-with-public-metrics branch from b6e96bf to f61cc7c Compare September 13, 2022 20:47

rpk: grafana-generate - support public metrics

0e65d05

rpk generate grafana-dashboard have a summary section that has some metrics that don't exist in the new /public_metrics endpoint.

r-vasquez force-pushed the grafana-with-public-metrics branch from f61cc7c to 0e65d05 Compare September 13, 2022 20:50

twmb approved these changes Sep 14, 2022

View reviewed changes

twmb merged commit 06a0bd6 into redpanda-data:dev Sep 14, 2022

r-vasquez deleted the grafana-with-public-metrics branch September 14, 2022 15:48

This was referenced Sep 19, 2022

[v22.2.x] Make rpk generated dashboards work with "public_metrics" #6463

Closed

[v22.2.x] rpk: grafana-generate - support public metrics #6464

Merged

mmedenjak added kind/enhance New feature or request kind/bug Something isn't working and removed kind/enhance New feature or request labels Sep 19, 2022

andrewhsu mentioned this pull request Sep 19, 2022

fix slash backport cmd: adding --paginate flag #6461

Merged

tmgstevens mentioned this pull request Nov 2, 2022

Add ops dashboard metrics to public_metrics #7059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpk: grafana-generate - support public metrics #6165

rpk: grafana-generate - support public metrics #6165

r-vasquez commented Aug 22, 2022 •

edited

Loading

VladLazar left a comment

BenPope left a comment

travisdowns commented Aug 24, 2022

BenPope commented Aug 24, 2022

r-vasquez commented Aug 26, 2022

r-vasquez Aug 26, 2022

0x5d Sep 13, 2022

VladLazar left a comment

r-vasquez commented Aug 30, 2022

VladLazar commented Aug 31, 2022

twmb Sep 2, 2022

travisdowns Sep 5, 2022

twmb left a comment

emaxerrno commented Sep 6, 2022

tmgstevens commented Sep 12, 2022

emaxerrno commented Sep 12, 2022

VladLazar commented Sep 13, 2022

BenPope commented Sep 13, 2022

tmgstevens commented Sep 13, 2022

twmb commented Sep 13, 2022

0x5d left a comment

0x5d Sep 13, 2022

tmgstevens commented Sep 13, 2022

r-vasquez commented Sep 13, 2022

r-vasquez commented Sep 19, 2022 •

edited

Loading

vbotbuildovich commented Sep 19, 2022

r-vasquez commented Sep 19, 2022

vbotbuildovich commented Sep 19, 2022

rpk: grafana-generate - support public metrics #6165

rpk: grafana-generate - support public metrics #6165

Conversation

r-vasquez commented Aug 22, 2022 • edited Loading

Cover letter

Backport Required

UX changes

Node Up and Partitions panels

Latency of Kafka Request

Throughput

Changes that affect all panels

Release notes

Improvements

VladLazar left a comment

Choose a reason for hiding this comment

BenPope left a comment

Choose a reason for hiding this comment

travisdowns commented Aug 24, 2022

BenPope commented Aug 24, 2022

r-vasquez commented Aug 26, 2022

r-vasquez Aug 26, 2022

Choose a reason for hiding this comment

0x5d Sep 13, 2022

Choose a reason for hiding this comment

VladLazar left a comment

Choose a reason for hiding this comment

r-vasquez commented Aug 30, 2022

VladLazar commented Aug 31, 2022

twmb Sep 2, 2022

Choose a reason for hiding this comment

travisdowns Sep 5, 2022

Choose a reason for hiding this comment

twmb left a comment

Choose a reason for hiding this comment

emaxerrno commented Sep 6, 2022

tmgstevens commented Sep 12, 2022

emaxerrno commented Sep 12, 2022

VladLazar commented Sep 13, 2022

BenPope commented Sep 13, 2022

tmgstevens commented Sep 13, 2022

twmb commented Sep 13, 2022

0x5d left a comment

Choose a reason for hiding this comment

0x5d Sep 13, 2022

Choose a reason for hiding this comment

tmgstevens commented Sep 13, 2022

r-vasquez commented Sep 13, 2022

r-vasquez commented Sep 19, 2022 • edited Loading

vbotbuildovich commented Sep 19, 2022

r-vasquez commented Sep 19, 2022

vbotbuildovich commented Sep 19, 2022

r-vasquez commented Aug 22, 2022 •

edited

Loading

r-vasquez commented Sep 19, 2022 •

edited

Loading