Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpk: Improve generated Grafana dashboard with public_metrics #6382

Closed
r-vasquez opened this issue Sep 13, 2022 · 1 comment
Closed

rpk: Improve generated Grafana dashboard with public_metrics #6382

r-vasquez opened this issue Sep 13, 2022 · 1 comment
Labels
area/rpk kind/enhance New feature or request

Comments

@r-vasquez
Copy link
Contributor

r-vasquez commented Sep 13, 2022

Description

Follow up for: #6165 (comment) to improve the generated grafana dashboard with public_metrics:

1. Refactor the dashboard to include:

This is an example of an ops dashboard that we've produced in CS which covers some of the things we'd be encouraging people to monitor. Unfortunately, it seems that some of these things aren't available in public_metrics, so we may need a separate issue to add those things.
List as follows

  1. Nodes Up
  2. Uptime
  3. No. Partitions
  4. No. Topics
  5. Leadership transfer rate (not present)
  6. Under replicated partitions
  7. Leaderless partitions
  8. CPU Utilisation (not present) -> Check additional info
  9. Allocated Memory
  10. Leadership balance
  11. Currently active connections (not present)
  12. Cluster info (build numbers, versions, etc) (not present)
  13. Produce latency
  14. Consumer latency
  15. Storage bytes written (not present)
  16. Storage bytes read (not present)
  17. Network bytes received
  18. Network bytes sent
  19. Under-replicated partitions (by topic) (not present) -> Check additional info
  20. Leaderless partitions (list)
  21. Under replicated partitions by cluster (not present) -> Can be derived from redpanda_kafka_under_replicated_replicas
  22. Number of groups for which a node is a leader
  23. Partition leadership per broker

2. Replace Memory and Storage Section:

Replace the panels in the storage section with two new panels that display the ratio of disk available:
Disk Usage per Broker (the percentage of disk currently in use): 1 - (redpanda_storage_disk_free_bytes / redpanda_storage_disk_total_bytes)

Replace the memory section with: Memory Usage per Broker (the percentage of memory currently in use): redpanda_memory_allocated_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory)

3. Aggregate the rest proxy and schema registry errors by redpanda_status

The queries should change like this: sum(...) by ($aggr_criteria, redpanda_status). Note the new label we aggregate by.

Additional Info:

#6165 (comment)

This will require rpk to handle /public_metrics differently from how we handle /metrics to keep backompat of the old dashboard.

@r-vasquez r-vasquez added kind/enhance New feature or request area/rpk labels Sep 13, 2022
@r-vasquez r-vasquez closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2022
@r-vasquez r-vasquez reopened this Sep 13, 2022
@r-vasquez
Copy link
Contributor Author

Fixed by: #9662

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rpk kind/enhance New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant