Move usage stats reporting into redpanda #2707

twmb · 2021-10-19T17:05:27Z

(was: Enabling scram breaks rpk status, which is run by default in our systemd file )

By default today, we begin rpk status with systemd when a user starts redpanda. rpk status uses the Kafka API to scrape the number of topics and number of partitions created on the broker. These two, along with free memory, free space, and cpu% by redpanda, are sent to Vectorized as metrics.

If a user enables scram on a cluster, rpk status will be unable to connect to redpanda to scrape the # of topics and # of partitions. The franz-go client internally retries issuing a metadata request a few times, and after a bit the process will die with an error. During these retries, redpanda will continuously be logging that connections are failing due to auth errors. This spams logs and creates gigabytes in syslog over a day.

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

twmb · 2021-10-19T17:19:58Z

Note that this only happens if the user has enable_usage_stats: true, which is the default for k8s.

dotnwat · 2021-10-19T18:44:22Z

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

are you thinking in terms of exposing this through the admin api?

another option is to place some admin credentials in /etc/redpanda/ and make them only readable by root.

twmb · 2021-10-19T18:57:13Z

This would be something for redpanda to emit by default, not something exposed for another process. cc @senior7515 for more details.

emaxerrno · 2021-10-19T18:57:44Z

@dotnwat i think the functionality should be moved into the c++ binary, in particular the cluster controller really.

dotnwat · 2021-10-19T19:38:36Z

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

Ah, got it! Thanks for writing all of the context in the issue I didn't realize how reporting works today. I also didn't read until the end of the issue description, so I missed the important part. My bad 🥇

Yup this should be straight forward since we have @Lazin's HTTP client!

Create periodic service in controller leader
Service gathers stats from local metadata cache, but may need to ping other nodes, too
Create JSON document
POST to configured http endpoint

For (1) something as simple as the periodic leadership re-balancing service could be used as a model for implementation. @jcsp @mmaslankaprv wow 3 use cases in one day for the hypothetical service we've been discussing.

emaxerrno · 2021-10-19T20:07:46Z

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

Ah, got it! Thanks for writing all of the context in the issue I didn't realize how reporting works today. I also didn't read until the end of the issue description, so I missed the important part. My bad 🥇

Yup this should be straight forward since we have @Lazin's HTTP client!

Create periodic service in controller leader

Service gathers stats from local metadata cache, but may need to ping other nodes, too

Create JSON document

POST to configured http endpoint

For (1) something as simple as the periodic leadership re-balancing service could be used as a model for implementation. @jcsp @mmaslankaprv wow 3 use cases in one day for the hypothetical service we've been discussing.

Exactly.

jcsp · 2021-10-21T14:09:00Z

The version tracking is likely to be rolled into an overall cluster status service, so #2735 will probably depend on this.

0x5d · 2021-11-24T19:02:29Z

Changes needed, besides porting the metrics reporting stuff to redpanda:
Packaging:

Deleting the systemd timer & service for rpk status
Deleting the cronjob in the docker image (not sure if it still exists though)
Updating internal tooling so that the above changes don't break packaging & releases

K8s:

Removing the rpk-status container from deployments

Rpk:

Removing rpk status altogether (currently deprecated).

jcsp changed the title ~~Enabling scram breaks rpk status, which is run by default in our systemd file~~ Move usage stats reporting into redpanda Oct 21, 2021

jcsp added kind/enhance New feature or request area/redpanda labels Oct 21, 2021

mmaslankaprv mentioned this issue Nov 25, 2021

Implemented simple metrics reporter #3066

Merged

mmaslankaprv closed this as completed in #3066 Dec 8, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move usage stats reporting into redpanda #2707

Move usage stats reporting into redpanda #2707

twmb commented Oct 19, 2021 •

edited by jcsp

Loading

twmb commented Oct 19, 2021

dotnwat commented Oct 19, 2021

twmb commented Oct 19, 2021

emaxerrno commented Oct 19, 2021

dotnwat commented Oct 19, 2021 •

edited

Loading

emaxerrno commented Oct 19, 2021

jcsp commented Oct 21, 2021

0x5d commented Nov 24, 2021 •

edited by emaxerrno

Loading

Move usage stats reporting into redpanda #2707

Move usage stats reporting into redpanda #2707

Comments

twmb commented Oct 19, 2021 • edited by jcsp Loading

twmb commented Oct 19, 2021

dotnwat commented Oct 19, 2021

twmb commented Oct 19, 2021

emaxerrno commented Oct 19, 2021

dotnwat commented Oct 19, 2021 • edited Loading

emaxerrno commented Oct 19, 2021

jcsp commented Oct 21, 2021

0x5d commented Nov 24, 2021 • edited by emaxerrno Loading

twmb commented Oct 19, 2021 •

edited by jcsp

Loading

dotnwat commented Oct 19, 2021 •

edited

Loading

0x5d commented Nov 24, 2021 •

edited by emaxerrno

Loading