Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move usage stats reporting into redpanda #2707

Closed
twmb opened this issue Oct 19, 2021 · 8 comments · Fixed by #3066
Closed

Move usage stats reporting into redpanda #2707

twmb opened this issue Oct 19, 2021 · 8 comments · Fixed by #3066
Labels
area/redpanda kind/enhance New feature or request

Comments

@twmb
Copy link
Contributor

twmb commented Oct 19, 2021

(was: Enabling scram breaks rpk status, which is run by default in our systemd file )

By default today, we begin rpk status with systemd when a user starts redpanda. rpk status uses the Kafka API to scrape the number of topics and number of partitions created on the broker. These two, along with free memory, free space, and cpu% by redpanda, are sent to Vectorized as metrics.

If a user enables scram on a cluster, rpk status will be unable to connect to redpanda to scrape the # of topics and # of partitions. The franz-go client internally retries issuing a metadata request a few times, and after a bit the process will die with an error. During these retries, redpanda will continuously be logging that connections are failing due to auth errors. This spams logs and creates gigabytes in syslog over a day.

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

@twmb
Copy link
Contributor Author

twmb commented Oct 19, 2021

Note that this only happens if the user has enable_usage_stats: true, which is the default for k8s.

@dotnwat
Copy link
Member

dotnwat commented Oct 19, 2021

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

are you thinking in terms of exposing this through the admin api?

another option is to place some admin credentials in /etc/redpanda/ and make them only readable by root.

@twmb
Copy link
Contributor Author

twmb commented Oct 19, 2021

This would be something for redpanda to emit by default, not something exposed for another process. cc @senior7515 for more details.

@emaxerrno
Copy link
Contributor

@dotnwat i think the functionality should be moved into the c++ binary, in particular the cluster controller really.

@dotnwat
Copy link
Member

dotnwat commented Oct 19, 2021

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

Ah, got it! Thanks for writing all of the context in the issue I didn't realize how reporting works today. I also didn't read until the end of the issue description, so I missed the important part. My bad 🥇

Yup this should be straight forward since we have @Lazin's HTTP client!

  1. Create periodic service in controller leader
  2. Service gathers stats from local metadata cache, but may need to ping other nodes, too
  3. Create JSON document
  4. POST to configured http endpoint

For (1) something as simple as the periodic leadership re-balancing service could be used as a model for implementation. @jcsp @mmaslankaprv wow 3 use cases in one day for the hypothetical service we've been discussing.

@emaxerrno
Copy link
Contributor

Rather than having an external process gather these metrics (rpk), redpanda itself should roll them up and emit them. This would be easier to enable/disable, would start one fewer process, and would be less surprising and more foolproof.

Ah, got it! Thanks for writing all of the context in the issue I didn't realize how reporting works today. I also didn't read until the end of the issue description, so I missed the important part. My bad 🥇

Yup this should be straight forward since we have @Lazin's HTTP client!

  1. Create periodic service in controller leader
  2. Service gathers stats from local metadata cache, but may need to ping other nodes, too
  3. Create JSON document
  4. POST to configured http endpoint

For (1) something as simple as the periodic leadership re-balancing service could be used as a model for implementation. @jcsp @mmaslankaprv wow 3 use cases in one day for the hypothetical service we've been discussing.

Exactly.

@jcsp jcsp changed the title Enabling scram breaks rpk status, which is run by default in our systemd file Move usage stats reporting into redpanda Oct 21, 2021
@jcsp jcsp added kind/enhance New feature or request area/redpanda labels Oct 21, 2021
@jcsp
Copy link
Contributor

jcsp commented Oct 21, 2021

The version tracking is likely to be rolled into an overall cluster status service, so #2735 will probably depend on this.

@0x5d
Copy link
Contributor

0x5d commented Nov 24, 2021

Changes needed, besides porting the metrics reporting stuff to redpanda:
Packaging:

  • Deleting the systemd timer & service for rpk status
  • Deleting the cronjob in the docker image (not sure if it still exists though)
  • Updating internal tooling so that the above changes don't break packaging & releases

K8s:

  • Removing the rpk-status container from deployments

Rpk:

  • Removing rpk status altogether (currently deprecated).

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/enhance New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants