Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault 1.5 crashes on linux on arm if telmetry is enabled #9553

Closed
joltedbot opened this issue Jul 22, 2020 · 4 comments · Fixed by #9554
Closed

Vault 1.5 crashes on linux on arm if telmetry is enabled #9553

joltedbot opened this issue Jul 22, 2020 · 4 comments · Fixed by #9554

Comments

@joltedbot
Copy link

Describe the bug
After upgrading to Vault 1.5 Linux/Arm I enabled telemetry in the vault config file for the first time. When I do the active vault node crashes on restart with a go panic. "panic: invalid argument to Intn". See the bottom for the rest of the error.

I have a 3 node HA Cluster so it just keeps setting a new node as active and that node fails and passes it to the next around in a circle.
If I remove the telemetry stanza it works again. If I put back the telemetry stanza with any combination of parameters I get the same panic.

To Reproduce
Steps to reproduce the behavior:

  1. Update Vault from 1.4.3 to 1.5 Linux Arm
  2. Add any telemetry stanza to the config file
  3. Restart Vault

Expected behavior
Vault should successfully start up and expose the Prometheus telemetry metrics at: /v1/sys/metrics

Environment:

  • Vault Server Version (retrieve with vault status): 1.5.0
  • Vault CLI Version (retrieve with vault version): v1.5.0
  • Server Operating System/Architecture: Raspbian (Kernel: 4.19.66-v7+ Arch: armv7l)

Vault server configuration file(s):

ui = true

listener "tcp" {
  address     = "0.0.0.0:8200"
  cluster_address  = "0.0.0.0:8201"
  tls_cert_file = "xxxxxxxxxxxxx"
  tls_key_file = "xxxxxxxxxxxxx"
  tls_min_version = "tls12"
}

storage "consul" {
  address = "x.x.x.x:8501"
  path    = "vault/"
  scheme        = "https"
  token = "xxxxxxxxxxxxx"
  tls_ca_file   = "xxxxxxxxxxxxx"
  tls_cert_file = "xxxxxxxxxxxxx"
  tls_key_file  = "xxxxxxxxxxxxx"
}

api_addr = "https://x.x.x.x:8200"
cluster_addr = "https://x.x.x.x:8201"

log_level = "error"

seal "awskms" {
  region     = ""xxxxxxxxxxxxx"
  access_key = "xxxxxxxxxxxxx"
  secret_key = "xxxxxxxxxxxxx"
  kms_key_id = "xxxxxxxxxxxxx"
}

telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

Additional context
Add any other context about the problem here.

Vault has been working well for a long time so there were no preexisting issues. The 1.5 upgrade went completely smoothly with no issues at all until I enabled the telemetry. I have tried a bunch of different telemetry stanza parameters including just "telemetry {}" and they all seem to act the same way. I also moved the telemetry stanza around in the HCL file just in case and there was no noticeable change.

Here is the rest of the error message:

panic: invalid argument to Intn
goroutine 535 [running]:
math/rand.(*Rand).Intn(0x5514120, 0xb2c97000, 0x54b3774)
#011/usr/local/go/src/math/rand/rand.go:169 +0x7c
math/rand.Intn(...)
#011/usr/local/go/src/math/rand/rand.go:337
github.com/hashicorp/vault/helper/metricsutil.(*GaugeCollectionProcess).delayStart(0x61eaa50, 0x0)
#011/go/src/github.com/hashicorp/vault/helper/metricsutil/gauge_process.go:124 +0x48
github.com/hashicorp/vault/helper/metricsutil.(*GaugeCollectionProcess).Run(0x61eaa50)
#011/go/src/github.com/hashicorp/vault/helper/metricsutil/gauge_process.go:232 +0x48
created by github.com/hashicorp/vault/vault.(*Core).emitMetrics
#011/go/src/github.com/hashicorp/vault/vault/core_metrics.go:198 +0x7e8

Systemd was also throwing a bunch of these when trying to start up the service and it seems to start but then crashes. Otherwise it seems to be able to start all the backends and even unseal before it crashes. These messages go away when I remove the telemetry stanza.

2020-07-21T15:53:39.163-0700 [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp y.y.y.y:8201: connect: connection refused""
2020-07-21T15:53:39.163-0700 [ERROR] core: forward request error: error="error during forwarding RPC request"
@mgritter
Copy link
Contributor

You can disable the gauge collection process which is the source of this bug by adding the configuration

usage_gauge_interval = "none"

to your telemetry stanza, which will let the rest of telemetry work.

@joltedbot
Copy link
Author

Unfortunately that workaround didn't work for me. I have the same issue with gauge interval disabled.

@mgritter
Copy link
Contributor

Sorry, I used the wrong name.

https://www.vaultproject.io/docs/configuration/telemetry

says it's usage_gauge_period which I checked is what's in the code.

@joltedbot
Copy link
Author

And I should have looked at the documentation first.

It works. This will get me by till 1.5.1.

Thanks for the help.

mgritter pushed a commit that referenced this issue Jul 22, 2020
mgritter pushed a commit that referenced this issue Jul 22, 2020
* Add upgrade note for #9553.
* Note that these are metrics introduced in 1.5.0.
* Added link to docs.
github-actions bot pushed a commit that referenced this issue Jul 22, 2020
* Add upgrade note for #9553.
* Note that these are metrics introduced in 1.5.0.
* Added link to docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants