Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Rule fails to load all rules files after reload - numFiles is dropping #4445

Closed
sevagh opened this issue Jul 14, 2021 · 4 comments
Closed

Comments

@sevagh
Copy link
Contributor

sevagh commented Jul 14, 2021

Thanos version: 0.22.0-rc.0 downloaded from the GitHub releases: https://github.com/thanos-io/thanos/releases/tag/v0.22.0-rc.0
OS: Debian 10 Buster, AMD64 architecture

The problem: I upgraded Thanos from v0.16.0 to v0.22.0. Since the upgrade, I noticed the Thanos Rule instances are failing to load all of the rules files.

If you do a systemctl restart, all the rules files are loaded. Then after the next systemctl reload signal, the rules files drop:
image

The total number of correct rules is the higher value, a little over 3000 rules. It only got that high immediately after a fresh daemon restart. After the next reload (every 1 hour, we sync new rules files from the self-service git repo onto the Thanos Rule machines and reload the daemon), the rules files dropped.

Initially I assumed it was a file descriptor problem and raised it through systemd, but I don't think that's related. The logs don't show any errors:

Jul 14 14:46:55 thanos1 systemd[1]: Stopped Thanos Ruler.
Jul 14 14:46:55 thanos1 systemd[1]: Started Thanos Ruler.
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.741945153Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626264000700 maxt=1626271200000 ulid=01FAJRGBK0A212XSG1CJH73J3C
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.742044543Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626271200700 maxt=1626278400000 ulid=01FAJZC2V6HFGWCHM49G72EJZA
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.742077653Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626278400000 maxt=1626285600000 ulid=01FAK67QW4Q13KSFSGZFWY2RYH
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.742107503Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626285600312 maxt=1626292800000 ulid=01FAKD3F3XSXT82XW67W9BXM3K
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.781298424Z caller=head.go:741 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.893174968Z caller=head.go:755 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=111.804004ms
Jul 14 14:46:55 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:55.893272868Z caller=head.go:761 component=tsdb msg="Replaying WAL, this may take a while"
Jul 14 14:46:58 thanos1 thanos[21646]: level=info ts=2021-07-14T21:46:58.792965494Z caller=head.go:787 component=tsdb msg="WAL checkpoint loaded"
Jul 14 14:47:03 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:03.335463023Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=11800 maxSegment=11805
Jul 14 14:47:07 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:07.402785914Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=11801 maxSegment=11805
caller=head.go:818 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=2.899793517s wal_replay_duration=17.538505979s total_replay_duration=20.55021845s
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.114126491Z caller=rule.go:647 msg="a leftover lockfile found and removed"
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.115180325Z caller=options.go:24 component=rules protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.117469283Z caller=factory.go:46 component=rules msg="loading bucket configuration"
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.117967064Z caller=rule.go:632 component=rules msg="starting rule node"
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118125575Z caller=intrumentation.go:48 component=rules msg="changing probe status" status=ready
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118360956Z caller=intrumentation.go:60 component=rules msg="changing probe status" status=healthy
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118480786Z caller=http.go:63 component=rules service=http/server component=rule msg="listening for requests and metrics" address=0.0.0.0:10908
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118565776Z caller=grpc.go:123 component=rules service=gRPC/server component=rule msg="listening for serving gRPC" address=0.0.0.0:10907
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118622056Z caller=tls_config.go:191 component=rules service=http/server component=rule msg="TLS is disabled." http2=false
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118951027Z caller=rule.go:800 component=rules msg="reload rule files" numFiles=192
Jul 14 14:48:03 thanos1 systemd[1]: Reloading Thanos Ruler.
Jul 14 14:48:03 thanos1 thanos[21646]: level=info ts=2021-07-14T21:48:03.939113181Z caller=main.go:180 msg="caught signal. Reloading." signal=hangup
Jul 14 14:48:03 thanos1 thanos[21646]: level=info ts=2021-07-14T21:48:03.939215081Z caller=main.go:183 msg="reload dispatched."
Jul 14 14:48:03 thanos1 systemd[1]: Reloaded Thanos Ruler.
Jul 14 15:06:21 thanos1 systemd[1]: Reloading Thanos Ruler.
Jul 14 15:06:21 thanos1 systemd[1]: Reloaded Thanos Ruler.
Jul 14 15:06:21 thanos1 thanos[21646]: level=info ts=2021-07-14T22:06:21.280907675Z caller=main.go:180 msg="caught signal. Reloading." signal=hangup

From the logs, we can numFiles is dropping ever since upgrading to 0.22.0-rc0 and sending a reload. The correct numFiles is 527:

caller=rule.go:829 component=rules msg="reload rule files" numFiles=527
Jul 13 12:29:49 thanos1 thanos[10537]: level=info ts=2021-07-13T19:29:49.452315245Z caller=rule.go:829 component=rules msg="reload rule files" numFiles=527
Jul 13 12:48:24 thanos1 thanos[10537]: level=info ts=2021-07-13T19:48:24.980880791Z caller=rule.go:829 component=rules msg="reload rule files" numFiles=527
Jul 13 13:09:58 thanos1 thanos[10537]: level=info ts=2021-07-13T20:09:58.561769291Z caller=rule.go:829 component=rules msg="reload rule files" numFiles=527
Jul 13 13:22:55 thanos1 thanos[10537]: level=info ts=2021-07-13T20:22:55.36108635Z caller=rule.go:829 component=rules msg="reload rule files" numFiles=527
Jul 13 13:52:47 thanos1 thanos[10537]: level=info ts=2021-07-13T20:52:47.194159858Z caller=rule.go:829 component=rules msg="reload rule files" numFiles=527

Those are historical logs from the old version. From the upgrade to 0.22.0, we can see numFiles dropping below 527 after reloads:

Jul 13 14:24:22 thanos1 thanos[10537]: level=info ts=2021-07-13T21:24:22.613139946Z caller=rule.go:829 component=rules msg="reload rule files" numFiles=527
Jul 13 14:37:19 thanos1 thanos[35733]: level=info ts=2021-07-13T21:37:19.303444105Z caller=rule.go:800 component=rules msg="reload rule files" numFiles=137
Jul 14 10:16:34 thanos1 thanos[20495]: level=info ts=2021-07-14T17:16:34.910480984Z caller=rule.go:800 component=rules msg="reload rule files" numFiles=527
Jul 14 14:47:17 thanos1 thanos[21646]: level=info ts=2021-07-14T21:47:17.118951027Z caller=rule.go:800 component=rules msg="reload rule files" numFiles=192
@sevagh sevagh changed the title Thanos Rule (0.22.0-rc0) fails to load all rules files after reload - numFiles is dropping Thanos Rule fails to load all rules files after reload - numFiles is dropping Jul 14, 2021
@sevagh
Copy link
Contributor Author

sevagh commented Jul 14, 2021

I think duplicate of #4432

@sevagh
Copy link
Contributor Author

sevagh commented Jul 14, 2021

Like was suggested in #4432, using curl http://sv3-thanos1:10908/-/reload -X POST works better than the Linux signal for now, and Thanos successfully reloads all of the files.

@GiedriusS
Copy link
Member

Thanks for testing and the detailed report! This bug appeared somewhere around the 0.19 version so you've hit it after upgrading. Let's try to fix it with #4442

@GiedriusS
Copy link
Member

Fixed by #4442. It's now covered by tests so it's working 100% 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants