Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: Add time & duration based partitioning #1077

Closed
wants to merge 22 commits into from
Closed

Conversation

povilasv
Copy link
Member

@povilasv povilasv commented Apr 24, 2019

Changes

V2 of https://github.com/improbable-eng/thanos/pull/957/files

Doesn't involve any deletions & adds support for durations

#814

Verification

Multiple tests added.

Validated mintime > maxtime behaviour

./thanos store --log.level=debug --min-block-start-time=2020-01-01T00:00:00Z --max-block-start-time=2019-01-01T23:59:59Z --objstore.config-file=$UBUNTU_HOME/thanos/dev-telecom.yaml
level=info ts=2019-05-24T12:07:19.750115246Z caller=flags.go:87 msg="gossip is disabled"
store command failed: invalid argument: --min-block-start-time '2020-01-01 00:00:00 +0000 UTC' can't be greater than --max-block-start-time  '2019-01-01 23:59:59 +0000 UTC'
./thanos store --log.level=debug --min-block-start-time=-10d --max-block-start-time=-15d --objstore.config-file=$UBUNTU_HOME/thanos/dev-telecom.yaml
level=info ts=2019-05-24T12:08:18.891954647Z caller=flags.go:87 msg="gossip is disabled"
store command failed: invalid argument: --min-block-start-time '-10d' can't be greater than --max-block-start-time  '-15d'

validated blocks loaded

./thanos store --log.level=debug --min-block-start-time=-10d --max-block-start-time=-5d --objstore.config-file=$UBUNTU_HOME/thanos/dev-telecom.yaml

Checked the metrics:

# HELP thanos_bucket_store_block_loads_total Total number of remote block loading attempts.
# TYPE thanos_bucket_store_block_loads_total counter
thanos_bucket_store_block_loads_total 14.0
# HELP thanos_bucket_store_blocks_loaded Number of currently loaded blocks.
# TYPE thanos_bucket_store_blocks_loaded gauge
thanos_bucket_store_blocks_loaded 7.0

Actually ran this in dev.

@povilasv povilasv changed the title Store: Add Time Or duration based partitioning Store: Add time Or duration based partitioning Apr 24, 2019
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! Worth to have some tests especially for tdv.

Good work!

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy that we're getting things in this area rolling. In the future I could also see sharding based on external labels in the block's metadata, for example, as even if we're limiting the time with this, there are other dimensions that are essentially unbound. Block selection + time selection is probably what we ultimately want, but this is a great start.

pkg/store/bucket.go Show resolved Hide resolved
minTime := store.TimeOrDuration(cmd.Flag("min-time", "Start of time range limit to serve. Option can be a constant time in RFC3339 format or time duration relative to current time, such as -1.5d or 2h45m. Valid duration units are ms, s, m, h, d, w, y.").
Default("0000-01-01T00:00:00Z"))

maxTime := store.TimeOrDuration(cmd.Flag("max-time", "End of time range limit to serve. Option can be a constant time in RFC3339 format or time duration relative to current time, such as -1.5d or 2h45m. Valid duration units are ms, s, m, h, d, w, y.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be more explicit whether the exact timestamp would be included or not. Maybe even an example in the docs of what a sane setup would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, some short docs would be nice.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally, some suggestions though.

cmd/thanos/store.go Outdated Show resolved Hide resolved
cmd/thanos/store.go Outdated Show resolved Hide resolved
cmd/thanos/store.go Outdated Show resolved Hide resolved
id: id,
dir: dir,
}
if err := b.loadMeta(ctx, id); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we load meta once? I think we do it everythere in small functions ):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should cache those on disk, not sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.loadMeta already does this:

	if _, err := os.Stat(b.dir); os.IsNotExist(err) {
		if err := os.MkdirAll(b.dir, 0777); err != nil {
			return errors.Wrap(err, "create dir")
		}
		src := path.Join(id.String(), block.MetaFilename)

		level.Debug(b.logger).Log("msg", "download file", "bucket", b.bucket, "src", src, "dir", b.dir)
		if err := objstore.DownloadFile(ctx, b.logger, b.bucket, src, b.dir); err != nil {
			return errors.Wrap(err, "download meta.json")
		}
	} else if err != nil {
		return err
	}
	meta, err := metadata.Read(b.dir)
	if err != nil {
		return errors.Wrap(err, "read meta.json")
	}

pkg/store/bucket.go Show resolved Hide resolved
cmd/thanos/store.go Show resolved Hide resolved
pkg/store/flag.go Outdated Show resolved Hide resolved
)

func TestTimeOrDurationValue(t *testing.T) {
cmd := kingpin.New("test", "test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need kingpin for this test IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka but I really like it :P

I think it's and e2e test for flag parsing and as the package imports kingpin and has a special function for it, IMO it makes sense.

}

// We check for blocks in configured minTime, maxTime range
if b.meta.MinTime < s.minTime.PrometheusTimestamp() || b.meta.MaxTime > s.maxTime.PrometheusTimestamp() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to compare against a blocks mintime, or maxtime, but not both. The scenario I have in mind is something like this:

  • You have a block that covers 01:00:00 to 03:00:00.
  • You have a thanos-store process covering 00:00:00 to 02:00:00
  • You have a second thanos-store process covering 02:00:01 to 04:00:00

I think with the logic above, neither thanos-store process will load the block because it's mintime is inside the first thanos-store process' range, but the maxtime is outside of its range. For the second thanos-store you have the opposite situation. Since neither of them own the entire block, neither of them will serve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think essentially we need this:

Suggested change
if b.meta.MinTime < s.minTime.PrometheusTimestamp() || b.meta.MaxTime > s.maxTime.PrometheusTimestamp() {
if b.meta.MaxTime < s.minTime.PrometheusTimestamp() || b.meta.MinTime > s.maxTime.PrometheusTimestamp() {

So:

Store range:                         |min --------------max|
Blocks:                  | ------ A--------| | -------- B --------|

Thanks for spotting this @claytono

Also can we have tests that would catch this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka I think in that scenario, we will end up with the same block served by two stores. For example:

Store A range: |min --- max|
Store B range              |min --- max|
Store C range                          |min --- max|
Blocks:        |----- A -----|----- B -----|----- C -----|

In this scenario, I think we want blocks A, B and C to be served by a single store. With my previous PR, I picked minTime as the determining factor, so with that PR the assignments would be like this:

  • Block A -> Store A
  • Block B -> Store A
  • Block C -> Store B

I'm not sure there is a "right" answer for whether or not we should use minTime or maxTime for assigning blocks to stores, but I think in the interest of efficiency, if store time range are contiguous, then each block should only be owned by a single store.

Copy link
Member Author

@povilasv povilasv May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka @claytono

IMO it's fine in current version. i.e. users just need to be careful then setting right parameters.

So you set |minTime - maxTime| and store only will respond with blocks that fit that range. Making thanos store returning blocks outside the range is IMO broken behavior & it will just confuse people as Thanos store will be returning blocks outside | mnTime - maxTime|

I'll add docs for that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW there are tests for current behaviour. If we decide to change them we can refactor those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on what your use case is. If the implementation requires that blocks must be entirely within the time range for the store, it may be impossible to pick time ranges where all blocks will be handled by a store, without having overlapping time ranges, which has other efficiency issues.

Our primary desire for time based ranges is to be able to partition by time ranges, so the behavior as implemented now doesn't work well for us. Ss the compactor generates larger and larger blocks, overlap between blocks with different labels makes it more and more likely that you're going to be serving very large blocks from multiple stores. This means you'll be loading large blocks into memory in multiple places and wasting memory. For us, memory usage of the store processes is the primary cost for Thanos.

Perhaps this should be configurable if different people want different behavior.

@povilasv povilasv changed the title Store: Add time Or duration based partitioning Store: Add time & duration based partitioning May 3, 2019
By default Thanos Store Gateway looks at all the data in Object Store and returns it based on query's time range.
You can shard Thanos Store gateway based on time or duration relative to current time.

For example setting: `--min-time=-6w` & `--max-time=-2w` will make Thanos Store Gateway look at blocks that fall within `now - 6 weeks` up to `now - 2 weeks`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now being the time of start of the process or now being query time now? As far as I can tell it's only evaluated based on .InitialSync. I feel like people are going to shoot themselves in the foot a lot with the duration based approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now being time of the sync, we run syncs every 3m. I'll update docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docs and changed the implementation

@povilasv
Copy link
Member Author

povilasv commented May 3, 2019

@bwplotka @claytono @brancz

Maybe we should actually do filtering based on just block's start time, for duration based approach it makes sense:
Lets's say

Thanos Store1 --min-time=0 to -max-time=2w
Thanos Store2 --min-time=2w to -max-time=1m

blocks would just constantly flow from Store1 to Store2, without gaps

This would also work nicely with constant time partitioning.

@claytono
Copy link
Contributor

claytono commented May 3, 2019

@povilasv Sorry, I didn't see this newer reply before replying to the older one. Filtering on just block start time would work well for us.

@povilasv povilasv requested review from brancz and bwplotka May 7, 2019 13:14
@povilasv
Copy link
Member Author

povilasv commented May 7, 2019

I think looking at block's start time makes most sense for time based partitioning.
If we only look at start time it's really easy to figure which blocks are served by which store and blocks just flow from one store to the next one.

I've changed the implementation and updated docs & tests.

@brancz and @bwplotka PTAL

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great work guys and sorry for delay.
TL;DR I am not sure if I agree with decision for using block startTime. More below in comments. (: I believe this is bit too confusing for end user and assuming too much. I think we should focus on shipping this in basic, non confusing form and focus on actually mitigating store mem consumption as per #448

Let me know if that makes sense. It's 1:30 AM my time 🤕

@@ -51,11 +52,23 @@ func registerStore(m map[string]setupFunc, app *kingpin.Application, name string
blockSyncConcurrency := cmd.Flag("block-sync-concurrency", "Number of goroutines to use when syncing blocks from object storage.").
Default("20").Int()

minTime := model.TimeOrDuration(cmd.Flag("min-time", "Start of time range limit to serve. Thanos Store serves only blocks, which have start time greater than this value. Option can be a constant time in RFC3339 format or time duration relative to current time, such as -1.5d or 2h45m. Valid duration units are ms, s, m, h, d, w, y.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all we should name it properly if we choose block min time and block max time. block-min-time and same for max maybe? At some point we could have min-time for have proper time partitioning. (all blocks that overlaps are synced and only exact time range is announced and returned). Also Start of time range limit to serve. is not true then. As e.g 12:30 minTime and when block has startTime 12:29.. even though block finishes after 12:30 we don't serve metrics from it (right?)

However just to revisit the decision for block-time based partitioning instead of just time. Plus the fact that we take start time.

I have mixed feelings @claytono @povilasv

Against:

  • If both max and min are inclusive or both exclusive you use case will not work in all cases. How you will compose chain of store gateways then?
  • Blocks are just some format and it can change a lot for example during compaction. Also we don't assume any particular sync of those timestamps among different soruces (Prometheus). So someone can upload block (let's work on hours for simplicity): 11:30-13:30 and other can upload 11:31,13:31. Now if you want to spread blocks evenly among store gateway this gets quite tricky. Set block-min-time to 11:30? this will include both blocks. However what next store gateway should be then? block-max-time=11:30? This
  • How you will partition time when compactor is compacting blocks a lot? When those block time ranges are super dynamic?
  • I find it extremely not intuitive to work on startTime only. So data newer than block-min-time are NOT served if they are in block with min time older than block-min-time. I set block-max-time and still I get returned the samples AFTER block-max-time (even though we announce something via Info). That's essentially super confusing to use IMO. Maybe naming and documenting properly will help, but it's not great.
  • I think @claytono can achieve block paritioning by pure time based partitioning as well -> just make sure time range is not overlapping with not wanted blocks AFAIK

Pros:

  • We already work on blocks on store gateway?
  • We don't lookup the potentially big blocks that overlaps within single sample for some reason for given time partition (which should not that much of mem overhead IMO)

I think I partially disagree with #1077 (comment) last comment (cannot reply) from @claytono

We never load whole block to memory in store gateway. Sure, there is some certain baseline memory added for essential "index-cache", plus on each query if the time range overlaps with block we fetch matching postings and series and check overlaps against chunk itself. If we would load the whole block in memory we would not need such complex indexing AND for every query you would need TB of memory (: This being said IMO it's fine to sync extra one or more blocks just because there is some overlap.

Awesome thanks @povilasv and @claytono for very productive work here, but let's discuss it bit more @claytono as I think you are trying to workaround this #448 which might have many other, better and simplier solutions. Even though workaround for time being is better than being blocked, but those extra duplication in blocks on edges does not matter IMO. I would rather do proper time range parititioning as we intilially talked about and spent time and effort on #448 (as it is important!) as there might be huge wins by just optimizing, streaming more, limiting cardinality etc. Just to let to know Store memory consumption is known thing from very 0.1.0 and it's hitting (and costs) us all a lot, so we are actively seeking & improving it, but more hands on this would be preferable.

BTW have you tried to shard functionally instead of time based? Like store things in different buckets? What if we would add matchers for external labels that store would be exposing only (hypothetically) ?

Copy link
Member Author

@povilasv povilasv May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all we should name it properly if we choose block min time and block max time. block-min-time and same for max maybe?

I can name it this way and add block-max-time, not a problem.

(all blocks that overlaps are synced and only exact time range is announced and returned).
If both max and min are inclusive or both exclusive you use case will not work in all cases. How you will compose chain of store gateways then?

Overlaps are not a problem for Thanos, Thanos Querier will merge responses and dedup, so you just make overlaping chain of --block-min-time & --block-max-time.

Now if you want to spread blocks evenly among store gateway this gets quite tricky.

This is PR is not about spreading the blocks evenly. I don't really care about that and I don't really see the usecase of spreading blocks evenly.

How you will partition time when compactor is compacting blocks a lot? When those block time ranges are super dynamic?

Not an issue the sync job already adds new blocks which fit the time range and removes the old ones. So some times it will go into one Thanos Store, sometimes into another.

I find it extremely not intuitive to work on startTime only.

We can add 4 flags then: --block-start-time-min-time, --block-start-time-max-time
, --block-end-time-min-time, --block-end-time-max-time

I think @claytono can achieve block paritioning by pure time based partitioning as well -> just make sure time range is not overlapping with not wanted blocks AFAIK

I don't follow, why do we actually care that time range is not overlapping?

The idea is that we need to scale Thanos Store, it's impossible to keep low response times if we don't partition Thanos Store gateway. I.e. what happens to our LRU cache if someone queries for short time period <1w and then a few people on long >3mo, etc. When you shard you can optimize for that type of a deal.

Also in general it is more approachable instead of having one big store gateway we have many little ones, which have advantage of easier scheduling (don't need a seperate node for store gw anymore), reducing query latency (every store performs smaller amount of work), etc and many more clients againts Object store, which as experience shows is faster than having a single client against bucket.

IMO block juggling is fine. Maybe I need to improve docs more about overlapping case, so that people don't try to do perfect ranges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @povilasv. For us this is about horizontal scaling. We're already partitioning by time by having multiple buckets and creating new ones as needed. This PR will allow us to reduce a lot of manual work involved in that and automate the rest.

While there may be better solutions to #448, I think we're going to always have a need for horizontal scaling and we don't have a solution to #448 today. I definitely want to see the memory efficiency improved, but horizontally scaling has allowed us to ignore the problem for the time being. This PR gives us breathing room while that's worked out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and many more clients againts Object store, which as experience shows is faster than having a single client against bucket.

Where you get that from? (:

Ok @claytono @povilasv so maybe from different position. Let's say we have just min-time max-time ranges? Nothing block based, but purely time wise. Why is that insufficient?

My main concern is that tieing some user expierience API (store gw flag) to some hidden format that is eventually synced will cause only pain. I would like to avoid this? Having simple time range flags would be both simple, understandable and solve you use case, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where you get that from? (:

For example https://care.qumulo.com/hc/en-us/articles/360015703713-Performance-Results-of-MinIO-with-Qumulo

Take a look at any object store benchmarks, I haven't seen a benchmark which tells that if you use 1 client it's going to be have faster reads then multi client. But you can prove me wrong.

Minio case singe client was: 643 MB/s
Multi-Stream Read: 1.1-1.7GB/s

S3:
Single: 390.8MB/sec
Multi-Stream Read: 1-1.5GB/s

Nothing block based, but purely time wise. Why is that insufficient?

Again time duration based is just a simple way to shard thanos store for large deployments.

My main concern is that tieing some user expierience API (store gw flag) to some hidden format that is eventually synced will cause only pain.

Yes that's the only drawback of this solution, is that people will need to be educated on doing overlapping time ranges, or not doing it at all.

S3 has eventual consistency anyway so in any solution we would have to take similar measures.

go.mod Outdated Show resolved Hide resolved
pkg/model/timeduration.go Outdated Show resolved Hide resolved
@povilasv povilasv requested a review from bwplotka May 24, 2019 11:43
@povilasv
Copy link
Member Author

povilasv commented May 24, 2019

I've renamed the flags and test it in dev environment. There is quay.io/utilitywarehouse/thanos:time-partioning-2019-05-24-00cf917a with docker image.

This is how it looks:
Old:
sidecar
2019-05-24-142347

New with additional Thanos Store:
sidecar
short-store
2019-05-24-145226

New Config:
2x stores with:

      containers:
      - args:
        - store
        - --log.level=info
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/etc/thanos/config.yaml
        - --index-cache-size=12GB
        - --chunk-pool-size=6GB
        - --max-block-start-time=-40h # This will skip blocks, which start time < now-40h, as they are served by sidecar
        - --min-block-start-time=-5w # This will serve blocks up to now-5w

2 Store's with:

      containers:
      - name: thanos-store
        image: quay.io/utilitywarehouse/thanos:time-partioning-2019-05-24-00cf917a
        args:
        - store
        - --log.level=info
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/etc/thanos/config.yaml
        - --index-cache-size=12GB
        - --chunk-pool-size=6GB
        - --max-block-start-time=-5w

The idea behind this config is that:

thanos sidecar responds data with < 40h
thanos sidecar + thanos store responds to data with >40h && <= 48h
thanos store repsonds with >48h < -5w
another thanos store responds with >5w

As stated before overlaping blocks are fine, Thanos Querier merges them.
In default config we are constantly merging responses from Sidecar and Thanos Store Gateway, as there is no time partitioning going on.

@povilasv
Copy link
Member Author

povilasv commented May 24, 2019

I did some stupid simple "benchmarking", opened Thanos Store grafana dashboard loaded and did 3 cases :
last 2d, interval 30min
Last 7d interval 30min
Last 30d interval 30min

Current stadard config Thanos Sidecar + 2 Thanos Store, no time partioning

2d

old 2d 30min interval

7d

old 7d 30min interval

30d

old 30d 30min interval

New Config 2 Thanos Store's with short term data, 2 Thanos Store's with long term data, only Sidecar's serve < 40h.

2d
new2d 30min

7d
new 7d 30min

30d

new 30d 30min

Note that first screenshots are missing response times for some queries, sry about that

@povilasv
Copy link
Member Author

povilasv commented May 24, 2019

Lastly, IMO time-sharding is a must. Imagine if you have 10 years of data or 100 years of data, we will need mainframes to handle 100year type of queries. Sharding does solve this.

@bwplotka if you are worried about this, maybe we should mark it as experimental and do hidden flags?
Then users might try it on their own if they want it, but we wont provide guarantees around that. WDYT?

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this 👍 I can definitely see what problem this solves. I'm not sure but maybe it would be useful to add some logging at the debug level around SyncBlocks() when a block gets dropped due to these restrictions to reduce potential confusion. WDYT?

@povilasv
Copy link
Member Author

I'm continuing work on a fresh PR #1408

@povilasv povilasv closed this Aug 13, 2019
@povilasv povilasv deleted the time-partioning branch August 13, 2019 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants