Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos planner component proposal #4458

Closed

Conversation

andrejbranch
Copy link

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

@jeromeinsf
Copy link

Looks related to cortexproject/cortex#4272
Might be worth aligning the 2

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Thanks for this, sorry for late review - was on PTO

@thanos-io/thanos-maintainers PTAL 🤗

Some initial comments, otherwise it looks good. Let's indeed check Cortex plan - we are touching same part and looks like we want to solve same problem


## Goals

* To separate planner into its own component and have compactors communicate with the planner to get the next available plan.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goals are behind this one. Would be nice outline our motivation here. I think it's prerquisite to scale horizontally compactor, especially within same tenant/external labels... no? 🤗 - which is essentially breaking from singleton model

CompactionPlan = 1
}
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention some alternatives?

I can think of one: Coordination free gossip-based planning and compaction

## Proposal
![where](../img/thanos_proposal_planner_component.png)

### Separate planner into its own component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Separate planner into its own component.
### Step 1: Separate planner into its own component.

### Separate planner into its own component.
For the initial implementation a reasonable amount of the current planner code could be reused. One difference is the new planner will need to run plans for all tenants. The planner should maintain a priority queue of available plans with fair ordering between tenants. The planner should also be aware of which plans are currently running and be able to handle the failure of a compaction plan gracefully. After the completion of this proposal, planner can be updated to improve single tenant compaction performance, following the goals of the [compaction group concurrency pull request](https://github.com/thanos-io/thanos/pull/3807).

### GRPC pubsub (bidirectional streaming)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### GRPC pubsub (bidirectional streaming)
### Step 2: GRPC pubsub (bidirectional streaming)


* Horizontally scaling compaction for a single tenant will be addressed following the completion of this proposal

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some ### Risks section where we can explain what we do to ensure Planner is not Single point of failure? (HA / Scalability)

Copy link
Contributor

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal 👍 As a community, agreeing upon an approach to refactoring and improving the compactor is long overdue.

Personally, I don't think implementing this as a separate Thanos component is the right approach. Critically, we would break existing users when they upgraded. Preferably, we would follow the pattern set by receive.

By default it will run as a single instance, but can also be configured in separate routing and ingesting modes for advanced users. Scaling up compactor would be an advanced feature, so we could reasonably expect the user to understand the differences and be able to configure it accordingly.

The default behaviour would be to run this work in a configurable number of local worker goroutines so vertically scaling would bring performance benefits, but the advanced feature would be to offload to remote workers so they could horizontally scale workloads with an HPA.

With this in mind, I think there are two sub-problems we should solve in order:

  1. Make compactor scale vertically on one node by paralellizing compactions.
  2. Make compactor offload compaction tasks to workers.


## Goals

* To separate planner into its own component and have compactors communicate with the planner to get the next available plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is not a goal of the proposal per-se but how you propose satisfying the goal - reading this it sounds like the goal is to make compactor run faster? Or perhaps to increase block compaction throughput?

![where](../img/thanos_proposal_planner_component.png)

### Separate planner into its own component.
For the initial implementation a reasonable amount of the current planner code could be reused. One difference is the new planner will need to run plans for all tenants. The planner should maintain a priority queue of available plans with fair ordering between tenants. The planner should also be aware of which plans are currently running and be able to handle the failure of a compaction plan gracefully. After the completion of this proposal, planner can be updated to improve single tenant compaction performance, following the goals of the [compaction group concurrency pull request](https://github.com/thanos-io/thanos/pull/3807).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the planner prioritise plans? What would it be optimising for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do that on later stages - there are many ideas we want to do (:

@stale
Copy link

stale bot commented Oct 2, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Oct 2, 2021
@stale stale bot closed this Oct 11, 2021
@yeya24 yeya24 reopened this Nov 9, 2021
@yeya24
Copy link
Contributor

yeya24 commented Nov 22, 2021

It would be nice to revisit this one someday.
This is very similar to https://rockset.com/blog/remote-compactions-in-rocksdb-cloud/.
We definitely want to make planner a separate component. And there are multiple ways to implement the workers:

  1. Have separate compaction workers to listen on some compaction plan RPCs
  2. For k8s specific environments, workers can be k8s jobs that download blocks, compact them and upload the final block
  3. ...

@roystchiang
Copy link
Contributor

Is it possible for me to continue this work?

I would like the planner part to be compatible with Cortex, so that both systems can enjoy the benefit it brings. While we flesh out whether a dedicated planner component is required, I believe both projects can already enjoy some improvement if the compactor is able to spit out parallelizable plans.

@bwplotka
Copy link
Member

Go for it @roystchiang ! Looks like it went stale. Feel free to take @andrejbranch work and open your own PR, we can close this one 🤗 Thanks! We definitely need help on this!

@stale
Copy link

stale bot commented Apr 16, 2022

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Apr 16, 2022
@stale stale bot closed this Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants