Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: implement the circuit breaker pattern for asynchronous set operations in the cache client #7010

Merged
merged 6 commits into from
Feb 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#6887](https://github.com/thanos-io/thanos/pull/6887) Query Frontend: *breaking :warning:* Add tenant label to relevant exported metrics. Note that this change may cause some pre-existing custom dashboard queries to be incorrect due to the added label.
- [#7028](https://github.com/thanos-io/thanos/pull/7028) Query|Query Frontend: Add new `--query-frontend.enable-x-functions` flag to enable experimental extended functions.
- [#6884](https://github.com/thanos-io/thanos/pull/6884) Tools: Add upload-block command to upload blocks to object storage.
- [#7010](https://github.com/thanos-io/thanos/pull/7010) Cache: Added `set_async_circuit_breaker_*` to utilize the circuit breaker pattern for dynamically thresholding asynchronous set operations.

### Changed

Expand Down
14 changes: 14 additions & 0 deletions docs/components/query-frontend.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ config:
max_get_multi_batch_size: 0
dns_provider_update_interval: 0s
auto_discovery: false
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 0
open_duration: 0s
min_requests: 0
consecutive_failures: 0
failure_percent: 0
expiration: 0s
```

Expand Down Expand Up @@ -132,6 +139,13 @@ config:
master_name: ""
max_async_buffer_size: 10000
max_async_concurrency: 20
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 10
open_duration: 5s
min_requests: 50
consecutive_failures: 5
failure_percent: 0.05
expiration: 24h0m0s
```

Expand Down
21 changes: 21 additions & 0 deletions docs/components/store.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,13 @@ config:
max_get_multi_batch_size: 0
dns_provider_update_interval: 0s
auto_discovery: false
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 0
open_duration: 0s
min_requests: 0
consecutive_failures: 0
failure_percent: 0
enabled_items: []
ttl: 0s
```
Expand All @@ -344,6 +351,13 @@ While the remaining settings are **optional**:
- `max_item_size`: maximum size of an item to be stored in memcached. This option should be set to the same value of memcached `-I` flag (defaults to 1MB) in order to avoid wasting network round trips to store items larger than the max item size allowed in memcached. If set to `0`, the item size is unlimited.
- `dns_provider_update_interval`: the DNS discovery update interval.
- `auto_discovery`: whether to use the auto-discovery mechanism for memcached.
- `set_async_circuit_breaker_config`: the configuration for the circuit breaker for asynchronous set operations.
- `enabled`: `true` to enable circuite breaker for asynchronous operations. The circuit breaker consists of three states: closed, half-open, and open. It begins in the closed state. When the total requests exceed `min_requests`, and either consecutive failures occur or the failure percentage is excessively high according to the configured values, the circuit breaker transitions to the open state. This results in the rejection of all asynchronous operations. After `open_duration`, the circuit breaker transitions to the half-open state, where it allows `half_open_max_requests` asynchronous operations to be processed in order to test if the conditions have improved. If they have not, the state transitions back to open; if they have, it transitions to the closed state. Following each 10 seconds interval in the closed state, the circuit breaker resets its metrics and repeats this cycle.
- `half_open_max_requests`: maximum number of requests allowed to pass through when the circuit breaker is half-open. If set to 0, the circuit breaker allows only 1 request.
- `open_duration`: the period of the open state after which the state of the circuit breaker becomes half-open. If set to 0, the circuit breaker utilizes the default value of 60 seconds.
- `min_requests`: minimal requests to trigger the circuit breaker, 0 signifies no requirements.
- `consecutive_failures`: consecutive failures based on `min_requests` to determine if the circuit breaker should open.
- `failure_percent`: the failure percentage, which is based on `min_requests`, to determine if the circuit breaker should open.
- `enabled_items`: selectively choose what types of items to cache. Supported values are `Postings`, `Series` and `ExpandedPostings`. By default, all items are cached.
- `ttl`: ttl to store index cache items in memcached.

Expand Down Expand Up @@ -376,6 +390,13 @@ config:
master_name: ""
max_async_buffer_size: 10000
max_async_concurrency: 20
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 10
open_duration: 5s
min_requests: 50
consecutive_failures: 5
failure_percent: 0.05
enabled_items: []
ttl: 0s
```
Expand Down
87 changes: 87 additions & 0 deletions pkg/cacheutil/cacheutil.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,29 @@ package cacheutil

import (
"context"
"time"

"github.com/pkg/errors"
"github.com/sony/gobreaker"
"golang.org/x/sync/errgroup"

"github.com/thanos-io/thanos/pkg/gate"
)

var (
errCircuitBreakerConsecutiveFailuresNotPositive = errors.New("circuit breaker: consecutive failures must be greater than 0")
errCircuitBreakerFailurePercentInvalid = errors.New("circuit breaker: failure percent must be in range (0,1]")

defaultCircuitBreakerConfig = CircuitBreakerConfig{
Enabled: false,
HalfOpenMaxRequests: 10,
OpenDuration: 5 * time.Second,
MinRequests: 50,
ConsecutiveFailures: 5,
FailurePercent: 0.05,
}
)

// doWithBatch do func with batch and gate. batchSize==0 means one batch. gate==nil means no gate.
func doWithBatch(ctx context.Context, totalSize int, batchSize int, ga gate.Gate, f func(startIndex, endIndex int) error) error {
if totalSize == 0 {
Expand Down Expand Up @@ -40,3 +57,73 @@ func doWithBatch(ctx context.Context, totalSize int, batchSize int, ga gate.Gate
}
return g.Wait()
}

// CircuitBreaker implements the circuit breaker pattern https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern.
type CircuitBreaker interface {
Execute(func() error) error
}

// CircuitBreakerConfig is the config for the circuite breaker.
type CircuitBreakerConfig struct {
// Enabled enables circuite breaker.
Enabled bool `yaml:"enabled"`

// HalfOpenMaxRequests is the maximum number of requests allowed to pass through
// when the circuit breaker is half-open.
// If set to 0, the circuit breaker allows only 1 request.
HalfOpenMaxRequests uint32 `yaml:"half_open_max_requests"`
// OpenDuration is the period of the open state after which the state of the circuit breaker becomes half-open.
// If set to 0, the circuit breaker utilizes the default value of 60 seconds.
OpenDuration time.Duration `yaml:"open_duration"`
// MinRequests is minimal requests to trigger the circuit breaker.
MinRequests uint32 `yaml:"min_requests"`
// ConsecutiveFailures represents consecutive failures based on CircuitBreakerMinRequests to determine if the circuit breaker should open.
ConsecutiveFailures uint32 `yaml:"consecutive_failures"`
// FailurePercent represents the failure percentage, which is based on CircuitBreakerMinRequests, to determine if the circuit breaker should open.
FailurePercent float64 `yaml:"failure_percent"`
}

func (c CircuitBreakerConfig) validate() error {
if !c.Enabled {
return nil
}
if c.ConsecutiveFailures == 0 {
return errCircuitBreakerConsecutiveFailuresNotPositive
}
if c.FailurePercent <= 0 || c.FailurePercent > 1 {
return errCircuitBreakerFailurePercentInvalid
}
return nil
}

type noopCircuitBreaker struct{}

func (noopCircuitBreaker) Execute(f func() error) error { return f() }

type gobreakerCircuitBreaker struct {
*gobreaker.CircuitBreaker
}

func (cb gobreakerCircuitBreaker) Execute(f func() error) error {
_, err := cb.CircuitBreaker.Execute(func() (any, error) {
return nil, f()
})
return err
}

func newCircuitBreaker(name string, config CircuitBreakerConfig) CircuitBreaker {
if !config.Enabled {
return noopCircuitBreaker{}
}
return gobreakerCircuitBreaker{gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: name,
MaxRequests: config.HalfOpenMaxRequests,
Interval: 10 * time.Second,
Timeout: config.OpenDuration,
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.Requests >= config.MinRequests &&
(counts.ConsecutiveFailures >= uint32(config.ConsecutiveFailures) ||
float64(counts.TotalFailures)/float64(counts.Requests) >= config.FailurePercent)
},
})}
}
51 changes: 36 additions & 15 deletions pkg/cacheutil/memcached_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/sony/gobreaker"
"gopkg.in/yaml.v2"

"github.com/thanos-io/thanos/pkg/discovery/dns"
Expand Down Expand Up @@ -54,6 +55,8 @@ var (
MaxGetMultiBatchSize: 0,
DNSProviderUpdateInterval: 10 * time.Second,
AutoDiscovery: false,

SetAsyncCircuitBreaker: defaultCircuitBreakerConfig,
}
)

Expand Down Expand Up @@ -141,6 +144,9 @@ type MemcachedClientConfig struct {

// AutoDiscovery configures memached client to perform auto-discovery instead of DNS resolution
AutoDiscovery bool `yaml:"auto_discovery"`

// SetAsyncCircuitBreaker configures the circuit breaker for SetAsync operations.
SetAsyncCircuitBreaker CircuitBreakerConfig `yaml:"set_async_circuit_breaker_config"`
}

func (c *MemcachedClientConfig) validate() error {
Expand All @@ -158,6 +164,9 @@ func (c *MemcachedClientConfig) validate() error {
return errMemcachedMaxAsyncConcurrencyNotPositive
}

if err := c.SetAsyncCircuitBreaker.validate(); err != nil {
return err
}
damnever marked this conversation as resolved.
Show resolved Hide resolved
return nil
}

Expand Down Expand Up @@ -195,6 +204,8 @@ type memcachedClient struct {
dataSize *prometheus.HistogramVec

p *AsyncOperationProcessor

setAsyncCircuitBreaker CircuitBreaker
}

// AddressProvider performs node address resolution given a list of clusters.
Expand Down Expand Up @@ -277,7 +288,8 @@ func newMemcachedClient(
config.MaxGetMultiConcurrency,
gate.Gets,
),
p: NewAsyncOperationProcessor(config.MaxAsyncBufferSize, config.MaxAsyncConcurrency),
p: NewAsyncOperationProcessor(config.MaxAsyncBufferSize, config.MaxAsyncConcurrency),
setAsyncCircuitBreaker: newCircuitBreaker("memcached-set-async", config.SetAsyncCircuitBreaker),
}

c.clientInfo = promauto.With(reg).NewGaugeFunc(prometheus.GaugeOpts{
Expand Down Expand Up @@ -375,22 +387,31 @@ func (c *memcachedClient) SetAsync(key string, value []byte, ttl time.Duration)
start := time.Now()
c.operations.WithLabelValues(opSet).Inc()

err := c.client.Set(&memcache.Item{
Key: key,
Value: value,
Expiration: int32(time.Now().Add(ttl).Unix()),
err := c.setAsyncCircuitBreaker.Execute(func() error {
return c.client.Set(&memcache.Item{
Key: key,
Value: value,
Expiration: int32(time.Now().Add(ttl).Unix()),
})
})
if err != nil {
// If the PickServer will fail for any reason the server address will be nil
// and so missing in the logs. We're OK with that (it's a best effort).
serverAddr, _ := c.selector.PickServer(key)
level.Debug(c.logger).Log(
"msg", "failed to store item to memcached",
"key", key,
"sizeBytes", len(value),
"server", serverAddr,
"err", err,
)
if errors.Is(err, gobreaker.ErrOpenState) || errors.Is(err, gobreaker.ErrTooManyRequests) {
level.Warn(c.logger).Log(
"msg", "circuit breaker disallows storing item in memcached",
"key", key,
"err", err)
} else {
// If the PickServer will fail for any reason the server address will be nil
// and so missing in the logs. We're OK with that (it's a best effort).
serverAddr, _ := c.selector.PickServer(key)
level.Debug(c.logger).Log(
"msg", "failed to store item to memcached",
"key", key,
"sizeBytes", len(value),
"server", serverAddr,
"err", err,
)
}
c.trackError(opSet, err)
return
}
Expand Down
Loading
Loading