Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a metric to track usage of inflight request limit. #58342

Merged
merged 1 commit into from
Jan 26, 2018

Conversation

gmarek
Copy link
Contributor

@gmarek gmarek commented Jan 16, 2018

This one is tricky. The goal is to know how 'loaded' given apiserver is before we start dropping the load, to so we need to somehow expose 'fullness' of channels.

Sadly this metric is pretty volatile so it's not clear how to do this correctly. I decided to do pre-aggregation to smoothen the metric a bit. In the current implementation the metric publishes maximum "usage" of the inflight is previous second.

If you have any ideas please share.
@smarterclayton @lavalamp @wojtek-t @liggitt @deads2k @caesarxuchao @sttts @crassirostris @hulkholden

Add apiserver metric for current inflight-request usage.

@gmarek gmarek added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 16, 2018
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 16, 2018
@gmarek gmarek added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 16, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 16, 2018
@gmarek gmarek removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Jan 16, 2018
@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 16, 2018
)
currentMutatingInflightRequests = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "apiserver_current_mutating_inflight_requests",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can mutating be a metric label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#58340 (comment) - It might for cleanness, but I'm not sure if we're not paying for that with a bit of lock contention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, thanks! Maybe add a comment here to avoid further questions?

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 19, 2018
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 23, 2018
@gmarek gmarek assigned wojtek-t and unassigned dims and jimmidyson Jan 23, 2018
@gmarek gmarek removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2018
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2018
@gmarek gmarek force-pushed the inflight branch 2 times, most recently from 523a30a to 4205590 Compare January 23, 2018 13:01
@@ -40,6 +50,40 @@ func handleError(w http.ResponseWriter, r *http.Request, err error) {
glog.Errorf(err.Error())
}

// requestWatermark is used to trak maximal usage of inflight requests.
type requestWatermark struct {
sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a convention to not inheirt directly from sync.Mutex. So let's change to:
lock sync.Mutex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I thought that at some point we had one that said that we should do it for simple things like this.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I'm not aware of something..
But in any case, I think what you have now is definitely good too.

@wojtek-t
Copy link
Member

/approve no-issue

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2018
@wojtek-t
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2018
@@ -93,6 +138,7 @@ func WithMaxInFlightLimit(
select {
case c <- true:
defer func() { <-c }()
watermark.record(len(nonMutatingChan), len(mutatingChan))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really want to add a blocking lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I update a PR. Take a look if you like it better.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2018
@@ -92,7 +137,12 @@ func WithMaxInFlightLimit(

select {
case c <- true:
defer func() { <-c }()
nonMutatingLen := len(nonMutatingChan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably only update the length of the channel we're using in this request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


func (w *requestWatermark) record(readOnlyVal, mutatingVal int) {
w.lock.Lock()
defer w.lock.Unlock()
Copy link
Member

@liggitt liggitt Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still a blocking lock. might be a good use for non-locking updates (would want to benchmark with contention to be sure)

oldVal := atomic.LoadInt64(&w.readOnlyWatermark)
while oldVal < newVal && !atomic.CompareAndSwapInt64(&w.readOnlyWatermark, oldVal, newVal) {
  oldVal = atomic.LoadInt64(&w.readOnlyWatermark)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that 3 atomic operations are worse than lock, but I agree benchmark would be needed to be sure, but even that may change in the future versions of go. Which is why I'd want to proceed with the current version. Locking is now done after processing the request and releasing the "inflight" token, which should prevent this impacting the critical path in the apiserver. Does that sound good to you?

Copy link
Member

@liggitt liggitt Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that 3 atomic operations are worse than lock

that would only happen on requests that bumped the high water mark... the majority would be a single atomic read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is why I'd want to proceed with the current version. Locking is now done after processing the request and releasing the "inflight" token, which should prevent this impacting the critical path in the apiserver.

I don't know the http stack well enough to know if this would still block request completion. It seems bad to contend on a single write lock in the handler path for all requests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will moving this to separate go routine fix that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would only happen on requests that bumped the high water mark... the majority would be a single atomic read.

missed that this gets reset when the metric is reported. yeah... this approach wouldn't be great.

wait.Forever(func() {
watermark.lock.Lock()
defer watermark.lock.Unlock()
metrics.UpdateInflightRequestMetrics(watermark.readOnlyWatermark, watermark.mutatingWatermark)
Copy link
Member

@liggitt liggitt Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're holding a lock hanging all requests at this point... either read into local vars or use atomic.ReadInt64/atomic.StoreInt64 to avoid a critical section that has callouts to prometheus here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need to both atomically read and write those values. I'm not sure it's worth it.

Copy link
Member

@liggitt liggitt Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read them locally to keep the critical section small before calling Prometheus

lock
read to local vars
reset
unlock
pass local vars to Prometheus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I was going to do this yesterday, but life attacked me before I was able to.

@gmarek gmarek force-pushed the inflight branch 2 times, most recently from f6c82c9 to 6be50a5 Compare January 25, 2018 10:55
@gmarek
Copy link
Contributor Author

gmarek commented Jan 25, 2018

Addressed all comments. @liggitt PTAL.

@gmarek
Copy link
Contributor Author

gmarek commented Jan 25, 2018

/retest

}
}

var watermark requestWatermark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since sync.Mutex is not supposed to be copied, passing this to anything would result in a copy of the lock. would be safer as var watermark = &requestWatermark{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@liggitt
Copy link
Member

liggitt commented Jan 26, 2018

/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gmarek, liggitt, wojtek-t

Associated issue requirement bypassed by: wojtek-t

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 55792, 58342). If you want to cherry-pick this change to another branch, please follow the instructions here.

prometheus.MustRegister(currentInflightRequests)
}

func UpdateInflightRequestMetrics(nonmutating, mutating int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric seems like it belongs in the filter package, not here. We shouldn't have to register metrics in this package. This is a weird coupling.

k8s-github-robot pushed a commit that referenced this pull request Jan 29, 2018
…-#58340-#58342-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #58340: Add apiserver metric for number of requests dropped by #58342: Add a metric to track usage of inflight request limit.

Cherry pick of #58340 #58342 on release-1.8.

#58340: Add apiserver metric for number of requests dropped by
#58342: Add a metric to track usage of inflight request limit.

```release-note
Add apiserver metric for current inflight-request usage and number of requests dropped because of inflight limit.
```
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 29, 2018
k8s-github-robot pushed a commit that referenced this pull request Jan 30, 2018
…58342-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #58340: Add apiserver metric for number of requests dropped by #58342: Add a metric to track usage of inflight request limit.

Cherry pick of #58340 #58342 on release-1.9.

#58340: Add apiserver metric for number of requests dropped by
#58342: Add a metric to track usage of inflight request limit.

```release-note
Add apiserver metric for current inflight-request usage and number of requests dropped because of inflight limit.
```
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants