Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atomicInt64Limiter WithoutSlack doesn't block #90

Closed
twelsh-aw opened this issue Jun 9, 2022 · 18 comments
Closed

atomicInt64Limiter WithoutSlack doesn't block #90

twelsh-aw opened this issue Jun 9, 2022 · 18 comments

Comments

@twelsh-aw
Copy link

First off thanks for the library :)

Saw that a new rate limiter was introduced that benchmarked a lot better and pulled it down to try it out.

Noticed that when running WithoutSlack, it just allows everything through instead of waiting because all subsequent Take() calls fall into the case now-timeOfNextPermissionIssue > int64(t.maxSlack)

Easiest way to repro is using your example_test.go:

  • Using rl := ratelimit.New(100) (slack=10):
go test -run Example -count=1
=== RUN   Example
--- PASS: Example (0.09s)
PASS
ok      command-line-arguments  0.207s
  • Using rl := ratelimit.New(100, ratelimit.WithoutSlack)
go test -run Example -count=1   
--- FAIL: Example (0.01s)
got:
1 10ms
2 775µs
3 3µs
4 2µs
5 10µs
6 2µs
7 2µs
8 2µs
9 2µs
want:
1 10ms
2 10ms
3 10ms
4 10ms
5 10ms
6 10ms
7 10ms
8 10ms
9 10ms
FAIL
FAIL    command-line-arguments  0.126s
FAIL
  • Using 0.2.0 rl :=newAtomicBased(100, WithoutSlack)
go test -run Example -count=1
PASS
ok      go.uber.org/ratelimit   0.323s

I am not 100% sure why the other units with mocked clocks are passing, but your example test and my application tests fail consistently with this new limiter. On darwin if that helps.

@storozhukBM
Copy link
Contributor

@twelsh-aw
Yes, the issue is definitely there, I have a quick fix for it, but I also want to spend a bit more time to figure out why our tests don't catch that issue.

@rabbbit
Copy link
Contributor

rabbbit commented Jun 9, 2022

Reverting - @storozhukBM we can re-do the whole PR again (possibly add some tests earlier?)

@storozhukBM
Copy link
Contributor

@rabbbit
I figured out that our tests continue to pass even if I break currently existing implementation.
For example if you change ratelimit_test.go:108 to t.clock.Sleep(interval-interval) tests will still pass.

@storozhukBM
Copy link
Contributor

@rabbbit
Further findings. Tests do not able to detect issues because with mock clock, literally time, freezes between Take() calls, to when we check now.Sub(oldState.last) in atomicLimiter or when we check now-timeOfNextPermissionIssue in atomicInt64Limiter it always returns 0

@storozhukBM
Copy link
Contributor

@rabbbit
If we add r.clock.Add(time.Microsecond) between takes in (r *runnerImpl) startTaking tests start to detect issues.
The problem with that is that we call startTaking in different goroutines and "github.com/benbjohnson/clock" is not concurrently safe, so race detector fails when it finds some races inside mockClock state.

So we either should find concurrent mock clock implementation, or fix "github.com/benbjohnson/clock" or write our own.

@rabbbit
Copy link
Contributor

rabbbit commented Jun 9, 2022

I'm sorry, I won't be able to look at this in detail, for probably next ~10 days (limited laptop access).

For now, I can revert the clock implementation change in the meantime if you think it's better/stable - the tests were (mostly?) stable there. I'm opposed to implementing a clock as part of this package :)

Looks like we'll be stuck with the old implementation for a while - "code freeze" until we fix the tests.

@storozhukBM
Copy link
Contributor

@rabbbit agree, I'll look at possible options with time mocking and will fix our tests in the following PR

@storozhukBM
Copy link
Contributor

@rabbbit I created a new PR as discussed #93

rabbbit pushed a commit that referenced this issue Jul 2, 2022
This limiter was introduced and merged in the following PR #85
Later @twelsh-aw found an issue with this implementation #90
So @rabbbit reverted this change in #91

Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.
With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug.
Right after it, we'll open a subsequent PR to fix this bug.
storozhukBM added a commit to storozhukBM/ratelimit that referenced this issue Jul 2, 2022
This limiter was introduced and merged in the following PR uber-go#85
Later @twelsh-aw found an issue with this implementation uber-go#90
So @rabbbit reverted this change in uber-go#91

Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better.
With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug.
Right after it, we'll open a subsequent PR to fix this bug.
@storozhukBM
Copy link
Contributor

@rabbbit

PR with fix for this issue #95

Also you can see on tests approach PR that it is able to detect this bug

@rabbbit
Copy link
Contributor

rabbbit commented Jul 5, 2022

Ack. This is pretty high on my priority list, will try to get to this soon. Again, sorry for the delay.

rabbbit added a commit that referenced this issue Jul 6, 2022
In #90 @twelsh-aw found a bug in a new implementation. This turned out
to be caused by us mocking time.

Since time mocking will always have some risks this diff proposes
to expand the `examples` we're using - since they use "real" time
they should be good enough to cover most basic cases like #90.

In particular:
- we add a "withoutSlack" test so that the exact case reported in #90
  doesn't happen. No slack option seems common enough to add it.
- updates "withSlack" example to actually show how slack operates.
  Due to non-even execution times I'm forced to round the time a bit.

Possible issues:
- test stability: I re-run the test a 1000 times without issues - the
  timing seems to be stable.
- test duration: in total we're extending the examples by 5ms, which
  shouldn't be human noticeable.
rabbbit added a commit that referenced this issue Jul 7, 2022
In #90 @twelsh-aw found a bug in a new implementation. This turned out
to be caused by us mocking time.

Since time mocking will always have some risks this diff proposes
to expand the `examples` we're using - since they use "real" time
they should be good enough to cover most basic cases like #90.

In particular:
- we add a "withoutSlack" test so that the exact case reported in #90
  doesn't happen. No slack option seems common enough to add it.
- updates "withSlack" example to actually show how slack operates.
  Due to non-even execution times I'm forced to round the time a bit.

Possible issues:
- test stability: I re-run the test a 1000 times without issues - the
  timing seems to be stable.
- test duration: in total we're extending the examples by 5ms, which
  shouldn't be human noticeable.
rabbbit pushed a commit that referenced this issue Jul 13, 2022
This PR fixes the issue found by @twelsh-aw with int64 based implementation #90

Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.
@storozhukBM
Copy link
Contributor

@rabbbit We already have a fix in place. What do you think about making atomicInt64Limiter a default implementation again and allowing people to have a try before cutting a new release tag? Or we can make this implementation public so that users can instantiate it. I'd like the new testing approach to be merged first, but if you're unsure about it, we can skip it for now.

@rabbbit
Copy link
Contributor

rabbbit commented Oct 30, 2022

The new atomic version seems to perform so much better than it feels we should enable it.

However, after that, we might just have to declare a code freeze until someone else has more cycles for reviews. Sadly this includes the new time-testing, would need to look at this in much more detail before merging in.

@storozhukBM
Copy link
Contributor

@rabbbit OK, so should I make a PR now?

@rabbbit
Copy link
Contributor

rabbbit commented Oct 31, 2022

SGTM; also, thanks for pushing this through :)

@storozhukBM
Copy link
Contributor

@rabbbit
Please take a look
#101

rabbbit pushed a commit that referenced this issue Oct 31, 2022
* Fix return timestamp discrepancy between regular atomic limiter and int64 based one
* Make int64 based atomic limiter default

Long story: this was added in #85, but reverted in #91 due to #90. #95 fixed the issue, so we're moving forward with the new implementation.
@rabbbit
Copy link
Contributor

rabbbit commented Oct 31, 2022

Closing as fixed - we reverted the bad change in #91, @storozhukBM fixed it in #95, but we were defaulting to the old implementation. #101 makes the new (faster) implementation the default.

@rabbbit rabbbit closed this as completed Oct 31, 2022
@storozhukBM
Copy link
Contributor

storozhukBM commented Oct 31, 2022

@twelsh-aw this change is now merged, can you please try again and give us your feedback

@storozhukBM
Copy link
Contributor

thank you @rabbbit

rabbbit added a commit that referenced this issue Jul 9, 2023
There were quite a few fixes in that repo, let's pull them in.

We've had a few mock-clock bugs before (#90), I'm hoping that with the
new versions #93 won't be necessary. I'll try reverting #95 on a branch
temporarily to see if it would have been caught.
rabbbit added a commit that referenced this issue Mar 4, 2024
There were quite a few fixes in that repo, let's pull them in.

We've had a few mock-clock bugs before (#90), I'm hoping that with the
new versions #93 won't be necessary. I'll try reverting #95 on a branch
temporarily to see if it would have been caught.
rabbbit added a commit that referenced this issue May 1, 2024
There were quite a few fixes in that repo, let's pull them in.

We've had a few mock-clock bugs before (#90), I'm hoping that with the
new versions #93 won't be necessary. I'll try reverting #95 on a branch
temporarily to see if it would have been caught.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants