Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query retries rarely cause retries because hostpoolpolicy aborts query execution #812

Open
Dieterbe opened this issue Oct 26, 2016 · 5 comments

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Oct 26, 2016

So I don't know if this works as designed or not. At least I found things counter-intuitive and spent quite some time troubleshooting what's going on.

In my client app I've implemented a config option to enable query retries, which just enables a SimpleRetryPolicy within gocql.
I then use toxiproxy to artifically slow down cassandra, somewhat randomly, so that some queries won't be slowed down much, while other are slower than the timeout. Note that in my testing i just use 1 cassandra instance.

As I've tested this, with higher and higher query retry settings, i notice that only some queries are retried, and only a limited amount, definitely not to the amount permitted by the SimpleRetryPolicy.
And so they still return timeout errors, even though could have been retried more often and returned a valid result instead.

The reason is that queryExecutor.executeQuery(qry ExecutableQuery) stops attempting a query when the hostpool Pick() function returns nil, instead of a valid host.

I thought I had read somewhere that when a hostpool becomes empty (e.g. all hosts timed out), then they would all be re-added again, so that the query could be retried again. But this does not seem to be case.

I verified this with both the round-robin host selection policy as well tokenAware with a simple hostpool backend. (I know tokenaware doesn't effect anything here because i have 1 node, presumably this is equivalent to using hostpool-simple directly)

with NumRetries 3/4/5/6/... and round-robin i saw that many queries were only tried twice (retried once) and timed out, and with tokenaware+hostpool-simple only 2 were tried twice, resulting in a timeout and "too many timeouts" errors, all queries afterwards were only tried once.

is this normal?

@Zariel
Copy link
Member

Zariel commented Oct 26, 2016

Yep, I realised whilst refactoring to add the queryExecutor that the retry policy interface is severely limiting, I had created #735.

Some ideas spring to mind

  • Eagerly consume from the host iterator
  • Improve retry policy interface to return next host for a given error + query.
  • Make retry policy take a list of hosts or a HostIter, then have it return a query plan with methods like,
type RetryPolicy interface {
    QueryPlan(hosts []HostInfo) QueryPlan
}

type QueryPlan interface {
    NextHost(err) (HostInfo, bool)
}

Then the query plan can have a view of all hosts it can return for a query, it can decide if the query should be tried again on the same host, same rack, same dc, cross dc. Or if the error should not be retried at all.

@Zariel
Copy link
Member

Zariel commented Oct 29, 2016

Something I can see happening is host priority, the host selector is returning a list of hosts in some priority, the retry policy needs to respect that as well.

@Dieterbe
Copy link
Contributor Author

Isn't it a problem that by the time a query needs to be retried the host priority list may be out of date ? Hosts may have gone down, marked slow, etc. Maybe it needs to refresh the list at every try ? Seems like host selection and retry policy are very intertwined matters, maybe we should just merge them into one thing. E.g. the host policies could contain the code for how/when to retry queries.

@Zariel
Copy link
Member

Zariel commented Oct 29, 2016

true, but that greatly complicates the policies and makes them much harder to implement. I don't think its entirely a problem as if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.

I don't entirely see it being a problem as query retry will be taking place in sub 50ms.

@Dieterbe
Copy link
Contributor Author

if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.

But hostSelectionPolicies can be more finegrained AFAIK and use a more nuanced criterium such as latency, rather then binary up/down. Isn't the epsilon-greedy hostpool an example of this? AFAIK at every single query it may adjust its preferences. (with hundreds of qps that means every few ms).
Also if host was dead at first, but then became live again and a connection was re-established, i would want to incorporate it into the retries.

I don't entirely see it being a problem as query retry will be taking place in sub 50ms

maybe if all your queries return within 50ms. we definitely see longer running queries that take >1000 ms, and in fact the reason for retries in our environment is usually due to hitting our timeouts limit. we're still experimenting with the various settings but at this point we have timouts 2000~3000ms and our queries are hitting that (unfortunately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants