-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query retries rarely cause retries because hostpoolpolicy aborts query execution #812
Comments
Yep, I realised whilst refactoring to add the Some ideas spring to mind
type RetryPolicy interface {
QueryPlan(hosts []HostInfo) QueryPlan
}
type QueryPlan interface {
NextHost(err) (HostInfo, bool)
} Then the query plan can have a view of all hosts it can return for a query, it can decide if the query should be tried again on the same host, same rack, same dc, cross dc. Or if the error should not be retried at all. |
Something I can see happening is host priority, the host selector is returning a list of hosts in some priority, the retry policy needs to respect that as well. |
Isn't it a problem that by the time a query needs to be retried the host priority list may be out of date ? Hosts may have gone down, marked slow, etc. Maybe it needs to refresh the list at every try ? Seems like host selection and retry policy are very intertwined matters, maybe we should just merge them into one thing. E.g. the host policies could contain the code for how/when to retry queries. |
true, but that greatly complicates the policies and makes them much harder to implement. I don't think its entirely a problem as if the retry policy is passed an I don't entirely see it being a problem as query retry will be taking place in sub 50ms. |
But hostSelectionPolicies can be more finegrained AFAIK and use a more nuanced criterium such as latency, rather then binary up/down. Isn't the epsilon-greedy hostpool an example of this? AFAIK at every single query it may adjust its preferences. (with hundreds of qps that means every few ms).
maybe if all your queries return within 50ms. we definitely see longer running queries that take >1000 ms, and in fact the reason for retries in our environment is usually due to hitting our timeouts limit. we're still experimenting with the various settings but at this point we have timouts 2000~3000ms and our queries are hitting that (unfortunately). |
So I don't know if this works as designed or not. At least I found things counter-intuitive and spent quite some time troubleshooting what's going on.
In my client app I've implemented a config option to enable query retries, which just enables a SimpleRetryPolicy within gocql.
I then use toxiproxy to artifically slow down cassandra, somewhat randomly, so that some queries won't be slowed down much, while other are slower than the timeout. Note that in my testing i just use 1 cassandra instance.
As I've tested this, with higher and higher query retry settings, i notice that only some queries are retried, and only a limited amount, definitely not to the amount permitted by the SimpleRetryPolicy.
And so they still return timeout errors, even though could have been retried more often and returned a valid result instead.
The reason is that
queryExecutor.executeQuery(qry ExecutableQuery)
stops attempting a query when the hostpool Pick() function returns nil, instead of a valid host.I thought I had read somewhere that when a hostpool becomes empty (e.g. all hosts timed out), then they would all be re-added again, so that the query could be retried again. But this does not seem to be case.
I verified this with both the round-robin host selection policy as well tokenAware with a simple hostpool backend. (I know tokenaware doesn't effect anything here because i have 1 node, presumably this is equivalent to using hostpool-simple directly)
with NumRetries 3/4/5/6/... and round-robin i saw that many queries were only tried twice (retried once) and timed out, and with tokenaware+hostpool-simple only 2 were tried twice, resulting in a timeout and "too many timeouts" errors, all queries afterwards were only tried once.
is this normal?
The text was updated successfully, but these errors were encountered: