query retries rarely cause retries because hostpoolpolicy aborts query execution #812

Dieterbe · 2016-10-26T13:00:43Z

So I don't know if this works as designed or not. At least I found things counter-intuitive and spent quite some time troubleshooting what's going on.

In my client app I've implemented a config option to enable query retries, which just enables a SimpleRetryPolicy within gocql.
I then use toxiproxy to artifically slow down cassandra, somewhat randomly, so that some queries won't be slowed down much, while other are slower than the timeout. Note that in my testing i just use 1 cassandra instance.

As I've tested this, with higher and higher query retry settings, i notice that only some queries are retried, and only a limited amount, definitely not to the amount permitted by the SimpleRetryPolicy.
And so they still return timeout errors, even though could have been retried more often and returned a valid result instead.

The reason is that queryExecutor.executeQuery(qry ExecutableQuery) stops attempting a query when the hostpool Pick() function returns nil, instead of a valid host.

I thought I had read somewhere that when a hostpool becomes empty (e.g. all hosts timed out), then they would all be re-added again, so that the query could be retried again. But this does not seem to be case.

I verified this with both the round-robin host selection policy as well tokenAware with a simple hostpool backend. (I know tokenaware doesn't effect anything here because i have 1 node, presumably this is equivalent to using hostpool-simple directly)

with NumRetries 3/4/5/6/... and round-robin i saw that many queries were only tried twice (retried once) and timed out, and with tokenaware+hostpool-simple only 2 were tried twice, resulting in a timeout and "too many timeouts" errors, all queries afterwards were only tried once.

is this normal?

The text was updated successfully, but these errors were encountered:

Zariel · 2016-10-26T21:10:58Z

Yep, I realised whilst refactoring to add the queryExecutor that the retry policy interface is severely limiting, I had created #735.

Some ideas spring to mind

Eagerly consume from the host iterator
Improve retry policy interface to return next host for a given error + query.
Make retry policy take a list of hosts or a HostIter, then have it return a query plan with methods like,

type RetryPolicy interface {
    QueryPlan(hosts []HostInfo) QueryPlan
}

type QueryPlan interface {
    NextHost(err) (HostInfo, bool)
}

Then the query plan can have a view of all hosts it can return for a query, it can decide if the query should be tried again on the same host, same rack, same dc, cross dc. Or if the error should not be retried at all.

Zariel · 2016-10-29T11:43:11Z

Something I can see happening is host priority, the host selector is returning a list of hosts in some priority, the retry policy needs to respect that as well.

Dieterbe · 2016-10-29T16:54:34Z

Isn't it a problem that by the time a query needs to be retried the host priority list may be out of date ? Hosts may have gone down, marked slow, etc. Maybe it needs to refresh the list at every try ? Seems like host selection and retry policy are very intertwined matters, maybe we should just merge them into one thing. E.g. the host policies could contain the code for how/when to retry queries.

Zariel · 2016-10-29T17:51:16Z

true, but that greatly complicates the policies and makes them much harder to implement. I don't think its entirely a problem as if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.

I don't entirely see it being a problem as query retry will be taking place in sub 50ms.

Dieterbe · 2016-10-29T18:19:29Z

if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.

But hostSelectionPolicies can be more finegrained AFAIK and use a more nuanced criterium such as latency, rather then binary up/down. Isn't the epsilon-greedy hostpool an example of this? AFAIK at every single query it may adjust its preferences. (with hundreds of qps that means every few ms).
Also if host was dead at first, but then became live again and a connection was re-established, i would want to incorporate it into the retries.

I don't entirely see it being a problem as query retry will be taking place in sub 50ms

maybe if all your queries return within 50ms. we definitely see longer running queries that take >1000 ms, and in fact the reason for retries in our environment is usually due to hitting our timeouts limit. we're still experimenting with the various settings but at this point we have timouts 2000~3000ms and our queries are hitting that (unfortunately).

Dieterbe mentioned this issue Oct 26, 2016

add support for retrying cassandra queries grafana/metrictank#344

Merged

Zariel mentioned this issue Apr 18, 2017

retry does not work if the request fails #893

Closed

annismckenzie mentioned this issue Jul 27, 2018

Provide a way to limit per-query-attempt deadlines #1151

Merged

martin-sucha mentioned this issue Dec 22, 2021

Improve retry policy #735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query retries rarely cause retries because hostpoolpolicy aborts query execution #812

query retries rarely cause retries because hostpoolpolicy aborts query execution #812

Dieterbe commented Oct 26, 2016 •

edited

Zariel commented Oct 26, 2016

Zariel commented Oct 29, 2016

Dieterbe commented Oct 29, 2016

Zariel commented Oct 29, 2016 •

edited

Dieterbe commented Oct 29, 2016

query retries rarely cause retries because hostpoolpolicy aborts query execution #812

query retries rarely cause retries because hostpoolpolicy aborts query execution #812

Comments

Dieterbe commented Oct 26, 2016 • edited

Zariel commented Oct 26, 2016

Zariel commented Oct 29, 2016

Dieterbe commented Oct 29, 2016

Zariel commented Oct 29, 2016 • edited

Dieterbe commented Oct 29, 2016

Dieterbe commented Oct 26, 2016 •

edited

Zariel commented Oct 29, 2016 •

edited