Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an idle timeout for the server #4760

Merged
merged 15 commits into from
Jun 16, 2018
Merged

Add an idle timeout for the server #4760

merged 15 commits into from
Jun 16, 2018

Conversation

jefferai
Copy link
Member

Because tidy operations can be long-running, this also changes all tidy
operations to behave the same operationally (kick off the process, get a
warning back, log errors to server log) and makes them all run in a
goroutine.

This could mean a sort of hard stop if Vault gets sealed because the
function won't have the read lock. This should generally be okay
(running tidy again should pick back up where it left off), but future
work could use cleanup funcs to trigger the functions to stop.

@jefferai jefferai added this to the 0.10.3 milestone Jun 13, 2018
@jefferai
Copy link
Member Author

A note for reviewers: the main test of tidy functions is in the expiration manager, but the way it's fixed up shows that the CAS stuff is working appropriately (and the way it failed earlier made it clear that both goroutines were exiting immediately). Given that they all now share the same control structure it should be applicable across the changed functions.

Another item: 10 minutes could be too long; should we make it 5? What operations take even that long with tidy out of the picture?

@kalafut
Copy link
Contributor

kalafut commented Jun 14, 2018

My main comment is whether the context.Background() additions should be something like context.WithTimeout(context.Background(), 30*time.Minute) along with checking ctx.Done() in the tidy operation. The concern is that a problematic tidy op or storage backend could just get stuck running with no external way to stop it. I don't know how long tidy normally runs (if there even is a "normally"), but a suitably larger WithTimeout() value could be a useful guard against spinning forever. (This duration could possibly be an optional API parameter too.)

@jefferai
Copy link
Member Author

Maybe, the problem is picking a number that isn't arbitrary. At what point would you want to cut it off? If Vault is functioning normally, why kill tidy after 30 minutes, leaving things in a not-fully-cleaned-up state?

@kalafut
Copy link
Contributor

kalafut commented Jun 14, 2018

Yes, if tidy durations have a very large range then context.WithTimeout will not help.

@kalafut
Copy link
Contributor

kalafut commented Jun 14, 2018

Logging "Tidy operation {x} completed successfully" would be good if we're not doing that already somewhere (I didn't notice it in the diff but maybe it's higher up). Prior to this commit, the API call returning was effectively that message. If there are ever questions about the operation it would be good to grep the logs for started/completed pairs.

@@ -935,7 +935,8 @@ CLUSTER_SYNTHESIS_COMPLETE:
}

server := &http.Server{
Handler: handler,
Handler: handler,
IdleTimeout: 10 * time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to check that this is having the desired effect but haven't been successful. Using a build with a very low value (5s) and throwing a long sleep into some operation, the connection wasn't closed. I was even able to nc 127.0.0.1 8200 and the connection would stay open indefinitely.

I wrote a little Go server to experiment and found that ReadTimeout will close my nc test at the right time, but just IdleTimeout won't. I'm not sure what IdleTimeout is really doing. But replacing IdleTimeout with ReadTimeout in Vault didn't close my connection either.

Because tidy operations can be long-running, this also changes all tidy
operations to behave the same operationally (kick off the process, get a
warning back, log errors to server log) and makes them all run in a
goroutine.

This could mean a sort of hard stop if Vault gets sealed because the
function won't have the read lock. This should generally be okay
(running tidy again should pick back up where it left off), but future
work could use cleanup funcs to trigger the functions to stop.
kalafut
kalafut previously approved these changes Jun 14, 2018
// that don't successfully auth to be kicked out quickly.
// Cluster connections should be reliable so being marginally
// aggressive here is fine.
err = tlsConn.SetDeadline(time.Now().Add(10 * time.Second))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replication also uses this connection and that potentially could be going over a less reliable network. Maybe it should be bumped a little bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. 30 seconds? If it disconnects it will reconnect again...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

jefferai and others added 8 commits June 14, 2018 18:36
@@ -285,7 +285,7 @@ func prepareDynamoDBTestContainer(t *testing.T) (cleanup func(), retAddress stri
t.Fatalf("Failed to connect to docker: %s", err)
}

resource, err := pool.Run("deangiberson/aws-dynamodb-local", "latest", []string{})
resource, err := pool.Run("cnadiminti/dynamodb-local", "latest", []string{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason for this switch? If it is significant, can we have a comment as to why this is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing only on Jeff's laptop... no issues for other devs or even Travis. The error didn't make a ton of sense either, since basically reading back from the container what was just written didn't work. Unclear what the true root cause was, but we assumed docker-related. Our previous container hasn't been updated by the author in 15 months, and this new one is current and well-used. When we swapped it in all platforms started passing.

Copy link
Member Author

@jefferai jefferai Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's merged in from a branch Jim put up, but basically: for some reason on my machine some dynamodb tests were failing in ways that others could not reproduce, and while my container was supposedly up to date, the other container in general has not been kept up to date, whereas the new one has.

err = tlsConn.SetDeadline(time.Time{})
if err != nil {
if c.logger.IsDebug() {
c.logger.Debug("error setting deadline for cluster connection", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to distinguish this error from the error above?

vishalnayak
vishalnayak previously approved these changes Jun 15, 2018
@jefferai jefferai merged commit f493d24 into master Jun 16, 2018
@jefferai jefferai deleted the idle-timeout branch June 16, 2018 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants