Add an idle timeout for the server #4760

jefferai · 2018-06-13T21:46:25Z

Because tidy operations can be long-running, this also changes all tidy
operations to behave the same operationally (kick off the process, get a
warning back, log errors to server log) and makes them all run in a
goroutine.

This could mean a sort of hard stop if Vault gets sealed because the
function won't have the read lock. This should generally be okay
(running tidy again should pick back up where it left off), but future
work could use cleanup funcs to trigger the functions to stop.

jefferai · 2018-06-14T01:26:10Z

A note for reviewers: the main test of tidy functions is in the expiration manager, but the way it's fixed up shows that the CAS stuff is working appropriately (and the way it failed earlier made it clear that both goroutines were exiting immediately). Given that they all now share the same control structure it should be applicable across the changed functions.

Another item: 10 minutes could be too long; should we make it 5? What operations take even that long with tidy out of the picture?

kalafut · 2018-06-14T05:09:58Z

My main comment is whether the context.Background() additions should be something like context.WithTimeout(context.Background(), 30*time.Minute) along with checking ctx.Done() in the tidy operation. The concern is that a problematic tidy op or storage backend could just get stuck running with no external way to stop it. I don't know how long tidy normally runs (if there even is a "normally"), but a suitably larger WithTimeout() value could be a useful guard against spinning forever. (This duration could possibly be an optional API parameter too.)

jefferai · 2018-06-14T12:39:47Z

Maybe, the problem is picking a number that isn't arbitrary. At what point would you want to cut it off? If Vault is functioning normally, why kill tidy after 30 minutes, leaving things in a not-fully-cleaned-up state?

kalafut · 2018-06-14T14:20:48Z

Yes, if tidy durations have a very large range then context.WithTimeout will not help.

kalafut · 2018-06-14T14:25:44Z

Logging "Tidy operation {x} completed successfully" would be good if we're not doing that already somewhere (I didn't notice it in the diff but maybe it's higher up). Prior to this commit, the API call returning was effectively that message. If there are ever questions about the operation it would be good to grep the logs for started/completed pairs.

kalafut · 2018-06-14T16:10:54Z

command/server.go

@@ -935,7 +935,8 @@ CLUSTER_SYNTHESIS_COMPLETE:
 		}

 		server := &http.Server{
-			Handler: handler,
+			Handler:     handler,
+			IdleTimeout: 10 * time.Minute,


I tried to check that this is having the desired effect but haven't been successful. Using a build with a very low value (5s) and throwing a long sleep into some operation, the connection wasn't closed. I was even able to nc 127.0.0.1 8200 and the connection would stay open indefinitely.

I wrote a little Go server to experiment and found that ReadTimeout will close my nc test at the right time, but just IdleTimeout won't. I'm not sure what IdleTimeout is really doing. But replacing IdleTimeout with ReadTimeout in Vault didn't close my connection either.

Because tidy operations can be long-running, this also changes all tidy operations to behave the same operationally (kick off the process, get a warning back, log errors to server log) and makes them all run in a goroutine. This could mean a sort of hard stop if Vault gets sealed because the function won't have the read lock. This should generally be okay (running tidy again should pick back up where it left off), but future work could use cleanup funcs to trigger the functions to stop.

…r server, plus add readheader/read timeout to api server

briankassouf · 2018-06-14T22:07:53Z

vault/request_forwarding.go

+				// that don't successfully auth to be kicked out quickly.
+				// Cluster connections should be reliable so being marginally
+				// aggressive here is fine.
+				err = tlsConn.SetDeadline(time.Now().Add(10 * time.Second))


Replication also uses this connection and that potentially could be going over a less reliable network. Maybe it should be bumped a little bit?

That's a good point. 30 seconds? If it disconnects it will reconnect again...

Works for me!

…nster because that passes more often than less monster timeouts, but really it's just flaky for reasons that likely have to do with logical.Framework testing stuff because there's no evidence of any problem other than it just not having run in time.

vishalnayak · 2018-06-15T22:14:54Z

physical/dynamodb/dynamodb_test.go

@@ -285,7 +285,7 @@ func prepareDynamoDBTestContainer(t *testing.T) (cleanup func(), retAddress stri
 		t.Fatalf("Failed to connect to docker: %s", err)
 	}

-	resource, err := pool.Run("deangiberson/aws-dynamodb-local", "latest", []string{})
+	resource, err := pool.Run("cnadiminti/dynamodb-local", "latest", []string{})


Was there a reason for this switch? If it is significant, can we have a comment as to why this is better?

This test was failing only on Jeff's laptop... no issues for other devs or even Travis. The error didn't make a ton of sense either, since basically reading back from the container what was just written didn't work. Unclear what the true root cause was, but we assumed docker-related. Our previous container hasn't been updated by the author in 15 months, and this new one is current and well-used. When we swapped it in all platforms started passing.

It's merged in from a branch Jim put up, but basically: for some reason on my machine some dynamodb tests were failing in ways that others could not reproduce, and while my container was supposedly up to date, the other container in general has not been kept up to date, whereas the new one has.

vishalnayak · 2018-06-15T22:32:38Z

vault/request_forwarding.go

+				err = tlsConn.SetDeadline(time.Time{})
+				if err != nil {
+					if c.logger.IsDebug() {
+						c.logger.Debug("error setting deadline for cluster connection", "error", err)


Do we want to distinguish this error from the error above?

jefferai added this to the 0.10.3 milestone Jun 13, 2018

jefferai requested review from vishalnayak and briankassouf June 14, 2018 01:26

kalafut reviewed Jun 14, 2018

View reviewed changes

jefferai added 2 commits June 14, 2018 12:14

Fix up tidy test

99c1244

jefferai force-pushed the idle-timeout branch from 7a89724 to 99c1244 Compare June 14, 2018 16:15

jefferai and others added 3 commits June 14, 2018 13:44

Merge branch 'master' into idle-timeout

69f84d0

Add deadline to cluster connections and an idle timeout to the cluste…

fc1a813

…r server, plus add readheader/read timeout to api server

Add proxy header timeout when wrapping in proxy proto

2e2e1c6

kalafut previously approved these changes Jun 14, 2018

View reviewed changes

Fix approle build

d7a5df3

jefferai dismissed kalafut’s stale review via d7a5df3 June 14, 2018 20:00

briankassouf reviewed Jun 14, 2018

View reviewed changes

jefferai and others added 8 commits June 14, 2018 18:36

Update request_forwarding.go

c44018d

Capture req locally since tests need it

0e22fa1

Add sleeps for tidy since it's now async

2fa4514

Merge branch 'master-oss' into idle-timeout

db715fb

Use deep.Equal for a dynamo test

7632e1d

Merge branch 'master-oss' into idle-timeout

6a976df

Update to a newer dynamodb-local container

5c7cf69

vishalnayak reviewed Jun 15, 2018

View reviewed changes

vishalnayak previously approved these changes Jun 15, 2018

View reviewed changes

Modernize CA testing steps

712cc87

jefferai dismissed vishalnayak’s stale review via 712cc87 June 16, 2018 19:58

jefferai merged commit f493d24 into master Jun 16, 2018

jefferai deleted the idle-timeout branch June 16, 2018 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an idle timeout for the server #4760

Add an idle timeout for the server #4760

jefferai commented Jun 13, 2018

jefferai commented Jun 14, 2018

kalafut commented Jun 14, 2018 •

edited

Loading

jefferai commented Jun 14, 2018

kalafut commented Jun 14, 2018

kalafut commented Jun 14, 2018

kalafut Jun 14, 2018

briankassouf Jun 14, 2018

jefferai Jun 14, 2018

briankassouf Jun 14, 2018

vishalnayak Jun 15, 2018

kalafut Jun 15, 2018

jefferai Jun 15, 2018 •

edited

Loading

vishalnayak Jun 15, 2018

Add an idle timeout for the server #4760

Add an idle timeout for the server #4760

Conversation

jefferai commented Jun 13, 2018

jefferai commented Jun 14, 2018

kalafut commented Jun 14, 2018 • edited Loading

jefferai commented Jun 14, 2018

kalafut commented Jun 14, 2018

kalafut commented Jun 14, 2018

kalafut Jun 14, 2018

Choose a reason for hiding this comment

briankassouf Jun 14, 2018

Choose a reason for hiding this comment

jefferai Jun 14, 2018

Choose a reason for hiding this comment

briankassouf Jun 14, 2018

Choose a reason for hiding this comment

vishalnayak Jun 15, 2018

Choose a reason for hiding this comment

kalafut Jun 15, 2018

Choose a reason for hiding this comment

jefferai Jun 15, 2018 • edited Loading

Choose a reason for hiding this comment

vishalnayak Jun 15, 2018

Choose a reason for hiding this comment

kalafut commented Jun 14, 2018 •

edited

Loading

jefferai Jun 15, 2018 •

edited

Loading