Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

benarent · 2019-12-19T16:18:42Z

What happened:
When we launched Teleport 4.0 we added a bunch of scalability improvements and defined system requirements. https://gravitational.com/teleport/docs/faq/#whats-teleport-scalability-and-hardware-recommendations

We currently only recommend 2,000 nodes connecting with a NAT ( Running in IoT mode ), We've a customer who will have around 10,000 nodes.

This ticket is to track the work required to support Teleport at this scale.

JonGilmore · 2019-12-27T14:38:40Z

@benarent any traction on this issue? We're currently sitting around 2400 nodes and are extremely hesitant to scale further until Gravitational can confidently say it can handle 10k nodes.

cc @dmart

klizhentas · 2019-12-27T16:54:36Z

@JonGilmore we will benchmark ASAP and get back to you

JonGilmore · 2020-01-02T19:51:07Z

@klizhentas happy new years! have you been able to perform any benchmarking?

benarent · 2020-01-07T21:38:50Z

Following up from todays call, along with scaling question the team is seeing a lot these issues in the Proxy Logs.

ERRO [DISCOVERY] Disconnecting connection to REMOVED:15898: discovery channel overflow at 10. reversetunnel/conn.go:169

WARN [PROXY:1] Proxy transport failed: read tcp REMOVED:58696->REMOVED:3025: use of closed network connection *net.OpError. reversetunnel/transport.go:305

JonGilmore · 2020-01-16T15:20:15Z

@benarent any updates on the Gravitational end on these errors?

JonGilmore · 2020-01-21T19:28:19Z

@benarent any updates?

benarent · 2020-01-22T17:28:37Z

Hey Jon, we are still working on it internally. We'll keep an update in Slack.

klizhentas · 2020-01-24T00:04:36Z

Description

These are some benchmark results of Teleport 4.2.0 with Teleport IOT connected nodes using managed AWS deployment.

Setup:

2 Availability zones with cross zone load balancing on for proxy and auth
2x m4.4xlarge auth servers
2x m4.4xlarge proxy servers

Some dynamo metrics:

Both auth servers, proxies have the following connection_limits:

teleport:
  connection_limits:
    max_connections: 65000
    max_users: 10000

Socket limits:

cat /proc/$(pidof teleport)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max open files            65536                65536                files

Results

The system becomes unstable during edge cases (e.g. all nodes restart).

SQLite caches lock down the Proxy on the surge caused by reconnects of nodes:

SLOW TRANSACTION: 1.207012132s, goroutine 4948412 [running]:
runtime/debug.Stack(0xbf82aa7dfa5691cc, 0x5ba076017c, 0x371e280)
	/opt/go/src/runtime/debug/stack.go:24 +0x9d
github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).inTransaction.func1(0xbf82aa7dadffd10b, 0x5b58847658, 0x371e280, 0xc0006ec8c0)
	/gopath/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:819 +0x98
github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).inTransaction(0xc0006ec8c0, 0x235e5c0, 0xc0000ec008, 0xc0945730f8, 0x0, 0x0)
	/gopath/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:871 +0x1e5
github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).Get(0xc0006ec8c0, 0x235e5c0, 0xc0000ec008, 0xc099b5cd60, 0x15, 0x20, 0xaa6167, 0xc00065b5c0, 0xc000563c00)
	/gopath/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:571 +0xed
github.com/gravitational/teleport/lib/backend.(*Reporter).Get(0xc0008b9ce0, 0x235e5c0, 0xc0000ec008, 0xc099b5cd60, 0x15, 0x20, 0x20, 0x15, 0xc099b5cd60)
	/gopath/src/github.com/gravitational/teleport/lib/backend/report.go:130 +0x24e
github.com/gravitational/teleport/lib/backend.(*Wrapper).Get(0xc00065b5c0, 0x235e5c0, 0xc0000ec008, 0xc099b5cd60, 0x15, 0x20, 0x15, 0x3, 0x10)
	/gopath/src/github.com/gravitational/teleport/lib/backend/wrap.go:95 +0xfd
github.com/gravitational/teleport/lib/services/local.(*CA).GetCertAuthority(0xc000400450, 0x1fc046e, 0x4, 0xc04a09b860, 0x3, 0x0, 0xc03782a030, 0x2, 0x2, 0x0, ...)
	/gopath/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 +0x246
github.com/gravitational/teleport/lib/cache.(*Cache).GetCertAuthority(0xc0008da000, 0x1fc046e, 0x4, 0xc04a09b860, 0x3, 0xc094573400, 0xc09440e518, 0x1, 0x1, 0x8, ...)
	/gopath/src/github.com/gravitational/teleport/lib/cache/cache.go:551 +0x1ea
github.com/gravitational/teleport/lib/reversetunnel.(*server).getTrustedCAKeysByID(0xc00029c600, 0x1fc046e, 0x4, 0xc04a09b860, 0x3, 0x6, 0xc061c72768, 0x1c11bc0, 0x1dfc740, 0xc152101980)
	/gopath/src/github.com/gravitational/teleport/lib/reversetunnel/srv.go:691 +0xc9
github.com/gravitational/teleport/lib/reversetunnel.(*server).checkHostCert(0xc00029c600, 0xc0df4a6240, 0xc10e946ab0, 0x28, 0xc04a09b860, 0x3, 0xc0825febb0, 0x7, 0x10b)
	/gopath/src/github.com/gravitational/teleport/lib/reversetunnel/srv.go:767 +0x136
github.com/gravitational/teleport/lib/reversetunnel.(*server).keyAuth(0xc00029c600, 0x236f480, 0xc0c8649b00, 0x2346800, 0xc0825febb0, 0x44a, 0x44a, 0xc1726d0091)
	/gopath/src/github.com/gravitational/teleport/lib/reversetunnel/srv.go:740 +0x5c6
github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh.(*connection).serverAuthenticate(0xc0c8649b00, 0xc0d935f380, 0x11, 0x40, 0x0)
	/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/server.go:448 +0x1bec
github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh.(*connection).serverHandshake(0xc0c8649b00, 0xc0d935f380, 0xc15284aed0, 0x2336900, 0xc0116f1cf0)
	/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/server.go:251 +0x59f
github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh.NewServerConn(0x237b420, 0xc153318b10, 0xc0000f9c30, 0x0, 0x0, 0x4, 0x0, 0x0)
	/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/server.go:182 +0xd6
github.com/gravitational/teleport/lib/sshutils.(*Server).HandleConnection(0xc0000f9ba0, 0x237b3c0, 0xc15284aed0)
	/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:407 +0x321
created by github.com/gravitational/teleport/lib/sshutils.(*Server).acceptConnections
	/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:364 +0x1c3

This situation on full proxy restart holds for about 2-3 minutes causing all 16 cores of proxy to spike up and lock down. This is especially caused by the lack of good randomized backoff on reconnects from nodes as they surge connect all at once.

Eventually the system stabilizes, and works OK, however this introduces usability and general concerns, as full restart of the proxies is possible during day to day operations.

JonGilmore · 2020-01-24T14:35:04Z

@klizhentas thank you for the reply. Currently, we're scaled to (6) c5.xl proxy nodes and (3) c5.2xl auth nodes and still seeing sporadic behavior (disconnects, not all nodes reporting when we try to run a tsh ls). We will occasionally see a similar SLOW TRANSACTION that you've pointed out above, but sometimes with results up to 20 seconds. Pinged you on slack to hopefully setup a conversation soon here.

This commit resolves #3227 In IOT mode, 10K nodes are connecting back to the proxies, putting a lot of pressure on the proxy cache. Before this commit, Proxy's only cache option were persistent sqlite-backed caches. The advantage of those caches that Proxies could continue working after reboots with Auth servers unavailable. The disadvantage is that sqlite backend breaks down on many concurrent reads due to performance issues. This commit introduces the new cache configuration option, 'in-memory': ```yaml teleport: cache: # default value sqlite, # the only supported values are sqlite or in-memory type: in-memory ``` This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected nodes with no issues. The second part of the commit disables the cache reload on timer that caused inconsistent view results for 10K displayed nodes with servers disappearing from the view. The third part of the commit increases the channels buffering discovery requests 10x. The channels were overfilling in 10K nodes and nodes were disconnected. The logic now does not treat the channel overflow as a reason to close the connection. This is possible due to the changes in the discovery protocol that allow target nodes to handle missing entries, duplicate entries or conflicting values.

benarent added enhancement R3 labels Dec 19, 2019

klizhentas mentioned this issue Feb 2, 2020

Implement optional in-memory proxy cache #3320

Merged

klizhentas closed this as completed in #3320 Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

benarent commented Dec 19, 2019

JonGilmore commented Dec 27, 2019

klizhentas commented Dec 27, 2019

JonGilmore commented Jan 2, 2020

benarent commented Jan 7, 2020

JonGilmore commented Jan 16, 2020

JonGilmore commented Jan 21, 2020

benarent commented Jan 22, 2020

klizhentas commented Jan 24, 2020 •

edited

Loading

JonGilmore commented Jan 24, 2020

Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

Comments

benarent commented Dec 19, 2019

JonGilmore commented Dec 27, 2019

klizhentas commented Dec 27, 2019

JonGilmore commented Jan 2, 2020

benarent commented Jan 7, 2020

JonGilmore commented Jan 16, 2020

JonGilmore commented Jan 21, 2020

benarent commented Jan 22, 2020

klizhentas commented Jan 24, 2020 • edited Loading

Description

Results

JonGilmore commented Jan 24, 2020

klizhentas commented Jan 24, 2020 •

edited

Loading