Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

Closed
benarent opened this issue Dec 19, 2019 · 9 comments · Fixed by #3320
Closed

Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227

benarent opened this issue Dec 19, 2019 · 9 comments · Fixed by #3320

Comments

@benarent
Copy link
Contributor

What happened:
When we launched Teleport 4.0 we added a bunch of scalability improvements and defined system requirements. https://gravitational.com/teleport/docs/faq/#whats-teleport-scalability-and-hardware-recommendations

We currently only recommend 2,000 nodes connecting with a NAT ( Running in IoT mode ), We've a customer who will have around 10,000 nodes.

This ticket is to track the work required to support Teleport at this scale.

@JonGilmore
Copy link

@benarent any traction on this issue? We're currently sitting around 2400 nodes and are extremely hesitant to scale further until Gravitational can confidently say it can handle 10k nodes.

cc @dmart

@klizhentas
Copy link
Contributor

@JonGilmore we will benchmark ASAP and get back to you

@JonGilmore
Copy link

@klizhentas happy new years! have you been able to perform any benchmarking?

@benarent
Copy link
Contributor Author

benarent commented Jan 7, 2020

Following up from todays call, along with scaling question the team is seeing a lot these issues in the Proxy Logs.

ERRO [DISCOVERY] Disconnecting connection to REMOVED:15898: discovery channel overflow at 10. reversetunnel/conn.go:169

WARN [PROXY:1]  Proxy transport failed: read tcp REMOVED:58696->REMOVED:3025: use of closed network connection *net.OpError. reversetunnel/transport.go:305

@JonGilmore
Copy link

@benarent any updates on the Gravitational end on these errors?

@JonGilmore
Copy link

@benarent any updates?

@benarent
Copy link
Contributor Author

Hey Jon, we are still working on it internally. We'll keep an update in Slack.

@klizhentas
Copy link
Contributor

klizhentas commented Jan 24, 2020

Description

These are some benchmark results of Teleport 4.2.0 with Teleport IOT connected nodes using managed AWS deployment.

Setup:

  • 2 Availability zones with cross zone load balancing on for proxy and auth
  • 2x m4.4xlarge auth servers
  • 2x m4.4xlarge proxy servers

Some dynamo metrics:

iot-dynamo

Both auth servers, proxies have the following connection_limits:

teleport:
  connection_limits:
    max_connections: 65000
    max_users: 10000

Socket limits:

cat /proc/$(pidof teleport)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max open files            65536                65536                files     

Results

The system becomes unstable during edge cases (e.g. all nodes restart).

SQLite caches lock down the Proxy on the surge caused by reconnects of nodes:

SLOW TRANSACTION: 1.207012132s, goroutine 4948412 [running]:
runtime/debug.Stack(0xbf82aa7dfa5691cc, 0x5ba076017c, 0x371e280)
	/opt/go/src/runtime/debug/stack.go:24 +0x9d
github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).inTransaction.func1(0xbf82aa7dadffd10b, 0x5b58847658, 0x371e280, 0xc0006ec8c0)
	/gopath/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:819 +0x98
github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).inTransaction(0xc0006ec8c0, 0x235e5c0, 0xc0000ec008, 0xc0945730f8, 0x0, 0x0)
	/gopath/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:871 +0x1e5
github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).Get(0xc0006ec8c0, 0x235e5c0, 0xc0000ec008, 0xc099b5cd60, 0x15, 0x20, 0xaa6167, 0xc00065b5c0, 0xc000563c00)
	/gopath/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:571 +0xed
github.com/gravitational/teleport/lib/backend.(*Reporter).Get(0xc0008b9ce0, 0x235e5c0, 0xc0000ec008, 0xc099b5cd60, 0x15, 0x20, 0x20, 0x15, 0xc099b5cd60)
	/gopath/src/github.com/gravitational/teleport/lib/backend/report.go:130 +0x24e
github.com/gravitational/teleport/lib/backend.(*Wrapper).Get(0xc00065b5c0, 0x235e5c0, 0xc0000ec008, 0xc099b5cd60, 0x15, 0x20, 0x15, 0x3, 0x10)
	/gopath/src/github.com/gravitational/teleport/lib/backend/wrap.go:95 +0xfd
github.com/gravitational/teleport/lib/services/local.(*CA).GetCertAuthority(0xc000400450, 0x1fc046e, 0x4, 0xc04a09b860, 0x3, 0x0, 0xc03782a030, 0x2, 0x2, 0x0, ...)
	/gopath/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 +0x246
github.com/gravitational/teleport/lib/cache.(*Cache).GetCertAuthority(0xc0008da000, 0x1fc046e, 0x4, 0xc04a09b860, 0x3, 0xc094573400, 0xc09440e518, 0x1, 0x1, 0x8, ...)
	/gopath/src/github.com/gravitational/teleport/lib/cache/cache.go:551 +0x1ea
github.com/gravitational/teleport/lib/reversetunnel.(*server).getTrustedCAKeysByID(0xc00029c600, 0x1fc046e, 0x4, 0xc04a09b860, 0x3, 0x6, 0xc061c72768, 0x1c11bc0, 0x1dfc740, 0xc152101980)
	/gopath/src/github.com/gravitational/teleport/lib/reversetunnel/srv.go:691 +0xc9
github.com/gravitational/teleport/lib/reversetunnel.(*server).checkHostCert(0xc00029c600, 0xc0df4a6240, 0xc10e946ab0, 0x28, 0xc04a09b860, 0x3, 0xc0825febb0, 0x7, 0x10b)
	/gopath/src/github.com/gravitational/teleport/lib/reversetunnel/srv.go:767 +0x136
github.com/gravitational/teleport/lib/reversetunnel.(*server).keyAuth(0xc00029c600, 0x236f480, 0xc0c8649b00, 0x2346800, 0xc0825febb0, 0x44a, 0x44a, 0xc1726d0091)
	/gopath/src/github.com/gravitational/teleport/lib/reversetunnel/srv.go:740 +0x5c6
github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh.(*connection).serverAuthenticate(0xc0c8649b00, 0xc0d935f380, 0x11, 0x40, 0x0)
	/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/server.go:448 +0x1bec
github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh.(*connection).serverHandshake(0xc0c8649b00, 0xc0d935f380, 0xc15284aed0, 0x2336900, 0xc0116f1cf0)
	/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/server.go:251 +0x59f
github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh.NewServerConn(0x237b420, 0xc153318b10, 0xc0000f9c30, 0x0, 0x0, 0x4, 0x0, 0x0)
	/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/server.go:182 +0xd6
github.com/gravitational/teleport/lib/sshutils.(*Server).HandleConnection(0xc0000f9ba0, 0x237b3c0, 0xc15284aed0)
	/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:407 +0x321
created by github.com/gravitational/teleport/lib/sshutils.(*Server).acceptConnections
	/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:364 +0x1c3

This situation on full proxy restart holds for about 2-3 minutes causing all 16 cores of proxy to spike up and lock down. This is especially caused by the lack of good randomized backoff on reconnects from nodes as they surge connect all at once.

Eventually the system stabilizes, and works OK, however this introduces usability and general concerns, as full restart of the proxies is possible during day to day operations.

@JonGilmore
Copy link

@klizhentas thank you for the reply. Currently, we're scaled to (6) c5.xl proxy nodes and (3) c5.2xl auth nodes and still seeing sporadic behavior (disconnects, not all nodes reporting when we try to run a tsh ls). We will occasionally see a similar SLOW TRANSACTION that you've pointed out above, but sometimes with results up to 20 seconds. Pinged you on slack to hopefully setup a conversation soon here.

klizhentas added a commit that referenced this issue Feb 2, 2020
This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

```yaml
teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory
```

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
klizhentas added a commit that referenced this issue Feb 5, 2020
This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

```yaml
teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory
```

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
klizhentas added a commit that referenced this issue Feb 6, 2020
This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

```yaml
teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory
```

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
klizhentas added a commit that referenced this issue Feb 6, 2020
This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

```yaml
teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory
```

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
klizhentas added a commit that referenced this issue Feb 6, 2020
This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

```yaml
teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory
```

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants