Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DigitalOcean www server #3424

Open
richardlau opened this issue Jul 14, 2023 · 9 comments
Open

DigitalOcean www server #3424

richardlau opened this issue Jul 14, 2023 · 9 comments

Comments

@richardlau
Copy link
Member

It doesn't look like we have a tracking issue for this, although much has been discussed spread out over several Slack discussion threads. See also nodejs/TSC#1416.

Summary

Our DigitalOcean hosted droplet for our www server (one of two servers behind a Cloudflare load balancer) has become very unreliable this year, seemingly getting worse to the extent that over the last few weeks it "works" for about a day (or less) and then runs out of file descriptors (error messages visible in the nginx error logs) and Cloudflare believes the server to be unhealthy (=cannot reach the /traffic-manager endpoint?) and switches over to the other server (called Joyent but now resides on Equinix Metal).

Prior to the last few weeks the "unhealthy" state was temporary -- eventually CF would decide the server was healthy again and switch back to the DO server but now it appears that the DO server remains unhealthy in CF until we restart nginx.

@richardlau
Copy link
Member Author

Over the last few weeks, due to the DO server not automatically recovering without intervention, we've been predominantly serving from Equinix Metal (Joyent).
image

AFAICT the Equinix Metal is not suffering the same issues as the DO server. While we are occasionally getting load balancer alert emails through to the build alias from CF it's nowhere near the frequency we were getting them for the DO server.

I don't think we've reflected all the nginx tweaks that have been made on the DO server to the Equinix Metal one so it might be worth looking at the differences there. In particular I think the connection limits are lower/not set on the Equinix Metal server. Other differences between the two servers are that nightly/v8-canary/release builds are pushed (from our release machines via scp) to the DO server -- an rsyncmirror.service runs on the Equinix Metal server periodically pulling things from the DO one.

@richardlau
Copy link
Member Author

Oh and while I have no evidence that suggests it would solve/address any of the current issues, we really should plan how we're going to update the server to a later OS (and probably nginx as I assume the one in the apt repository is old). It may be worth considering creating a replacement server from scratch vs a risky upgrade of the existing server(s).

@targos
Copy link
Member

targos commented Jul 14, 2023

It may be worth considering creating a replacement server from scratch vs a risky upgrade of the existing server(s).

Absolutely agree.

@ovflowd
Copy link
Member

ovflowd commented Jul 15, 2023

(=cannot reach the /traffic-manager endpoint?) and switches over to the other server)

Which is even worse because that endpoint is a pure HTTP response with no file access, and for not being able to handle that...

@ovflowd
Copy link
Member

ovflowd commented Jul 15, 2023

It may be worth considering creating a replacement server from scratch vs a risky upgrade of the existing server(s).

Big +1

@MoLow
Copy link
Member

MoLow commented Jul 16, 2023

add to the build agenda so we can discuss how to proceed on this

@targos
Copy link
Member

targos commented Jul 17, 2023

(=cannot reach the /traffic-manager endpoint?) and switches over to the other server)

Which is even worse because that endpoint is a pure HTTP response with no file access, and for not being able to handle that...

As I understand it, the problem is that nginx reaches the maximum open files limit and cannot accept new connections (including those that come from the CF load balancer health checks).

@richardlau
Copy link
Member Author

Just on the point re. creating a new server -- our existing server was created five years ago and is on the basic plan (perhaps that was all that was available then?). Theoretically it has a 2 Gbps maximum network throughput but I don't think I've seen the droplet hit that, even when we raised the open file limit on the droplet.

"CPU-Optimized Droplets with Premium CPUs" have a higher throughput limit of 10 Gbps but will cost more. I don't have access to (nor do I particular want access to) billing for our DO account so I don't know what our current droplet is costing vs our credits. I don't know what our credits are on the DO account either but I do know that we've run out in the past. If we decide to go with a larger droplet then we should loop in the OpenJS Foundation.

@targos
Copy link
Member

targos commented Nov 29, 2023

I forgot to mention somewhere that when we upgraded the DO server (#3564), we also bumped it to Premium Intel CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants