Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orka Upgrade Thurs Jan 19th 2023 6pm - 9pm GMT #3112

Closed
3 tasks
AshCripps opened this issue Dec 13, 2022 · 13 comments
Closed
3 tasks

Orka Upgrade Thurs Jan 19th 2023 6pm - 9pm GMT #3112

AshCripps opened this issue Dec 13, 2022 · 13 comments

Comments

@AshCripps
Copy link
Member

Ive scheduled in our Orka upgrade to 6pm - 9pm on 19th Jan.

Reason I booked so far out we need to do some stuff beforehand:

  • Make sure our images are up to date (save down the state they are currently to save effort when bringing them back
  • Decide what numbers we want to bring back (good time to bring in new OSes)
  • Afterwards each machine will be reset to the base image we create so all machines will need ansibling.

Notes from Macstadium:

We require a minimum 3-hour maintenance window to perform the upgrade. We have time slots (EST) to choose from Monday-Thursday.
The API will be unreachable during the maintenance window. Please pause all CI/CD functions during the maintenance window.
Prior to starting the maintenance window, If you want to preserve the changes done to a VM make sure to save them to an image (save or commit).
Prior to the start of the maintenance window, please save/shut down any VMs running on M1/Intel nodes. Cached images on M1 nodes may be removed as part of the upgrade
If you have any Kubernetes data in your sandbox namespace, you will need to back that data up prior to the start of the window. If you have any non-ephemeral VMs, please let us know in advance.
For upgrades on environments moving to version 2.3.0, all tokens will be purged as part of the new database performance optimization. If you are already on 2.1.0 and above this action has been completed. You may need to generate new tokens for your Orka API connections.
Orka 2.3.0 and above now offer better optimization for logs. These involve a new configuration to be set regarding retention rates. The default values are: Expiration policy is set to two weeks (336 hours). This is fully customizable. Please plan your log retention strategy to best meet your needs and use cases. Be sure to inform us of necessary configuration changes prior to the upgrade. The old log system will be deprecated in future releases.
There is no impact on your images, VM configs, or ISOs performing this upgrade.
Please upgrade your Jenkins plugin to the latest version: https://plugins.jenkins.io/macstadium-orka/ Doesnt apply to use we dont use the ephermial version (should we?)
We will notify you as soon as the upgrade is completed!

I have an email notification but it should have also gone to all member of build infra email chain

cc @nodejs/build

@anonrig
Copy link
Member

anonrig commented Dec 15, 2022

I believe this issue can be closed due to #3116

@UlisesGascon
Copy link
Member

UlisesGascon commented Dec 16, 2022

I made some notes in Slack while I was working on it, but I will consolidate it in the following lines

TL;DR 😁

  • Confirmed that the VPN issues due SSL are solved Orka connectivity issues to the VPN (SSL) #3101
  • Completed redeployment for all the VMs in the corresponding Nodes in Orka.
  • Check that the VMs are running with at least the expected OSX Version
  • Check that the VMs are accessible via SHH and the deployment follows the Inventory. This also solved Orka macOS 10.15 are offline #3083
  • Deploy the release vm macos10.15-x64-1
  • Ensure the VMs are in good shape (versions, ansible, etc..)
  • Ensure the VMs are available as Jenkins Agents
  • Backup the VMs to simplify restoring/replacing in the future

Extra information 📖

I deployed the testing machines in the nodes expecting to follow the NAT config by default, but it was not super clear to me how that magic works in details. On the VMs appears to be test-orka-macos10.14-x64-1 but in reality was a different machine (macos11) after a long debbuging period I assume that the current NAT setup is:

Node Internal Ip External Ip
macpro-4 10.221.188.14 199.7.167.100
macpro-5 10.221.188.15 199.7.167.101
macpro-6 10.221.188.16 199.7.167.102

So I distributed the VMs accoding to the external IPs expected and the proper ports binding (8822, 8823...).

Here are some logs checking ssh and OS version per machine

test-orka-macos10.14-x64-1
 test-orka-macos10:~ administrator$ sw_vers
 ProductName:	Mac OS X
 ProductVersion:	10.14.4
 BuildVersion:	18E2034
test-orka-macos10.14-x64-2
 test-orka-macos10:~ administrator$ sw_vers
 ProductName:	Mac OS X
 ProductVersion:	10.14.4
 BuildVersion:	18E2034
test-orka-macos10.14-x64-3
 test-orka-macos10:~ administrator$ sw_vers
 ProductName:	Mac OS X
 ProductVersion:	10.14.4
 BuildVersion:	18E2034
test-orka-macos10.15-x64-1
 administrator@test-orka-macos10 ~ % sw_vers
 ProductName:	Mac OS X
 ProductVersion:	10.15.4
 BuildVersion:	19E266
test-orka-macos10.15-x64-2
 administrator@test-orka-macos10 ~ % sw_vers
 ProductName:	Mac OS X
 ProductVersion:	10.15.4
 BuildVersion:	19E266
test-orka-macos11-x64-1
 administrator@test-orka-macos11-x64-1 ~ % sw_vers
 ProductName:	macOS
 ProductVersion:	11.6
 BuildVersion:	20G165
test-orka-macos11-x64-2
administrator@test-orka-macos11-x64-1 ~ % sw_vers
ProductName:	macOS
ProductVersion:	11.6
BuildVersion:	20G165

Opportunity 🦾

Checking the inventory I discovered that 199.7.167.100 node was not used at all. So that bring us the opportunity to easy add more testing machines. Discussing with @RafaelGSS and @anonrig in a separate issues will be great also to use some of this extra computing capabilities to run some VMs dedicated to the @nodejs/performance experiments and benchmarking in Jenkins

@richardlau
Copy link
Member

@UlisesGascon As @AshCripps mentioned in the issue description,

  • Afterwards each machine will be reset to the base image we create so all machines will need ansibling.

This is the likely reason that you're seeing inconsistencies with what the machine is and its hostname -- the hostname will be whatever is in the base image until we rerun the Ansible ansible/playbooks/jenkins/worker/create.yml playbook.

@AshCripps
Copy link
Member Author

oh I knew I was forgetting something, when you deploy a machine it takes the next available IP so you have to either change the inventory file or deploy in IP order from the file

@UlisesGascon
Copy link
Member

UlisesGascon commented Dec 17, 2022

Current status:

All the test machines are now available in Jenkins. I did some manual patching inside the 10.x machines (Jekins tokens, kwon_hosts, etc..) as I faced some Ansible challenges ( #3119 ).

Regarding the backups, I will suggest to wait until we are sure that the machines are working fine and the pipelines are passing before doing a final backup 👍

Captura de Pantalla 2022-12-17 a las 13 11 59

@UlisesGascon UlisesGascon self-assigned this Dec 17, 2022
@UlisesGascon
Copy link
Member

Quick update since 2022:

I think we are good with the current situation in the CI for the test-orka-macos* in terms of stability. Should I run the backup update so we can restore the vm images from this point in the future?

Also we need to provision (I can do it) and re-Ansible the release machine macos10.15-x64-1 but my ssh permissions are limited to testing only. Who can re-ansible the machine? with the fix in #3130 should be easy.

@richardlau
Copy link
Member

richardlau commented Jan 4, 2023

I think we are good with the current situation in the CI for the test-orka-macos* in terms of stability. Should I run the backup update so we can restore the vm images from this point in the future?

The macOS 10.14 and 10.15 VMs still have an issue building Node.js 14 (#3131). Should be fixed by running through the manual steps in https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#install-command-line-tools-for-xcode.

Also we need to provision (I can do it) and re-Ansible the release machine macos10.15-x64-1 but my ssh permissions are limited to testing only. Who can re-ansible the machine? with the fix in #3130 should be easy.

I could try re-ansibling. If anyone else wants to try, the macOS release machines will also require the manual steps in https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#macos-release-machines to be run to set up full Xcode and the signing certificates.

@richardlau
Copy link
Member

@UlisesGascon We should now be good w.r.t Xcode on the test-orka-macos* machines. Feel free to snapshort the VMs in their current state.

My offer to re-ansible the release machine stands if you re-provision it 🙂.

@UlisesGascon
Copy link
Member

@richardlau I re-provisioned the machine few minutes ago. I will do the snapshots once the release one is ready too 👍

@richardlau
Copy link
Member

I've reansibled release-orka-10.15-x64-1 and ran through the manual steps in https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#macos-release-machines. For now I've disabled release-nearform-macos10.15-x64-1 to force the next iojs+release job onto the orka machine so we can verify (via the next nightly or canary build) that it is working as expected.

@richardlau
Copy link
Member

The orka macOS 10.15 release machine looks good. I've reenabled the nearform one.

@UlisesGascon
Copy link
Member

I will started with the backup. Based on the documentation, I will create a new image based on a deployed machine strategy for all the vms.

@UlisesGascon
Copy link
Member

I finished with the backup images process. If in the future we need more space for new images we can potentially remove the ones from 2020

Captura de Pantalla 2023-01-11 a las 11 16 43

@richardlau, I am confident with the backups, but if you want I can delete and restore one VM just to check that the backups are working.

I believe this was the last step in order to close this issue 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants