Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switchover: hardening & refactor instrumentation #23

Merged
merged 22 commits into from
Jun 27, 2019

Conversation

mastercactapus
Copy link
Member

  • Identified the issue which this PR solves.
  • Read the CONTRIBUTING document.
  • Code builds clean without any errors or warnings.
  • Added appropriate tests for any new functionality.
  • All new and existing tests passed.
  • Added comments in the code, where necessary.
  • Ran make check to catch common errors. Fixed any that came up.

Description:
The switchover implementation had become out of date and not all tables had correct triggers to allow logical replication. This PR updates the switchover shell code to create triggers dynamically to prevent that from happening in the future.

  • Removed migration-created triggers in favor of a pragmatic approach
  • Fixed some cancelation and progress bar bugs
  • Added a README covering some of the basics of the feature
  • Changed how config validation works so different URLs can be used as long as they result in connections to the correct database(s)

"encoding/hex"
)

func newDBID() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind walking me through how this is leveraged? From switchover/dbsync/sync.go it looks like it's being used to create both oldDbID and newDbID, so I'm confused how it's leveraged in the context of config verification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Yeah, I suppose that is a little confusing. The "new" in the context of this function is just referring to creation, like http.NewServeMux creates a new http.ServeMux.

Inside sync.go oldDB things refer to the --db-url database, the one being switched away from. newDB things refer to the --db-url-next, or the one we are switching to.

Before we call it an official/hardened feature, it might be worth coming up with a better naming scheme for tracking the two databases, rather than old and new.

As far as the purpose of this function, each time the shell starts it generates a random ID for each DB, and publishes that ID to a channel on only that DB. It serves as a functional test that what the shell thinks is the next/new DB is the same one as what each GoAlert instance thinks is the next/new DB. Same for the old/prev DB.

Does that help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - thank you!

@@ -112,7 +113,11 @@ func ensureTableQuery(ctx context.Context, db *sql.DB, fn func() error) error {
}
// 42P01 is undefined_table
// https://www.postgresql.org/docs/9.6/static/errcodes-appendix.html
if pErr, ok := err.(*pq.Error); !ok || pErr.Code != "42P01" {
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code == "42P01" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth making the pq error code a const? Only two references at this point, but just a thought

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Not a bad idea, maybe a package with Postgres errors that we use in-app to hold those.

pqerr.UndefinedTable or something like that?

We are using 2 different drivers (pgx for switchover and data generation, pq for everything else), but it seems like they both play well in keeping the code the same format/string.

I like the idea of something like pqerr.IsCode(pqerr.CodeUndefinedTable, err) or simply pqerr.IsCode("42P01", err) that could handle the two drivers transparently. There's a lot of stuff in maperror.go that could be more portable/cleaner that way.


I think that might be better as a PR on its own though -- there are quite a few places we are doing these checks, and I'd rather not introduce a new style in this particular spot without updating the rest to match.

@m17ch @arurao What do you guys think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do something similar in an API i'm working on - keeps it transparent and allows us to do any massaging of the errors we may want to do.

}

status, err := s.status(ctx)
if err != nil {
return err
}
sh.Println(status)
sh.Println("change_log enabled.")

sh.Println("change_log disabled")
return nil
},
})
sh.AddCmd(ctxCmd{
Name: "disable",
Help: "Enable change_log",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable -> Disable?

if dur < maxSync {
break
}
sh.Println("Took longer than max sync time, re-syncing")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential to become an infinite loop here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentionally. It will keep re-syncing the DB until it catches up, or the user uses CTRL+C to stop the process.

It's hard to predict how many times it may take, and if it were to stop at an arbitrary number, a busy system could fall behind again.

Since it's safe to run multiple times and can be interrupted at any time, it will keep going as long as is needed this way, rather than having the user re-run execute and confirm over and over.

Copy link
Contributor

@m17ch m17ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functionally validated 👍

@mastercactapus mastercactapus merged commit fcae6ce into master Jun 27, 2019
@mastercactapus mastercactapus deleted the switchover-refactor branch June 27, 2019 17:46
arurao added a commit that referenced this pull request Jul 10, 2019
http/db: Bound concurrent DB queries (#1)

* add per-auth concurrency-limiting middleware

alerts: address mobile alerts controls bottom margin (#3)

* increase toolbar margin on smaller screens

* adjust whitespace width to the left of the alerts checkbox controls

* slightly reduce alerts checkbox controls bottom margin on desktop

fix ability to navigate back to the active alerts tab (#2)

users: fix name validation for users search (#5)

services: auto select service in alert form (#4)

* rename ServiceOnCallDisplay to ServiceOnCallQuery

* Add empty list message for on call users

* Add space after EP name on a service

* Fix prop names for alerts list show favorites snackbar and alert form

* fix prop names in alert form and alert fab

* revert linting change

services: re-add route to view integration key docs (#6)

* re-add route for integration key api documentation

* use latest markdown component as used in details pages

* update readme title and move email section to bottom

* remove previously used markdown package

* update markdown file to use a table for api parameter options

* format markdown

mailgun: fix signature errors (#7)

* add mailgun config to harness

* Return 400 on form-parse errors, 406 for mailgun

* add mailgun smoketest

* increase max req. body size from 32KiB to 256KiB

auth: update default Auth.RefererURLs behavior (#8)

* Change default referer behavior to same-host

* don't overwrite config with empty object

* fixed a couple integration tests

* add flag to disable HTTPS redirect

Templates (#10)

* added issue templates

docs: Add guides, update READMEs (#9)

* Changes from old repo

* adding screenshot

* tweak to project description

* Simplify dev setup documentation

* resetdb: don't err if db doesn't exist

* Cleanup contributing guidelines a bit

* Update smoketest readme

* add Dockerfile

* getting started guide

* dev setup guide

* add mailgun section

* add note about API-only mode

* attempt to start postgres container on failure (fix dev issue)

* shorten getting started intro

* tweaks to postgres getting started info

* add link to effective Go in CONTRIBUTING.md

* add links to more info for tests

* add go unit test example

* update label query in README

Co-Authored-By: Mitch Cimenski <mitch@m17ch.com>
Co-Authored-By: Arundhati Rao <arao4@buffalo.edu>

chore: update Go dependencies (#12)

* update switchover code for pgx api changes

* github package api update

* use contrib package for jaeger tracing

* update module files

ui dependencies: webpack-dev-server upgrade to stable version (#11)

* Update webpack-dev-server package.json dependency to stable version

* revert lockfile changes that are unrelated to the webpack-dev-server

* upgrade to latest webpack-dev-server version

ui dependencies: upgrade material-ui and react (#13)

* update mui and react dependencies

* remove useNextVariants from the mui theme

* fix grid spacing breaking changes

* forward required refs to transition components

* fix theme unit spacing breaking changes

* update variant fab button to use fab component

* fix issue with route expecting a function to render

* fix material-select height issue with mui updates

* update material-select to use a valid color prop on typography

* update mui to the latest latest versions

* update react-dom to latest release

* update lockfile version of @hot-loader/react-dom

* user explicit version for @hot-loader/react-dom

* yarn install

ui dependencies: material-ui-pickers update to @material-ui/pickers v3.1.0 (#14)

* update to latest material-ui pickers package

* update imports

* use explicit version of dependency and update package name in webpack config

* yarn lock update

docker: fix default listen address in container (#17)

ui dependencies: non-breaking change package updates (#18)

* update babel and material ui

* upgrade mui lab

* package patch updates

* package minor updates

* webpack upgrade fix to fs

* update node props to allow css-loader to properly build when starting the app

* revert webpack config changes

* major package upates with no relevant breaking changes

* remove some unused packages

* update react-redux and connected-react-router

the latest updates of these two packages are dependent on eachother being up to date simultaneously because of Redux utilizing React's context API

* nonbreaking package upgrades from over the weekend

* fix chance's word props min/max not actually existing

* this.context.router no longer supported, use withRouter to get history prop

* update lockfile

* make check

contactmethod: allow UK (+44) numbers (#22)

* Add support for UK country code (+44)

* Add test cases for UK numbers

* Reference the UK for the Supported Country Code

ui-dependencies: organize package.json dependencies (#20)

* remove some unused packages and organize package.json

* save lockfile

* re-add required types packages as dev dependencies

* add graphql as required dev dependency

make fake slack creds look more fake (#32)

Replacing fake Slack credentials with zeros instead of random so they don't appear to be real ones.

services: favorites test code improvements (#31)

* add favorite support to `createService` mutation

* create new services as favorites

* create `favorites.ts` test suite

Co-authored-by: Anderson Day <dayander@msu.edu>
Co-authored-by: Katy <kmm5278@rit.edu>

switchover: hardening & refactor instrumentation (#23)

* fix migration template typo

* automatic trigger generation

* handle pgx migration errors

* make no-pause-api and extra sync default behavior

* proper db identification/validation

* add progress for long operations

* normalize table scanning

* cancelation fixes

* first pass at switchover docs

* include version in DB application_name

* status information improvements

alertlog: graphql2 endpoint (#16)

implementing recentEvents as a field for an alert's log entries.

ep step: allow user selection when Slack is disabled (#36)

* update cypress

* add failing test for ep user selection

* fix the bug

rotations: add user favorites support (#27)

* Added `QuerySetFavoriteButton` component for dynamic favoriting
* Added the ability to favorite rotations

alerts: add `createAlert` mutation (#37)

* gql2 createAlert mutation and smoketest

dev: easily run integration tests by cli (#39)

* Added `--log-errors-only` flag to the backend
* Added `make cy-wide-prod-run` and `make cy-mobile-prod-run` commands for running tests
* Added `make check-all` that runs all integration & smoketests, as well as `make check`
* Updated the runjson tool to support env-params
* Use chrome by default to run tests

allow 10-digit sections in slack tokens (#41)

schedules: add user favorites support (#26)

* Favorites icon for schedules

* New migration and schema changes

* Tests for setting/finding favorites; integration with shared favorites component

* Wizard favorites schedules created automatically and new schedules are favorited on creation. Wizard unit test updated accordingly

config: loosen key validation & guide updates  (#44)

* allow all external config strings to be ASCII

* update docs for new Mailgun UI

Search: update validation for search strings containing spaces (#47)

* new search validation function
* replacing text validation to search validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants