Skip to content

Commit

Permalink
Merge pull request #113 from Financial-Times/clean-up
Browse files Browse the repository at this point in the history
Clean up & document
  • Loading branch information
apaleslimghost committed Feb 25, 2019
2 parents 77dcfc9 + 97dacdb commit ce34059
Show file tree
Hide file tree
Showing 21 changed files with 346 additions and 725 deletions.
3 changes: 0 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,4 @@ IGNORE_A11Y = true
test-unit:
FT_GRAPHITE_KEY=123 HEROKU_AUTH_TOKEN=token mocha

test-int:
mocha int-tests/ -r loadvars.js

test: verify test-unit
174 changes: 115 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,97 +1,153 @@
# n-health [![CircleCI](https://circleci.com/gh/Financial-Times/n-health.svg?style=svg)](https://circleci.com/gh/Financial-Times/n-health)

Makes it easy to add a variety of healthchecks to an app.
Collection of healthcheck classes to use in your nodejs application

## Adding Health Checks
To Add more health checks create a new file in the `config` directory. It should be a .js file which exports an object. The object must have the following properties:
## Usage

* name: A name for the healthcheck - is supposed to match to a name in the CMDB, ideally
* description: Test description for the checks - for reference only
* checks: Array of checks - see below for check config
`n-health` exports a function that loads [healthcheck configuration](#healthcheck-configuration) files from a folder:

## Standard check options
```js
const nHealth = require('n-health');

* name, severity, businessImpact, technicalSummary and panicGuide are all required. See the [specification](https://docs.google.com/document/edit?id=1ftlkDj1SUXvKvKJGvoMoF1GnSUInCNPnNGomqTpJaFk) for details
* interval: time between checks in milliseconds or any string compatible with [ms](https://www.npmjs.com/package/ms) [default: 1minute]
* type: The type of check (see below)
* officeHoursOnly: [default: false] For queries that will probably fail out of hours (e.g. Internet Explorer usage, B2B stuff), set this to true and the check will pass on weekends and outside office hours. Use sparingly.
const healthChecks = nHealth(
'path/to/healthchecks' // by default, `/healthchecks` or `/config` in the root of your application
)
```

## Healthcheck types and options
It returns an object with an `asArray` method. If you're using `n-express`, pass this array as the `healthChecks` option:

### pingdom
```js
const nExpress = require('@financial-times/n-express')

nExpress({
healthChecks: healthChecks.asArray()
})
```

If you're not using n-express, you should create a `/__health` endpoint which returns the following JSON structure (see the [specification](https://docs.google.com/document/edit?id=1ftlkDj1SUXvKvKJGvoMoF1GnSUInCNPnNGomqTpJaFk) for details):

```json
{
"schemaVersion": 1,
"name": "app name",
"systemCode": "biz-ops system code",
"description": "human readable description",
"checks": []
}
```

`checks` should be an array of check status objects. You can get this by calling `getStatus` on each item in the array, for example with `healthChecks.asArray().map(check => check.getStatus())`.

### Custom healthchecks

If you require a healthcheck not provided by n-health, you can pass a second argument to `nHealth`, which should be a path to a folder of files exporting custom healthcheck classes. These modules should export a class that extends `n-health`'s `Check` class and implements the `tick` method, which is periodically called to update the check's `status`. It can also implement the `init` to do something when the check is first run. Both of these methods can be `async` if you need to do something like make a request.

```js
const {Check, status} = require('n-health');

class RandomCheck extends Check {
tick() {
this.status = Math.random() < 0.5 ? status.PASSED : status.FAILED;
}
}

module.exports = RandomCheck;
```

See the [src/checks](src/checks) folder for some examples.

## Healthcheck configuration

A healthcheck config is a Javascript file that exports an object with these properties.

* `name`: A name for the healthcheck - is supposed to match to a name in biz-ops, ideally
* `description`: Test description for the checks - for reference only
* `checks`: Array of [check objects](#check-objects)

### Check objects

#### Common options

* `type`: The type of check, which should be one of the types below. That check type's options should also be included in the object as required.
* `name`, `severity`, `businessImpact`, `technicalSummary` and `panicGuide` are all required. See the [specification](https://docs.google.com/document/edit?id=1ftlkDj1SUXvKvKJGvoMoF1GnSUInCNPnNGomqTpJaFk) for details
* `interval`: time between checks in milliseconds or any string compatible with [ms](https://www.npmjs.com/package/ms) [default: 1minute]
* `officeHoursOnly`: [default: `false`] For queries that will probably fail out of hours (e.g. Internet Explorer usage, B2B stuff), set this to true and the check will pass on weekends and outside office hours (defined as 8am-6pm UTC). Use sparingly.

#### `pingdom`
Will poll the pingdom API to get the status of a specific check

* checkId: The id of the check in pingdom
* `checkId`: The id of the check in pingdom

### responseCompare
#### `responseCompare`
Fetches from multiple urls and compares the responses. Useful to check that replication is working

* urls: An array of urls to call
* comparison: Type of comparison to apply to the responses (Only "equal" so far
* `urls`: An array of urls to call
* `comparison`: Type of comparison to apply to the responses:
- `'equal'` the check succeeds if all the responses have the same status

### json
#### `json`
Calls a url, gets some json and runs a callback to check its form

* url: url to call and get the json
* fetchOptions: Object to pass to fetch, see https://www.npmjs.com/package/node-fetch#options for more information.
* callback: A function to run on the response. Accepts the parsed json as an argument and should return true or false
* `url`: url to call and get the json
* `fetchOptions`: Object to pass to fetch, see https://www.npmjs.com/package/node-fetch#options for more information.
* `callback`: A function to run on the response. Accepts the parsed json as an argument and should return true or false

### aggregate
Reports on the status of other checks. Useful if you have a multi-region service and, if one check fails it is not as bad as if ALL the checks fail.
#### `aggregate`
Reports on the status of other checks. Useful if you have a multi-region service and, if one check fails it is not as bad as if ALL the checks fail.

* watch: Array of names of checks to aggregate
* mode: Aggregate mode. I think "atLeastOne" is the only valid option so far
* `watch`: Array of names of checks to aggregate
* `mode`: Aggregate mode:
- `'atLeastOne'` the check succeeds if at least one of its subchecks succeeds

### graphiteSpike
#### `graphiteSpike`
Compares current and historical graphite metrics to see if there is a spike

* numerator: [required] Name of main graphite metric to count (may contain wildcards)
* divisor: [optional] Name of graphite metric to divide by (may contain wildcards)
* normalize: [optional] Boolean indicating whether to normalize to adjust for difference in size between sample and baseline timescales. Default is `true` if no divisor specified, `false` otherwise.
* samplePeriod: [default: '10min'] Length of time to count metrics for a sample of current behaviour
* baselinePeriod: [default: '7d'] Length of time to count metrics for to establish baseline behaviour
* direction: [default: 'up'] Direction in which to look for spikes; 'up' = sharp increase in activity, 'down' = sharp decrease in activity
* threshold: [default: 3] Amount of difference between current and baseline activity which registers as a spike e.g. 5 means current activity must be 5 times greater/less than the baseline activity
* `numerator`: [required] Name of main graphite metric to count (may contain wildcards)
* `divisor`: [optional] Name of graphite metric to divide by (may contain wildcards)
* `normalize`: [optional] Boolean indicating whether to normalize to adjust for difference in size between sample and baseline timescales. Default is `true` if no divisor specified, `false` otherwise.
* `samplePeriod`: [default: `'10min'`] Length of time to count metrics for a sample of current behaviour
* `baselinePeriod`: [default: `'7d'`] Length of time to count metrics for to establish baseline behaviour
* `direction`: [default: `'up'`] Direction in which to look for spikes; 'up' = sharp increase in activity, 'down' = sharp decrease in activity
* `threshold`: [default: `3`] Amount of difference between current and baseline activity which registers as a spike e.g. 5 means current activity must be 5 times greater/less than the baseline activity

### graphiteThreshold
#### `graphiteThreshold`
Checks whether the value of a graphite metric has crossed a threshold

* metric: [required] Name of graphite metric to count (may contain wildcards)
* threshold: [required] Value to check the metrics against
* samplePeriod: [default: '10min'] Length of time to count metrics for a sample of current behaviour
* direction: [default: 'above'] Direction on which to trigger the healthcheck;
- 'above' = alert if value goes above the threshold
- 'below' = alert if value goes below the threshold
* `metric`: [required] Name of graphite metric to count (may contain wildcards)
* `threshold`: [required] Value to check the metrics against
* `samplePeriod`: [default: `'10min'`] Length of time to count metrics for a sample of current behaviour
* `direction`: [default: `'above'`] Direction on which to trigger the healthcheck:
- `'above'` = alert if value goes above the threshold
- `'below'` = alert if value goes below the threshold

### graphiteWorking
#### `graphiteWorking`

Checks if the value of a graphite metric has received data recently.

* metric: [required] Name of graphite metric to count (may contain wildcards)
* `metric`: [required] Name of graphite metric to count (may contain wildcards)
- Use `summarize` if the metric receives data infrequently, e.g. `summarize(next.heroku.next-article.some-infrequent-periodic-metric, '30mins', 'sum', true)`
* time: [default: '-5minutes'] Length of time to count metrics
* `time`: [default: `'-5minutes'`] Length of time to count metrics

### cloudWatchThreshold
#### `cloudWatchThreshold`
Checks whether the value of a CloudWatch metric has crossed a threshold

_Note: this assumes that `AWS_ACCESS_KEY` & `AWS_SECRET_ACCESS_KEY` are implictly available as environment variables on process.env_


* cloudWatchRegion = [default 'eu-west-1'] AWS region the metrics are stored
* cloudWatchMetricName = [required] Name of the CloudWatch metric to count
* cloudWatchNamespace = [required] Namespace the metric resides in
* cloudWatchStatistic = [default 'Sum'] Data aggregation type to return
* cloudWatchDimensions = Optional array of metric data to query
* samplePeriod: [default: 300] Length of time in seconds to count metrics for a sample of current behaviour
* threshold: [required] Value to check the metrics against
* direction: [default: 'above'] Direction on which to trigger the healthcheck;
- 'above' = alert if value goes above the threshold
- 'below' = alert if value goes below the threshold

### cloudWatchAlarm
* `cloudWatchRegion` = [default `'eu-west-1'`] AWS region the metrics are stored
* `cloudWatchMetricName` = [required] Name of the CloudWatch metric to count
* `cloudWatchNamespace` = [required] Namespace the metric resides in
* `cloudWatchStatistic` = [default `'Sum'`] Data aggregation type to return
* `cloudWatchDimensions` = Optional array of metric data to query
* `samplePeriod`: [default: `300`] Length of time in seconds to count metrics for a sample of current behaviour
* `threshold`: [required] Value to check the metrics against
* `direction`: [default: `'above'`] Direction on which to trigger the healthcheck:
- `'above'` = alert if value goes above the threshold
- `'below'` = alert if value goes below the threshold

#### `cloudWatchAlarm`
Checks whether the state of a CloudWatch alarm is health

_Note: this assumes that `AWS_ACCESS_KEY` & `AWS_SECRET_ACCESS_KEY` are implictly available as environment variables on process.env_

* cloudWatchRegion = [default 'eu-west-1'] AWS region the metrics are stored
* cloudWatchAlarmName = [required] Name of the CloudWatch alarm to check
* `cloudWatchRegion` = [default `'eu-west-1'`] AWS region the metrics are stored
* `cloudWatchAlarmName` = [required] Name of the CloudWatch alarm to check
25 changes: 0 additions & 25 deletions int-tests/heroku.int.test.js

This file was deleted.

5 changes: 0 additions & 5 deletions loadvars.js

This file was deleted.

2 changes: 0 additions & 2 deletions src/checks/aggregate.check.js
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,13 @@ class AggregateCheck extends Check {
init () {
let watchRegex = new RegExp(`(${this.watch.join('|')})`, 'i');
this.obserables = this.parent.checks.filter(check => watchRegex.test(check.name));
return Promise.resolve()
}

tick(){
let results = this.obserables.map(c => c.getStatus().ok);
if(this.mode === AggregateCheck.modes.AT_LEAST_ONE){
this.status = results.length && results.some(r => r) ? status.PASSED : status.FAILED;
}
return Promise.resolve();
}
}

Expand Down
71 changes: 36 additions & 35 deletions src/checks/check.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,20 @@ const isOfficeHoursNow = () => {
};

class Check {
constructor(opts) {
[
'name',
'severity',
'businessImpact',
'panicGuide',
'technicalSummary'
].forEach(prop => {
if(!opts[prop]) {
throw new Error(`${prop} is required for every healthcheck`);
}
})

constructor (opts) {
'name,severity,businessImpact,panicGuide,technicalSummary'
.split(',')
.forEach(prop => {
if (!opts[prop]) {
throw new Error(`${prop} is required for every healthcheck`);
}
})

if (this.start !== Check.prototype.start || this._tick !== Check.prototype._tick) {
if(this.start !== Check.prototype.start || this._tick !== Check.prototype._tick) {
throw new Error(`Do no override native start and _tick methods of n-health checks.
They provide essential error handlers. If complex setup is required, define
an init method returning a Promise`)
Expand All @@ -38,37 +41,34 @@ an init method returning a Promise`)
this.status = status.PENDING;
this.lastUpdated = null;
}
init () {
return Promise.resolve();
}
start () {
this.init()
.then(() => {
this.int = setInterval(this._tick.bind(this), this.interval);
this._tick();
})

init() {}

async start() {
await this.init();

this.int = setInterval(this._tick.bind(this), this.interval);
this._tick();
}

_tick () {
async _tick() {
try {
await this.tick()
} catch(err){
logger.error({ event: 'FAILED_HEALTHCHECK_TICK', name: this.name }, err)
raven.captureError(err);
this.status = status.ERRORED;
this.checkOutput = 'Healthcheck failed to execute';
}

return Promise.resolve()
.then(() => this.tick())
.catch(err => {
logger.error({ event: 'FAILED_HEALTHCHECK_TICK', name: this.name }, err)
raven.captureError(err);
this.status = status.ERRORED;
this.checkOutput = 'Healthcheck failed to execute';
})
.then(() => {
this.lastUpdated = new Date();
});
this.lastUpdated = new Date();
}

stop () {
stop() {
clearInterval(this.int);
}

getStatus () {
getStatus() {
const output = {
name: this.name,
ok: this.status === status.PASSED,
Expand All @@ -81,17 +81,18 @@ an init method returning a Promise`)
checkOutput: this.status === status.ERRORED ? 'Healthcheck failed to execute' : this.checkOutput
};

if (this.officeHoursOnly && !isOfficeHoursNow()) {
if(this.officeHoursOnly && !isOfficeHoursNow()) {
output.ok = true;
output.checkOutput = 'This check is not set to run outside of office hours';
} else if (this.lastUpdated) {
} else if(this.lastUpdated) {
output.lastUpdated = this.lastUpdated.toISOString();
let shouldHaveRun = Date.now() - (this.interval + 1000);
if(this.lastUpdated.getTime() < shouldHaveRun){
output.ok = false;
output.checkOutput = 'Check has not run recently';
}
}

return output;
}
}
Expand Down
Loading

0 comments on commit ce34059

Please sign in to comment.