Skip to content

Commit

Permalink
fix(cpu-pressure): skip expected errors (#730)
Browse files Browse the repository at this point in the history
* feat(local): improve local run

* fix(network): track podResourceVersion

* refactor: client as field / remove useless perms

* fix(controller): use baseLog to avoid panic

* fix(informer): filter non chaos resources

* feat(datadog): install dd agent in local cluster

* chore: go 1.20.5

* fix(cpu-pressure): skip expected errors
  • Loading branch information
luphaz committed Jun 29, 2023
1 parent 3e1ecd8 commit 8057e50
Show file tree
Hide file tree
Showing 82 changed files with 3,107 additions and 1,627 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ executors:
#- image: datadog/chaos-controller-runner-circle:<< pipeline.parameters.CURRENT_CI_IMAGE >>
# This is circle ci images, provides default tool installed (like docker) to ease step definition and avoid apt-get/update things
# https://circleci.com/docs/circleci-images/#next-gen-language-images
- image: cimg/go:1.20
- image: cimg/go:1.20.5
resource_class: 2xlarge
python:
<<: *working_directory
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,10 @@ packer_cache/
crash.log
*.json-e

# local certificates
# local certificates & config
*.crt
*.key
.local.yaml

# eBPF
ebpf/builds/
8 changes: 4 additions & 4 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ build:make:
stage: build
when: always
variables:
GO_FILENAME: go1.20.4.linux-amd64.tar.gz
GO_FILENAME: go1.20.5.linux-amd64.tar.gz
script:
- apt-get update
- apt-get -y install build-essential git
Expand Down Expand Up @@ -184,7 +184,7 @@ slack-notifier-build.on-failure:

.slack-notifier-release-staging: &slack-notifier-release-staging
image: registry.ddbuild.io/slack-notifier:sdm
tags: [ "runner:main" ]
tags: ["runner:main"]
stage: notify
only:
- /^.*-staging$/
Expand All @@ -195,10 +195,10 @@ slack-notifier-release-staging.on-success:
<<: *slack-notifier-release-staging
when: on_success
variables:
MESSAGE: ':check: | Staging Image Build Complete | [ $CI_PROJECT_NAME ][ $CI_COMMIT_REF_NAME ][ $CI_COMMIT_SHA ]'
MESSAGE: ":check: | Staging Image Build Complete | [ $CI_PROJECT_NAME ][ $CI_COMMIT_REF_NAME ][ $CI_COMMIT_SHA ]"

slack-notifier-release-staging.on-failure:
<<: *slack-notifier-release-staging
when: on_failure
variables:
MESSAGE: ':siren: | Staging Image Build Failed | [ $CI_PROJECT_NAME ][ $CI_COMMIT_REF_NAME ][ $CI_COMMIT_SHA ] :siren:'
MESSAGE: ":siren: | Staging Image Build Failed | [ $CI_PROJECT_NAME ][ $CI_COMMIT_REF_NAME ][ $CI_COMMIT_SHA ] :siren:"
52 changes: 51 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,56 @@ Once you have installed the above requirements, run the `make lima-all` command

Once the instance is started, you can log into it using either the `lima` or its longer form `limactl shell default` commands.

#### Change default lima instance

We are not using `default` as our instance name on lima anymore.

The alias `lima` can still be used as before to quickkly jump on your instance shell (e.g. `lima uname -a`)

More specifically:

```text
Usage: lima [COMMAND...]
lima is an alias for "limactl shell default".
The instance name ("default") can be changed by specifying $LIMA_INSTANCE.
The shell and initial workdir inside the instance can be specified via $LIMA_SHELL
and $LIMA_WORKDIR.
See `limactl shell --help` for further information.
```

Just define `export LIMA_INSTANCE=$(whoami | tr "." "-")` into your `.zshrc` or similar.

### Installing Datadog Agent in local cluster

In case you have a Datadog account and want to install the Datadog Agent into your local cluster to retrieve logs/metrics/traces from it:

- Create an API key [here](https://app.datadoghq.com/organization-settings/api-keys)
- Create an APP key [here](https://app.datadoghq.com/organization-settings/application-keys)
- Store them securely and add them to your `.zshrc`:

> NB: it is recommended to properly tag/isolate your local workload from your PROD workload, check with your Datadog account admin how to adapt tagging accordingly and confirmm which configuration should be applied to your Datadog Agent
```bash
security add-generic-password -a ${USER} -s staging_datadog_api_key -w
security add-generic-password -a ${USER} -s staging_datadog_app_key -w
# security delete-generic-password -a ${USER} -s staging_datadog_api_key
# security delete-generic-password -a ${USER} -s staging_datadog_app_key
# in your .zshrc or similar you can then do:
export STAGING_DATADOG_API_KEY=$(security find-generic-password -a ${USER} -s staging_datadog_api_key -w)
export STAGING_DATADOG_APP_KEY=$(security find-generic-password -a ${USER} -s staging_datadog_app_key -w)
# choose appropriate site here: https://docs.datadoghq.com/getting_started/site/
# only for `make open-dd`
export STAGING_DD_SITE=https://app.datadoghq.com

```

- Run `make lima-install-datadog-agent`
- Run `make open-dd` to open your default browser to the infrastructure page of your host

### Deploying local changes to Lima: `make lima-redeploy`

To deploy changes made to the controller code or chart, run the `make lima-redeploy` command that will run the following targets:
Expand Down Expand Up @@ -154,7 +204,7 @@ The end-to-end tests will create a set of dummy pods in the `default` namespace
In case you have a Datadog account and want to push the tests results to it, you can do the following:

- Create an API key [here](https://app.datadoghq.com/organization-settings/api-keys)
- Store it securily and add it to your `.zshrc`:
- Store it securely and add it to your `.zshrc`:

```bash
security add-generic-password -a ${USER} -s datadog_api_key -w
Expand Down
1 change: 0 additions & 1 deletion LICENSE-3rdparty.csv
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,6 @@ gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/internal/telemet
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/internal/traceprof,Apache-2.0
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/internal/version,Apache-2.0
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/profiler,Apache-2.0
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/profiler/internal,Apache-2.0
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/fastdelta,Apache-2.0
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/immutable,Apache-2.0
gopkg.in/DataDog/dd-trace-go.v1,gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/pproflite,Apache-2.0
Expand Down
90 changes: 75 additions & 15 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: manager injector handler release generate generate-mocks clean-mocks all lima-push-all lima-redeploy lima-all e2e-test test lima-install manifests lima-restart install-controller-gen
.PHONY: *
.SILENT: release

GOOS = $(shell go env GOOS)
Expand All @@ -10,6 +10,19 @@ ifeq (,$(GOBIN))
GOBIN = $(shell go env GOPATH)/bin
endif

INSTALL_DATADOG_AGENT = false
LIMA_INSTALL_SINK = noop
ifdef STAGING_DATADOG_API_KEY
ifdef STAGING_DATADOG_APP_KEY
INSTALL_DATADOG_AGENT = true
LIMA_INSTALL_SINK = datadog
endif
endif

ifndef CONTROLLER_APP_VERSION
CONTROLLER_APP_VERSION = $(shell git rev-parse HEAD)$(shell git diff --quiet || echo '-dirty')
endif

# Lima requires to have images built on a specific namespace to be shared to the Kubernetes cluster when using containerd runtime
# https://github.com/abiosoft/colima#interacting-with-image-registry
CONTAINERD_REGISTRY_PREFIX ?= k8s.io
Expand All @@ -21,7 +34,10 @@ HANDLER_IMAGE ?= ${CONTAINERD_REGISTRY_PREFIX}/chaos-handler:latest

LIMA_PROFILE ?= lima
LIMA_CONFIG ?= lima
KUBECTL ?= limactl shell default sudo kubectl
# default instance name will be connected user name
LIMA_INSTANCE ?= $(shell whoami | tr "." "-")

KUBECTL ?= limactl shell $(LIMA_INSTANCE) sudo kubectl
PROTOC_VERSION = 3.17.3
PROTOC_OS ?= osx
PROTOC_ZIP = protoc-${PROTOC_VERSION}-${PROTOC_OS}-x86_64.zip
Expand Down Expand Up @@ -94,7 +110,10 @@ _injector:;
_handler:;
_manager: generate

_docker-build-injector: docker-build-ebpf
_docker-build-injector:
ifneq (true,$(SKIP_EBPF))
$(MAKE) docker-build-ebpf
endif
_docker-build-handler:;
_docker-build-manager:;

Expand All @@ -116,8 +135,8 @@ docker-build-$(1): _docker-build-$(1) $(1)
docker save $$(IMAGE_TAG) -o ./bin/$(1)/$(1).tar.gz

lima-push-$(1): docker-build-$(1)
limactl copy ./bin/$(1)/$(1).tar.gz default:/tmp/
limactl shell default -- sudo k3s ctr i import /tmp/$(1).tar.gz
limactl copy ./bin/$(1)/$(1).tar.gz $(LIMA_INSTANCE):/tmp/
limactl shell $(LIMA_INSTANCE) -- sudo k3s ctr i import /tmp/$(1).tar.gz

minikube-load-$(1):
# let's fail if the file does not exists so we know, mk load is not failing
Expand Down Expand Up @@ -255,7 +274,7 @@ generate: install-controller-gen

# Lima actions
## Create a new lima cluster and deploy the chaos-controller into it
lima-all: lima-start lima-install-cert-manager lima-push-all lima-install
lima-all: lima-start lima-install-datadog-agent lima-install-cert-manager lima-push-all lima-install
kubens chaos-engineering

## Rebuild the chaos-controller images, re-install the chart and restart the chaos-controller pods
Expand All @@ -278,6 +297,9 @@ lima-install-demo:
## we override images for all of our components to the expected namespace
lima-install: manifests
helm template \
--set=controller.version=$(CONTROLLER_APP_VERSION) \
--set=controller.metricsSink=$(LIMA_INSTALL_SINK) \
--set=controller.profilerSink=$(LIMA_INSTALL_SINK) \
--values ./chart/values/$(HELM_VALUES) \
./chart | $(KUBECTL) apply -f -
ifneq (local.yaml,$(HELM_VALUES)) # we can only wait for a controller if it exists, local.yaml does not deploy the controller
Expand All @@ -286,7 +308,7 @@ endif

## Uninstall CRDs and controller from a lima k3s cluster
lima-uninstall:
helm template --values ./chart/values/$(HELM_VALUES) ./chart | $(KUBECTL) delete -f -
helm template --set=skipNamespace=true --values ./chart/values/$(HELM_VALUES) ./chart | $(KUBECTL) delete -f -

## Restart the chaos-controller pod
lima-restart:
Expand All @@ -303,7 +325,7 @@ lima-kubectx-clean:
kubectl config unset current-context

lima-kubectx:
limactl shell default sudo sed 's/default/lima/g' /etc/rancher/k3s/k3s.yaml >> ~/.kube/config_lima
limactl shell $(LIMA_INSTANCE) sudo sed 's/default/lima/g' /etc/rancher/k3s/k3s.yaml > ~/.kube/config_lima
KUBECONFIG=${KUBECONFIG}:~/.kube/config:~/.kube/config_lima kubectl config view --flatten > /tmp/config
rm ~/.kube/config_lima
mv /tmp/config ~/.kube/config
Expand All @@ -312,13 +334,13 @@ lima-kubectx:

## Stop and delete the lima cluster
lima-stop:
limactl stop -f default
limactl delete default
limactl stop -f $(LIMA_INSTANCE)
limactl delete $(LIMA_INSTANCE)
$(MAKE) lima-kubectx-clean

## Start the lima cluster, pre-cleaning kubectl config
lima-start: lima-kubectx-clean
LIMA_CGROUPS=${LIMA_CGROUPS} LIMA_CONFIG=${LIMA_CONFIG} ./scripts/lima_start.sh
LIMA_CGROUPS=${LIMA_CGROUPS} LIMA_CONFIG=${LIMA_CONFIG} LIMA_INSTANCE=${LIMA_INSTANCE} ./scripts/lima_start.sh
$(MAKE) lima-kubectx

# Longhorn is used as an alternative StorageClass in order to enable "reliable" disk throttling accross various local setup
Expand Down Expand Up @@ -374,16 +396,26 @@ generate-mocks: clean-mocks install-mockery
release:
VERSION=$(VERSION) ./tasks/release.sh

lima-install-local:
_pre_local: generate manifests
@$(shell $(KUBECTL) get deploy chaos-controller 2> /dev/null)
ifeq (0,$(.SHELLSTATUS))
# uninstall using a non local value to ensure deployment is deleted
-$(MAKE) lima-uninstall HELM_VALUES=dev.yaml
$(MAKE) lima-install HELM_VALUES=local.yaml
$(KUBECTL) -n chaos-engineering get cm chaos-controller -oyaml | yq '.data["config.yaml"]' > .local.yaml
yq -i '.controller.webhook.certDir = "chart/certs"' .local.yaml
else
@echo "Chaos controller is not installed, skipped!"
endif

pre-debug: generate manifests lima-install-local
debug: _pre_local
@echo "now you can launch through vs-code or your favorite IDE a controller in debug with appropriate configuration (--config=chart/values/local.yaml + CONTROLLER_NODE_NAME=local)"

local: generate manifests lima-install-local
CONTROLLER_NODE_NAME=local go run main.go --config=chart/values/local.yaml
run:
CONTROLLER_NODE_NAME=local go run . --config=.local.yaml

watch: _pre_local install-watchexec
watchexec make SKIP_EBPF=true lima-push-injector run

install-protobuf:
curl -sSLo /tmp/${PROTOC_ZIP} https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/${PROTOC_ZIP}
Expand Down Expand Up @@ -456,3 +488,31 @@ ifeq (,$(wildcard $(GOBIN)/yamlfmt))
tar -xvzf /tmp/yamlfmt.tar.gz --directory=$(GOBIN) yamlfmt
rm /tmp/yamlfmt.tar.gz
endif

install-watchexec:
ifeq (,$(wildcard $(GOBIN)/gow))
$(info installing watchexec...)
brew install watchexec
endif

EXISTING_NAMESPACE = $(shell $(KUBECTL) get ns datadog-agent -oname || echo "")

lima-install-datadog-agent:
ifeq (true,$(INSTALL_DATADOG_AGENT))
ifeq (,$(EXISTING_NAMESPACE))
$(KUBECTL) create ns datadog-agent
helm repo add --force-update datadoghq https://helm.datadoghq.com
helm install -n datadog-agent my-datadog-operator datadoghq/datadog-operator
$(KUBECTL) create secret -n datadog-agent generic datadog-secret --from-literal api-key=${STAGING_DATADOG_API_KEY} --from-literal app-key=${STAGING_DATADOG_APP_KEY}
endif
endif
$(KUBECTL) apply -f - < examples/datadog-agent.yaml

open-dd:
ifeq (true,$(INSTALL_DATADOG_AGENT))
ifdef STAGING_DD_SITE
open "${STAGING_DD_SITE}/infrastructure?host=lima-$(LIMA_INSTANCE)&tab=details"
else
@echo "You need to define STAGING_DD_SITE in your .zshrc or similar to use this feature"
endif
endif
35 changes: 34 additions & 1 deletion chart/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ kind: Deployment
metadata:
name: chaos-controller
namespace: {{ .Values.chaosNamespace }}
labels:
app: chaos-controller
chart_name: "{{ .Chart.Name }}"
chart_version: "{{ .Chart.Version }}"
tags.datadoghq.com/env: dev
tags.datadoghq.com/service: chaos-controller
tags.datadoghq.com/version: {{ .Values.controller.version | default .Values.controller.image.tag }}
spec:
replicas: 1
selector:
Expand All @@ -17,10 +24,16 @@ spec:
metadata:
labels:
app: chaos-controller
chart_name: "{{ .Chart.Name }}"
chart_version: "{{ .Chart.Version }}"
admission.datadoghq.com/enabled: "true"
tags.datadoghq.com/env: dev
tags.datadoghq.com/service: chaos-controller
tags.datadoghq.com/version: {{ .Values.controller.version | default .Values.controller.image.tag }}
annotations:
kubectl.kubernetes.io/default-container: manager
spec:
serviceAccount: chaos-controller
serviceAccountName: chaos-controller
containers:
- name: kube-rbac-proxy
image: {{ template "chaos-controller.format-image" .Values.proxy.image }}
Expand All @@ -33,6 +46,13 @@ spec:
ports:
- containerPort: 8443
name: https
resources:
limits:
cpu: 50m
memory: 64Mi
requests:
cpu: 50m
memory: 64Mi
- name: manager
image: {{ template "chaos-controller.format-image" deepCopy .Values.global.chaos.defaultImage | merge .Values.global.oci | merge .Values.controller.image }}
imagePullPolicy: IfNotPresent
Expand All @@ -41,10 +61,21 @@ spec:
args:
- --config=/etc/chaos-controller/config.yaml
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CONTROLLER_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: TRACE_AGENT_URL
value: $(HOST_IP):8126
ports:
- containerPort: {{ .Values.controller.webhook.port }}
name: webhook-server
Expand All @@ -63,6 +94,8 @@ spec:
- mountPath: /etc/chaos-controller
name: config
readOnly: true
securityContext:
runAsUser: 0
{{- if .Values.controller.image.pullSecrets }}
imagePullSecrets:
- name: {{ .Values.controller.image.pullSecrets }}
Expand Down
1 change: 1 addition & 0 deletions chart/templates/generated/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,5 +71,6 @@ rules:
resources:
- services
verbs:
- get
- list
- watch
2 changes: 2 additions & 0 deletions chart/templates/namespace.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
# under the Apache License Version 2.0.
# This product includes software developed at Datadog (https://www.datadoghq.com/).
# Copyright 2023 Datadog, Inc.
{{- if not .Values.skipNamespace }}
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .Values.chaosNamespace }}"
{{- end }}
Loading

0 comments on commit 8057e50

Please sign in to comment.