Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added E2E tests framework and test cases #137

Merged
merged 1 commit into from
Mar 18, 2020
Merged

Conversation

M00nF1sh
Copy link
Contributor

@M00nF1sh M00nF1sh commented Mar 12, 2020

Added E2E tests framework and test cases.

The tests cases is structured as follows:

  • /test/e2e/fishapp: pods are composited by two container: colorApp and httpProxy. The colorApp is a container echo environment variable. the httpProxy allows we check pod's connectivity into other components within cluster. fishapp containers can be used as both client and server, so it can be structured to form either symmetrical mesh or asymmetrical mesh. (fishapp = app by m00nf1sh 😄 )
  • /test/e2e/fishapp/dynamic_stack: a dynamic configurable symmetrical mesh based on fishapp.
    • $VirtualNodesCount of VirtualNode
    • $VirtualServicesCount of VirtualServices.
    • each VirtualService have $RoutesCountPerVirtualRouter of paths: path-0, path-1...
    • each path above have $TargetsCountPerRoute targets with equal weight.
    • each VirtualNode is allowed to talk to $BackendsCountPerVirtualNode of services.
    • each VirtualNode have $ReplicasPerVirtualNode of pods.
    • each pod have two container port, one is colorApp and intercepted by envoy used VirtualService backend. Another is an HttpProxy un-intercepted by envoy(we use it to test pod's connectivity to services by doing portforwards xD)
    • After the stack deploy succeeds, for each pod, we find it's corresponding node's allowed backend service, and request each path in these services for $ConnectivityCheckPerURL times.
  • test/e2e/fishapp/<other_stacks>: extension point for other stacks based on fishapp.
  • test/e2e/<other_app>/<other_scenario> :extension point for other application and scenarios.
  • test/e2e/framework: contains support and utility code.

current test workflow based on dynamic_stack(for both cloudMap & DNS):

  1. Reset controller/injector into default ones from latest helm. (yes, it's a e2e test against our helm charts as well xD)
  2. Install a "dynamic stack" called stackOld.
  3. Verify stackOld's behaviors. (for each pod in this symmetrical mesh, check it's connectivity to other services allowed to access).
  4. Upgrade controller/injector into new built ones from code using helm. (to check upgrade via helm works fine)
  5. Verify stackOld's behaviors again. (to check new controller don't break existing application)
  6. Install a new "dynamic stack" called stackNew.
  7. Verify stackNew's behaviors.

How to run tests manually

  1. Run test against your cluster with controller/injector installed: ginkgo -r -v test/e2e/ -- --kubeconfig=<kubeconfig> --cluster-name=<cluster-name> --aws-region=us-west-2 --aws-vpc-id=<vpc-id>
  2. Run test against your cluster with controller/injector installed and upgrade test against new controller image: ginkgo -r -v test/e2e/ -- --kubeconfig=<kubeconfig> --cluster-name=<cluster-name> --aws-region=us-west-2 --aws-vpc-id=<vpc-id> --controller-image=<new-controller-image>
  3. Run test against your cluster with controller/injector installed and upgrade test against new injector image: ginkgo -r -v test/e2e/ -- --kubeconfig=<kubeconfig> --cluster-name=<cluster-name> --aws-region=us-west-2 --aws-vpc-id=<vpc-id> --injector-image=<new-injector-image>

CircleCI run:

Here is the plan to running test in CI for controller:

compile the controller and upload to ECR as controller: xxxxxx
create cluster with controller installed with image(controller: xxxxxx), and latest injector from helm chart. (with the help of this PR: aws/aws-k8s-tester#81), which allows to install controller & injector from latest helm chart, while allows to override controller/injector image.
launch the test suit against the cluster

Here is the plan to running test in CI for injector:

compile the injector and upload to ECR as injector: xxxxxx
create cluster with injector installed with image(injector: xxxxxx), and latest controller from helm chart. (with the help of this PR: aws/aws-k8s-tester#81), which allows to install controller & injector from latest helm chart, while allows to override controller/injector image.
launch the test suit against the cluster

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Collaborator

@kiranmeduri kiranmeduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!, thanks for introducing these tests.

test/e2e/fishapp/shared/manifest_builder.go Outdated Show resolved Hide resolved
test/e2e/fishapp/shared/manifest_builder.go Show resolved Hide resolved
test/e2e/fishapp/shared/manifest_builder.go Show resolved Hide resolved
"net/http"
)

// NewPortForwarder return a new port forwarder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this comment explaining how this is used by operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. To test whether pod(node-0-xxxx) can access service-1 from our code(which runs on mac or circleCI):

first, we setup a port-forward like localhost:8899:node-0-xxxx:8899. (httpProxy runs on 8899)
then we do curl -x "http://localhost:8899" "http://service-1:/path-0"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Kiran meant is to add a little of the explanation to the comment above--which I think is a good idea.

test/e2e/fishapp/shared/dynamic_stack.go Outdated Show resolved Hide resolved
test/e2e/fishapp/shared/dynamic_stack.go Outdated Show resolved Hide resolved
}
cancel()
Expect(<-errChan).To(BeNil())
return urlRespCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we asserting that a particular percentage of weighted traffic was routed to a specific backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test based on chi-square under pvalue 0.001 :D

// =======virtual nodes =========
// vn1 -> vs1,vs2,vs3,vs4
// vn2 -> vs5,vs1,vs2,vs3
// vn3 -> vs4,vs5,vs1,vs2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns with structuring test fixtures in this way.

First, the mesh does not represent anything corresponding to a real-world use case. Because of this, it is difficult for readers of the test code to reason about. Having a real-world scenario described in a test fixture enables readers to reasonably understand and explain test assertions.

Secondly, you are always creating a symmetrical mesh. In other words, the "dynamic" stack is designed so that it actually always produces an mesh where each virtual service is constructed in an identical manner. This will inevitably lead to a greater chance of false positive test results that tend to exclude edge/corner cases that materialize when you have scenarios that are "uneven" (i.e. more real-worldish ;) ).

As an example of how this works out, let me take this assertion function example from this patch:

// expects connectivity and routing works correctly
func (s *DynamicStack) ExpectBehavesCorrectly(ctx context.Context, f *framework.Framework) {
	time.Sleep(3 * time.Minute)
	vsByName := make(map[string]*appmeshv1beta1.VirtualService)
	for i := 0; i != s.VirtualServicesCount; i++ {
		vs := s.createdServiceVSs[i]
		vsByName[vs.Name] = vs
	}

	podsConnectivityCheckResult := make(map[string]map[string]map[string]int)
	for i := 0; i != s.VirtualNodesCount; i++ {
		dp := s.createdNodeDPs[i]
		vn := s.createdNodeVNs[i]

		var vsList []*appmeshv1beta1.VirtualService
		for _, backend := range vn.Spec.Backends {
			vsList = append(vsList, vsByName[backend.VirtualService.VirtualServiceName])
		}
		ret := s.checkDeploymentVirtualServiceConnectivity(ctx, f, dp, vsList)
		for k, v := range ret {
			podsConnectivityCheckResult[k] = v
		}
	}
	prettyJSON, err := json.MarshalIndent(podsConnectivityCheckResult, "", "    ")
	Expect(err).NotTo(HaveOccurred())
	utils.Logf("%v", string(prettyJSON))
}

Now, when I read the above, I don't get any semantic or scenario-specific context at all. I don't really know why a particular virtual node or virtual service is expected to exist or not exist. It's just too abstracted from reality.

If instead of defining a generic always-symmetric mesh with abstract node and service names, imagine creating named real-world mesh examples with realistic names:

type MeshFixture interface {
    Name() string
    Deploy() error
    Check() error
    Cleanup() error
}

// A blue-green deployment scenario, implements MeshFixture
type BlueGreenScenario struct {
    webapp *corev1.Deployment
    service *corev1.Service
    virtService *appmeshv1beta1.VirtualService
    routeConfig shared.RouteToWeightedVirtualNodes
    blueNode *appmeshv1beta1.VirtualNode
    greenNode *appmeshv1beta1.VirtualNode
}

func (d *BlueGreenScenario) Deploy() error {
    // Create the Deployment, Service, VirtualService, Mesh and VirtualNode objects
    // NOTE: do NOT create the VirtualRouter yet
}

func (d *BlueGreenScenario) Check() error {
    // Assert that the Mesh, VirtualNodes, VirtualService, Deployment and Service objects exist properly
    ...
    // Create the VirtualRouter which establishes the weighted routing of traffic to
    // the Green virtual node
    ...
    // Check that traffic is flowing to the Green virtual node now
   ...
}

The individual assertions within the Check() method will, by nature of this setup, be much more descriptive. Instead of failure messages looking like this:

	Expect(len(pods.Items)).NotTo(BeZero())
check connectivity of pod 1-a2ka to service 1
Expected 0 to not be 0

to instead be something like this:

	Expect(len(d.webapp.Pods)).NotTo(BeZero())
        checking connectivity of pod green-a2ka to service web-app

Copy link
Contributor Author

@M00nF1sh M00nF1sh Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jay, I agree this dynamic test case is not enough.

But the purpose of this PR is to create a test framework and a test case to cover the basic functionalities of AppMesh(multiple service/multiple node/multiple route/multiple target/multiple backend). It's not intended to cover all E2E scenarios given the timeline i have.
The dynamic stack is designed in a way to make it reusable for scale tests as well, which allows to do scale test with configurable dimensions. (I found an appmesh dataplane bug with it 😄)

There are more tests cases yet to be covered, e.g. add/remove route, change weights, http2, grpc, tls, retry, etc.

The test framework is structured in a way to add new test cases easily.
e.g.

  • the folder test/e2e/fishapp is specific to this colorApp+httpProxy container, it can be used to test any http-based appMesh scenarios.
  • the file *test/e2e/fishapp/dynamic_stack is specific to this symmetric mesh based on "fishapp".
  • [extension point] more tests based on this "fishapp" container(which can be used as either client/server or both at same time), e.g. test/e2e/fishapp/bluegreen_stack
  • [extension point] other test scenario not rely on "fishapp" can be added like test/e2e/color-app, test/e2e/bluegreen-app, etc...

BTW, i didn't design a MeshFixture like interface because these tests tends to be highly depends on use case, some scenario based tests may have deploy & test steps interleaved, there is no value added to have such interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the purpose of this PR is to create a test framework and a test case to cover the basic functionalities of AppMesh(multiple service/multiple node/multiple route/multiple target/multiple backend). It's not intended to cover all E2E scenarios given the timeline i have.

I didn't say that I expected you to cover all possible e2e scenarios :)

The dynamic stack is designed in a way to make it reusable for scale tests as well, which allows to do scale test with configurable dimensions. (I found an appmesh dataplane bug with it )

Yes, I understand that, and the dynamic stack fixture you've created here is a good way to do fuzz, edge/negative, and stress testing. No problem with that.

What I would like to see in a followup PR is an approach that covers the standard use cases/scenarios that AppMesh is designed to cover (blue/green, A/B, etc) in a manner that is easier for readers of the test cases to figure out expectations. Again, it's cool what you've done here. I just think a future enhancement to simplify the scenarios would be useful.

There are more tests cases yet to be covered, e.g. add/remove route, change weights, http2, grpc, tls, retry, etc.

Yes, agreed completely.

The test framework is structured in a way to add new test cases easily.
e.g.

  • the folder test/e2e/fishapp is specific to this colorApp+httpProxy container, it can be used to test any http-based appMesh scenarios.
  • the file *test/e2e/fishapp/dynamic_stack is specific to this symmetric mesh based on "fishapp".
  • [extension point] more tests based on this "fishapp" container(which can be used as either client/server or both at same time), e.g. test/e2e/fishapp/bluegreen_stack
  • [extension point] other test scenario not rely on "fishapp" can be added like test/e2e/color-app, test/e2e/bluegreen-app, etc...

Yep, no disagreement on this.

BTW, i didn't design a MeshFixture like interface because these tests tends to be highly depends on use case, some scenario based tests may have deploy & test steps interleaved, there is no value added to have such interface.

My suggestion is always code to an interface, not a concrete struct. Makes things more extensible and easier to reason about in the long run. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 TBH, my original version of E2E test is using a blue green deployment color-app. But it cannot cover the case of multiple-path and multiple virtual service. So instead I took this dynamic symmetrical mesh approach. (I was also asked to script my scale tests, so it can be reused xD).

Yeah, we can add more real-word e2e tests into this in the future :D

test/e2e/fishapp/dynamic_stack.go Show resolved Hide resolved
dynamicStack.ExpectBehavesCorrectly(ctx, f)
var ans int
fmt.Println("press any thing to continue cleanup")
_, _ = fmt.Scan(&ans)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to add this manual step here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, it's used for my debug steps to hold the test cleanup for trouble-shotting, removed :D

if errs := s.deleteMeshAndNamespace(ctx, f); len(errs) != 0 {
deletionErrors = append(deletionErrors, errs...)
}
Expect(len(deletionErrors)).To(BeZero())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the cleanup step fails? Do resources need to be manually removed, or can you retry cleanup only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cleanup steps are independent from each other(failed steps will report errors).
And yes, if resources are left over, we need to manually removed. I don't think add retry helps before we saw resource leaks scenarios. (like apiserver crash etc).
We'll monitor test failures and use dedicated account to run tests. If we saw resource leaks, we'll add handling for them correspondingly.

Comment on lines 49 to 56
ServiceDiscoveryType: shared.CloudMapServiceDiscovery,
VirtualServicesCount: 5,
VirtualNodesCount: 5,
RoutesCountPerVirtualRouter: 2,
TargetsCountPerRoute: 4,
BackendsCountPerVirtualNode: 2,
ReplicasPerVirtualNode: 3,
ConnectivityCheckPerURL: 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: I think it would be nice to be able to define these via a template or other argument format so they can be modified quickly and the tests run with different configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i think it's a valuable idea.
However, since tests is highly dynamic, and i think code itself is better than template :D. (e.g. we can have multiple tests tests different dimension combinations specified, currently i only specified one)

Comment on lines 1 to 17
package utils

import (
"fmt"
"time"

"github.com/onsi/ginkgo"
)

func nowStamp() string {
return time.Now().Format(time.StampMilli)
}

func log(level string, format string, args ...interface{}) {
fmt.Fprintf(ginkgo.GinkgoWriter, nowStamp()+": "+level+": "+format+"\n", args...)
}

func Logf(format string, args ...interface{}) {
log("INFO", format, args...)
}

func Failf(format string, args ...interface{}) {
msg := fmt.Sprintf(format, args...)
ginkgo.Fail(msg)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to just use something like logrus to eliminate needing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm, do we need to do structured logging in test cases? this can be an improvement :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structured as in JSON: no probably not, but the default format is highly similar to what you've defined here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to use zap xD

@M00nF1sh M00nF1sh force-pushed the e2e branch 14 times, most recently from 4c85b7c to 4dfedc7 Compare March 18, 2020 03:17
test/e2e/framework/resource/deployment/manager.go Outdated Show resolved Hide resolved
"net/http"
)

// NewPortForwarder return a new port forwarder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Kiran meant is to add a little of the explanation to the comment above--which I think is a good idea.


for _, sdType := range []shared.ServiceDiscoveryType{shared.DNSServiceDiscovery, shared.CloudMapServiceDiscovery} {
func(sdType shared.ServiceDiscoveryType) {
It(fmt.Sprintf("should behaves correctly with service discovery type %v", sdType), func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"should behave correctly" is a bit vague, can you 1. split this into multiple tests and 2. improve the description to something like "should helm install and upgrade for controller and injector".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nckturner what do you mean by split this into multiple tests.
There are two test currently DNS and CloudMap. and within the test unit, the tests are sequential, so not really split able.
install old controller => deploy stack => verify behavior =>
install new controller => verify behavior against old stack =>
deploy new stack => verify behavior against new stack

Also, for the behavior, the test is just check connectivity and result distribution.

Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of code here, but everything looks good to me. Would love to see in the future some non-dynamically-generated test scenarios, but we can work on that in a future enhancement. Nice work, @M00nF1sh!

name: go mod download
command: go mod download
- save_cache:
key: go-mod-{{ checksum "go.sum" }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work above with the go mod caching. 👍

Copy link
Contributor Author

@M00nF1sh M00nF1sh Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 , only problem is we don't get docker layer cache for free..

func NewForConfig(c *rest.Config) (*Clientset, error) {
configShallowCopy := *c
if configShallowCopy.RateLimiter == nil && configShallowCopy.QPS > 0 {
if configShallowCopy.Burst <= 0 {
return nil, fmt.Errorf("Burst is required to be greater than 0 when RateLimiter is not set and QPS is set to greater than 0")
}
configShallowCopy.RateLimiter = flowcontrol.NewTokenBucketRateLimiter(configShallowCopy.QPS, configShallowCopy.Burst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you could have put these changes into an entirely separate PR/commit, since the above are not directly related to the purpose of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch :D. i reverted the version change to controller-gen tools 😄

build_push_controller_image() {
declare -r image_repo="$1" image_tag="$2" image_name="$3"

ecr::ensure_repository "${image_repo}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the above looks familiar :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, i steal this ensure repo from CNI repo :D

setup_cluster
test_controller_image "${image_name}"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff in this script file. Nice and clean. ++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😃

// =======virtual nodes =========
// vn1 -> vs1,vs2,vs3,vs4
// vn2 -> vs5,vs1,vs2,vs3
// vn3 -> vs4,vs5,vs1,vs2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the purpose of this PR is to create a test framework and a test case to cover the basic functionalities of AppMesh(multiple service/multiple node/multiple route/multiple target/multiple backend). It's not intended to cover all E2E scenarios given the timeline i have.

I didn't say that I expected you to cover all possible e2e scenarios :)

The dynamic stack is designed in a way to make it reusable for scale tests as well, which allows to do scale test with configurable dimensions. (I found an appmesh dataplane bug with it )

Yes, I understand that, and the dynamic stack fixture you've created here is a good way to do fuzz, edge/negative, and stress testing. No problem with that.

What I would like to see in a followup PR is an approach that covers the standard use cases/scenarios that AppMesh is designed to cover (blue/green, A/B, etc) in a manner that is easier for readers of the test cases to figure out expectations. Again, it's cool what you've done here. I just think a future enhancement to simplify the scenarios would be useful.

There are more tests cases yet to be covered, e.g. add/remove route, change weights, http2, grpc, tls, retry, etc.

Yes, agreed completely.

The test framework is structured in a way to add new test cases easily.
e.g.

  • the folder test/e2e/fishapp is specific to this colorApp+httpProxy container, it can be used to test any http-based appMesh scenarios.
  • the file *test/e2e/fishapp/dynamic_stack is specific to this symmetric mesh based on "fishapp".
  • [extension point] more tests based on this "fishapp" container(which can be used as either client/server or both at same time), e.g. test/e2e/fishapp/bluegreen_stack
  • [extension point] other test scenario not rely on "fishapp" can be added like test/e2e/color-app, test/e2e/bluegreen-app, etc...

Yep, no disagreement on this.

BTW, i didn't design a MeshFixture like interface because these tests tends to be highly depends on use case, some scenario based tests may have deploy & test steps interleaved, there is no value added to have such interface.

My suggestion is always code to an interface, not a concrete struct. Makes things more extensible and easier to reason about in the long run. :)

@M00nF1sh M00nF1sh changed the title Add e2e tests Added E2E tests framework and test cases. Mar 18, 2020
@M00nF1sh M00nF1sh changed the title Added E2E tests framework and test cases. Added E2E tests framework and test cases Mar 18, 2020
@M00nF1sh M00nF1sh closed this Mar 18, 2020
@M00nF1sh M00nF1sh reopened this Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants