Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement for flow-visibility e2e test #5770

Merged
merged 1 commit into from
Jan 8, 2024

Conversation

yuntanghsu
Copy link
Contributor

@yuntanghsu yuntanghsu commented Dec 5, 2023

In this PR, we do:

  1. Changed the order where we append expired records before exporting them from our exporter.
    For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired. For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation.

  2. Extend recordmetrics endpoint in FA API to check if Flow Aggregator can successfully connect to the "ClickHouse"/"IPFIX collector Pod"/"Antrea-Agent Pods" before sending traffic.

  3. Add labels to External subtest to filter useless logs from the IPFIX collector Pod.

  4. Confirm the correct addition of a label to a specific Pod after updating the Pod.

  5. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse.

  6. Use new image from go-ipfix. We improve the IPFIX collector by:
    a. Disable printing records whenever we receive it. Instead, we store records in a string array.
    b. Add http listener and handler to receive request to return or reset records.
    In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
    inside it.

Note:

As can be seen from the two pictures below, the time difference between the two attempts to fetch the ipfix log is > 3.5 seconds.
Screenshot 2023-12-05 at 11 34 00 PM
Screenshot 2023-12-05 at 10 49 11 PM

It is ~1.2s for single stack cluster:
image

Time for fetching the log from IPFIX collector is decreased to ~80ms. (vmware/go-ipfix#338)
image

Related PR:
vmware/go-ipfix#338

resolve #5730

@yuntanghsu yuntanghsu marked this pull request as draft December 5, 2023 00:40
@yuntanghsu yuntanghsu force-pushed the e2e_test branch 11 times, most recently from 430fd57 to 145f01b Compare December 6, 2023 22:24
@yuntanghsu yuntanghsu changed the title test Remove octetDeltaCount check in flow-visibility e2e test Dec 6, 2023
@yuntanghsu yuntanghsu force-pushed the e2e_test branch 2 times, most recently from ea415fd to 2cc3ebc Compare December 7, 2023 00:00
@yuntanghsu yuntanghsu marked this pull request as ready for review December 8, 2023 03:30
@yuntanghsu yuntanghsu force-pushed the e2e_test branch 2 times, most recently from 420021e to 9e72433 Compare December 8, 2023 09:37
@yuntanghsu yuntanghsu marked this pull request as draft December 8, 2023 09:51
@yuntanghsu yuntanghsu force-pushed the e2e_test branch 11 times, most recently from a204e48 to b05d69f Compare December 9, 2023 18:11
@yuntanghsu yuntanghsu force-pushed the e2e_test branch 2 times, most recently from 62ff0d1 to 2c3ef64 Compare January 4, 2024 21:29
antoninbas
antoninbas previously approved these changes Jan 4, 2024
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/test-all

Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to update go-ipfix version in the go.mod file?

pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
test/e2e/flowaggregator_test.go Show resolved Hide resolved
test/e2e/flowaggregator_test.go Show resolved Hide resolved
@yuntanghsu
Copy link
Contributor Author

We don't need to update go-ipfix version in the go.mod file?

We just need the updated collector image. But we can update the version as well.

In this commit, we do:

1. Changed the order where we append expired records before exporting them from our exporter.
For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired. For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation.

2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic.

3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic.

4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod.

5. Confirm the correct addition of a label to a specific Pod after updating the Pod.

6. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse.

7. Use new image from go-ipfix. We improve the IPFIX collector by:
    a.  Disable printing records whenever we receive it. Instead, we store records in a string array.
    b. Add http listener and handler to receive request to return or reset records.
    In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
    inside it.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/test-all

Comment on lines +1775 to +1785
err = wait.Poll(defaultInterval, timeout, func() (bool, error) {
pod, err := data.clientset.CoreV1().Pods(data.testNamespace).Get(context.TODO(), testPod.Name, metav1.GetOptions{})
if err != nil {
if errors.IsNotFound(err) {
return false, nil
}
return false, fmt.Errorf("error when getting Pod '%s': %w", pod.Name, err)
}
return pod.Labels["targetLabel"] == label, nil
})
require.NoErrorf(t, err, "Error when verifying the label on %s", testPod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge the PR as is, but I don't think this is useful. It should not be possible to have a successful Update call, yet see the old object when calling Get afterwards.

@antoninbas antoninbas merged commit bc7dcdf into antrea-io:main Jan 8, 2024
47 of 52 checks passed
@yuntanghsu yuntanghsu deleted the e2e_test branch January 8, 2024 19:21
@antoninbas
Copy link
Contributor

@yuntanghsu I wonder if we should backport this to release-1.14. I believe visibility tests are also failing for that branch.
What do you think?

@yuntanghsu
Copy link
Contributor Author

yuntanghsu commented Jan 11, 2024

But this PR depends on a couple of PRs:

  1. Service information of PacketIn record is changed in this PR: Remove redundant log in fillPodInfo/fillServiceInfo and update deny connection store. #5592
  2. Labels check is added in this PR: Use labels to enhance records filtering in FA e2e tests #5731

If we want to resolve the e2e test issue and increase its stability, I think we should also backport all the e2e related PR.

#5592
#5704 (optional)
#5731
#5755 (optional)
#5770
#5802 (optional)

@antoninbas
Copy link
Contributor

hmm that's a lot. Do we know which PR specifically "broke" e2e visibility tests?

@yuntanghsu
Copy link
Contributor Author

Should be this one #5592

That is because we are trying to get the CORRECT records sent in each subtests. Previous e2e test had lots of false positive cases -> we were using records from the previous subtests to verify the information in the current subtest.

I think the error in release-1.14 is because of its instability instead of the reason I describe above.
It could be fixed by this PR. But I'm not sure how to exclude the label related part and only backport the remaining part?

@antoninbas
Copy link
Contributor

antoninbas commented Jan 16, 2024

If the issue is just with the e2e tests, and not functional, then we could consider disabling the workflow for the release-1.14 branch?

test-e2e-flow-visibility:

But I was also wondering: are there some parts of this PR which should be backported because they impact functionality? I'm thinking of this in particular:

Changed the order where we append expired records before exporting them from our exporter

@yuntanghsu
Copy link
Contributor Author

I think that one is the only one that impacts the functionality of our flow-exporter. Other changes like antctl/ipfix-image are mainly for e2e test only.

Do we want to backport the changes for connectionStore in this PR (#5592) as well?

@antoninbas
Copy link
Contributor

#5592 was marked as backport, so yes you should backport it now (we will release 1.14.2 at the end of the week)
I suggest that you backport all necessary PRs together in a single cherry-pick (it's supported by the script). This way we only need to review and test 1 PR, and they are all related.

yuntanghsu added a commit to yuntanghsu/antrea that referenced this pull request Jan 22, 2024
1. Changed the order in which we append expired records before exporting
   them from our exporter. For inter-Node traffic with egress/ingress NP
   with action drop, we will receive records from both PacketIn and the
   conntrack table. For the egress case, we need to ensure that the
   PacketIn record arrives at the FA first, or we will be stuck waiting
   for correlation.

2. Add checks to verify that records can successfully be sent by the
   Flow Exporter before sending traffic in e2e tests.

3. Add labels to External subtest to filter useless logs from the IPFIX
   collector Pod.

4. Confirm the correct addition of a label to a specific Pod after
   updating the Pod.

5. Remove the octetDeltaCount check and, instead, filter out all records
   with octetDeltaCount=0 when retrieving records from the IPFIX
   collector Pod and ClickHouse.

6. Use new version of go-ipfix test collector. The new collector no
    longer prints all received records. Instead records are saved
    in-memory, and can be queried over HTTP. The list of records can
    also be reset with an HTTP call. In this way, we can drastically
    reduce the time to retrieve records in tests.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
antoninbas pushed a commit that referenced this pull request Jan 23, 2024
1. Changed the order in which we append expired records before exporting
   them from our exporter. For inter-Node traffic with egress/ingress NP
   with action drop, we will receive records from both PacketIn and the
   conntrack table. For the egress case, we need to ensure that the
   PacketIn record arrives at the FA first, or we will be stuck waiting
   for correlation.

2. Add checks to verify that records can successfully be sent by the
   Flow Exporter before sending traffic in e2e tests.

3. Add labels to External subtest to filter useless logs from the IPFIX
   collector Pod.

4. Confirm the correct addition of a label to a specific Pod after
   updating the Pod.

5. Remove the octetDeltaCount check and, instead, filter out all records
   with octetDeltaCount=0 when retrieving records from the IPFIX
   collector Pod and ClickHouse.

6. Use new version of go-ipfix test collector. The new collector no
    longer prints all received records. Instead records are saved
    in-memory, and can be queried over HTTP. The list of records can
    also be reset with an HTTP call. In this way, we can drastically
    reduce the time to retrieve records in tests.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Test] "E2e tests on a Kind cluster on Linux for Flow Visibility" job keeps failing
4 participants