Improvement for flow-visibility e2e test #5770

yuntanghsu · 2023-12-05T00:40:14Z

In this PR, we do:

Changed the order where we append expired records before exporting them from our exporter.
For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired. For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation.
Extend recordmetrics endpoint in FA API to check if Flow Aggregator can successfully connect to the "ClickHouse"/"IPFIX collector Pod"/"Antrea-Agent Pods" before sending traffic.
Add labels to External subtest to filter useless logs from the IPFIX collector Pod.
Confirm the correct addition of a label to a specific Pod after updating the Pod.
Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse.
Use new image from go-ipfix. We improve the IPFIX collector by:
a. Disable printing records whenever we receive it. Instead, we store records in a string array.
b. Add http listener and handler to receive request to return or reset records.
In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
inside it.

Note:

As can be seen from the two pictures below, the time difference between the two attempts to fetch the ipfix log is > 3.5 seconds.

It is ~1.2s for single stack cluster:

Time for fetching the log from IPFIX collector is decreased to ~80ms. (vmware/go-ipfix#338)

Related PR:
vmware/go-ipfix#338

resolve #5730

antoninbas

LGTM

antoninbas · 2024-01-04T21:47:15Z

/test-all

heanlan

We don't need to update go-ipfix version in the go.mod file?

pkg/agent/flowexporter/exporter/exporter.go

test/e2e/flowaggregator_test.go

yuntanghsu · 2024-01-04T22:21:57Z

We don't need to update go-ipfix version in the go.mod file?

We just need the updated collector image. But we can update the version as well.

In this commit, we do: 1. Changed the order where we append expired records before exporting them from our exporter. For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired. For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation. 2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic. 3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic. 4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 5. Confirm the correct addition of a label to a specific Pod after updating the Pod. 6. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse. 7. Use new image from go-ipfix. We improve the IPFIX collector by: a. Disable printing records whenever we receive it. Instead, we store records in a string array. b. Add http listener and handler to receive request to return or reset records. In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records inside it. Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>

heanlan

LGTM

antoninbas · 2024-01-05T23:26:27Z

/test-all

antoninbas · 2024-01-08T18:03:38Z

test/e2e/flowaggregator_test.go

+		err = wait.Poll(defaultInterval, timeout, func() (bool, error) {
+			pod, err := data.clientset.CoreV1().Pods(data.testNamespace).Get(context.TODO(), testPod.Name, metav1.GetOptions{})
+			if err != nil {
+				if errors.IsNotFound(err) {
+					return false, nil
+				}
+				return false, fmt.Errorf("error when getting Pod '%s': %w", pod.Name, err)
+			}
+			return pod.Labels["targetLabel"] == label, nil
+		})
+		require.NoErrorf(t, err, "Error when verifying the label on %s", testPod.Name)


I'm going to merge the PR as is, but I don't think this is useful. It should not be possible to have a successful Update call, yet see the old object when calling Get afterwards.

antoninbas · 2024-01-11T16:59:54Z

@yuntanghsu I wonder if we should backport this to release-1.14. I believe visibility tests are also failing for that branch.
What do you think?

yuntanghsu · 2024-01-11T19:27:14Z

But this PR depends on a couple of PRs:

Service information of PacketIn record is changed in this PR: Remove redundant log in fillPodInfo/fillServiceInfo and update deny connection store. #5592
Labels check is added in this PR: Use labels to enhance records filtering in FA e2e tests #5731

If we want to resolve the e2e test issue and increase its stability, I think we should also backport all the e2e related PR.

#5592
#5704 (optional)
#5731
#5755 (optional)
#5770
#5802 (optional)

antoninbas · 2024-01-11T19:29:34Z

hmm that's a lot. Do we know which PR specifically "broke" e2e visibility tests?

yuntanghsu · 2024-01-11T19:36:46Z

Should be this one #5592

That is because we are trying to get the CORRECT records sent in each subtests. Previous e2e test had lots of false positive cases -> we were using records from the previous subtests to verify the information in the current subtest.

I think the error in release-1.14 is because of its instability instead of the reason I describe above.
It could be fixed by this PR. But I'm not sure how to exclude the label related part and only backport the remaining part?

antoninbas · 2024-01-16T18:19:12Z

If the issue is just with the e2e tests, and not functional, then we could consider disabling the workflow for the release-1.14 branch?

antrea/.github/workflows/kind.yml

Line 394 in e469bd9

test-e2e-flow-visibility:

But I was also wondering: are there some parts of this PR which should be backported because they impact functionality? I'm thinking of this in particular:

Changed the order where we append expired records before exporting them from our exporter

yuntanghsu · 2024-01-16T20:11:40Z

I think that one is the only one that impacts the functionality of our flow-exporter. Other changes like antctl/ipfix-image are mainly for e2e test only.

Do we want to backport the changes for connectionStore in this PR (#5592) as well?

antoninbas · 2024-01-16T21:12:03Z

#5592 was marked as backport, so yes you should backport it now (we will release 1.14.2 at the end of the week)
I suggest that you backport all necessary PRs together in a single cherry-pick (it's supported by the script). This way we only need to review and test 1 PR, and they are all related.

1. Changed the order in which we append expired records before exporting them from our exporter. For inter-Node traffic with egress/ingress NP with action drop, we will receive records from both PacketIn and the conntrack table. For the egress case, we need to ensure that the PacketIn record arrives at the FA first, or we will be stuck waiting for correlation. 2. Add checks to verify that records can successfully be sent by the Flow Exporter before sending traffic in e2e tests. 3. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 4. Confirm the correct addition of a label to a specific Pod after updating the Pod. 5. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse. 6. Use new version of go-ipfix test collector. The new collector no longer prints all received records. Instead records are saved in-memory, and can be queried over HTTP. The list of records can also be reset with an HTTP call. In this way, we can drastically reduce the time to retrieve records in tests. Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>

yuntanghsu marked this pull request as draft December 5, 2023 00:40

yuntanghsu force-pushed the e2e_test branch 11 times, most recently from 430fd57 to 145f01b Compare December 6, 2023 22:24

yuntanghsu changed the title ~~test~~ Remove octetDeltaCount check in flow-visibility e2e test Dec 6, 2023

yuntanghsu force-pushed the e2e_test branch 2 times, most recently from ea415fd to 2cc3ebc Compare December 7, 2023 00:00

yuntanghsu marked this pull request as ready for review December 8, 2023 03:30

yuntanghsu force-pushed the e2e_test branch 2 times, most recently from 420021e to 9e72433 Compare December 8, 2023 09:37

yuntanghsu marked this pull request as draft December 8, 2023 09:51

yuntanghsu force-pushed the e2e_test branch 11 times, most recently from a204e48 to b05d69f Compare December 9, 2023 18:11

yuntanghsu force-pushed the e2e_test branch 2 times, most recently from 62ff0d1 to 2c3ef64 Compare January 4, 2024 21:29

antoninbas previously approved these changes Jan 4, 2024

View reviewed changes

heanlan reviewed Jan 4, 2024

View reviewed changes

pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved

test/e2e/flowaggregator_test.go Show resolved Hide resolved

test/e2e/flowaggregator_test.go Show resolved Hide resolved

yuntanghsu dismissed antoninbas’s stale review via ff74bbf January 4, 2024 22:23

yuntanghsu force-pushed the e2e_test branch 2 times, most recently from ff74bbf to 3bd8dce Compare January 4, 2024 22:29

yuntanghsu force-pushed the e2e_test branch from 3bd8dce to 1bfce29 Compare January 5, 2024 21:55

heanlan approved these changes Jan 5, 2024

View reviewed changes

antoninbas approved these changes Jan 5, 2024

View reviewed changes

antoninbas reviewed Jan 8, 2024

View reviewed changes

antoninbas merged commit bc7dcdf into antrea-io:main Jan 8, 2024
47 of 52 checks passed

yuntanghsu deleted the e2e_test branch January 8, 2024 19:21

yuntanghsu mentioned this pull request Jan 22, 2024

Cherry pick of #5592: Remove redundant log in fillPodInfo/fillServiceInfo, #5731: Use labels to enhance records filtering in FA e2e tests, #5770: Improve flow-visibility e2e test on upstream release 1.14 #5900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement for flow-visibility e2e test #5770

Improvement for flow-visibility e2e test #5770

yuntanghsu commented Dec 5, 2023 •

edited

Loading

antoninbas left a comment

antoninbas commented Jan 4, 2024

heanlan left a comment

yuntanghsu commented Jan 4, 2024

heanlan left a comment

antoninbas commented Jan 5, 2024

antoninbas Jan 8, 2024

antoninbas commented Jan 11, 2024

yuntanghsu commented Jan 11, 2024 •

edited

Loading

antoninbas commented Jan 11, 2024

yuntanghsu commented Jan 11, 2024

antoninbas commented Jan 16, 2024 •

edited

Loading

yuntanghsu commented Jan 16, 2024

antoninbas commented Jan 16, 2024

Improvement for flow-visibility e2e test #5770

Improvement for flow-visibility e2e test #5770

Conversation

yuntanghsu commented Dec 5, 2023 • edited Loading

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas commented Jan 4, 2024

heanlan left a comment

Choose a reason for hiding this comment

yuntanghsu commented Jan 4, 2024

heanlan left a comment

Choose a reason for hiding this comment

antoninbas commented Jan 5, 2024

antoninbas Jan 8, 2024

Choose a reason for hiding this comment

antoninbas commented Jan 11, 2024

yuntanghsu commented Jan 11, 2024 • edited Loading

antoninbas commented Jan 11, 2024

yuntanghsu commented Jan 11, 2024

antoninbas commented Jan 16, 2024 • edited Loading

yuntanghsu commented Jan 16, 2024

antoninbas commented Jan 16, 2024

yuntanghsu commented Dec 5, 2023 •

edited

Loading

yuntanghsu commented Jan 11, 2024 •

edited

Loading

antoninbas commented Jan 16, 2024 •

edited

Loading