-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement for flow-visibility e2e test #5770
Conversation
430fd57
to
145f01b
Compare
ea415fd
to
2cc3ebc
Compare
420021e
to
9e72433
Compare
a204e48
to
b05d69f
Compare
62ff0d1
to
2c3ef64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to update go-ipfix version in the go.mod file?
We just need the updated collector image. But we can update the version as well. |
ff74bbf
to
3bd8dce
Compare
In this commit, we do: 1. Changed the order where we append expired records before exporting them from our exporter. For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired. For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation. 2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic. 3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic. 4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 5. Confirm the correct addition of a label to a specific Pod after updating the Pod. 6. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse. 7. Use new image from go-ipfix. We improve the IPFIX collector by: a. Disable printing records whenever we receive it. Instead, we store records in a string array. b. Add http listener and handler to receive request to return or reset records. In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records inside it. Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
err = wait.Poll(defaultInterval, timeout, func() (bool, error) { | ||
pod, err := data.clientset.CoreV1().Pods(data.testNamespace).Get(context.TODO(), testPod.Name, metav1.GetOptions{}) | ||
if err != nil { | ||
if errors.IsNotFound(err) { | ||
return false, nil | ||
} | ||
return false, fmt.Errorf("error when getting Pod '%s': %w", pod.Name, err) | ||
} | ||
return pod.Labels["targetLabel"] == label, nil | ||
}) | ||
require.NoErrorf(t, err, "Error when verifying the label on %s", testPod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge the PR as is, but I don't think this is useful. It should not be possible to have a successful Update
call, yet see the old object when calling Get
afterwards.
@yuntanghsu I wonder if we should backport this to release-1.14. I believe visibility tests are also failing for that branch. |
But this PR depends on a couple of PRs:
If we want to resolve the e2e test issue and increase its stability, I think we should also backport all the e2e related PR. #5592 |
hmm that's a lot. Do we know which PR specifically "broke" e2e visibility tests? |
Should be this one #5592 That is because we are trying to get the CORRECT records sent in each subtests. Previous e2e test had lots of false positive cases -> we were using records from the previous subtests to verify the information in the current subtest. I think the error in release-1.14 is because of its instability instead of the reason I describe above. |
If the issue is just with the e2e tests, and not functional, then we could consider disabling the workflow for the release-1.14 branch? antrea/.github/workflows/kind.yml Line 394 in e469bd9
But I was also wondering: are there some parts of this PR which should be backported because they impact functionality? I'm thinking of this in particular:
|
I think that one is the only one that impacts the functionality of our flow-exporter. Other changes like antctl/ipfix-image are mainly for e2e test only. Do we want to backport the changes for connectionStore in this PR (#5592) as well? |
#5592 was marked as backport, so yes you should backport it now (we will release 1.14.2 at the end of the week) |
1. Changed the order in which we append expired records before exporting them from our exporter. For inter-Node traffic with egress/ingress NP with action drop, we will receive records from both PacketIn and the conntrack table. For the egress case, we need to ensure that the PacketIn record arrives at the FA first, or we will be stuck waiting for correlation. 2. Add checks to verify that records can successfully be sent by the Flow Exporter before sending traffic in e2e tests. 3. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 4. Confirm the correct addition of a label to a specific Pod after updating the Pod. 5. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse. 6. Use new version of go-ipfix test collector. The new collector no longer prints all received records. Instead records are saved in-memory, and can be queried over HTTP. The list of records can also be reset with an HTTP call. In this way, we can drastically reduce the time to retrieve records in tests. Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
1. Changed the order in which we append expired records before exporting them from our exporter. For inter-Node traffic with egress/ingress NP with action drop, we will receive records from both PacketIn and the conntrack table. For the egress case, we need to ensure that the PacketIn record arrives at the FA first, or we will be stuck waiting for correlation. 2. Add checks to verify that records can successfully be sent by the Flow Exporter before sending traffic in e2e tests. 3. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 4. Confirm the correct addition of a label to a specific Pod after updating the Pod. 5. Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse. 6. Use new version of go-ipfix test collector. The new collector no longer prints all received records. Instead records are saved in-memory, and can be queried over HTTP. The list of records can also be reset with an HTTP call. In this way, we can drastically reduce the time to retrieve records in tests. Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
In this PR, we do:
Changed the order where we append expired records before exporting them from our exporter.
For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired. For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation.
Extend recordmetrics endpoint in FA API to check if Flow Aggregator can successfully connect to the "ClickHouse"/"IPFIX collector Pod"/"Antrea-Agent Pods" before sending traffic.
Add labels to External subtest to filter useless logs from the IPFIX collector Pod.
Confirm the correct addition of a label to a specific Pod after updating the Pod.
Remove the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod and ClickHouse.
Use new image from go-ipfix. We improve the IPFIX collector by:
a. Disable printing records whenever we receive it. Instead, we store records in a string array.
b. Add http listener and handler to receive request to return or reset records.
In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
inside it.
Note:
As can be seen from the two pictures below, the time difference between the two attempts to fetch the ipfix log is > 3.5 seconds.
It is ~1.2s for single stack cluster:
Time for fetching the log from IPFIX collector is decreased to ~80ms. (vmware/go-ipfix#338)
Related PR:
vmware/go-ipfix#338
resolve #5730