Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve IPFIX collector #338

Merged
merged 1 commit into from
Dec 15, 2023
Merged

Improve IPFIX collector #338

merged 1 commit into from
Dec 15, 2023

Conversation

yuntanghsu
Copy link
Contributor

@yuntanghsu yuntanghsu commented Dec 12, 2023

In this PR:

  1. We disable printing records whenever we receive it. Instead, we store records in a string array.
  2. Add http listener and handler to receive request to return or reset records.

The reason we do this change is because we want to reduce the time we use to retrieve logs.
Previously, using kubectl logs will take lots of time to get the logs. And the time depends on the number of records inside the IPFIX collector Pod. The time range is ~0.8s to 4s. After above changes, the time is reduced to ~80ms.

Functionality has been tested in this PR: https://github.com/antrea-io/antrea/actions/runs/7201229562/job/19617317869?pr=5770

Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Merging #338 (5a11b76) into main (092fb4b) will decrease coverage by 0.15%.
Report is 1 commits behind head on main.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #338      +/-   ##
==========================================
- Coverage   72.87%   72.73%   -0.15%     
==========================================
  Files          19       19              
  Lines        2853     2853              
==========================================
- Hits         2079     2075       -4     
- Misses        602      604       +2     
- Partials      172      174       +2     
Flag Coverage Δ
integration-tests 50.65% <ø> (-2.25%) ⬇️
unit-tests 71.74% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

cmd/collector/collector.go Outdated Show resolved Hide resolved
cmd/collector/collector.go Show resolved Hide resolved
w.Write(jsonData)
}

func resetRecordHandler(w http.ResponseWriter, r *http.Request) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, but you should request the HTTP method to be POST in this handler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this method be POST?
Do you think we still need this Handler. After disabling printing the log, the retrieve time is decreased from ~4s to ~80ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas
I think this concern hasn't been solved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I didn't see the question before
It should be POST because it does a mutation the server. That's an HTTP convention. You can lookup which HTTP verbs are most appropriate for different type of APIs.

Do you think we still need this Handler. After disabling printing the log, the retrieve time is decreased from ~4s to ~80ms.

I thought you wrote 0.8s in the PR description, which would be 800ms?

Regardless, let's add the reset method, it's easy to do and if we don't end up using it in Antrea tests, so be it.
It's easier than opening a new PR in this repo later and publishing a new collector image tag...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.8s to 4s will be the range when we use "kubectl logs" to get the logs.
Under the current changes, this time will be approximately 80ms regardless of the number of records inside the IPFIX collector.

cmd/collector/collector.go Outdated Show resolved Hide resolved
cmd/collector/collector.go Show resolved Hide resolved
cmd/collector/collector.go Outdated Show resolved Hide resolved
yuntanghsu added a commit to yuntanghsu/antrea that referenced this pull request Dec 13, 2023
In this commit, we do:

1. Changed the order where we append expired records before exporting them from our exporter.
For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. If the record from the conntrack table is exported first, then the record will need to do correlation at FA. From the egress case, there is no issue as we will receive the records from both nodes. But the record won't be sent to the collector as it keeps waiting to do correlation if both records come from the same node.
Similar approach is added in vmware/go-ipfix#338 as well.

2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic.

3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic.

4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod.

5. Confirm the correct addition of a label to a specific Pod after updating the Pod.

6. Adjust the flow-visibility end-to-end test by disabling the octetDeltaCount check. This modification is necessary because, when the dual-stack cluster is enabled, the time taken to retrieve logs from the IPFIX collector Pod is significantly longer (around 4 seconds). In the e2e test, we regularly checked the logs every 500 milliseconds to ensure that we didn't receive the last record (where octetDeltaCount is 0). However, due to the delay, the PollImmediately() function doesn't execute every 500 milliseconds. Therefore, we have removed the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod.

7. Use new image from go-ipfix PR ( vmware/go-ipfix#338).  We improve the IPFIX collector by:
    a.  Disable printing records whenever we receive it. Instead, we store records in a string array.
    b. Add http listener and handler to receive request to return or reset records.
    In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
    inside it.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
@yuntanghsu yuntanghsu changed the title Add http handler to reset/get records Improve IPFIX collector Dec 13, 2023
yuntanghsu added a commit to yuntanghsu/antrea that referenced this pull request Dec 13, 2023
In this commit, we do:

1. Changed the order where we append expired records before exporting them from our exporter.
For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. If the record from the conntrack table is exported first, then the record will need to do correlation at FA. From the egress case, there is no issue as we will receive the records from both nodes. But the record won't be sent to the collector as it keeps waiting to do correlation if both records come from the same node.
Similar approach is added in vmware/go-ipfix#338 as well.

2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic.

3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic.

4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod.

5. Confirm the correct addition of a label to a specific Pod after updating the Pod.

6. Adjust the flow-visibility end-to-end test by disabling the octetDeltaCount check. This modification is necessary because, when the dual-stack cluster is enabled, the time taken to retrieve logs from the IPFIX collector Pod is significantly longer (around 4 seconds). In the e2e test, we regularly checked the logs every 500 milliseconds to ensure that we didn't receive the last record (where octetDeltaCount is 0). However, due to the delay, the PollImmediately() function doesn't execute every 500 milliseconds. Therefore, we have removed the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod.

7. Use new image from go-ipfix PR ( vmware/go-ipfix#338).  We improve the IPFIX collector by:
    a.  Disable printing records whenever we receive it. Instead, we store records in a string array.
    b. Add http listener and handler to receive request to return or reset records.
    In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
    inside it.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
cmd/collector/collector.go Outdated Show resolved Hide resolved
cmd/collector/collector.go Outdated Show resolved Hide resolved
cmd/collector/collector.go Outdated Show resolved Hide resolved
yuntanghsu added a commit to yuntanghsu/antrea that referenced this pull request Dec 13, 2023
In this commit, we do:

1. Changed the order where we append expired records before exporting them from our exporter.
For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. If the record from the conntrack table is exported first, then the record will need to do correlation at FA. From the egress case, there is no issue as we will receive the records from both nodes. But the record won't be sent to the collector as it keeps waiting to do correlation if both records come from the same node.
Similar approach is added in vmware/go-ipfix#338 as well.

2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic.

3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic.

4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod.

5. Confirm the correct addition of a label to a specific Pod after updating the Pod.

6. Adjust the flow-visibility end-to-end test by disabling the octetDeltaCount check. This modification is necessary because, when the dual-stack cluster is enabled, the time taken to retrieve logs from the IPFIX collector Pod is significantly longer (around 4 seconds). In the e2e test, we regularly checked the logs every 500 milliseconds to ensure that we didn't receive the last record (where octetDeltaCount is 0). However, due to the delay, the PollImmediately() function doesn't execute every 500 milliseconds. Therefore, we have removed the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod.

7. Use new image from go-ipfix PR ( vmware/go-ipfix#338).  We improve the IPFIX collector by:
    a.  Disable printing records whenever we receive it. Instead, we store records in a string array.
    b. Add http listener and handler to receive request to return or reset records.
    In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records
    inside it.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
antoninbas
antoninbas previously approved these changes Dec 14, 2023
Copy link
Member

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Member

I don't remember how tagging works for the collector image. @heanlan @dreamtalen will we need a new release of go-ipfix?

@heanlan
Copy link
Contributor

heanlan commented Dec 14, 2023

will we need a new release of go-ipfix?

Yes, we need a new release

In this commit, we do:
1. We disable printing records whenever we receive it. Instead, we store records in a string array.
2. Add http listener and handler to receive request to return or reset records.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
@antoninbas antoninbas merged commit 837050b into vmware:main Dec 15, 2023
8 of 9 checks passed
yanjunz97 pushed a commit to yanjunz97/go-ipfix that referenced this pull request Dec 19, 2023
1. Disable printing records whenever we receive them. Instead, we store records in a string array.
2. Add http listener and handler to receive request to return or reset records.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
yanjunz97 pushed a commit that referenced this pull request Dec 19, 2023
1. Disable printing records whenever we receive them. Instead, we store records in a string array.
2. Add http listener and handler to receive request to return or reset records.

Signed-off-by: Yun-Tang Hsu <hsuy@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants