Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[awsemfexporter] Group exported metrics by labels #1891

Closed
wants to merge 3 commits into from

Conversation

kohrapha
Copy link
Contributor

Description

Currently, each incoming metric is pushed to CloudWatch logs as a separate log. However, many metrics share the same labels so this results in a lot of duplicate data. To solve this, this PR implements batching of metrics by their labels such that metrics with the same set of labels will be exported together.

Additionally, this PR fixes a long-standing bug where the incoming metric's timestamp isn't used for the exported metric. Now, we use the incoming metric's timestamp and default to the current timestamp if it is not available.

Specifically, metrics are batched together if they have the same:

  • label names + values
  • namespace
  • timestamp
  • log group name
  • log stream name

The batched metrics are further split up if metric_declarations are defined. Currently, the filtered metrics are split up by the metric declaration rules they match. Since they have the same labels, they will have the same dimensions if they match the same metric declaration rules.
Caveat: 2 groups of filtered metrics can still share the same dimension sets if their metric declarations result in the same dimension set. We currently don't perform this check to group the 2 groups together.

Implementation Details

Since this PR includes a lot of refactoring, I will give an overview of how the new metric translation logic works. Given a list of ResourceMetrics via emfExporter.pushMetricsData,

  1. For each ResourceMetrics in the list, we will add its metrics into groupedMetrics (a map consisting of batched metrics).
  2. For each metric within the ResourceMetrics, we create a CWMetricMetadata which consists of metadata (i.e. namespace, timestamp, log group, log stream, instrumentation library name) associated with the given metric. This will be added to groupedMetrics for future processing.
  3. We extract the DataPoints from each metric. For each DataPoint, we define its "group key" using its labels, namespace, timestamp, log group, and log stream. We use this group key to add the metric to its corresponding group in groupedMetrics.
  4. After translating all OT Metrics into groupedMetrics, we iterate through each group and translate it into CWMetric. In this stage, we will filter out metrics if there are metric declarations defined and set the dimensions for exported metrics (w/ rolled-up dimensions).
  5. Finally, we translate the CWMetric into an EMF log and push it to CloudWatch using the appropriate log group and log stream found in the group's CWMetricMetadata.

Testing:
Tests were added for new functions and tests for modified functions were updated. Additionally, this PR was tested in a sample environment using an NGINX server on EKS. Given the following config (same as in #2):

exporters:
  awsemf:
    log_group_name: 'awscollector-test'
    region: 'us-west-2'
    log_stream_name: metric-declarations
    dimension_rollup_option: 'NoDimensionRollup'
    metric_declarations:
    - dimensions: [['Service', 'Namespace'], ['pod_name', 'container_name']]
      metric_name_selectors:
      - '^go_memstats_alloc_bytes_total$'
    - dimensions: [['app_kubernetes_io_component', 'Namespace'], ['app_kubernetes_io_name'], ['Invalid', 'Namespace']]
      metric_name_selectors:
      - '^go_goroutines$'
    - dimensions: [['Namespace', 'app_kubernetes_io_component', 'Namespace']]
      metric_name_selectors:
      - '^go_.+$'

we get the following cases:

  • batch with matched metrics
{
    "Namespace": "eks-aoc",
    "Service": "my-nginx-ingress-nginx-controller-metrics",
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "kubernetes-service-endpoints",
                "Dimensions": [
                    [
                        "Namespace",
                        "app_kubernetes_io_component"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "go_memstats_heap_alloc_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_threads",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_alloc_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_gc_cpu_fraction",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_heap_released_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_mcache_inuse_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_objects",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_last_gc_time_seconds",
                        "Unit": "s"
                    },
                    {
                        "Name": "go_memstats_mcache_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_frees_total",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_stack_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_buck_hash_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_idle_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_lookups_total",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_mallocs_total",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_mspan_inuse_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_next_gc_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_other_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_gc_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_inuse_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_mspan_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_stack_inuse_bytes",
                        "Unit": "By"
                    }
                ]
            },
            {
                "Namespace": "kubernetes-service-endpoints",
                "Dimensions": [
                    [
                        "Namespace",
                        "app_kubernetes_io_component"
                    ],
                    [
                        "app_kubernetes_io_name"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "go_goroutines",
                        "Unit": ""
                    }
                ]
            },
            {
                "Namespace": "kubernetes-service-endpoints",
                "Dimensions": [
                    [
                        "Namespace",
                        "Service"
                    ],
                    [
                        "container_name",
                        "pod_name"
                    ],
                    [
                        "Namespace",
                        "app_kubernetes_io_component"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "go_memstats_alloc_bytes_total",
                        "Unit": ""
                    }
                ]
            }
        ],
        "Timestamp": 1606931694465
    },
    "app_kubernetes_io_component": "controller",
    "app_kubernetes_io_instance": "my-nginx",
    "app_kubernetes_io_managed_by": "Helm",
    "app_kubernetes_io_name": "ingress-nginx",
    "app_kubernetes_io_version": "0.40.2",
    "container_name": "controller",
    "go_goroutines": 89,
    "go_memstats_alloc_bytes": 8168512,
    "go_memstats_alloc_bytes_total": 78897.33333333333,
    "go_memstats_buck_hash_sys_bytes": 1504910,
    "go_memstats_frees_total": 939.7833333333333,
    "go_memstats_gc_cpu_fraction": 0.000016842131408600387,
    "go_memstats_gc_sys_bytes": 5698672,
    "go_memstats_heap_alloc_bytes": 8168512,
    "go_memstats_heap_idle_bytes": 54452224,
    "go_memstats_heap_inuse_bytes": 10690560,
    "go_memstats_heap_objects": 58592,
    "go_memstats_heap_released_bytes": 51896320,
    "go_memstats_heap_sys_bytes": 65142784,
    "go_memstats_last_gc_time_seconds": 1606931634.4573667,
    "go_memstats_lookups_total": 0,
    "go_memstats_mallocs_total": 866.4166666666666,
    "go_memstats_mcache_inuse_bytes": 3472,
    "go_memstats_mcache_sys_bytes": 16384,
    "go_memstats_mspan_inuse_bytes": 149192,
    "go_memstats_mspan_sys_bytes": 229376,
    "go_memstats_next_gc_bytes": 12224112,
    "go_memstats_other_sys_bytes": 760066,
    "go_memstats_stack_inuse_bytes": 1966080,
    "go_memstats_stack_sys_bytes": 1966080,
    "go_memstats_sys_bytes": 75318272,
    "go_threads": 15,
    "helm_sh_chart": "ingress-nginx-3.7.1",
    "kubernetes_node": "ip-192-168-46-33.us-west-2.compute.internal",
    "pod_name": "my-nginx-ingress-nginx-controller-77d5fd6977-ld9wg",
    "process_cpu_seconds_total": 0.0016666666666666757,
    "process_max_fds": 1048576,
    "process_open_fds": 38,
    "process_resident_memory_bytes": 46612480,
    "process_start_time_seconds": 1606928481.44,
    "process_virtual_memory_bytes": 761430016,
    "process_virtual_memory_max_bytes": -1,
    "promhttp_metric_handler_requests_in_flight": 1
}
  • batch with no matched metrics
{
    "Namespace": "eks-aoc",
    "Service": "my-nginx-ingress-nginx-controller-metrics",
    "app_kubernetes_io_component": "controller",
    "app_kubernetes_io_instance": "my-nginx",
    "app_kubernetes_io_managed_by": "Helm",
    "app_kubernetes_io_name": "ingress-nginx",
    "app_kubernetes_io_version": "0.40.2",
    "container_name": "controller",
    "controller_class": "nginx",
    "controller_namespace": "eks-aoc",
    "controller_pod": "my-nginx-ingress-nginx-controller-77d5fd6977-ld9wg",
    "helm_sh_chart": "ingress-nginx-3.7.1",
    "host": "a7710ecaa12b540be99c5bfd5ee07a1f-266546424.us-west-2.elb.amazonaws.com",
    "ingress": "ingress-nginx-demo",
    "kubernetes_node": "ip-192-168-46-33.us-west-2.compute.internal",
    "method": "GET",
    "namespace": "eks-traffic",
    "nginx_ingress_controller_bytes_sent": {
        "Max": 10000000,
        "Min": 10,
        "Count": 114,
        "Sum": 21888
    },
    "nginx_ingress_controller_request_duration_seconds": {
        "Max": 10,
        "Min": 0.005,
        "Count": 114,
        "Sum": 0.029000000000000026
    },
    "nginx_ingress_controller_request_size": {
        "Max": 100,
        "Min": 10,
        "Count": 114,
        "Sum": 15960
    },
    "nginx_ingress_controller_response_duration_seconds": {
        "Max": 10,
        "Min": 0.005,
        "Count": 114,
        "Sum": 0.020000000000000018
    },
    "nginx_ingress_controller_response_size": {
        "Max": 10,
        "Min": 0.005,
        "Count": 114,
        "Sum": 21888
    },
    "path": "/banana",
    "pod_name": "my-nginx-ingress-nginx-controller-77d5fd6977-ld9wg",
    "service": "banana-service",
    "status": "200"
}

@kohrapha
Copy link
Contributor Author

cc: @hdj630, @mxiamxia

@codecov
Copy link

codecov bot commented Dec 22, 2020

Codecov Report

Merging #1891 (eb1cde8) into master (a20a6f4) will increase coverage by 0.09%.
The diff coverage is 99.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1891      +/-   ##
==========================================
+ Coverage   89.83%   89.92%   +0.09%     
==========================================
  Files         378      380       +2     
  Lines       18213    18344     +131     
==========================================
+ Hits        16361    16496     +135     
+ Misses       1388     1386       -2     
+ Partials      464      462       -2     
Flag Coverage Δ
integration 69.77% <ø> (ø)
unit 88.63% <99.46%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
exporter/awsemfexporter/metric_translator.go 98.42% <98.23%> (+0.32%) ⬆️
exporter/awsemfexporter/datapoint.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/emf_exporter.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/groupedmetric.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/metric_declaration.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/util.go 100.00% <100.00%> (ø)
receiver/k8sclusterreceiver/watcher.go 97.64% <0.00%> (+2.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a20a6f4...eb1cde8. Read the comment docs.

@anuraaga
Copy link
Contributor

@kohrapha Thanks for the change! It sounds nice but is pretty huge - is it possible to split into

  1. Timestamp bugfix
  2. Refactoring without change in behavior
  3. Add batching

?

@kohrapha
Copy link
Contributor Author

@anuraaga Thanks for the suggestion! I can definitely split it up, but as my internship is ending on Thursday, I might not be able to follow up.

@mxiamxia @hdj630 any thoughts?

Copy link
Member

@mxiamxia mxiamxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean code. Pls address the comments, otherwise everything LGTM!

@kohrapha kohrapha changed the title Kohrapha/integration [awsemfexporter] Group exported metrics by labels Dec 23, 2020
@jpkrohling
Copy link
Member

I'm removing myself as assignee for this issue, as I'm unavailable until Jan 11.

@jpkrohling jpkrohling removed their assignment Dec 28, 2020
@gramidt
Copy link
Member

gramidt commented Dec 28, 2020

Great work and exciting functionality, @kohrapha!

The code looks good. LGTM!

@mxiamxia
Copy link
Member

mxiamxia commented Jan 5, 2021

Hi @anuraaga, @bogdandrutu, @kohrapha has done his internship on this project. I have spent a good amount of time to review this PR and I see @gramidt also help on the review. Could we get an approval from you to merge the code? Thanks.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxiamxia Sorry for the late review, though it's because of the size of the PR. I understand the tension between final PRs and intern deadlines, but since we can generally expect feedback to be required anyways, I don't think it's really a reason to lump up multiple changes into a single huge PR.

I did just a quick skim so far and found a few issues. @mxiamxia will you be taking ownership of this PR, ideally splitting it up?

type DataPoint struct {
Value interface{}
Labels map[string]string
Timestamp int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not uint64 or even Time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not Time since it's millis. Please name the field TimestampMS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this and agree with @anuraaga .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll work on it and do the PR split.

type rateCalculationMetadata struct {
needsCalculateRate bool
rateKeyParams map[string]string
timestamp int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for all timestamps

func (dps IntDataPointSlice) At(i int) DataPoint {
metric := dps.IntDataPointSlice.At(i)
labels := createLabels(metric.LabelsMap(), dps.instrumentationLibraryName)
timestamp := unixNanoToMilliseconds(metric.Timestamp())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timestamp := unixNanoToMilliseconds(metric.Timestamp())
timestampMS := unixNanoToMilliseconds(metric.Timestamp())

}

// createMetricKey generates a hashed key from metric labels and additional parameters
func createMetricKey(labels map[string]string, parameters map[string]string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use label.Distinct as we did in the statsd receiver.

#1670 (comment)

return
}

rateKeyParams := map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can be a struct instead of a map. There should just be one struct containing all of these fields along with label.Distinct

@tigrannajaryan
Copy link
Member

@anuraaga since you already reviewed this I am assigning the PR to you so that you can facilitate. Thanks.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 15, 2021
@bogdandrutu
Copy link
Member

@kohrapha @anuraaga friendly ping on this

@github-actions github-actions bot removed the Stale label Jan 22, 2021
@anuraaga
Copy link
Contributor

@mxiamxia Can we close this PR for now until we get time for it?

@gramidt
Copy link
Member

gramidt commented Jan 22, 2021

@kohrapha @anuraaga - I'm happy to help out where needed or even take it to completion. Let me know your thoughts.

@mxiamxia
Copy link
Member

mxiamxia commented Jan 23, 2021

@mxiamxia Can we close this PR for now until we get time for it?

Yes, please close this one and I'll split this PR to smaller ones next week.

@bogdandrutu
Copy link
Member

Closing per @mxiamxia request

dyladan referenced this pull request in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
* Removed groupbytraceprocessor

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

* Removed link to the groupbytrace processor

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
tigrannajaryan pushed a commit that referenced this pull request Feb 8, 2021
We sent a large PR #1891 to support batching the metrics on the same dimensions for AWS EMF Log request to save the customers' billing cost and request throughput. At the same time, there was a fairly large code refactor on EMFExporter. For better code review purpose, I plan to split #1891  to 2 PRs. (This is PR#1)

In this PR, We refactored EMFExporter without introducing any new feature. For each OTel metrics data point, we defined `DataPoint` file, it wraps `pdata.DataPointSlice` to the custom structures for each type of metric data point. we also moved the metric data handling functions - data conversion and rate calculation to `datapoint`. 
It also fixed the metric `timestamp` bug.
pmatyjasek-sumo pushed a commit to pmatyjasek-sumo/opentelemetry-collector-contrib that referenced this pull request Apr 28, 2021
We sent a large PR open-telemetry#1891 to support batching the metrics on the same dimensions for AWS EMF Log request to save the customers' billing cost and request throughput. At the same time, there was a fairly large code refactor on EMFExporter. For better code review purpose, I plan to split open-telemetry#1891  to 2 PRs. (This is PR#1)

In this PR, We refactored EMFExporter without introducing any new feature. For each OTel metrics data point, we defined `DataPoint` file, it wraps `pdata.DataPointSlice` to the custom structures for each type of metric data point. we also moved the metric data handling functions - data conversion and rate calculation to `datapoint`. 
It also fixed the metric `timestamp` bug.
ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022
* Add semantic convention generator

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Update semantic conventions from generator

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Use existing internal/tools module

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Fix lint issues, more initialisms

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Update changelog

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* semconvgen: Faas->FaaS

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Fix a few more key names with replacements

* Update replacements from PR feedback

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* rename commonInitialisms to capitalizations, move some capitalizations there

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Regenerate semantic conventions with updated capitalizations and replacements

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Generate semantic conventions from spec v1.3.0

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Cleanup semconv generator util a bit

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* No need to put internal tooling additions in the CHANGELOG

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Fix HTTP semconv tests

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

* Add semconv generation notes to RELEASING.md

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants