Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[awsemfexporter] Group exported metrics by labels #2317

Merged
merged 1 commit into from
Feb 22, 2021

Conversation

mxiamxia
Copy link
Member

This PR is the 2nd part of splitting #1891 which was originally done by @kohrapha.
Currently, each incoming metric is pushed to CloudWatch logs as a separate log. However, many metrics share the same labels so this results in a lot of duplicate data. To solve this, this PR implements batching of metrics by their labels such that metrics with the same set of labels will be exported together.

Specifically, metrics are batched together if they have the same:

  • label names + values
  • namespace
  • timestamp
  • log group name
  • log stream name

The batched metrics are further split up if metric_declarations are defined. Currently, the filtered metrics are split up by the metric declaration rules they match. Since they have the same labels, they will have the same dimensions if they match the same metric declaration rules.
Caveat: 2 groups of filtered metrics can still share the same dimension sets if their metric declarations result in the same dimension set. We currently don't perform this check to group the 2 groups together.

Implementation Details

Since this PR includes a lot of refactoring, I will give an overview of how the new metric translation logic works. Given a list of ResourceMetrics via emfExporter.pushMetricsData,

  1. For each ResourceMetrics in the list, we will add its metrics into groupedMetrics (a map consisting of batched metrics).
  2. For each metric within the ResourceMetrics, we create a CWMetricMetadata which consists of metadata (i.e. namespace, timestamp, log group, log stream, instrumentation library name) associated with the given metric. This will be added to groupedMetrics for future processing.
  3. We extract the DataPoints from each metric. For each DataPoint, we define its "group key" using its labels, namespace, timestamp, log group, and log stream. We use this group key to add the metric to its corresponding group in groupedMetrics.
  4. After translating all OT Metrics into groupedMetrics, we iterate through each group and translate it into CWMetric. In this stage, we will filter out metrics if there are metric declarations defined and set the dimensions for exported metrics (w/ rolled-up dimensions).
  5. Finally, we translate the CWMetric into an EMF log and push it to CloudWatch using the appropriate log group and log stream found in the group's CWMetricMetadata.

Testing:
Tests were added for new functions and tests for modified functions were updated. Additionally, this PR was tested in a sample environment using an NGINX server on EKS. Given the following config (same as in #2):

exporters:
  awsemf:
    log_group_name: 'awscollector-test'
    region: 'us-west-2'
    log_stream_name: metric-declarations
    dimension_rollup_option: 'NoDimensionRollup'
    metric_declarations:
    - dimensions: [['Service', 'Namespace'], ['pod_name', 'container_name']]
      metric_name_selectors:
      - '^go_memstats_alloc_bytes_total$'
    - dimensions: [['app_kubernetes_io_component', 'Namespace'], ['app_kubernetes_io_name'], ['Invalid', 'Namespace']]
      metric_name_selectors:
      - '^go_goroutines$'
    - dimensions: [['Namespace', 'app_kubernetes_io_component', 'Namespace']]
      metric_name_selectors:
      - '^go_.+$'

we get the following cases:

  • batch with matched metrics
{
    "Namespace": "eks-aoc",
    "Service": "my-nginx-ingress-nginx-controller-metrics",
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "kubernetes-service-endpoints",
                "Dimensions": [
                    [
                        "Namespace",
                        "app_kubernetes_io_component"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "go_memstats_heap_alloc_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_threads",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_alloc_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_gc_cpu_fraction",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_heap_released_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_mcache_inuse_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_objects",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_last_gc_time_seconds",
                        "Unit": "s"
                    },
                    {
                        "Name": "go_memstats_mcache_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_frees_total",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_stack_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_buck_hash_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_idle_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_lookups_total",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_mallocs_total",
                        "Unit": ""
                    },
                    {
                        "Name": "go_memstats_mspan_inuse_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_next_gc_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_other_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_gc_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_heap_inuse_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_mspan_sys_bytes",
                        "Unit": "By"
                    },
                    {
                        "Name": "go_memstats_stack_inuse_bytes",
                        "Unit": "By"
                    }
                ]
            },
            {
                "Namespace": "kubernetes-service-endpoints",
                "Dimensions": [
                    [
                        "Namespace",
                        "app_kubernetes_io_component"
                    ],
                    [
                        "app_kubernetes_io_name"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "go_goroutines",
                        "Unit": ""
                    }
                ]
            },
            {
                "Namespace": "kubernetes-service-endpoints",
                "Dimensions": [
                    [
                        "Namespace",
                        "Service"
                    ],
                    [
                        "container_name",
                        "pod_name"
                    ],
                    [
                        "Namespace",
                        "app_kubernetes_io_component"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "go_memstats_alloc_bytes_total",
                        "Unit": ""
                    }
                ]
            }
        ],
        "Timestamp": 1606931694465
    },
    "app_kubernetes_io_component": "controller",
    "app_kubernetes_io_instance": "my-nginx",
    "app_kubernetes_io_managed_by": "Helm",
    "app_kubernetes_io_name": "ingress-nginx",
    "app_kubernetes_io_version": "0.40.2",
    "container_name": "controller",
    "go_goroutines": 89,
    "go_memstats_alloc_bytes": 8168512,
    "go_memstats_alloc_bytes_total": 78897.33333333333,
    "go_memstats_buck_hash_sys_bytes": 1504910,
    "go_memstats_frees_total": 939.7833333333333,
    "go_memstats_gc_cpu_fraction": 0.000016842131408600387,
    "go_memstats_gc_sys_bytes": 5698672,
    "go_memstats_heap_alloc_bytes": 8168512,
    "go_memstats_heap_idle_bytes": 54452224,
    "go_memstats_heap_inuse_bytes": 10690560,
    "go_memstats_heap_objects": 58592,
    "go_memstats_heap_released_bytes": 51896320,
    "go_memstats_heap_sys_bytes": 65142784,
    "go_memstats_last_gc_time_seconds": 1606931634.4573667,
    "go_memstats_lookups_total": 0,
    "go_memstats_mallocs_total": 866.4166666666666,
    "go_memstats_mcache_inuse_bytes": 3472,
    "go_memstats_mcache_sys_bytes": 16384,
    "go_memstats_mspan_inuse_bytes": 149192,
    "go_memstats_mspan_sys_bytes": 229376,
    "go_memstats_next_gc_bytes": 12224112,
    "go_memstats_other_sys_bytes": 760066,
    "go_memstats_stack_inuse_bytes": 1966080,
    "go_memstats_stack_sys_bytes": 1966080,
    "go_memstats_sys_bytes": 75318272,
    "go_threads": 15,
    "helm_sh_chart": "ingress-nginx-3.7.1",
    "kubernetes_node": "ip-192-168-46-33.us-west-2.compute.internal",
    "pod_name": "my-nginx-ingress-nginx-controller-77d5fd6977-ld9wg",
    "process_cpu_seconds_total": 0.0016666666666666757,
    "process_max_fds": 1048576,
    "process_open_fds": 38,
    "process_resident_memory_bytes": 46612480,
    "process_start_time_seconds": 1606928481.44,
    "process_virtual_memory_bytes": 761430016,
    "process_virtual_memory_max_bytes": -1,
    "promhttp_metric_handler_requests_in_flight": 1
}
  • batch with no matched metrics
{
    "Namespace": "eks-aoc",
    "Service": "my-nginx-ingress-nginx-controller-metrics",
    "app_kubernetes_io_component": "controller",
    "app_kubernetes_io_instance": "my-nginx",
    "app_kubernetes_io_managed_by": "Helm",
    "app_kubernetes_io_name": "ingress-nginx",
    "app_kubernetes_io_version": "0.40.2",
    "container_name": "controller",
    "controller_class": "nginx",
    "controller_namespace": "eks-aoc",
    "controller_pod": "my-nginx-ingress-nginx-controller-77d5fd6977-ld9wg",
    "helm_sh_chart": "ingress-nginx-3.7.1",
    "host": "a7710ecaa12b540be99c5bfd5ee07a1f-266546424.us-west-2.elb.amazonaws.com",
    "ingress": "ingress-nginx-demo",
    "kubernetes_node": "ip-192-168-46-33.us-west-2.compute.internal",
    "method": "GET",
    "namespace": "eks-traffic",
    "nginx_ingress_controller_bytes_sent": {
        "Max": 10000000,
        "Min": 10,
        "Count": 114,
        "Sum": 21888
    },
    "nginx_ingress_controller_request_duration_seconds": {
        "Max": 10,
        "Min": 0.005,
        "Count": 114,
        "Sum": 0.029000000000000026
    },
    "nginx_ingress_controller_request_size": {
        "Max": 100,
        "Min": 10,
        "Count": 114,
        "Sum": 15960
    },
    "nginx_ingress_controller_response_duration_seconds": {
        "Max": 10,
        "Min": 0.005,
        "Count": 114,
        "Sum": 0.020000000000000018
    },
    "nginx_ingress_controller_response_size": {
        "Max": 10,
        "Min": 0.005,
        "Count": 114,
        "Sum": 21888
    },
    "path": "/banana",
    "pod_name": "my-nginx-ingress-nginx-controller-77d5fd6977-ld9wg",
    "service": "banana-service",
    "status": "200"
}

@codecov
Copy link

codecov bot commented Feb 10, 2021

Codecov Report

Merging #2317 (868dc42) into main (e43c235) will increase coverage by 1.02%.
The diff coverage is 99.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2317      +/-   ##
==========================================
+ Coverage   72.72%   73.75%   +1.02%     
==========================================
  Files         410      412       +2     
  Lines       25355    25475     +120     
==========================================
+ Hits        18440    18789     +349     
+ Misses       6368     6133     -235     
- Partials      547      553       +6     
Flag Coverage Δ
integration 69.26% <ø> (?)
unit 72.77% <99.04%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
exporter/awsemfexporter/metric_translator.go 97.70% <98.19%> (-0.79%) ⬇️
exporter/awsemfexporter/datapoint.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/emf_exporter.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/grouped_metric.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/metric_declaration.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/util.go 100.00% <100.00%> (+5.55%) ⬆️
internal/common/testing/container/container.go 73.68% <0.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e43c235...868dc42. Read the comment docs.

@mxiamxia mxiamxia force-pushed the batch_metrics branch 7 times, most recently from 780abbb to 32be859 Compare February 15, 2021 21:51
Copy link
Contributor

@shaochengwang shaochengwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@mxiamxia mxiamxia force-pushed the batch_metrics branch 2 times, most recently from e229569 to ce8e948 Compare February 16, 2021 01:16
@mxiamxia
Copy link
Member Author

@bogdandrutu Kindly ping for reviewing and merge. Thanks.

@tigrannajaryan
Copy link
Member

@mxiamxia please resolve comments that are addressed and resolve merge conflicts.

@bogdandrutu
Copy link
Member

@mxiamxia this needs a rebase

@mxiamxia mxiamxia force-pushed the batch_metrics branch 7 times, most recently from 60f0eb4 to a38f82c Compare February 21, 2021 05:30
@mxiamxia
Copy link
Member Author

Rebased the commits and Resolved the conflicts.

@bogdandrutu bogdandrutu merged commit 0761ee3 into open-telemetry:main Feb 22, 2021
pmatyjasek-sumo referenced this pull request in pmatyjasek-sumo/opentelemetry-collector-contrib Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants