[Metricbeat] module:kubernetes / metricset:event fails to gather FailedScheduling events on k8s 1.20.5 #28923

MnrGreg · 2021-11-10T23:35:21Z

When upgrading from kubernetes 1.19.1 to 1.20.5, metricset:event no longer gathers .reason:FailedScheduling k8s events. Might it be related to the k8s event structure changing?

See 1.19.1 event json below:

{
            "apiVersion": "v1",
            "count": 4,
            "eventTime": null,
            "firstTimestamp": "2021-08-05T18:05:01Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "netshoot-69c75755c-kzftr",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "143753331",
                "uid": "c5c0b600-cc55-4922-b921-8dad8a4289b3"
            },
            "kind": "Event",
            "lastTimestamp": "2021-08-05T18:06:18Z",
            "message": "0/39 nodes are available: 39 Insufficient cpu.",
            "metadata": {
                "creationTimestamp": "2021-08-05T18:05:01Z",
                "name": "netshoot-69c75755c-kzftr.16987b2dbf489625",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "143754124",
                "selfLink": "/api/v1/namespaces/tsengineering-netshoot-dev/events/netshoot-69c75755c-kzftr.16987b2dbf489625",
                "uid": "f8f0ae32-9230-4816-ae90-2e351b5bfa46"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "default-scheduler"
            },
            "type": "Warning"
        }

compare with similar event in 1.20.5 below:

{
            "action": "Scheduling",
            "apiVersion": "v1",
            "eventTime": "2021-08-05T16:53:44.592316Z",
            "firstTimestamp": null,
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "netshoot-5ccb8d65c7-8d5sr",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "373641912",
                "uid": "88e1db83-adfb-4e4f-8e1d-89ed98395465"
            },
            "kind": "Event",
            "lastTimestamp": null,
            "message": "0/51 nodes are available: 40 Insufficient cpu",
            "metadata": {
                "creationTimestamp": "2021-08-05T16:53:44Z",
                "name": "netshoot-5ccb8d65c7-8d5sr.16987749e75c0e0b",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "373807800",
                "uid": "aef136c3-0e8f-4dac-abf8-24d5a1d7243d"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "default-scheduler",
            "reportingInstance": "default-scheduler-ts-sharedplatform-rck-nonprod-cp-qzzjh",
            "series": {
                "count": 460,
                "lastObservedTime": "2021-08-05T19:22:48.831892Z"
            },
            "source": {},
            "type": "Warning"
        }

Note the differences:

.metadata.selfLink is removed in 1.20
.source is empty
.eventTime is populated
.firstTimestamp and .lastTimestamp are null

Is .lastTimestamp: null perhaps causing the event watch time comparison to fail? See here:

beats/metricbeat/module/kubernetes/event/event.go

Lines 107 to 112 in a350c32

    
           // skip events happened before watch 
        
           FilterFunc: func(obj interface{}) bool { 
        
           	eve := obj.(*kubernetes.Event) 
        
           	if kubernetes.Time(&eve.LastTimestamp).Before(now) { 
        
           		return false 
        
           	}

Running metricbeat in debug mode with '-d "*' does not surface any errors. The events seem to be silently dropped.

Related issue:
kubernetes/kubernetes#90482

Version: Metricbeat 6.8.14 / Kubernetes 1.20.5
Operating System: PhotonOS 3
Discuss Forum URL: https://discuss.elastic.co/t/module-kubernetes-metricset-event-fails-to-gather-failedscheduling-events/280597

The text was updated successfully, but these errors were encountered:

delfer · 2021-11-19T08:07:44Z

We are also affected.

semyon1992 · 2021-11-19T08:15:47Z

I have met this issue on my cluster too.
Kubernetes version is: v1.21.5
Metricbeat version is: 7.15.2

kubectl get events -o yaml
returns null values on firstTimestamp and lastTimestamp

los911evgen · 2021-11-29T08:06:40Z

The same issue.

elasticmachine · 2021-12-02T12:11:03Z

Pinging @elastic/integrations (Team:Integrations)

bgaliev · 2021-12-13T05:38:26Z

I have the same issue

ChrsMark · 2021-12-13T11:17:19Z

Thanks for reporting this @MnrGreg and folks. It looks like the API's types have been changed. We are looking into this already :).

ChrsMark · 2021-12-13T14:12:56Z

It seems that the API's types have been changed and we need to further investigate this so as to support newer and older versions at the same time. For sure we need to unblock the filtering to be dropping all events and in this regard I've opened a PR to introduce a new setting skip_older so as to make the filter configurable in case we cannot rely on the LastTimestamp field. This can be released with 8.0 so as to solve problems like the one reported in this PR.

ChrsMark · 2021-12-15T14:45:51Z

An update on this.

Based on kubernetes/kubernetes#90482 (comment) + manual testing + provided sample events it looks like firstTimestamp, lastTimestamp and eventTime fields are optional and we cannot rely on them.

In addition, there is also the new API and plans for migrating to this (link) so in the long term when all components will be migrated, and API server mostly, we will be able/should to switch into using the new API's types.

Based on testing until version 1.23.0 (kind) I see that the old API is still used:

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2021-12-15T13:47:15Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Endpoints",
                "name": "rancher.io-local-path",
                "namespace": "local-path-storage",
                "resourceVersion": "513",
                "uid": "c470b86c-1157-48b2-a15c-e2727c43bc1d"
            },
            "kind": "Event",
            "lastTimestamp": "2021-12-15T13:47:15Z",
            "message": "local-path-provisioner-5bb5788f44-t4gb6_8c4a9e90-9859-4079-a47d-1d2acb5e7ed1 became leader",
            "metadata": {
                "creationTimestamp": "2021-12-15T13:47:15Z",
                "managedFields": [
                    {
                        "apiVersion": "v1",
                        "fieldsType": "FieldsV1",
                        "fieldsV1": {
                            "f:count": {},
                            "f:firstTimestamp": {},
                            "f:involvedObject": {},
                            "f:lastTimestamp": {},
                            "f:message": {},
                            "f:reason": {},
                            "f:source": {
                                "f:component": {}
                            },
                            "f:type": {}
                        },
                        "manager": "local-path-provisioner",
                        "operation": "Update",
                        "time": "2021-12-15T13:47:15Z"
                    }
                ],
                "name": "rancher.io-local-path.16c0f1b7c1eb6823",
                "namespace": "local-path-storage",
                "resourceVersion": "514",
                "uid": "5fc99d35-787b-4d8f-806c-ee4be1ccc7a8"
            },
            "reason": "LeaderElection",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "rancher.io/local-path_local-path-provisioner-5bb5788f44-t4gb6_8c4a9e90-9859-4079-a47d-1d2acb5e7ed1"
            },
            "type": "Normal"
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }

Same results when testing on GKE (1.21.1).

In this regard I would go with a solution of updating our implementation to tolerate with events that do not have value for the lastTimestamp field so as in the generic use-case the events would be filtered if there is a valid timestamp, otherwise if there is not valid timestamp we don't skip the events. @MichaelKatsoulis any thoughts here?

MichaelKatsoulis · 2021-12-15T15:05:01Z

The approach I was thinking is that after the introduction of skip_older setting we should set its default value to false.
That way by default we don't skip older/not timestamped events. By default events ttl is 1 hour (link)
If the user wants to set it to true we could add a note that events with no timestamp would be also skipped.

ChrsMark · 2021-12-16T08:12:50Z

I think that the default should remain as skip_older: true to avoid possible breaking changes. We can of course do not filter events with no valid timestamps in all cases.

ChrsMark · 2021-12-16T11:48:53Z

Ok, so after thinking of this with @MichaelKatsoulis the way to go is to change the condition to also take eventTime in account with an OR logic, otherwise if both these fields are missing we skip the events and in order to disable this skip one should use the skip_older: false setting.

The change needed at

beats/metricbeat/module/kubernetes/event/event.go

Line 112 in 533eefb

if m.skipOlder && kubernetes.Time(&eve.LastTimestamp).Before(now) {

would be like m.skipOlder && ( kubernetes.Time(&eve.LastTimestamp).Before(now) || kubernetes.Time(&eve.EventTime).Before(now)

ChrsMark · 2021-12-20T12:25:56Z

Hi @MnrGreg, since the temp workaround has been merged and scheduled for v8.0.0 would you be able to test with docker.elastic.co/beats/metricbeat:8.0.0-SNAPSHOT and verify if the patch solves your issue? Thank you :)

ChrsMark · 2022-01-10T12:41:22Z

Closing for now. We can re-open if things don't work as expected.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 10, 2021

ChrsMark added the Team:Integrations Label for the Integrations team label Dec 2, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 2, 2021

ChrsMark self-assigned this Dec 13, 2021

This was referenced Dec 13, 2021

Update k8s library #29394

Merged

Add option to skip older k8s events #29396

Merged

ChrsMark mentioned this issue Dec 15, 2021

Add skip_older setting for event data_stream elastic/integrations#2342

Merged

ChrsMark mentioned this issue Dec 16, 2021

Enhance filter check in k8s event metricset #29470

Merged

ChrsMark closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metricbeat] module:kubernetes / metricset:event fails to gather FailedScheduling events on k8s 1.20.5 #28923

[Metricbeat] module:kubernetes / metricset:event fails to gather FailedScheduling events on k8s 1.20.5 #28923

MnrGreg commented Nov 10, 2021 •

edited

Loading

delfer commented Nov 19, 2021

semyon1992 commented Nov 19, 2021 •

edited

Loading

los911evgen commented Nov 29, 2021

elasticmachine commented Dec 2, 2021

bgaliev commented Dec 13, 2021

ChrsMark commented Dec 13, 2021

ChrsMark commented Dec 13, 2021 •

edited

Loading

ChrsMark commented Dec 15, 2021

MichaelKatsoulis commented Dec 15, 2021

ChrsMark commented Dec 16, 2021

ChrsMark commented Dec 16, 2021 •

edited

Loading

ChrsMark commented Dec 20, 2021

ChrsMark commented Jan 10, 2022

[Metricbeat] module:kubernetes / metricset:event fails to gather FailedScheduling events on k8s 1.20.5 #28923

[Metricbeat] module:kubernetes / metricset:event fails to gather FailedScheduling events on k8s 1.20.5 #28923

Comments

MnrGreg commented Nov 10, 2021 • edited Loading

delfer commented Nov 19, 2021

semyon1992 commented Nov 19, 2021 • edited Loading

los911evgen commented Nov 29, 2021

elasticmachine commented Dec 2, 2021

bgaliev commented Dec 13, 2021

ChrsMark commented Dec 13, 2021

ChrsMark commented Dec 13, 2021 • edited Loading

ChrsMark commented Dec 15, 2021

MichaelKatsoulis commented Dec 15, 2021

ChrsMark commented Dec 16, 2021

ChrsMark commented Dec 16, 2021 • edited Loading

ChrsMark commented Dec 20, 2021

ChrsMark commented Jan 10, 2022

MnrGreg commented Nov 10, 2021 •

edited

Loading

semyon1992 commented Nov 19, 2021 •

edited

Loading

ChrsMark commented Dec 13, 2021 •

edited

Loading

ChrsMark commented Dec 16, 2021 •

edited

Loading