Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] module:kubernetes / metricset:event fails to gather FailedScheduling events on k8s 1.20.5 #28923

Closed
MnrGreg opened this issue Nov 10, 2021 · 13 comments
Assignees
Labels
Team:Integrations Label for the Integrations team

Comments

@MnrGreg
Copy link

MnrGreg commented Nov 10, 2021

When upgrading from kubernetes 1.19.1 to 1.20.5, metricset:event no longer gathers .reason:FailedScheduling k8s events. Might it be related to the k8s event structure changing?

See 1.19.1 event json below:

{
            "apiVersion": "v1",
            "count": 4,
            "eventTime": null,
            "firstTimestamp": "2021-08-05T18:05:01Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "netshoot-69c75755c-kzftr",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "143753331",
                "uid": "c5c0b600-cc55-4922-b921-8dad8a4289b3"
            },
            "kind": "Event",
            "lastTimestamp": "2021-08-05T18:06:18Z",
            "message": "0/39 nodes are available: 39 Insufficient cpu.",
            "metadata": {
                "creationTimestamp": "2021-08-05T18:05:01Z",
                "name": "netshoot-69c75755c-kzftr.16987b2dbf489625",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "143754124",
                "selfLink": "/api/v1/namespaces/tsengineering-netshoot-dev/events/netshoot-69c75755c-kzftr.16987b2dbf489625",
                "uid": "f8f0ae32-9230-4816-ae90-2e351b5bfa46"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "default-scheduler"
            },
            "type": "Warning"
        }

compare with similar event in 1.20.5 below:

{
            "action": "Scheduling",
            "apiVersion": "v1",
            "eventTime": "2021-08-05T16:53:44.592316Z",
            "firstTimestamp": null,
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "netshoot-5ccb8d65c7-8d5sr",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "373641912",
                "uid": "88e1db83-adfb-4e4f-8e1d-89ed98395465"
            },
            "kind": "Event",
            "lastTimestamp": null,
            "message": "0/51 nodes are available: 40 Insufficient cpu",
            "metadata": {
                "creationTimestamp": "2021-08-05T16:53:44Z",
                "name": "netshoot-5ccb8d65c7-8d5sr.16987749e75c0e0b",
                "namespace": "tsengineering-netshoot-dev",
                "resourceVersion": "373807800",
                "uid": "aef136c3-0e8f-4dac-abf8-24d5a1d7243d"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "default-scheduler",
            "reportingInstance": "default-scheduler-ts-sharedplatform-rck-nonprod-cp-qzzjh",
            "series": {
                "count": 460,
                "lastObservedTime": "2021-08-05T19:22:48.831892Z"
            },
            "source": {},
            "type": "Warning"
        }

Note the differences:

  1. .metadata.selfLink is removed in 1.20
  2. .source is empty
  3. .eventTime is populated
  4. .firstTimestamp and .lastTimestamp are null

Is .lastTimestamp: null perhaps causing the event watch time comparison to fail? See here:

// skip events happened before watch
FilterFunc: func(obj interface{}) bool {
eve := obj.(*kubernetes.Event)
if kubernetes.Time(&eve.LastTimestamp).Before(now) {
return false
}

Running metricbeat in debug mode with '-d "*' does not surface any errors. The events seem to be silently dropped.

Related issue:
kubernetes/kubernetes#90482

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 10, 2021
@delfer
Copy link

delfer commented Nov 19, 2021

We are also affected.

@semyon1992
Copy link

semyon1992 commented Nov 19, 2021

I have met this issue on my cluster too.
Kubernetes version is: v1.21.5
Metricbeat version is: 7.15.2

kubectl get events -o yaml
returns null values on firstTimestamp and lastTimestamp

@los911evgen
Copy link

The same issue.

@ChrsMark ChrsMark added the Team:Integrations Label for the Integrations team label Dec 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 2, 2021
@bgaliev
Copy link

bgaliev commented Dec 13, 2021

I have the same issue

@ChrsMark
Copy link
Member

Thanks for reporting this @MnrGreg and folks. It looks like the API's types have been changed. We are looking into this already :).

@ChrsMark ChrsMark self-assigned this Dec 13, 2021
@ChrsMark
Copy link
Member

ChrsMark commented Dec 13, 2021

It seems that the API's types have been changed and we need to further investigate this so as to support newer and older versions at the same time. For sure we need to unblock the filtering to be dropping all events and in this regard I've opened a PR to introduce a new setting skip_older so as to make the filter configurable in case we cannot rely on the LastTimestamp field. This can be released with 8.0 so as to solve problems like the one reported in this PR.

@ChrsMark
Copy link
Member

An update on this.

Based on kubernetes/kubernetes#90482 (comment) + manual testing + provided sample events it looks like firstTimestamp, lastTimestamp and eventTime fields are optional and we cannot rely on them.

In addition, there is also the new API and plans for migrating to this (link) so in the long term when all components will be migrated, and API server mostly, we will be able/should to switch into using the new API's types.

Based on testing until version 1.23.0 (kind) I see that the old API is still used:

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2021-12-15T13:47:15Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Endpoints",
                "name": "rancher.io-local-path",
                "namespace": "local-path-storage",
                "resourceVersion": "513",
                "uid": "c470b86c-1157-48b2-a15c-e2727c43bc1d"
            },
            "kind": "Event",
            "lastTimestamp": "2021-12-15T13:47:15Z",
            "message": "local-path-provisioner-5bb5788f44-t4gb6_8c4a9e90-9859-4079-a47d-1d2acb5e7ed1 became leader",
            "metadata": {
                "creationTimestamp": "2021-12-15T13:47:15Z",
                "managedFields": [
                    {
                        "apiVersion": "v1",
                        "fieldsType": "FieldsV1",
                        "fieldsV1": {
                            "f:count": {},
                            "f:firstTimestamp": {},
                            "f:involvedObject": {},
                            "f:lastTimestamp": {},
                            "f:message": {},
                            "f:reason": {},
                            "f:source": {
                                "f:component": {}
                            },
                            "f:type": {}
                        },
                        "manager": "local-path-provisioner",
                        "operation": "Update",
                        "time": "2021-12-15T13:47:15Z"
                    }
                ],
                "name": "rancher.io-local-path.16c0f1b7c1eb6823",
                "namespace": "local-path-storage",
                "resourceVersion": "514",
                "uid": "5fc99d35-787b-4d8f-806c-ee4be1ccc7a8"
            },
            "reason": "LeaderElection",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "rancher.io/local-path_local-path-provisioner-5bb5788f44-t4gb6_8c4a9e90-9859-4079-a47d-1d2acb5e7ed1"
            },
            "type": "Normal"
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }

Same results when testing on GKE (1.21.1).

In this regard I would go with a solution of updating our implementation to tolerate with events that do not have value for the lastTimestamp field so as in the generic use-case the events would be filtered if there is a valid timestamp, otherwise if there is not valid timestamp we don't skip the events. @MichaelKatsoulis any thoughts here?

@MichaelKatsoulis
Copy link
Contributor

The approach I was thinking is that after the introduction of skip_older setting we should set its default value to false.
That way by default we don't skip older/not timestamped events. By default events ttl is 1 hour (link)
If the user wants to set it to true we could add a note that events with no timestamp would be also skipped.

@ChrsMark
Copy link
Member

I think that the default should remain as skip_older: true to avoid possible breaking changes. We can of course do not filter events with no valid timestamps in all cases.

@ChrsMark
Copy link
Member

ChrsMark commented Dec 16, 2021

Ok, so after thinking of this with @MichaelKatsoulis the way to go is to change the condition to also take eventTime in account with an OR logic, otherwise if both these fields are missing we skip the events and in order to disable this skip one should use the skip_older: false setting.

The change needed at

if m.skipOlder && kubernetes.Time(&eve.LastTimestamp).Before(now) {
would be like m.skipOlder && ( kubernetes.Time(&eve.LastTimestamp).Before(now) || kubernetes.Time(&eve.EventTime).Before(now)

@ChrsMark
Copy link
Member

Hi @MnrGreg, since the temp workaround has been merged and scheduled for v8.0.0 would you be able to test with docker.elastic.co/beats/metricbeat:8.0.0-SNAPSHOT and verify if the patch solves your issue? Thank you :)

@ChrsMark
Copy link
Member

Closing for now. We can re-open if things don't work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

8 participants