Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133

Closed

Conversation

abayer
Copy link
Contributor

@abayer abayer commented Jul 13, 2022

Changes

Fixes #5127

We've been seeing sporadic flaky failures for a number of e2e tests for a while now, such as TestPipelineRunTimeout and sidecar-related tests. I recently dug into exactly what differed between a success and a failure, specifically for TestPipelineRunTimeout, the most frequent of those flakes. I was able to determine that sometimes, the TaskRun would be timed out due to the PipelineRun-level timeout, but pr.HasTimedOut would not return true on the next reconciliation of the PipelineRun. This strongly suggests that the TaskRun timeout was calculated to end slightly before the PipelineRun timeout would end, and then the PipelineRun reconciliation happened in the very brief (milliseconds at most) window between the TaskRun completing as timed out and the PipelineRun timeout being reached.

It's not possible for us to make the end of the generated TaskRun timeout exactly match the end of the specified PipelineRun timeout, since the TaskRun's StartTime won't be set until the TaskRun has actually been created, so there'll always be some difference there, as best as I can tell. So I decided to work around this inherent limitation by instead tracking cases where we've set the TaskRun timeout based on PipelineRun.Status.StartTime + PipelineRun.PipelineTimeout(ctx), i.e., the TaskRun timeout is the difference between elapsed time of the PipelineRun and the time at which the PipelineRun proper would be timed out.

Then, if all tasks in a PipelineRun have completed, and at least one of them has timed out and had its timeout set based on that difference, we know that the PipelineRun has timed out, even if pr.HasTimedOut is returning false because we haven't quite yet hit the end of the PipelineRun's timeout duration.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

PipelineRuns will always be marked as timed out if any of their tasks timed out due to the timeout set on the PipelineRun itself.

…due to the PR timeout

Fixes tektoncd#5127

We've been seeing sporadic flaky failures for a number of e2e tests for a while now, such as `TestPipelineRunTimeout` and sidecar-related tests. I recently dug into exactly what differed between a success and a failure, specifically for `TestPipelineRunTimeout`, the most frequent of those flakes. I was able to determine that sometimes, the `TaskRun` would be timed out due to the `PipelineRun`-level timeout, but `pr.HasTimedOut` would not return true on the next reconciliation of the `PipelineRun`. This strongly suggests that the `TaskRun` timeout was calculated to end slightly before the `PipelineRun` timeout would end, and then the `PipelineRun` reconciliation happened in the very brief (milliseconds at most) window between the `TaskRun` completing as timed out and the `PipelineRun` timeout being reached.

It's not possible for us to make the end of the generated `TaskRun` timeout exactly match the end of the specified `PipelineRun` timeout, since the `TaskRun`'s `StartTime` won't be set until the `TaskRun` has actually been created, so there'll always be some difference there, as best as I can tell. So I decided to work around this inherent limitation by instead tracking cases where we've set the `TaskRun` timeout based on `PipelineRun.Status.StartTime + PipelineRun.PipelineTimeout(ctx)`, i.e., the `TaskRun` timeout is the difference between elapsed time of the `PipelineRun` and the time at which the `PipelineRun` proper would be timed out.

Then, if all tasks in a `PipelineRun` have completed, and at least one of them has timed out and had its timeout set based on that difference, we know that the `PipelineRun` has timed out, even if `pr.HasTimedOut` is returning false because we haven't quite yet hit the end of the `PipelineRun`'s timeout duration.

Signed-off-by: Andrew Bayer <andrew.bayer@gmail.com>
@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/flake Categorizes issue or PR as related to a flakey test labels Jul 13, 2022
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from abayer after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 13, 2022
@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

I've marked this as WIP because I'm 100% sure that more unit test coverage is needed but wanted to start running e2e tests over and over to see if any of the flakes ever show up.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 86.3% 86.4% 0.1
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 94.3% 91.0% -3.2
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 97.4% 97.4% 0.0

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test check-pr-has-kind-label

@tekton-robot
Copy link
Collaborator

@abayer: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-tekton-pipeline-alpha-integration-tests
  • /test pull-tekton-pipeline-build-tests
  • /test pull-tekton-pipeline-integration-tests
  • /test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

  • /test pull-tekton-pipeline-go-coverage
  • /test pull-tekton-pipeline-kind-alpha-integration-tests
  • /test pull-tekton-pipeline-kind-alpha-yaml-tests
  • /test pull-tekton-pipeline-kind-integration-tests
  • /test pull-tekton-pipeline-kind-yaml-tests

Use /test all to run the following jobs that were automatically triggered:

  • pull-tekton-pipeline-alpha-integration-tests
  • pull-tekton-pipeline-build-tests
  • pull-tekton-pipeline-go-coverage
  • pull-tekton-pipeline-integration-tests
  • pull-tekton-pipeline-unit-tests

In response to this:

/test check-pr-has-kind-label

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abayer abayer removed the kind/flake Categorizes issue or PR as related to a flakey test label Jul 13, 2022
@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-pipeline-kind-k8s-v1-21-e2e

@tekton-robot
Copy link
Collaborator

@abayer: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-tekton-pipeline-alpha-integration-tests
  • /test pull-tekton-pipeline-build-tests
  • /test pull-tekton-pipeline-integration-tests
  • /test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

  • /test pull-tekton-pipeline-go-coverage
  • /test pull-tekton-pipeline-kind-alpha-integration-tests
  • /test pull-tekton-pipeline-kind-alpha-yaml-tests
  • /test pull-tekton-pipeline-kind-integration-tests
  • /test pull-tekton-pipeline-kind-yaml-tests

Use /test all to run the following jobs that were automatically triggered:

  • pull-tekton-pipeline-alpha-integration-tests
  • pull-tekton-pipeline-build-tests
  • pull-tekton-pipeline-go-coverage
  • pull-tekton-pipeline-integration-tests
  • pull-tekton-pipeline-unit-tests

In response to this:

/test pull-pipeline-kind-k8s-v1-21-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-pipeline-kind-k8s-v1-21-e2e

@tekton-robot
Copy link
Collaborator

@abayer: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-tekton-pipeline-alpha-integration-tests
  • /test pull-tekton-pipeline-build-tests
  • /test pull-tekton-pipeline-integration-tests
  • /test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

  • /test pull-tekton-pipeline-go-coverage
  • /test pull-tekton-pipeline-kind-alpha-integration-tests
  • /test pull-tekton-pipeline-kind-alpha-yaml-tests
  • /test pull-tekton-pipeline-kind-integration-tests
  • /test pull-tekton-pipeline-kind-yaml-tests

Use /test all to run the following jobs that were automatically triggered:

  • pull-tekton-pipeline-alpha-integration-tests
  • pull-tekton-pipeline-build-tests
  • pull-tekton-pipeline-go-coverage
  • pull-tekton-pipeline-integration-tests
  • pull-tekton-pipeline-unit-tests

In response to this:

/test pull-pipeline-kind-k8s-v1-21-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

Boo - I hit TestPipelineRunTimeout failing once in ten local runs of the full e2e test suite, so I'm not sure this actually works. Also @jerop had a waaaaaay better idea which I'm working on now. =)

EDIT: Ah, my local test was screwed up and still using the v0.37.2 images. Sigh. Well, @jerop's idea is still better.

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

2 similar comments
@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

@abayer
Copy link
Contributor Author

abayer commented Jul 13, 2022

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

@tekton-robot
Copy link
Collaborator

@abayer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-tekton-pipeline-integration-tests aca1517 link true /test pull-tekton-pipeline-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abayer
Copy link
Contributor Author

abayer commented Jul 14, 2022

Closing in favor of #5134.

@abayer abayer closed this Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipelineRuns using timeout or timeouts fields sometimes are marked as failed rather than timed out
2 participants