Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun timeout is always set to 1h0m0s after v0.40 #6520

Closed
syan-tibco opened this issue Apr 11, 2023 · 9 comments
Closed

TaskRun timeout is always set to 1h0m0s after v0.40 #6520

syan-tibco opened this issue Apr 11, 2023 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@syan-tibco
Copy link

Expected Behavior

The timeout in TaskRun should honor the timeout.tasks in pipelinerun

Actual Behavior

It works before v0.39. After v0.40 the timeout in TaskRun is always set to 1h0m0s.

Steps to Reproduce the Problem

  1. Create a PipelineRune with the following timeouts
  timeouts:
    pipeline: "40s"
    tasks: "30s"
    finally: "10s"
  1. Make sure we only have one timeout set on the chain.

PipelineRun --> pipeline --> task --> TaskRun

  1. kubeclt apply the PipelineRun to k8s and checks the timeout on generated TaskRun.

I intentionally change the timeout in config-defaults as default-timeout-minutes: "30". Just to make sure I can see different values in TaskRun

I can see the timeout is set correctly for

  • TEKTON_PIPELINE_RELEASE=v0.39.0
kind: TaskRun
...
timeout: 29.99155678s

I can see the timeout is always 1h for

  • TEKTON_PIPELINE_RELEASE=v0.40.0
  • TEKTON_PIPELINE_RELEASE=v0.41.2
  • TEKTON_PIPELINE_RELEASE=v0.44.2
kind: TaskRun
...
timeout: 1h0m0s

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: v1.26.3
Kustomize Version: v4.5.7
Server Version: v1.25.4
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.41.2

If this is indeed a bug; we want this to have a patch on 0.41.x. We have some GUI dependency on the old tekton status. (new tekton move task in status to different CR.)

@syan-tibco syan-tibco added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2023
@lbernick
Copy link
Member

I believe this is a duplicate of #6137. @syan-tibco can you take a look at that issue and lmk if this answers your question? In the example above, tekton should cancel the taskruns after 30s even if their timeout is 1h. You can use pipeline.spec.tasks[].timeout to set timeouts for child taskruns.

@syan-tibco
Copy link
Author

syan-tibco commented Apr 12, 2023

hi @lbernick thanks for the response. It is similar issue but this change breaks our use case. Yes in the sample that I give tekton will cancel the taskrun after 30s. But when we have longer taskruns; the task will also be canceled in 1 hour. Here is the details

use case

We have some pipelines that need to run 6 ~ 7 hours. The task inside that pipeline might need to run 3 ~ 4 hours. It covers cases like:

  • bring up eks, create AWS resource and install all tools (1+ hours)
  • upgrade EKS and restart all nodes in a nice way (3+ hours)
  • AMI upgrade for hundreds of nodes (4+ hours)

problem

The pipelinerun that we submit

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
...
  timeouts:
    finally: 5m0s
    pipeline: 2h0m0s
    tasks: 1h55m0s

This will cover 90% of cases. The longest pipeline timeout is 7h.

Here the taskrun that I get

apiVersion: tekton.dev/v1beta1
kind: TaskRun
...
  timeout: 1h0m0s

Then after 1 hour I will get error like

        conditions:
        - lastTransitionTime: "2023-04-11T02:05:55Z"
          message: TaskRun "generic-runner-910716586980-1681175155534-generic-runner"
            failed to finish within "1h0m0s"
          reason: TaskRunTimeout
          status: "False"
          type: Succeeded

And it will delete the pods thus we will lost all logs.

workaround

No workaround for us; we have to revert our production setup to v0.39.0. Here is the sample taskrun for v0.39.0

apiVersion: tekton.dev/v1beta1
kind: TaskRun
...
  timeout: 6h59m59.943281705s

here is the pipelinerun setting

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
....
  timeouts:
    pipeline: 7h0m0s

@lbernick
Copy link
Member

@syan-tibco have you tried setting pipeline.spec.tasks[].timeout? another option is to change the default timeout from 1h to something longer than 7h

@syan-tibco
Copy link
Author

yeah I tried all of them. I setup timeout on all the possible place that I can set

  • pipelinerun
  • pipeline
  • task

I also change the default value to default-timeout-minutes: "30" as I mentioned in the beginning. I just want to see a different timeout value in taskrun. But I always get this fixed 1h0m0s

That is why I name the topic.

In production, before I revert to v0.39.0; I tried to set default-timeout-minutes: "120". But still timeout in 1 hour.

@chengjoey
Copy link
Member

chengjoey commented Apr 13, 2023

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: timeout
spec:
  timeouts:
    // this timeout always not effect
    tasks: 30s
  pipelineSpec:
    tasks:
    - name: task1
      // this timeout is in effect
      timeout: 50s
      taskSpec:
        steps:
        - image: busybox
          script: |
            echo "hello"
            sleep 1200
            echo "goodbye"

This situation may have started from this PR.

In v0.39.0, spec.timeouts.tasks could be set to taskrun by getTaskRunTimeout

// getTaskRunTimeout returns the timeout to set when creating the ResolvedPipelineTask.
// If there is no timeout for the TaskRun, returns 0.
// If pipeline level timeouts have already been exceeded, returns 1 second.
func getTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rpt *resources.ResolvedPipelineTask, c clock.PassiveClock) *metav1.Duration {
var timeout time.Duration
if pr.TasksTimeout() != nil {
timeout = pr.TasksTimeout().Duration
} else {
timeout = pr.PipelineTimeout(ctx)
}
return calculateTaskRunTimeout(timeout, pr, rpt, c)
}

after v0.39.0 taskrun timeout looks like it can only be received from pipelinetask.Timeout

if rpt.PipelineTask.Timeout != nil {
tr.Spec.Timeout = rpt.PipelineTask.Timeout
}

is this as expected? may be I'm missing something

cc @lbernick @abayer

@lbernick
Copy link
Member

yeah I tried all of them. I setup timeout on all the possible place that I can set

  • pipelinerun
  • pipeline
  • task

I also change the default value to default-timeout-minutes: "30" as I mentioned in the beginning. I just want to see a different timeout value in taskrun. But I always get this fixed 1h0m0s

That is why I name the topic.

In production, before I revert to v0.39.0; I tried to set default-timeout-minutes: "120". But still timeout in 1 hour.

I just tried to reproduce this on main and was unable to.

Using the following PipelineRun:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: timeout
spec:
  timeouts:
    tasks: 2h
  pipelineSpec:
    tasks:
    - name: task1
      timeout: 90m
      taskSpec:
        steps:
        - image: busybox
          script: |
            echo "hello"
            sleep 1200
            echo "goodbye"

a child TaskRun was created with timeout 90m (expected behavior).

I also tried setting the default timeout to 10 minutes. Using the following PipelineRun:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: timeout2
spec:
  timeouts:
    tasks: 2h
  pipelineSpec:
    tasks:
    - name: task1
      taskSpec:
        steps:
        - image: busybox
          script: |
            echo "hello"
            sleep 1200
            echo "goodbye"

a child TaskRun was created with timeout 10 minutes (also expected behavior).

pipelineRun.spec.timeouts.tasks applies to the cumulative time taken by all child TaskRuns. When that timeout elapses, all children will be canceled. It's no longer used to calculate the timeout applied directly to each child TaskRun. This had the unfortunate unintended consequence of being a breaking change for PipelineRuns with spec.timeouts.tasks greater than the default timeout; however, you can still achieve your intended behavior with spec.tasks[].timeout or config defaults.

I'm confused why in your initial bug you set default-timeout-minutes: "30" but didn't see this reflected on the taskrun. What version of tekton were you using when you observed this behavior, and what's the full yaml of your config-defaults configmap?

@syan-tibco
Copy link
Author

Hi @lbernick thanks for the test case. I played for a while and finally understood the logic. Previously I set the default setting wrongly. Now I make the default value work.

here is my test case:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: timeout
spec:
  timeouts:
    pipeline: 2h
    tasks: 10s
  pipelineSpec:
    tasks:
    - name: task1
      timeout: 90m
      taskSpec:
        steps:
        - image: busybox
          timeout: "20s"
          script: |
            echo "hello"
            sleep 30
            echo "goodbye"

Here is my finding. We can set timeout in 4 places for a pipelinerun:

  1. spec.timeouts.tasks: this is a hidden minimum timeout. The task will be canceled if this timeout arrives first.
  2. spec.pipelineSpec.tasks.timeout: this is the key value! this value will be end on taskrun. This is the maximum timeout value. If this value is not set; then tekton will use default value
  3. spec.pipelineSpec.tasks.taskSpec.steps.timeout: this is minimum timeout value. The task will be canceled if this timeout arrives first.
  4. default-timeout-minutes in config-defaults configmap. This will only take effect when there is no value for point 2. This is still the maximum timeout.

The surprising part is that we have 2 hidden minimum timeout (#1 and #3) and 1 maximum timeout (#2 and #4). In this sample, the timeout on taskrun will be set as 1h30m0s but will fail after 10 seconds because of the value in #1.

In my actual use case I have 3 files

---
# file 1
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: task1
spec:
  steps:
  - image: busybox
    # timeout: "10m"
    script: |
      echo "hello"
      sleep 1200
      echo "goodbye"

---
# file 2
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: pipeline1
spec:
  tasks:
  - name: task1
    taskRef:
      name: task1
    # timeout: "20m"

---
# file 3
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: pipelinetimeout
spec:
  timeouts:
    pipeline: "2h"
    tasks: "1h55m0s"
    finally: "5m0s"
  taskRunTemplate:
    serviceAccountName: pipeline-cluster-admin
  pipelineRef:
    name: pipeline1

I guess if I set timeout on file 2; I should have correct maximum timeout. I will give a try and update the issue.

The testing I am doing is under v0.46.0. I will need to test on v0.41.2.

@syan-tibco
Copy link
Author

I tested in v0.41.2 and added timeout on file 2. It works.

The confusing part is that when we have layered yaml references; normally the top layer will overwrite values in the inner layer.

When I set this spec.timeouts.tasks; I will expect the inner layer task and taskrun honor this value. The default value should not play any role if any of the layers set timeout.

The timeout value in taskrun which comes from pipeline.spec.task.timeout make it harder to understand.

And finally, we have two timers running in parallel to countdown: one in pipelinerun and the other in taskrun. Make it hard to know which one is the source of the truce.

Thanks for the help. I now have a workaround. Just share the feedback about timeout. We can close this issue if nothing needs to be enhanced.

@lbernick
Copy link
Member

Thanks @syan-tibco, if there are any aspects of the documentation that can be improved here, please feel free to reopen the issue or open a pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants