TaskRun timeout is always set to 1h0m0s after v0.40 #6520

syan-tibco · 2023-04-11T15:08:34Z

Expected Behavior

The timeout in TaskRun should honor the timeout.tasks in pipelinerun

Actual Behavior

It works before v0.39. After v0.40 the timeout in TaskRun is always set to 1h0m0s.

Steps to Reproduce the Problem

Create a PipelineRune with the following timeouts

  timeouts:
    pipeline: "40s"
    tasks: "30s"
    finally: "10s"

Make sure we only have one timeout set on the chain.

PipelineRun --> pipeline --> task --> TaskRun

kubeclt apply the PipelineRun to k8s and checks the timeout on generated TaskRun.

I intentionally change the timeout in config-defaults as default-timeout-minutes: "30". Just to make sure I can see different values in TaskRun

I can see the timeout is set correctly for

TEKTON_PIPELINE_RELEASE=v0.39.0

kind: TaskRun
...
timeout: 29.99155678s

I can see the timeout is always 1h for

TEKTON_PIPELINE_RELEASE=v0.40.0
TEKTON_PIPELINE_RELEASE=v0.41.2
TEKTON_PIPELINE_RELEASE=v0.44.2

kind: TaskRun
...
timeout: 1h0m0s

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: v1.26.3
Kustomize Version: v4.5.7
Server Version: v1.25.4

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.41.2

If this is indeed a bug; we want this to have a patch on 0.41.x. We have some GUI dependency on the old tekton status. (new tekton move task in status to different CR.)

The text was updated successfully, but these errors were encountered:

lbernick · 2023-04-12T18:48:43Z

I believe this is a duplicate of #6137. @syan-tibco can you take a look at that issue and lmk if this answers your question? In the example above, tekton should cancel the taskruns after 30s even if their timeout is 1h. You can use pipeline.spec.tasks[].timeout to set timeouts for child taskruns.

syan-tibco · 2023-04-12T19:47:09Z

hi @lbernick thanks for the response. It is similar issue but this change breaks our use case. Yes in the sample that I give tekton will cancel the taskrun after 30s. But when we have longer taskruns; the task will also be canceled in 1 hour. Here is the details

use case

We have some pipelines that need to run 6 ~ 7 hours. The task inside that pipeline might need to run 3 ~ 4 hours. It covers cases like:

bring up eks, create AWS resource and install all tools (1+ hours)
upgrade EKS and restart all nodes in a nice way (3+ hours)
AMI upgrade for hundreds of nodes (4+ hours)

problem

The pipelinerun that we submit

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
...
  timeouts:
    finally: 5m0s
    pipeline: 2h0m0s
    tasks: 1h55m0s

This will cover 90% of cases. The longest pipeline timeout is 7h.

Here the taskrun that I get

apiVersion: tekton.dev/v1beta1
kind: TaskRun
...
  timeout: 1h0m0s

Then after 1 hour I will get error like

        conditions:
        - lastTransitionTime: "2023-04-11T02:05:55Z"
          message: TaskRun "generic-runner-910716586980-1681175155534-generic-runner"
            failed to finish within "1h0m0s"
          reason: TaskRunTimeout
          status: "False"
          type: Succeeded

And it will delete the pods thus we will lost all logs.

workaround

No workaround for us; we have to revert our production setup to v0.39.0. Here is the sample taskrun for v0.39.0

apiVersion: tekton.dev/v1beta1
kind: TaskRun
...
  timeout: 6h59m59.943281705s

here is the pipelinerun setting

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
....
  timeouts:
    pipeline: 7h0m0s

lbernick · 2023-04-12T19:57:17Z

@syan-tibco have you tried setting pipeline.spec.tasks[].timeout? another option is to change the default timeout from 1h to something longer than 7h

syan-tibco · 2023-04-12T20:01:05Z

yeah I tried all of them. I setup timeout on all the possible place that I can set

pipelinerun
pipeline
task

I also change the default value to default-timeout-minutes: "30" as I mentioned in the beginning. I just want to see a different timeout value in taskrun. But I always get this fixed 1h0m0s

That is why I name the topic.

In production, before I revert to v0.39.0; I tried to set default-timeout-minutes: "120". But still timeout in 1 hour.

chengjoey · 2023-04-13T03:52:36Z

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: timeout
spec:
  timeouts:
    // this timeout always not effect
    tasks: 30s
  pipelineSpec:
    tasks:
    - name: task1
      // this timeout is in effect
      timeout: 50s
      taskSpec:
        steps:
        - image: busybox
          script: |
            echo "hello"
            sleep 1200
            echo "goodbye"

This situation may have started from this PR.

In v0.39.0, spec.timeouts.tasks could be set to taskrun by getTaskRunTimeout

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 1144 to 1155 in 6a789d4

    
           // getTaskRunTimeout returns the timeout to set when creating the ResolvedPipelineTask. 
        
           // If there is no timeout for the TaskRun, returns 0. 
        
           // If pipeline level timeouts have already been exceeded, returns 1 second. 
        
           func getTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rpt *resources.ResolvedPipelineTask, c clock.PassiveClock) *metav1.Duration { 
        
           	var timeout time.Duration 
        
           	if pr.TasksTimeout() != nil { 
        
           		timeout = pr.TasksTimeout().Duration 
        
           	} else { 
        
           		timeout = pr.PipelineTimeout(ctx) 
        
           	} 
        
           	return calculateTaskRunTimeout(timeout, pr, rpt, c) 
        
           }

after v0.39.0 taskrun timeout looks like it can only be received from pipelinetask.Timeout

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 842 to 844 in abd1849

    
           if rpt.PipelineTask.Timeout != nil { 
        
           	tr.Spec.Timeout = rpt.PipelineTask.Timeout 
        
           }

is this as expected? may be I'm missing something

cc @lbernick @abayer

lbernick · 2023-04-14T19:01:03Z

yeah I tried all of them. I setup timeout on all the possible place that I can set

pipelinerun

pipeline

task

I also change the default value to default-timeout-minutes: "30" as I mentioned in the beginning. I just want to see a different timeout value in taskrun. But I always get this fixed 1h0m0s

That is why I name the topic.

In production, before I revert to v0.39.0; I tried to set default-timeout-minutes: "120". But still timeout in 1 hour.

I just tried to reproduce this on main and was unable to.

Using the following PipelineRun:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: timeout
spec:
  timeouts:
    tasks: 2h
  pipelineSpec:
    tasks:
    - name: task1
      timeout: 90m
      taskSpec:
        steps:
        - image: busybox
          script: |
            echo "hello"
            sleep 1200
            echo "goodbye"

a child TaskRun was created with timeout 90m (expected behavior).

I also tried setting the default timeout to 10 minutes. Using the following PipelineRun:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: timeout2
spec:
  timeouts:
    tasks: 2h
  pipelineSpec:
    tasks:
    - name: task1
      taskSpec:
        steps:
        - image: busybox
          script: |
            echo "hello"
            sleep 1200
            echo "goodbye"

a child TaskRun was created with timeout 10 minutes (also expected behavior).

pipelineRun.spec.timeouts.tasks applies to the cumulative time taken by all child TaskRuns. When that timeout elapses, all children will be canceled. It's no longer used to calculate the timeout applied directly to each child TaskRun. This had the unfortunate unintended consequence of being a breaking change for PipelineRuns with spec.timeouts.tasks greater than the default timeout; however, you can still achieve your intended behavior with spec.tasks[].timeout or config defaults.

I'm confused why in your initial bug you set default-timeout-minutes: "30" but didn't see this reflected on the taskrun. What version of tekton were you using when you observed this behavior, and what's the full yaml of your config-defaults configmap?

syan-tibco · 2023-04-14T22:09:53Z

Hi @lbernick thanks for the test case. I played for a while and finally understood the logic. Previously I set the default setting wrongly. Now I make the default value work.

here is my test case:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: timeout
spec:
  timeouts:
    pipeline: 2h
    tasks: 10s
  pipelineSpec:
    tasks:
    - name: task1
      timeout: 90m
      taskSpec:
        steps:
        - image: busybox
          timeout: "20s"
          script: |
            echo "hello"
            sleep 30
            echo "goodbye"

Here is my finding. We can set timeout in 4 places for a pipelinerun:

spec.timeouts.tasks: this is a hidden minimum timeout. The task will be canceled if this timeout arrives first.
spec.pipelineSpec.tasks.timeout: this is the key value! this value will be end on taskrun. This is the maximum timeout value. If this value is not set; then tekton will use default value
spec.pipelineSpec.tasks.taskSpec.steps.timeout: this is minimum timeout value. The task will be canceled if this timeout arrives first.
default-timeout-minutes in config-defaults configmap. This will only take effect when there is no value for point 2. This is still the maximum timeout.

The surprising part is that we have 2 hidden minimum timeout (#1 and #3) and 1 maximum timeout (#2 and #4). In this sample, the timeout on taskrun will be set as 1h30m0s but will fail after 10 seconds because of the value in #1.

In my actual use case I have 3 files

---
# file 1
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: task1
spec:
  steps:
  - image: busybox
    # timeout: "10m"
    script: |
      echo "hello"
      sleep 1200
      echo "goodbye"

---
# file 2
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: pipeline1
spec:
  tasks:
  - name: task1
    taskRef:
      name: task1
    # timeout: "20m"

---
# file 3
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: pipelinetimeout
spec:
  timeouts:
    pipeline: "2h"
    tasks: "1h55m0s"
    finally: "5m0s"
  taskRunTemplate:
    serviceAccountName: pipeline-cluster-admin
  pipelineRef:
    name: pipeline1

I guess if I set timeout on file 2; I should have correct maximum timeout. I will give a try and update the issue.

The testing I am doing is under v0.46.0. I will need to test on v0.41.2.

syan-tibco · 2023-04-15T02:15:06Z

I tested in v0.41.2 and added timeout on file 2. It works.

The confusing part is that when we have layered yaml references; normally the top layer will overwrite values in the inner layer.

When I set this spec.timeouts.tasks; I will expect the inner layer task and taskrun honor this value. The default value should not play any role if any of the layers set timeout.

The timeout value in taskrun which comes from pipeline.spec.task.timeout make it harder to understand.

And finally, we have two timers running in parallel to countdown: one in pipelinerun and the other in taskrun. Make it hard to know which one is the source of the truce.

Thanks for the help. I now have a workaround. Just share the feedback about timeout. We can close this issue if nothing needs to be enhanced.

lbernick · 2023-04-17T13:30:22Z

Thanks @syan-tibco, if there are any aspects of the documentation that can be improved here, please feel free to reopen the issue or open a pull request!

syan-tibco added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2023

lbernick closed this as completed Apr 17, 2023

vdemeester mentioned this issue Mar 22, 2024

Timeout not working on Pipeline run #7793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskRun timeout is always set to 1h0m0s after v0.40 #6520

TaskRun timeout is always set to 1h0m0s after v0.40 #6520

syan-tibco commented Apr 11, 2023

lbernick commented Apr 12, 2023

syan-tibco commented Apr 12, 2023 •

edited

Loading

lbernick commented Apr 12, 2023

syan-tibco commented Apr 12, 2023

chengjoey commented Apr 13, 2023 •

edited

Loading

lbernick commented Apr 14, 2023

syan-tibco commented Apr 14, 2023

syan-tibco commented Apr 15, 2023

lbernick commented Apr 17, 2023

TaskRun timeout is always set to 1h0m0s after v0.40 #6520

TaskRun timeout is always set to 1h0m0s after v0.40 #6520

Comments

syan-tibco commented Apr 11, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

lbernick commented Apr 12, 2023

syan-tibco commented Apr 12, 2023 • edited Loading

use case

problem

workaround

lbernick commented Apr 12, 2023

syan-tibco commented Apr 12, 2023

chengjoey commented Apr 13, 2023 • edited Loading

lbernick commented Apr 14, 2023

syan-tibco commented Apr 14, 2023

syan-tibco commented Apr 15, 2023

lbernick commented Apr 17, 2023

syan-tibco commented Apr 12, 2023 •

edited

Loading

chengjoey commented Apr 13, 2023 •

edited

Loading