Fix recursion issue on Skip #3524

afrittoli · 2020-11-15T11:26:47Z

Changes

Fix recursion issue on Skip

The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821

This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes #3521

Co-authored-by: Scott sbws@google.com

Signed-off-by: Andrea Frittoli andrea.frittoli@uk.ibm.com

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices
Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Fix and issue in the pipeline state resolution which lead to very long or failed start times and high controller CPU in case of pipelines with a large number of dependencies between tasks (40+).

imjasonh

/lgtm

Just a couple style questions, in case they help debugging this code in the future. Feel free to ignore them.

Thanks for tracking this down!

pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go

pkg/reconciler/pipelinerun/resources/pipelinerunstate.go

afrittoli · 2020-11-15T20:52:17Z

/cc @vdemeester @imjasonh @skaegi @jerop @pritidesai

I added this to the v0.18 milestone, I think it's probably worth a minor release.
I'm not sure we need this on v0.17.0.

vdemeester · 2020-11-16T11:14:10Z

/lgtm

pritidesai · 2020-11-16T18:39:13Z

/cc @jerop

ghost

Awesome, thanks for tackling this Andrea!

ghost · 2020-11-16T18:42:41Z

pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go

+		facts.SkipCache = make(map[string]bool)
+	}
+	if _, cached := facts.SkipCache[t.PipelineTask.Name]; !cached {
+		facts.SkipCache[t.PipelineTask.Name] = t.skip(facts) // t.skip() is same as our existing t.Skip()


Suggest removing the comment now that the original t.Skip is renamed t.skip

tekton-robot · 2020-11-16T18:44:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbwsg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ghost · 2020-11-16T18:45:26Z

Oops, one more thing: IsTaskSkipped is mentioned in the commit message but no longer appears in the diff.

pritidesai · 2020-11-16T18:52:40Z

pkg/reconciler/pipelinerun/resources/pipelinerunstate.go

+	// needed, via the `Skip` method in pipelinerunresolution.go
+	// The skip data is sensitive to changes in the state. The ResetSkippedCache method
+	// can be used to clean the cache and force re-computation when needed.
+	SkipCache map[string]bool


we might be able to instantiate this cache based on pr.Status.SkippedTasks which holds the list of skippedTasks.

I will not block this PR for skippedTasks, we can revisit this in next iteration. @jerop?

afrittoli · 2020-11-16T18:53:08Z

It looks like CI is slower than my laptop:

=== RUN   TestPipelineRunState_SuccessfulOrSkippedDAGTasks/large_deps,_not_started
panic: test timed out after 10m0s

goroutine 274 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:1628 +0x11f
created by time.goFunc
	/usr/local/go/src/time/sleep.go:167 +0x52

I'll reduce the size of the pipeline.

ghost · 2020-11-16T19:19:20Z

(Copying my comment thread from slack for posterity):

I've checked out the branch and I am noticing some slightly odd behaviour. Every test run varies dramatically in amount of time large_deps,_not_started takes. Some runs are 7 seconds, some are blowing through the 30s timeout cap.

I'm poking around with pprof to see if I can figure out where all the time is being spent when the test run gets slow. When i hit the 30s timeout I am seeing the following top from pprof:

     3.83s 11.92% 11.92%     23.86s 74.28%  github.com/tektoncd/pipeline/pkg/reconciler/pipeline/dag.visit
     2.97s  9.25% 21.17%      7.73s 24.07%  runtime.concatstrings
     2.80s  8.72% 29.89%      2.80s  8.72%  runtime.duffcopy
     2.62s  8.16% 38.04%      2.62s  8.16%  aeshashbody
     2.02s  6.29% 44.33%      3.75s 11.67%  runtime.mapassign_faststr
     1.90s  5.92% 50.25%      3.93s 12.24%  runtime.mapaccess2_faststr
     1.73s  5.39% 55.64%      1.73s  5.39%  runtime.kevent
     1.68s  5.23% 60.87%      3.92s 12.20%  runtime.mallocgc
     1.61s  5.01% 65.88%      1.61s  5.01%  runtime.memmove
     1.60s  4.98% 70.86%      1.60s  4.98%  runtime.pthread_kill

So all the time is being spent in dag.visit. Digging a bit further I notice that it times out during the DagFromState function in the test, which happens before the successfulOrSkippedDAGTasks function is exercised. So somewhere in building the graph for the test we are slowing waaaaaay down sporadically.

Edit: here are the commands I'm running.

$ go test -cpuprofile cpu.prof -memprofile mem.prof -count 1 -v -timeout 30s ./pkg/reconciler/pipelinerun/resources/...

$ go tool pprof cpu.prof

(pprof) top

pritidesai · 2020-11-16T19:52:20Z

@sbwsg I ran the commands you pasted 🆙 and its the same dag.visit showing on the top (I have checked out this branch and working with it):

      flat  flat%   sum%        cum   cum%
     3.89s 10.92% 10.92%      3.89s 10.92%  runtime.duffcopy
     3.67s 10.31% 21.23%     25.30s 71.05%  github.com/tektoncd/pipeline/pkg/reconciler/pipeline/dag.visit
     2.80s  7.86% 29.09%      8.11s 22.77%  runtime.concatstrings
     2.29s  6.43% 35.52%      5.12s 14.38%  runtime.mallocgc
     2.23s  6.26% 41.79%      2.23s  6.26%  aeshashbody
     2.13s  5.98% 47.77%      2.13s  5.98%  runtime.pthread_kill
     1.79s  5.03% 52.79%      3.87s 10.87%  runtime.mapassign_faststr
     1.77s  4.97% 57.76%      1.77s  4.97%  runtime.pthread_cond_wait
     1.69s  4.75% 62.51%      3.10s  8.71%  runtime.mapaccess2_faststr
     1.68s  4.72% 67.23%      1.68s  4.72%  runtime.memmove

pritidesai · 2020-11-16T20:55:50Z

@afrittoli I ran the unit test you have added here on master and profiling it, results in:

     5.90s 11.31% 11.31%      5.90s 11.31%  runtime.pthread_cond_wait
     4.99s  9.57% 20.88%      4.99s  9.57%  runtime.kevent
     4.82s  9.24% 30.12%      4.82s  9.24%  runtime.pthread_kill
     4.62s  8.86% 38.98%     21.95s 42.09%  runtime.mapassign_faststr
     3.90s  7.48% 46.46%      8.19s 15.70%  runtime.evacuate_faststr
     2.88s  5.52% 51.98%      2.88s  5.52%  aeshashbody
        2s  3.84% 55.82%      3.33s  6.39%  runtime.scanobject
     1.90s  3.64% 59.46%      1.90s  3.64%  runtime.memclrNoHeapPointers
     1.61s  3.09% 62.55%      1.85s  3.55%  runtime.heapBitsSetType
     1.24s  2.38% 64.93%      1.24s  2.38%  runtime.memmove

I am wondering why these changes have so much time spent on dag.visit

The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "ResetSkippedCache". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Co-authored-by: Scott <sbws@google.com> Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>

tekton-robot · 2020-11-17T00:08:54Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/pipeline_types.go	69.4%	72.7%	3.3
pkg/reconciler/pipeline/dag/dag.go	98.8%	98.9%	0.1

Only consider link B -> A once. If A depends from B in N different ways, consider that a single link in the dag. Considering them different won't help us detect cycles nor scheduling tasks. In the cycle detection logic, use a string builder instead of '+' to concatenate strings, since it more efficient.

tekton-robot · 2020-11-17T00:19:41Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/pipeline_types.go	69.4%	70.4%	1.0
pkg/reconciler/pipeline/dag/dag.go	98.8%	98.9%	0.1

pritidesai · 2020-11-17T01:23:02Z

pkg/apis/pipeline/v1beta1/pipeline_types.go

+
+	}
+
+	return uniqueDeps.List()


this significantly reduces the call to dag.visit and makes it possible for the unit test in this PR to complete on average under five seconds. The mysterious dot notation in dag.visit is still unsolved though:

pipeline/pkg/reconciler/pipeline/dag/dag.go

Line 145 in 747f4ba

visited[currentName+"."+n.Task.HashKey()] = true

Its creating a link between two nodes which are indirectly connected (C.A in A -> B -> C) but where/how is this data being utilized is a mystery.

pritidesai · 2020-11-17T01:30:18Z

/lgtm

@skaegi let us know if you still run into scaling issues, the examples you have specified must be fixed with these changes.

Thanks @afrittoli and @sbwsg for fixes and help troubleshoot this 🙏

pritidesai · 2020-11-17T16:11:33Z

Cheery picked with PR #3534

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Nov 15, 2020

tekton-robot requested review from dlorenc and imjasonh November 15, 2020 11:26

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 15, 2020

afrittoli force-pushed the issue/3521 branch from 547fb27 to 6603bd2 Compare November 15, 2020 18:11

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2020

afrittoli changed the title ~~WIP Fix recursion issue on Skip~~ Fix recursion issue on Skip Nov 15, 2020

tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 15, 2020

imjasonh reviewed Nov 15, 2020

View reviewed changes

pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go Outdated Show resolved Hide resolved

pkg/reconciler/pipelinerun/resources/pipelinerunstate.go Outdated Show resolved Hide resolved

tekton-robot assigned imjasonh Nov 15, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2020

afrittoli force-pushed the issue/3521 branch from 6603bd2 to cdf533c Compare November 15, 2020 20:45

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2020

afrittoli added this to the Pipelines v0.18 milestone Nov 15, 2020

tekton-robot requested review from imjasonh, jerop, pritidesai, skaegi and vdemeester November 15, 2020 20:52

afrittoli force-pushed the issue/3521 branch from cdf533c to 17b7469 Compare November 15, 2020 21:00

afrittoli removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2020

afrittoli force-pushed the issue/3521 branch from 17b7469 to 72455ba Compare November 16, 2020 10:28

tekton-robot assigned vdemeester Nov 16, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2020

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2020

ghost approved these changes Nov 16, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2020

pritidesai reviewed Nov 16, 2020

View reviewed changes

afrittoli force-pushed the issue/3521 branch from 246df4d to aaec640 Compare November 16, 2020 18:54

pritidesai added the needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch label Nov 16, 2020

afrittoli force-pushed the issue/3521 branch from aaec640 to 7aa35f0 Compare November 16, 2020 22:46

afrittoli force-pushed the issue/3521 branch from 4c385b9 to eda511b Compare November 17, 2020 00:17

pritidesai reviewed Nov 17, 2020

View reviewed changes

tekton-robot assigned pritidesai Nov 17, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2020

tekton-robot merged commit 719e304 into tektoncd:master Nov 17, 2020

pritidesai removed the needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch label Nov 17, 2020

pritidesai mentioned this pull request Nov 17, 2020

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Closed

pritidesai mentioned this pull request Feb 3, 2021

WhenExpressions in Finally Tasks #3738

Merged

4 tasks

lbernick mentioned this pull request Aug 10, 2022

restrict PipelineRunSpec.Params with a task result reference #5271

Closed

pritidesai mentioned this pull request Sep 2, 2022

Improve DAG validation for pipelines with hundreds of tasks #5421

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix recursion issue on Skip #3524

Fix recursion issue on Skip #3524

afrittoli commented Nov 15, 2020 •

edited

Loading

imjasonh left a comment

afrittoli commented Nov 15, 2020

vdemeester commented Nov 16, 2020

pritidesai commented Nov 16, 2020

ghost left a comment

ghost Nov 16, 2020

afrittoli Nov 16, 2020

tekton-robot commented Nov 16, 2020

ghost commented Nov 16, 2020

pritidesai Nov 16, 2020

afrittoli commented Nov 16, 2020

ghost commented Nov 16, 2020 •

edited by ghost

Loading

pritidesai commented Nov 16, 2020

pritidesai commented Nov 16, 2020

tekton-robot commented Nov 17, 2020

tekton-robot commented Nov 17, 2020

pritidesai Nov 17, 2020 •

edited

Loading

pritidesai commented Nov 17, 2020

pritidesai commented Nov 17, 2020

Fix recursion issue on Skip #3524

Fix recursion issue on Skip #3524

Conversation

afrittoli commented Nov 15, 2020 • edited Loading

Changes

Submitter Checklist

Reviewer Notes

Release Notes

imjasonh left a comment

Choose a reason for hiding this comment

afrittoli commented Nov 15, 2020

vdemeester commented Nov 16, 2020

pritidesai commented Nov 16, 2020

ghost left a comment

Choose a reason for hiding this comment

ghost Nov 16, 2020

Choose a reason for hiding this comment

afrittoli Nov 16, 2020

Choose a reason for hiding this comment

tekton-robot commented Nov 16, 2020

ghost commented Nov 16, 2020

pritidesai Nov 16, 2020

Choose a reason for hiding this comment

afrittoli commented Nov 16, 2020

ghost commented Nov 16, 2020 • edited by ghost Loading

pritidesai commented Nov 16, 2020

pritidesai commented Nov 16, 2020

tekton-robot commented Nov 17, 2020

tekton-robot commented Nov 17, 2020

pritidesai Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

pritidesai commented Nov 17, 2020

pritidesai commented Nov 17, 2020

afrittoli commented Nov 15, 2020 •

edited

Loading

ghost commented Nov 16, 2020 •

edited by ghost

Loading

pritidesai Nov 17, 2020 •

edited

Loading