Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix recursion issue on Skip #3524

Merged
merged 2 commits into from
Nov 17, 2020
Merged

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Nov 15, 2020

Changes

Fix recursion issue on Skip

The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821

This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes #3521

Co-authored-by: Scott sbws@google.com

Signed-off-by: Andrea Frittoli andrea.frittoli@uk.ibm.com

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Fix and issue in the pipeline state resolution which lead to very long or failed start times and high controller CPU in case of pipelines with a large number of dependencies between tasks (40+).

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Nov 15, 2020
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 15, 2020
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2020
@afrittoli afrittoli changed the title WIP Fix recursion issue on Skip Fix recursion issue on Skip Nov 15, 2020
@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 15, 2020
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Just a couple style questions, in case they help debugging this code in the future. Feel free to ignore them.

Thanks for tracking this down!

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2020
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2020
@afrittoli afrittoli added this to the Pipelines v0.18 milestone Nov 15, 2020
@afrittoli
Copy link
Member Author

/cc @vdemeester @imjasonh @skaegi @jerop @pritidesai

I added this to the v0.18 milestone, I think it's probably worth a minor release.
I'm not sure we need this on v0.17.0.

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2020
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2020
@pritidesai
Copy link
Member

/cc @jerop

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for tackling this Andrea!

facts.SkipCache = make(map[string]bool)
}
if _, cached := facts.SkipCache[t.PipelineTask.Name]; !cached {
facts.SkipCache[t.PipelineTask.Name] = t.skip(facts) // t.skip() is same as our existing t.Skip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing the comment now that the original t.Skip is renamed t.skip

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2020
@ghost
Copy link

ghost commented Nov 16, 2020

Oops, one more thing: IsTaskSkipped is mentioned in the commit message but no longer appears in the diff.

// needed, via the `Skip` method in pipelinerunresolution.go
// The skip data is sensitive to changes in the state. The ResetSkippedCache method
// can be used to clean the cache and force re-computation when needed.
SkipCache map[string]bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might be able to instantiate this cache based on pr.Status.SkippedTasks which holds the list of skippedTasks.

I will not block this PR for skippedTasks, we can revisit this in next iteration. @jerop?

@afrittoli
Copy link
Member Author

It looks like CI is slower than my laptop:

=== RUN   TestPipelineRunState_SuccessfulOrSkippedDAGTasks/large_deps,_not_started
panic: test timed out after 10m0s

goroutine 274 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:1628 +0x11f
created by time.goFunc
	/usr/local/go/src/time/sleep.go:167 +0x52

I'll reduce the size of the pipeline.

@pritidesai pritidesai added the needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch label Nov 16, 2020
@ghost
Copy link

ghost commented Nov 16, 2020

(Copying my comment thread from slack for posterity):

I've checked out the branch and I am noticing some slightly odd behaviour. Every test run varies dramatically in amount of time large_deps,_not_started takes. Some runs are 7 seconds, some are blowing through the 30s timeout cap.

I'm poking around with pprof to see if I can figure out where all the time is being spent when the test run gets slow. When i hit the 30s timeout I am seeing the following top from pprof:

     3.83s 11.92% 11.92%     23.86s 74.28%  github.com/tektoncd/pipeline/pkg/reconciler/pipeline/dag.visit
     2.97s  9.25% 21.17%      7.73s 24.07%  runtime.concatstrings
     2.80s  8.72% 29.89%      2.80s  8.72%  runtime.duffcopy
     2.62s  8.16% 38.04%      2.62s  8.16%  aeshashbody
     2.02s  6.29% 44.33%      3.75s 11.67%  runtime.mapassign_faststr
     1.90s  5.92% 50.25%      3.93s 12.24%  runtime.mapaccess2_faststr
     1.73s  5.39% 55.64%      1.73s  5.39%  runtime.kevent
     1.68s  5.23% 60.87%      3.92s 12.20%  runtime.mallocgc
     1.61s  5.01% 65.88%      1.61s  5.01%  runtime.memmove
     1.60s  4.98% 70.86%      1.60s  4.98%  runtime.pthread_kill

So all the time is being spent in dag.visit. Digging a bit further I notice that it times out during the DagFromState function in the test, which happens before the successfulOrSkippedDAGTasks function is exercised. So somewhere in building the graph for the test we are slowing waaaaaay down sporadically.

Edit: here are the commands I'm running.

$ go test -cpuprofile cpu.prof -memprofile mem.prof -count 1 -v -timeout 30s ./pkg/reconciler/pipelinerun/resources/...

$ go tool pprof cpu.prof

(pprof) top

@pritidesai
Copy link
Member

@sbwsg I ran the commands you pasted 🆙 and its the same dag.visit showing on the top (I have checked out this branch and working with it):

      flat  flat%   sum%        cum   cum%
     3.89s 10.92% 10.92%      3.89s 10.92%  runtime.duffcopy
     3.67s 10.31% 21.23%     25.30s 71.05%  github.com/tektoncd/pipeline/pkg/reconciler/pipeline/dag.visit
     2.80s  7.86% 29.09%      8.11s 22.77%  runtime.concatstrings
     2.29s  6.43% 35.52%      5.12s 14.38%  runtime.mallocgc
     2.23s  6.26% 41.79%      2.23s  6.26%  aeshashbody
     2.13s  5.98% 47.77%      2.13s  5.98%  runtime.pthread_kill
     1.79s  5.03% 52.79%      3.87s 10.87%  runtime.mapassign_faststr
     1.77s  4.97% 57.76%      1.77s  4.97%  runtime.pthread_cond_wait
     1.69s  4.75% 62.51%      3.10s  8.71%  runtime.mapaccess2_faststr
     1.68s  4.72% 67.23%      1.68s  4.72%  runtime.memmove

@pritidesai
Copy link
Member

@afrittoli I ran the unit test you have added here on master and profiling it, results in:

     5.90s 11.31% 11.31%      5.90s 11.31%  runtime.pthread_cond_wait
     4.99s  9.57% 20.88%      4.99s  9.57%  runtime.kevent
     4.82s  9.24% 30.12%      4.82s  9.24%  runtime.pthread_kill
     4.62s  8.86% 38.98%     21.95s 42.09%  runtime.mapassign_faststr
     3.90s  7.48% 46.46%      8.19s 15.70%  runtime.evacuate_faststr
     2.88s  5.52% 51.98%      2.88s  5.52%  aeshashbody
        2s  3.84% 55.82%      3.33s  6.39%  runtime.scanobject
     1.90s  3.64% 59.46%      1.90s  3.64%  runtime.memclrNoHeapPointers
     1.61s  3.09% 62.55%      1.85s  3.55%  runtime.heapBitsSetType
     1.24s  2.38% 64.93%      1.24s  2.38%  runtime.memmove

I am wondering why these changes have so much time spent on dag.visit

The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipeline_types.go 69.4% 72.7% 3.3
pkg/reconciler/pipeline/dag/dag.go 98.8% 98.9% 0.1

Only consider link B -> A once. If A depends from B in N different
ways, consider that a single link in the dag. Considering them
different won't help us detect cycles nor scheduling tasks.

In the cycle detection logic, use a string builder instead of '+'
to concatenate strings, since it more efficient.
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipeline_types.go 69.4% 70.4% 1.0
pkg/reconciler/pipeline/dag/dag.go 98.8% 98.9% 0.1


}

return uniqueDeps.List()
Copy link
Member

@pritidesai pritidesai Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this significantly reduces the call to dag.visit and makes it possible for the unit test in this PR to complete on average under five seconds. The mysterious dot notation in dag.visit is still unsolved though:

visited[currentName+"."+n.Task.HashKey()] = true

Its creating a link between two nodes which are indirectly connected (C.A in A -> B -> C) but where/how is this data being utilized is a mystery.

@pritidesai
Copy link
Member

/lgtm

@skaegi let us know if you still run into scaling issues, the examples you have specified must be fixed with these changes.

Thanks @afrittoli and @sbwsg for fixes and help troubleshoot this 🙏

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2020
@tekton-robot tekton-robot merged commit 719e304 into tektoncd:master Nov 17, 2020
@pritidesai
Copy link
Member

Cheery picked with PR #3534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pinning CPU with a pipeline with 50 tasks and use of params/results chain
5 participants