From bf871c51386e3740de42b582237bb56312f300cb Mon Sep 17 00:00:00 2001 From: Yawen Luo Date: Tue, 3 May 2022 11:57:31 -0400 Subject: [PATCH] Initial draft of TEP 0109: Better structured provenance retrieval in Tekton Chains --- ...d-provenance-retrieval-in-tekton-chains.md | 537 ++++++++++++++++++ teps/README.md | 1 + 2 files changed, 538 insertions(+) create mode 100644 teps/0109-better-structured-provenance-retrieval-in-tekton-chains.md diff --git a/teps/0109-better-structured-provenance-retrieval-in-tekton-chains.md b/teps/0109-better-structured-provenance-retrieval-in-tekton-chains.md new file mode 100644 index 000000000..003c934d7 --- /dev/null +++ b/teps/0109-better-structured-provenance-retrieval-in-tekton-chains.md @@ -0,0 +1,537 @@ +--- +status: implementable +title: Better structured provenance retrieval in Tekton Chains +creation-date: "2022-04-29" +last-updated: "2022-05-02" +authors: + - "@ywluogg" +--- + +# TEP-0109: Better structured provenance retrieval in Tekton Chains + + + + + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases](#use-cases) + - [Requirements](#requirements) +- [Proposal](#proposal) + - [Notes and Caveats](#notes-and-caveats) +- [Design Details](#design-details) +- [Design Evaluation](#design-evaluation) + - [Reusability](#reusability) + - [Simplicity](#simplicity) + - [Flexibility](#flexibility) + - [User Experience](#user-experience) + - [Performance](#performance) + - [Risks and Mitigations](#risks-and-mitigations) + - [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Implementation Plan](#implementation-plan) + - [Test Plan](#test-plan) + - [Infrastructure Needed](#infrastructure-needed) + - [Upgrade and Migration Strategy](#upgrade-and-migration-strategy) + - [Implementation Pull Requests](#implementation-pull-requests) +- [References](#references) + + +## Summary + +_Recommendation_: read [TEP-0075 (object/dictionary param and result types)](https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md) and +[TEP-0076 (array support in results and indexing syntax)](https://github.com/tektoncd/community/blob/main/teps/0076-array-result-types.md) +before this TEP, as this TEP builds on these two for Tekton Pipelines. + +This TEP proposes expanding support of provenance metadata retrieval from Tekton Pipelines TaskRuns in Tekton Chains. The expansion enhances the metadata retrieval of various kinds of signable objects. The expansion includes support of results retrieval in types of object and array. The expansion also includes support of other artifacts that's currently not supported. + +## Motivation + +With SLSA being established, there is a rise in demand for achieving richer provenance within attestations. [In-toto](https://github.com/in-toto/attestation) provides a way for supporting richer provenance inside Predicates. Tekton Chains currently supports multiple kinds of attestations and the in-toto attestation format is one of the popular ones in the open source community. Tekton Chains currently use [Results](https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#emitting-results) and [Params](https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#specifying-parameters) for such type of attestation generation. It uses type hinting for capturing a CI/CD pipeline's inputs and outputs’ provenance info in string formats. The concepts of `inputs` and `outputs` come from [SLSA provenance v0.2](https://slsa.dev/provenance/v0.2). As Chains is planning to support more complex provenance info and structures, the current string formatted type hinting will not accomodate the scalibilities and integretity it needs, as Pipeline didn’t provide structured provenance info generation in TaskRuns and we are capturing unstructured provenance info from TaskRuns in scattered ways. With [TEP 0075](https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md) and [TEP 0076](https://github.com/tektoncd/community/blob/main/teps/0076-array-result-types.md), TaskRun can provide features of structured results and params. + +### Goals + +- Add structured support for retrieving provenance from Tekton Pipeline TaskRuns in Tekton Chains +- Considerations of flexibilities for later support of [nested objects in arrays](https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md#more-alternatives) in Results in Tekton Chains +- Able to tell which signable artifacts are sources / inputs, and which are artifacts / outputs in TaskRuns +- Support extended sets of signable artifacts +- the design can be easily extend to the scope for supporting Pipeline level provenance + +### Non-Goals + +- This proposal will not discuss use cases and specific designs for supporting Pipeline level provenance. +- Support “Dependencies complete” in SLSA +- Support nested objects in arrays in Results in Tekton Chains +- Support Trusted Tasks immediately in the implementation for this TEP + +### Use Cases + +Before diving into the concrete use cases, a summary at the begining of this section explains the reasons that make these concrete use cases being displayed below should be considered in scope. The use cases are oriented around different types of inputs and outputs. + +Ultimately, we are looking for support for as many format types of inputs and outputs as possible, especially those that in-toto provenances support. Here we will only focus on giving some examples that are commonly used. + +For `inputs`, if we look at the requirements of SLSA in terms of sources, only SLSA L4 requires [Dependencies complete](https://slsa.dev/spec/v0.1/requirements#dependencies-complete). At the current point, if we only worry about the requirements lower than SLSA L4, support that covers VCS and OCI images is a good start. In in-toto provenance, all the inputs should be stored in the [materials](https://github.com/in-toto/attestation/tree/main/spec#predicate-conventions) under in-toto provenance. The required format is: + +``` +{ + "uri": "", + "digest": { /* DigestSet */ } + } +``` + +For the `outputs`, we are looking at the following support, as these are the artifact types supported in Artifact Registries. However, the list should definitely be expanded as this doc is spreading the discussions in the communities to see which should be prioritized. Currently, we have the following under considerations: Python, Maven, Go, NodeJS and OCI images. Note that OCI images are the ones we currently support. + +#### Concrete Use Cases + +All below examples will be generating in-toto attestation. + 1. Use Git commits as sources, and generates in-toto provenance for a TaskRun that builds an image. + + A Task example **without** structured results and params support can be like: + ``` yaml + apiVersion: tekton.dev/v1beta1 + kind: Task + spec: + params: + - name: CHAINS-GIT_COMMIT + type: string + description: git commit sha + - name: CHAINS-GIT_URL + type: string + description: git commit url + - name: FOO_IMAGE + type: string + description: artifact image + ... + results: + - name: IMAGE_DIGEST + description: digest of image + - name: IMAGE_URL + description: url of image + ``` + A Task example `with` structured results support will be like: + ``` yaml + apiVersion: tekton.dev/v1beta1 + kind: Task + spec: + ... + results: + - name: ARTIFACT_INPUTS + type: array + description: | + Stores result names that should be captured as signable artifacts as "materials" in in-toto provenances in Tekton Chains. + - name: ARTIFACT_OUTPUTs + type: array + description: | + Stores result names that should be captured as signable artifacts as "subjects" in in-toto provenances in Tekton Chains. + - name: git-vcs + type: object + description: | + The source distribution + * uri: resource uri of the artifact. + * digest: revision digest in form of algorithm:digest. + properties: + uri: + type: string + digest: + type: string + - name: oci_image + type: object + description: N/A + properties: + uri: + type: string + digest: + type: string + ``` + And its corresponding TaskRun example will be like: + ``` yaml + TaskRun: + ... + results: + - name: ARTIFACT_INPUTS + value: ["git-vcs"] + - name: ARTIFACT_OUTPUTS + value: ["oci-image"] + - name: git-vcs + value: + uri: git+https://github.com/foo/bar.git + digest: sha256:abc + - name: oci-image + value: + uri: gcr.io/somerepo/someimage + digest: sha512:abc + ``` +The generated intoto provenance will contain the following: +``` +subjects: [{"name": "gcr.io/somerepo/someimage", "digest": {"sha256": "abc"}}] +... +materials: [{ + "uri": "git+https://github.com/foo/bar.git", + "digest": {"sha1": "abc..."} +}] +``` + +2. Use Perforce and images as sources, and generates in-toto provenance for a TaskRun that builds a Maven package. +``` yaml +results: + - name: ARTIFACT_INPUTS + value: ["perforce-vcs"] + - name: ARTIFACT_OUTPUTS + value: ["maven-pkg", "maven-pom", "maven-src-pkg"] + - name: perforce-vcs + value: + uri: http://myp4web:8080/depot/main/atlas/ + digest: sha256:abc + - name: maven-pkg + value: + uri: us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.jar + digest: sha256:abc + - name: maven-pom + value: + uri: us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.pom + digest: sha256:def + - name: maven-src-pkg + value: + uri: us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0-sources.jar + digest: sha256:xyz +``` +The generated field for these targets will be: +``` +subjects: [{"name": "us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.jar", "digest": {"sha256": "abc"}} +{"name": "us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.pom", "digest": {"sha256": "def"}} +{"name": "us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0-sources.jar", "digest": {"sha256": "xyz"}}] +... +materials: [{ + // The git repo that contains the build.yaml referenced above. + "uri": "http://myp4web:8080/depot/main/atlas/", + // The resolved git commit hash reflecting the version of the repo used + // for this build. + "digest": {"sha1": "abc..."} +}] +``` + +2. Use Perforce and images as sources, and generates in-toto provenance for a TaskRun that builds a Maven package. + +```yaml +results: + - name: ARTIFACT_INPUTS + value: ["perforce-vcs"] + - name: ARTIFACT_OUTPUTS + value: ["maven-pkg", "maven-pom", "maven-src-pkg"] + - name: perforce-vcs + value: + uri: http://myp4web:8080/depot/main/atlas/ + digest: sha256:abc + - name: maven-pkg + value: + uri: us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.jar + digest: sha256:abc + - name: maven-pom + value: + uri: us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.pom + digest: sha256:def + - name: maven-src-pkg + value: + uri: us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0-sources.jar + digest: sha256:xyz +``` + +The generated field for these targets will be: + +``` +subjects: [{"name": "us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.jar", "digest": {"sha256": "abc"}} +{"name": "us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0.pom", "digest": {"sha256": "def"}} +{"name": "us-west4-maven.pkg.dev/test-project/test-repo/com/google/guava/guava/31.0/guava-31.0-sources.jar", "digest": {"sha256": "xyz"}}] +... +materials: [{ + // The git repo that contains the build.yaml referenced above. + "uri": "http://myp4web:8080/depot/main/atlas/", + // The resolved git commit hash reflecting the version of the repo used + // for this build. + "digest": {"sha1": "abc..."} +}] +``` + +### Requirements + +- Able to structurally retrieve provenance info from TaskRun and produces currently supported in-toto provenance formats once a TaskRun is finished +- Based on SLSA L2 requirements, the provenance must provide the location of source code in version control and the provenance MUST identify the output artifact via at least one cryptographic hash. +- If a TaskRun is within a PipelineRun, the TaskRun’s provenance can be produced between the time that when one TaskRun is finished + +## Proposal + +Currently we are using type-hinting to retrieve needed provenance info from TaskRun’s results. The proposal aims to leverage the new features of structured results described in [TEP 0075](https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md) and [TEP 0076](https://github.com/tektoncd/community/blob/main/teps/0076-array-result-types.md) and support better structured provenance retrieval for inputs and artifacts to make the current type-hinting easier. This proposal is focused on the use cases for generation in-toto attestations, as other types of attestations don’t need the TaskRun results. + +The workflow for the targeted results in Chains is like the following: + +``` +(TaskRun results) —-> (Scanned by Chains controller, targets are collected) —> (Targets are parsed and validated and matched against with defined result and param structures) —-> (Chains controller sign the signable artifacts and put into intoto attestations) +``` + +Tekton Chains will follow [TEP 0075](https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md) and [TEP 0076](https://github.com/tektoncd/community/blob/main/teps/0076-array-result-types.md) and support the sets of json schemas and described in the documents. In this proposal, the expected formats of results are also defined. + +As mentioned above, the signable artifacts can be split into inputs (sources) and outputs (artifacts). It's important that Chains are able to determine which signable artifacts are inputs, and which are outputs. The signable target result names are all captured in either of the results, `ARTIFACT_INPUTS` or `ARTIFACT_OUTPUTS`, which will be two arrays storing the result names that should be captured by Tekton Chains to format in-toto provenances. `ARTIFACT_INPUTS` will be captured in `materials` and `ARTIFACT_OUTPUTS` will be captured in `subjects`. + +It's also beneficial that the schemas defined in Chains can follow the field requirements in in-toto attestations. Currently, the type hinting is following a readable formats that users will find it more readable, such as results being populated in [upload-pypi](https://github.com/tektoncd/catalog/blob/f4708d478ee8fac6b5b68347cde087cb7c1d1b1c/task/upload-pypi/0.1/upload-pypi.yaml) and [jib-maven](https://github.com/tektoncd/catalog/blob/6a6f3543fa14d7d840fd13d19ba4452e5e319830/task/jib-maven/0.4/jib-maven.yaml) Catalog Tasks, but these gives difficulties about standarizing signable target schemas, as each signable target's metadata can be faily different from each other. For example, a Git source usually needs a git commit and revision, which are super different from what is needed for OCI images, which need image url and its digest. In-toto attestations provide a way to unify the identifiers for multiple types of signable artifacts, which is described in the [Use Cases](#use-cases). + +Generating the objects that satisfy these schemas, from human readable provenance metadata can be done by writing provenance data using Steps in Tasks into these schema structures in TaskRuns and populate them PipelineRuns' Results, and there is also a discussion around adding a new field `provenance` in PipelineRun and TaskRun, which only allows [Trusted Resources](https://github.com/tektoncd/community/blob/2413ad70a742a6e9103c531e1c5788b2b392a7eb/teps/0091-verified-remote-resources.md#requirements) to generate provenance info in this field. Chains should be able to accomodate what is currently capable in TaskRun and PipelineRun results, and other potential new fields specifically for provenances. + +### Notes and Caveats + +When these well defined structures grow as richer provenance are pursued, the size of the structure can eventually grow beyond the limit of the container termination message. [TEP 0086](https://github.com/tektoncd/community/pull/521) is addressing this issue. + +## Design Details + +### Signable Artifacts + +All signable artifacts should be provided in a structure that is well defined in the proposal. If any field is missed, the signable artifacts will be skipped and an error will be thrown. The users need to follow the naming pattern in order the result objects can be captured. + +As described above, users of Tekton Chains need to have two results, `ARTIFACT_INPUTS` and `ARTIFACT_OUTPUTS`, that are arrays of strings which stores the signable artifacts' result names: + +```yaml +results: + - name: ARTIFACT_INPUTS + value: ["ARTIFACT-INPUT-NAME-1", "ARTIFACT-INPUT-NAME-1"] + - name: ARTIFACT_OUTPUTS + value: ["ARTIFACT-OUTPUT-NAME-1"] +``` + +And the signable target results need to follow the below schema: + +```yaml +results: + - name: ARTIFACT-NAME + type: object + description: | + * uri: resource uri of the artifact. It can uniquely identify the artifact. + * digest: revision digest in form algorithm:digest. + properties: + uri: + type: string + digest: + type: string +``` + +In this way, every time when Tekton Chains tries to support a new type of artifacts, users can form in-toto provenance from this schema format, without needing to change Tekton Chains much. In general, Tekton Chains itself doesn't want to distinguish the types of artifacts from attestation generation point of view. The namings of these artifact results would be much more flexible as well. + +#### Optional Alternatives for Schemas + +It will be more flexible if users can define the provenance following the defined schemas, but don't necessarily have to follow the naming patterns. Users can provide a result in array type, with name consistent with the signable target, and have a list of result names that are under that category. An example can be like: + +```yaml +results: + - name: SIGNABLE-VCS + value: ["FOO", "BAR"] + - name: FOO + value: + src_type: git + uri: git+https://github.com/foo/bar.git + digest_ago: sha256 + digest: abc + - name: SIGNABLE-OCI_IMAGE + value: ["IMAGE1"] + - name: IMAGE1 + value: + name: gcr.io/somerepo/someimage + digest: abc + alg: sha256 +``` + +And the result that contains the provenance still need to be aligned with the schema we defined in Chains. This way, users don't need to name the results. +This is not chosen because Chains would need to be aware of the difference between different types of artifacts. + +## Design Evaluation + +### Reusability + +- Pro: the defined schema can be used for all targets that can be either subjects or materials in intoto provenances +- Con: users need to follow the exact schema in order to let Chains capture the signable artifacts + +### Simplicity + +- Pro: the design is supporting the same sets of json schema described in TEP 75 and 76 +- Pro: the provenance can be retrieved much more easily than the current type hinting method +- Con: the schema can potentially grow much richer and size limit can become a problem + +### Flexibility + +- Pro: when supporting new types of signable artifacts, the schema can be easily added. +- Con: for each new type of signable artifacts, the schema needs to be created individually + + + +### User Experience + +The process to provide the above results could be challenging for Pipeline and Task authors. An [`oci-image-artifact-registry`](https://github.com/ywluogg/oci-image-structured-results) Task and TaskRun is provided with steps to build and push and OCI image, and then produce the above provenance infos. + +### Performance + + + +### Risks and Mitigations + + + +### Drawbacks + + + +## Alternatives + +### Inputs / Outputs Distinguishment + +We can also separately collect inputs provenance from params, and outputs provenance from TaskRun results. The signable artifacts being found in params will be the inputs, and those found in TaskRun results will be the outputs. However, params are not reliable as users don't really specify those human less readable provenance metadata formats mentioned in in-toto attestations in neither PipelineRun nor TaskRun params. + +### Using Run Status to generate Provenance Metadata + +Results is not an ideal place to have the provenance metadata populated for the artifacts, since Results are provided by Pipeline authors. When people are looking for provenance, the metadata being provided should ideally be trustful and unchangable after generation. How can we garantee the provenance metadata can be trusted? Assuming the Tekton Pipeline being installed and where the Runs are operated are trustful, one missing piece to comply to SLSA level L3 is that the Run yaml being submitted is a trustful config, which can be fulfilled by [TEP 091: Trusted Resource](https://github.com/tektoncd/community/pull/739). The other missing piece is that we want to make sure the generated provenance data shouldn't be changable by any untrusted threads. TaskRun and PipelineRun `status` field satisfies these needs, as only Pipeline controller is able to generate info within this field. To extend the previous trustful setup, we can allow Trusted Tasks to modify the field as well: when Trusted Tasks are being used, Pipeline Controller can verify the Trusted Tasks being used in a Run, and let Trusted Tasks to generate provenance metadata a new field under `status` fields. + +This approach would require changes in Pipeline and also completion of [Trusted Resources](https://github.com/tektoncd/community/pull/739), so the detailed design can be scoped out in a future TEPs. However, the schemas for the provenance metadata should follow those defined in this TEP. + +## Implementation Plan + +The implementation plan has two main components: + +1. upgrading Tekton Chains's Tekton Pipeline's version to v0.38, which TEP 75 and 76 will be supported behind alpha feature flag. + +2. Iterate TaskRun / PipelineRun results for scanning the targeted results. + +### Test Plan + + + +### Infrastructure Needed + + + +### Upgrade and Migration Strategy + + + +### Implementation Pull Requests + + + +## References + + diff --git a/teps/README.md b/teps/README.md index 6f4e4def6..23cd9472f 100644 --- a/teps/README.md +++ b/teps/README.md @@ -257,6 +257,7 @@ This is the complete list of Tekton teps: |[TEP-0106](0106-support-specifying-metadata-per-task-in-runtime.md) | Support Specifying Metadata per Task in Runtime | implemented | 2022-05-27 | |[TEP-0107](0107-propagating-parameters.md) | Propagating Parameters | implemented | 2022-05-26 | |[TEP-0108](0108-mapping-workspaces.md) | Mapping Workspaces | implemented | 2022-05-26 | +|[TEP-0109](0109-better-structured-provenance-retrieval-in-tekton-chains.md) | Better structured provenance retrieval in Tekton Chains | implementable | 2022-05-02 | |[TEP-0110](0110-decouple-catalog-organization-and-reference.md) | Decouple Catalog Organization and Resource Reference | implemented | 2022-06-29 | |[TEP-0111](0111-propagating-workspaces.md) | Propagating Workspaces | implementable | 2022-06-03 | |[TEP-0112](0112-replace-volumes-with-workspaces.md) | Replace Volumes with Workspaces | proposed | 2022-06-02 |