Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more rust data #2948

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

Conversation

C0D3-M4513R
Copy link

@C0D3-M4513R C0D3-M4513R commented Jun 11, 2024

This repository attempts to add more rust information.
Currently this adds:

Things to Note:

  • Support for "local-registry" is untested
  • Missing support for "path" source kind
  • Missing support for "git" source kind
  • Missing support for "directory" source kind
  • Is support for those source kinds really necessary? cargo internally sees a registry as a specialization of sources. "registry", "local-registry" and "sparse" are all a registry. "path", "git" and "directory" are only a source.
    • "git": occurs, when you override a dependency (or patch it?)
    • "directory" and "path" I've not seen and was unable to produce them. I've just ended up with a Cargo.lock, where those dependencies had no source attribute.
  • Does not look at license-file information currently (no spdx licenses). Only license (which should coform to spdx) (related to Classify licenses based on file contents #656)

Follow-Up Ideas:

  • What about PURL's in conjunction with alternate rust-registries? (there might be collisions with PURL's with the current implementation)

Also if more testing is required, I would appreciate help with what should be tested and what is a priority or not.

Also please note, that this is my very first time writing go.
My code might make sub optimal usage of go's language constructs

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
@spiffcs
Copy link
Contributor

spiffcs commented Jun 11, 2024

Thanks so much @C0D3-M4513R for the PR!

I see you still have it in draft. Is there anything else you're looking to add before we give it a look and add comments?
Do you want help adding testing or features for any of the additional notes you listed?

I did have one question while we wait for the validations to run.

Here is the documentation for the "sources" syft supports:
https://github.com/anchore/syft/wiki/supported-sources

Is this what you're referring to when you when mention support for a source is missing?

We also have a community meeting that's every other week. I've linked the calendar invite below if you'd like to come and talk about this change or have any suggestions around how we can make the tool better for rust after this PR lands:
calendar

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
@C0D3-M4513R
Copy link
Author

Here's a list of stuff, that I'd like to get in also:

  • a proper Relationship graph, with the goal of getting filesAnalyzed to be true
  • Licenses

Also thanks for pointing me to the "sources" syft supports natively, but Rust's registries don't seem to be under there.

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
with go-toml v1 when a file had something like the following line, it would fail to parse:
`[target."cfg(any(target_os = \"linux\", target_os = \"dragonfly\", target_os = \"freebsd\", target_os = \"openbsd\", target_os = \"netbsd\"))".dependencies.accesskit_unix]`

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
@C0D3-M4513R C0D3-M4513R marked this pull request as ready for review June 11, 2024 19:08
@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jun 11, 2024

I am through with what I want to add.

Some stuff definetly got a bit ugly (looking at GetGeneratedInformation for CargoLockEntry)

Also I have no idea, if different instances of the PackageId struct result in different cache keys. I just persisted one copy in CargoLockEntry to be safe for now.

I only need to fix tests now (since I added information some cargo tests fail now).

Edit: Also it seems like on nixos make fails due to unallowable licenses on a lot of go standard library packages
It also seems like running all the tests takes a very long time....

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jun 20, 2024

I have now addressed the config, caching and relationship ordering.
There is still some git repository caching in here that might need addressing, but I'm not sure how nicely the caching implementation plays with that.

Outdated Info Regardless I've hit an unexpected roadblock with the tests. I'm suddenly getting the following on a couple cataloger tests. I do not know where that is coming from, so help regarding this would be appreciated.
panic: recursive set of Transformers detected:
	Transformer(cmpopts.SortSlices, cmpopts.sliceSorter.sort): interface {} => interface {}
consider using cmpopts.AcyclicTransformer [recovered]
	panic: recursive set of Transformers detected:
	Transformer(cmpopts.SortSlices, cmpopts.sliceSorter.sort): interface {} => interface {}
consider using cmpopts.AcyclicTransformer

goroutine 5 [running]:
testing.tRunner.func1.2({0x17f6e40, 0xc0e3d04570})
	/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/src/testing/testing.go:1631 +0x24a
testing.tRunner.func1()
	/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/src/testing/testing.go:1634 +0x377
panic({0x17f6e40?, 0xc0e3d04570?})
	/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/src/runtime/panic.go:770 +0x132
github.com/google/go-cmp/cmp.(*recChecker).Check(0x1c62110?, {0xc1445c2000, 0x10000, 0xfffe?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:632 +0x459
github.com/google/go-cmp/cmp.(*state).compareAny(0xc0c202b040, {0x1c4bf80, 0xc0c3f0fcc0})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:246 +0x22f
github.com/google/go-cmp/cmp.(*transformer).apply(0xc0bd22bbc0, 0xc0c202b040, {0x17db340?, 0xc0fdd5ded8?, 0xc1446c1fd0?}, {0x17db340?, 0xc0fd363848?, 0x1c62110?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/options.go:328 +0x218
github.com/google/go-cmp/cmp.(*state).tryOptions(0xc0c202b040, {0x1c62110, 0x17db340}, {0x17db340?, 0xc0fdd5ded8?, 0x0?}, {0x17db340?, 0xc0fd363848?, 0x0?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:306 +0x109
github.com/google/go-cmp/cmp.(*state).compareAny(0xc0c202b040, {0x1c4bf50, 0xc0d1e3e640})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:261 +0x4c7
------------- Removed a ton of repititions here ----------
github.com/google/go-cmp/cmp.(*state).compareInterface(0xc0c202b040, {0x1c62110?, 0x1854ea0?}, {0x1854ea0?, 0xc0c298aef0?, 0x20002?}, {0x1854ea0?, 0xc0c2bb7c00?, 0xc0c2322810?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:579 +0x305
github.com/google/go-cmp/cmp.(*state).compareAny(0xc0c202b040, {0x1c4bf80, 0xc0c1e03630})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:297 +0xadd
github.com/google/go-cmp/cmp.(*state).statelessCompare(0xc0c202b040, {0x1c4bf80?, 0xc0c1e03630?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:232 +0x7a
github.com/google/go-cmp/cmp.(*state).callTRFunc(0xc0c202b040, {0x181afa0?, 0xc0c0792570?, 0x17db340?}, {0x17db340?, 0xc0c2527e60?, 0x7f82261a5818?}, {0xc0c2527ea8?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:336 +0x2af
github.com/google/go-cmp/cmp.(*transformer).apply(0xc0bd22bbc0, 0xc0c202b040, {0x17db340?, 0xc0c2527e48?, 0x7f82261a5818?}, {0x17db340?, 0xc0c2527e60?, 0x1c62110?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/options.go:326 +0x186
github.com/google/go-cmp/cmp.(*state).tryOptions(0xc0c202b040, {0x1c62110, 0x17db340}, {0x17db340?, 0xc0c2527e48?, 0xc000469e70?}, {0x17db340?, 0xc0c2527e60?, 0x469e90?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:306 +0x109
github.com/google/go-cmp/cmp.(*state).compareAny(0xc0c202b040, {0x1c4bfb0, 0xc0c2325140})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:261 +0x4c7
github.com/google/go-cmp/cmp.Diff({0x17db340, 0xc0c2527e48}, {0x17db340, 0xc0c2527e60}, {0xc0bfc5d500?, 0xa94fa5?, 0xc0164fb308?})
	/home/user/go/pkg/mod/github.com/google/go-cmp@v0.6.0/cmp/compare.go:131 +0x165
github.com/anchore/syft/syft/pkg/cataloger/internal/pkgtest.(*CatalogTester).assertPkgs(0xc00046a300, 0xc00017f860, {0xc0001b3088, 0x8, 0x8}, {0xc0bf8fd008, 0x92, 0xbf})
	/home/user/Documents/syft/syft/pkg/cataloger/internal/pkgtest/test_generic_parser.go:300 +0x927
github.com/anchore/syft/syft/pkg/cataloger/internal/pkgtest.(*CatalogTester).TestParser(0xc00046a300, 0xc00017f860, 0xc0000892d0)
	/home/user/Documents/syft/syft/pkg/cataloger/internal/pkgtest/test_generic_parser.go:230 +0x19a
github.com/anchore/syft/syft/pkg/cataloger/internal/pkgtest.TestFileParser(0xc00017f860, {0x1a81599, 0x18}, 0xc0000892d0, {0xc0002d9708, 0x8, 0x8}, {0xc000408008, 0x92, 0xff})
	/home/user/Documents/syft/syft/pkg/cataloger/internal/pkgtest/test_generic_parser.go:310 +0x254
github.com/anchore/syft/syft/pkg/cataloger/rust.TestParseCargoLock(0xc00017f860)
	/home/user/Documents/syft/syft/pkg/cataloger/rust/parse_cargo_lock_test.go:451 +0x3ea5
testing.tRunner(0xc00017f860, 0x1ade048)
	/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 1
	/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/src/testing/testing.go:1742 +0x390

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jun 21, 2024

Both lints and tests should all pass now.
(TestNewFromFile_WithArchive/path_detected and TestNewFromFile_WithArchive/use_first_entry_for_duplicate_paths have begun failing for me, but we'll see)
I'd really appreciate a review on this, since I otherwise have no clue what else needs to be done.

Additionally I've revisited the Package Verification Code and added the safety check for the crate Checksum that we've talked about. I still believe that this conforms to spdx spec, because:

  1. sha2-256 is still considered cryptographically secure
  2. A crate at one version at one registry should never change. The default cargo registry does ensure that. If the version does change, the checksum would be different and it will be detected.
  3. If a registry's config changes the registries contents should still stay the same. If not, see 1 and 2

An external party can also easily reproduce the Package Verification code for a publicly accessible registry:

  1. Get a dependency in the Cargo.lock file
  2. Get the registry config (dependent on the registry source kind)
  3. Download/Get the crate
  4. Make sure the sha2-256 sum of the download matches the Checksum
  5. un-gzip and un-tar the download (this step is needed to eventually also support path, dir and git source kinds, where the crate is stored as the raw source files instead)
  6. produce the package verification code according to spdx spec now.

I can't find any point that's stated in the spdx documentation for the Package Verification Code, which would indicate this being against spdx spec when using a publicly accessible registries.

I'm open for discussion if registries needing authentication, local registries and local crates should be included or not, since for those it is a little more ambiguous if one can independently reproduce the package verification code.

The core questions here are:

  • Does "independently reproducible" include auth or not?
  • Does "independently reproducible" include local files?
    • remember, that we could have:
      • a local-registry source kind
      • local crates in the directory we are currently scanning (those are not yet detected, but might be in the future)
        • those local crates might or might not be part of a git repo
        • if the git might or might not be publicly accessible
        • syft doesn't know if the git repo is actually publicly accessible or not

Also another thing to note is that registries theoretically could use a transparent authentication system, such as whitelisting certain ip's and then mislabeling themselves as not needing auth, which would be undetectable by syft.

@kzantow
Copy link
Contributor

kzantow commented Jun 21, 2024

Hi @C0D3-M4513R -- I'd like to follow up from our discussion about Package Verification Code.

From the SPDX 2.3 docs:

This field provides an independently reproducible mechanism identifying specific contents of a package based on the actual files (except the SPDX document itself, if it is included in the package) that make up each package and that correlates to the data in this SPDX document.
...
Algorithm

verificationcode = 0
filelist = templist = ""
for all files in the package {

... this description mentions "all files in the package". I have always read this to mean each file is an SPDX File entry, but the description doesn't seem to say that explicitly, so calculating this field based on files not included in the SBOM could be acceptable. There seems to be a section in an unpublished version of the SPDX 2.3 spec with some general guidance that uses the wording: "If the files bound by the Package are described in the document...", which seems to support my original impression that the files used to calculate this should also be included.

But this is still a bit unclear, so I've posed this question for clarification on the SPDX Slack / will follow up on the mailing list to try to get a definitive answer. To me, this is the crux of the question about whether we can implement this particular aspect of the feature, since we wouldn't include the files in the file section, as they were not scanned directly on the source.

I would also like to point out that in SPDX 3, using the Package Verification Code is apparently discouraged in favor of the content identifier, which looks like it could be a much more useful git hash for the revision, or possibly some other more useful hash code.

All that said, since we're already downloading the package contents, I don't see any issue calculating this for the Syft internal data, regardless of whether we can include it in SPDX provided that it doesn't end up being particularly expensive for no benefit.

Sorry for being a bit long-winded here, I just want to make sure we aren't shoehorning something in to our data just for SPDX that isn't really supposed to be there.

... now, on a completely different topic:

Missing support for "git" source kind

This looks like something we might be able to generalize as an internal utility, as the Go license search is able to fetch modules from git to a degree. I would also say this is definitely not required as a part of this PR and could come later if we chose to implement it. From my very brief experience with Rust, I've seen git being used in a number of places and I was under the impression it was fairly common. Or is that not really true and cargo generally fetches things from the registries?

@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jun 21, 2024

Missing support for "git" source kind

Your take on that is correct, and I agree that it could/should come later (I see that as being not that simple).
the "git" source kind is primarily used, when you manually make an edit and want to use it in the code.

This looks like something we might be able to generalize as an internal utility, as the Go license search is able to fetch modules from git to a degree

Imo it might be very hard doing that. An internal git repo cache might prove useful, but beyond that I don't see there being much common code. As a matter of fact: I already use git for the "repository" source-kind.

which seems to support my original impression that the files used to calculate this should also be included.

I'm reading this as: I should include the files I use to calculate the pacakge verification code. But I am including the files in the contains relationship? Or is there a separate thing for declaring all the files of a package?

Using a Content-Identifier instead of Package-Verification-Code

I see that as difficult, because finding a good content-identifier might be hard.
I'd want to ideally include: Crate Name, Crate Version, Registry config Download specifier or Registry Source or Download Url and Checksum.
Finding a good content-identifier for local-non-git projects is impossible.
Finding a good content-identifier for a git project is simpler: Just include the Crate Name, Crate Version and Commit Sha, if the repo is not-dirty (all changes commited). If the repo is dirty, treat it the same as there not being a git project.

I'd even say that the current PURL might be actively harmful, since we don't specify any repository information and just assume it's the default cargo one.
This thus could lead to ambiguity amongst PURL's.

@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jun 22, 2024

I'm reading this as: I should include the files I use to calculate the pacakge verification code. But I am including the files in the contains relationship? Or is there a separate thing for declaring all the files of a package?

I just dug into this, and it seems that in spdx there is a File tag, that is used.
The File tag get's populated by the SBom.Artifact information, which gets eventually populated by syft/internal/task/file_tasks.go:NewFileMetadataCatalogerTask, which get's all the Contains relations and their Coordinates (those are populated for rust now, but with a http download link to a .tar.gz type of file in the filesystemid. See cataloger/rust/parse_cargo_lock_test.go).
You then directly access the fileCoordinate's RealPath using the specified resolver without checking the filsystemId (see syft/file/cataloger/filemetadata/cataloger.go:Catalog:L35). Thus I currently do not see an elegant way to add the files to the sbom, with the current architecture.

I would love to be wrong on this, so please correct me, if I am.

One idea to solve this is to also allow packages to specify additional FileMetadata, FileDigests,FileContents,FileLicenses or Executables which are then added onto the SBom.Artifact. This might solve the architectural constraint?

Also if you are concerned about the correctness of the files, you should look at the current output. All hashes in the files section are currently wrong:

[
  {
    "fileName":"/.github/workflows/rust.yml",
    "SPDXID":"SPDXRef-File-.github-workflows-rust.yml-5218e54acecbaea1",
    "checksums":[ 
      {
        "algorithm":"SHA1",
        "checksumValue":"0000000000000000000000000000000000000000"
      }
    ],
    "licenseConcluded":"NOASSERTION",
    "licenseInfoInFiles":["NOASSERTION"],
    "copyrightText":""
  },
  {
    "fileName":"/Cargo.lock",
    "SPDXID":"SPDXRef-File-Cargo.lock-c6bea2c24af05bc1",
    "checksums":[
      {
        "algorithm":"SHA1",
        "checksumValue":"0000000000000000000000000000000000000000"}
    ],
    "licenseConcluded":"NOASSERTION",
    "licenseInfoInFiles":["NOASSERTION"],
    "copyrightText":""
  }
]

(from https://github.com/C0D3-M4513R/time/actions/runs/9516743702/artifacts/1602138196)

@kzantow
Copy link
Contributor

kzantow commented Jun 22, 2024

@C0D3-M4513R there are some ways to add files if we determine that's the right thing to do... but the conundrum I have here is that we probably should not include the files because they were not part of the source that was requested to be scanned. So, if we don't add them, is it acceptable to use them to calculate the Package Verification Code? I don't know the right answer, I posed the question to the SPDX mailing list: https://lists.spdx.org/g/spdx/message/1869, so let's wait to see what the guidance is.

@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jun 29, 2024

I'd really like to be done with this pr sometime soon. Can we just get on with the review and ignore the Package Verification Code discussion for now?

Also I have seen the error in the static analysis, but I just don't know how to properly fix it (all attempts have resulted in me getting a cyclic import error).

And by ignore, I specifically mean to not add any Sha1 Hashes of the files, so the Package Verification Code doesn't get generated. (I've not been able to interpret the responses on the mailing list)

@wagoodman
Copy link
Contributor

@C0D3-M4513R I can help with the static analysis failure, but that is pulling on another thread -- introducing the rust package within pkg is a new pattern that I don't think is needed here. I think a lot of the cyclic imports you're running into are an artifact of this. Ultimately the pkg should have only descriptions of data in the form of structs and should minimize any behavior (methods) on those structs. This is especially true when it comes to parsing or transforming, that logic really belongs down in pkg/cataloger/rust.

This will ultimately be a larger refactor of the PR, but I'm happy to make code changes and push up on your branch.

In terms of the Package Verification Code, its essential that we don't include values that we know are not valid relative to these SBOM specs, and we've gotten confirmation that using the Package Verification Code in this way isn't correct. I think if we remove the Package Verification Code logic from this PR it'll be functionally valid to include.

@wagoodman
Copy link
Contributor

Just a heads up -- I'll be pushed up a set of changes today to this branch that will help it along 👍

@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jul 3, 2024

Just a heads up -- I'll be pushed up a set of changes today to this branch that will help it along 👍

Go ahead. I've debated removing the Package Verification Code myself from this pr, but other things came along. I assume that's one of the things you are going to remove.

Feel free to push to my branch anytime.

@wagoodman
Copy link
Contributor

wagoodman commented Jul 9, 2024

I'm pushing up my changes now, but I realize I've left things in a broken state -- I'll still be working on that today, but wanted to at least push up what I have.

Here's the list of changes I made at a high level:

  • moved the package metadata definitions purely into syft/pkg (shared definitions) and kept parsing-level concerns within the syft/pkg/cataloger/rust/... packages (rust domain package). By ensuring that we're always importing shared definitions within domain packages (and not the other way around) this eliminates the risk of package import cycles.
  • I migrated much of the parsing and fetching logic / data structures into an internal package within the rust domain package so it can't be imported elsewhere (really only from within the rust domain package).
  • I removed the package verification specific code on the SPDX processing side.
  • migrated package-to-package relationship creation logic to a cataloger post processor, using the existing dependency.Specification approach we have in place elsewhere.

What's still needed:

  • right now unit tests are reaching out to crate.io and leveraging local cache. We should try to remove these side effects with mocking during tests.
  • adding more unit tests.
  • capturing relationships for hosted archives of dependencies to extracted files with checksums needs a little more thought

Here's what I mean with that last point... this is what is being added today in your PR relationship-wise for these hosted archives:

// Identifiable(Package)  ----- contains(digest-1) ----> Identifiable(File.Coordinate(archive-file-1, URL))
//                         \--- contains(digest-2) ----> Identifiable(File.Coordinate(archive-file-2, URL))
//                         \--- contains(digest-3) ----> ...

But there are a few issues with this:

  • The file coordinate is being used with a URL reference, however, today this represents files found from the artifact. I think using the filesystemID to portray the URL instead of a FS is a clever way to do this. But I think it's important to be able to clearly distinguish remote resources and local resources. I need to think on this some.
  • The digests are attached to the graph edges, not the nodes themselves... but the digests are clearly properties of the nodes and not edges. The digests being placed in the edge seems to be more of a convenience than anything else.

We might be able to figure something like:

// Identifiable(Package) ---- hosted-at --> Identifiable(URL)  ---- contains ----> Identifiable(ResourcePath(archive-file-1))
//                                                             \--- contains ----> Identifiable(ResourcePath(archive-file-2))
//                                                             \--- contains ----> ...

where I'm handwaving at "hosted-at" as an edge type, and where digests would go, and what ResourcePath even is... but the point is that on the left side of this graph you'd have things that are in the artifact, and we'd have edges to resources that are not in the artifact. The graph would be strongly separable to what exists in the artifact and does not. It might be that we can go the first way and consumers would need to know to look for https:// in the filesystem component of the coordinate, but I'd like to explore ways to make this more obvious.

I'm curious about one thing though: what's the use case for showing source level digests (for each source file) when the package could simply store the sha256 of the source archive itself? Do you have a specific use case / need for having source level digests for all dependencies @C0D3-M4513R ?

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@C0D3-M4513R
Copy link
Author

C0D3-M4513R commented Jul 9, 2024

right now unit tests are reaching out to crate.io and leveraging local cache. We should try to remove these side effects with mocking during tests.

This itself also tests that connecting to a repository works and that dependencies can be downloaded.
If you can mock out crate.io, whilst still testing the different repository kinds go ahead.

The digests being placed in the edge seems to be more of a convenience than anything else.

It definitely was a convenience decision, to get the package verification code to work without much additional code.

I'm curious about one thing though: what's the use case for showing source level digests (for each source file) when the package could simply store the sha256 of the source archive itself? Do you have a specific use case / need for having source level digests for all dependencies @C0D3-M4513R ?

The decision to use source level digests came as a logical conclusion from the fact, that source archives are not always available. This loops back to the missing "source kinds" that I mention earlier. Sometimes you actually do have local dependencies (e.g. you have multiple crates in a single repository linked together with a cargo workspace). In order to not have to break anything later I chose to make the digests on the source level, because that is more universal.

@wagoodman wagoodman self-assigned this Jul 11, 2024
@jdubs11
Copy link

jdubs11 commented Jul 15, 2024

"licenseDeclared": "NOASSERTION",
"licenseConcluded": "NOASSERTION",

Led me here. Commenting here so I can follow PR updates. I have a need for this functionality also.

Signed-off-by: C0D3 M4513R <28912031+C0D3-M4513R@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Stalled
Development

Successfully merging this pull request may close these issues.

5 participants