Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Design improved relationships for Package, Dependencies, Embeds and Requirements #1066

Closed
15 of 16 tasks
pombredanne opened this issue Jan 31, 2024 · 12 comments
Closed
15 of 16 tasks
Assignees

Comments

@pombredanne
Copy link
Contributor

pombredanne commented Jan 31, 2024

We need to design improved models for Package, Dependencies, Embeds and Requirements.

Some background:

  • What we routinely calls "dependencies" are not really dependencies: more often these are requirements that are "resolved" to actual concrete dependencies. To really "depend" on a software package, you need to know the exact resolved version. On the other hand, a requirement expresses the need for some range of versions (even if it's a range of a single version).

  • We should distinguish a "require" relationship between packages vs. an actual "resolved" dependency to a concrete version, and track if the resolved dependency is present in the codebase or not.

  • Dependencies (or requirements) may be only documented in a package manifest (like pom.xml) or a lockfile (like package-lock.json), or also be present in the codebase.

  • We need to track also when a package contains another package (or for that matter when an app contains or embeds packages).

Therefore we may have three levels of tree-like data we need to track for packages:

  • "potential" requirements (we call these dependencies today). This is a logically a flat list of PURLs with VERS attached to a package instance.
  • "resolved" dependencies (we call these dependencies today too, but sometimes also packages). This is logically a tree (or graph) of PURLs with a resolved version. Each such dependency is derived and satisfies one or more "requirement".
  • "embeds" are packages embedded in other packages, e.g., they are physically copied, vendor and or built into a product codebase. These could have been the results of provisioning/fetching resolved dependencies, or could be a straight copy. These are not much tracked as such today.

We have agreed on some key model updates, based on discussions here and in our weekly calls:

The https://github.com/nexB/dependency-inspector tickets to create lockfiles are:

Then for the effective resolution of dependencies as a tree from manifests and lockfiles:

And beyond:

Here are also some related issues, but not directly need to complete this feature:

@pombredanne
Copy link
Contributor Author

pombredanne commented Jan 31, 2024

Note that this is as much for ScanCode.io as it is for ScanCode Toolkit, DejaCode and PurlDB

@pombredanne pombredanne changed the title RFC: Design improved models for Package, Dependencies, Embeds and Requirements RFC: Design improved relationships for Package, Dependencies, Embeds and Requirements Jan 31, 2024
@DennisClark DennisClark self-assigned this Jan 31, 2024
@AyanSinhaMahapatra
Copy link
Contributor

AyanSinhaMahapatra commented Feb 15, 2024

Summarizing the discussion we had on the community call on this issue:

We need to have dependencies also as packages, and not
as a separate model (with the skinny purl-only package data and optionally
resolved packages nested inside) we have today. And dependencies here
would be preserved as relationships between these packages, and the actual
primary package -> secondary packages (either dependencies/embedded packages)
would also be preserved by creating relationships between them.

Here we have some types of packages:

  1. Primary package instances
  2. Resolved package dependency with package data/files
  3. Resolved package dependency without package data/files (just PURL)
  4. Secondary/Embedded package (files embedded in primary package files)
  5. Package requirements (package + version range)

This approach has some benefits:

  • We don't have to change massively our package models, just some attribute addition/deletion
    as we are changing how we report the data within those models, and the output format
    remains the same essentially.
  • Resolved packages won't be nested, everything will be there as package instances.
  • This is straightforward, as dependency is really a relationship and not a separate model.
  • Similar to what users expect/other tools provide, and essentially what we will have if
    we scan a package with pre-resolved dependencies present in the code tree, there too
    since we have the packages detected as instances, we have them as packages.

For example, in SCIO we could have filters for primary packages, secondary/embedded packages,
so that these are differentiated and looked at separately as required, and the dependency tab
could have the relations between them.

Two possible implementations in this:

a) We have everything (packages, resolved/embedded dependencies, dependency purls and even requirements)
returned as Packages.

  • Would be less confusing and straightforward, having everything stored similarly.
  • We can do a resolve requirements to the oldest version available within
    the required versions, or something similar, to have a proper package instance.

b) We have everything else but requirements (packages, resolved/embedded dependencies, dependency purls)
as packages. Requirements (does not have exact version, therefore does not point to a package)

  • Requirements do not have a version, so these aren't really package instances as
    they are not concrete packages which can be identified in a unique way. So these
    should not be confused with packages.
  • Even for dependency purls we can get package metadata/download the packages and scan them, but we
    cannot do the same for dependency requirements, as they are not package instances.
  • If we try to resolve these dependency requirements incorrectly (to the oldest version available within
    the required versions) this would be incorrect and misleading for any compliance. As package licenses
    could change between versions and without correct versions we cannot check vulnerabilities. So these
    are essentially false positives.

cc @pombredanne @tdruez

@pombredanne
Copy link
Contributor Author

pombredanne commented Feb 26, 2024

The latest design is to go in this direction:

  • We are transforming the DiscoveredDependency model to add a foreign key to the corresponding resolved DiscoveredPackage something like a "resolved_to" relationship
  • We will have a dependency resolution process in a pipeline where we look at the dependencies and resolve them to actual found packages or can also use a resolver, even if very crude.
  • We may also track if a package is physically found on disk in a codebase
  • This can stay optional
  • We will need to ensure that we are doing the right thing too on the scancode toolkit side
  • The same solution will be then propagated to PurlDB and possibly to DejaCode

See also attached:

deps-resolution-in-scancode-1

To clarify, we would go with option 1:

  1. We do not create a "resolved to" until we "run" a resolution

  2. We resolve as below possibly in a dumb way. We are aware that looking for the latest version of a package is a somewhat dumb "mock resolution" and may not have a lot of value

  3. When we have a lockfile (that contains pinned dependencies) we could create the "resolved to" Package at the same time

  4. When we have already existing Packages and a Dependencies, we should be able to relate as a resolution the existing Packages (a sort of mock resolution

deps-resolution-in-scancode-2

@pombredanne
Copy link
Contributor Author

@AyanSinhaMahapatra I had some other notes too... did you capture some?

@pombredanne
Copy link
Contributor Author

We have a mostly working PR with #896 by @Hritik14

tdruez added a commit that referenced this issue May 17, 2024
And refine existing fields of the model

Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue May 17, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue May 17, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
@tdruez
Copy link
Contributor

tdruez commented May 17, 2024

I would recommend an implementation in multiple stages:

  1. Add new resolved_to field on DiscoveredDependency #1066 #1240
  2. Add the resolution processing to create packages and feed the new resolved_to field. That's also at that time we can add the new DiscoveredPackage fields that will help to qualify the package "level" regarding the dependency tree.
  3. Add the methods to collect and use the dependency tree within a project and include the tree in the outputs such as CycloneDX.

tdruez added a commit that referenced this issue May 17, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue May 17, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
@tdruez
Copy link
Contributor

tdruez commented May 17, 2024

@AyanSinhaMahapatra The new resolved_to field is now merged.

Could you start a PR on the resolution part? #1066 (comment)
As a side note we want to keep the resolution tree, meaning a package defined as resolved_to can also be defined as a for_package. The model should support this approach fine.

@pombredanne
Copy link
Contributor Author

Here are some notes as how I understand this model will work:

We are updating these existing relationships (and their reverses):

  • DiscoveredDependency > for_package > DiscoveredPackage
  • DiscoveredPackage > declared_dependencies > DiscoveredDependency

We are creating these new relationships (and their reverses):

  • DiscoveredDependency > resolved_to > DiscoveredPackage
  • DiscoveredPackage > resolved_dependencies > DiscoveredDependency

A resolution would start with this:

  1. Starting "universe":
foo@1.2
 - direct deps: bar > 2.0
 
bar@3.0
 - direct deps: baz >= 1.0

baz@1.0
 - direct deps: None
  1. If we resolve for foo@1.2 then the Tree is:
foo@1.2
 - direct deps: bar > 2.0
   - resolved_to: bar@3.0
     - direct deps: baz >= 1.0
       - resolved_to: baz@1.0

@Hritik14
Copy link
Contributor

As per above and #896, now ResolvedDependency instead of being a concrete model of its own will act as an associative table that connects any two DiscoveredPackage in a many-to-many fashion.
This looks like the following:

scio-transitive-1 drawio

Now, adding the above suggestion of direct deps or requirement might look something like:

scio-transitive-2 drawio

A package has requirements, a requirement resolves to a dependency and a dependency maps to a package. Packages retain their many to many relationship via the dependency associative but every dependency also has a one to one relationship with a requirement.

@tdruez
#1240 looks good to me, although should we call resolved_dependencies as resolved_parents or just parents ? Having declared_dependencies and resolved_dependencies both on DiscoveredPackage does not make sense.

@pombredanne
Copy link
Contributor Author

pombredanne commented May 21, 2024

@Hritik14 If I expand on the topic of #1066 (comment) then we have these records and relationships (I have added a few extra packages/deps)

Before resolution:

  • DiscoveredPackage > declared_dependencies > DiscoveredDependency
  foo@1.2                 >             bar > 2.0
  foo@1.2                 >             biz > 3.0
  bar@3.0                 >             baz >= 1.0
  biz@4.0                 >             baz >= 0.1
  baz@1.0
  • DiscoveredDependency > for_package > DiscoveredPackage
  bar > 2.0               >             foo@1.2
  biz > 3.0               >             foo@1.2
  baz >= 1.0              >             bar@3.0
  baz >= 0.1              >             biz@4.0

After resolution:

  • DiscoveredPackage > resolved_dependencies > DiscoveredDependency
  bar@3.0                 >             bar > 2.0
  baz@1.0                 >             baz >= 1.0
  baz@1.0                 >             baz >= 0.1
  biz@4.0                 >             biz > 3.0
  • DiscoveredDependency > resolved_to > DiscoveredPackage
  bar > 2.0               >             bar@3.0
  baz >= 1.0              >             baz@1.0
  baz >= 0.1              >             baz@1.0
  biz > 3.0               >             biz@4.0

What does having an intermediate DiscoveredRequirement brings that we cannot do with the newly updated model? (The idea here is to avoid changing fundamentally the model structure)

With that said resolved_dependencies could be renamed as resolved_from_dependencies as multiple dependencies "requirements" could be resolved to a single version as seen here

AyanSinhaMahapatra added a commit that referenced this issue May 22, 2024
Reference: #1237
Reference: #1066
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Contributor

AyanSinhaMahapatra commented May 22, 2024

Could you start a PR on the resolution part? #1066 (comment)
As a side note we want to keep the resolution tree, meaning a package defined as resolved_to can also be defined as a for_package. The model should support this approach fine.

@tdruez @pombredanne I've added an initial implementation for the resolution process, this is now:

  • populating the newly added resolved_to attribute in the dependency relationships
  • creating new resolved packages from the resolved_package attribute we have already in SCTK output. This is stored in dependencies for PackageData objects, for the corresponding lockfile resource.
  • Added this as optional groups in both resolve_dependencies and inspect_packages pipeline, uses the same function for this. (need feedback on the UI/pipelines flow and group name, is this okay? Or should we do an addon pipeline instead)

Things remaining [WIP]:

Questions:
I'm not sure on the attribute names for the following:

add the new DiscoveredPackage fields that will help to qualify the package "level" regarding the dependency tree.

Can we implement the above with #880
where we can have some indication of a Package being created from static resolver?

Also, should we use https://github.com/nexB/univers more in SCTK and make sure is_resolved is correctly populated in all cases?

tdruez added a commit that referenced this issue Jun 3, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue Jun 3, 2024
tdruez added a commit that referenced this issue Jun 3, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue Jun 3, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue Jun 3, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue Jun 4, 2024
Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this issue Jun 4, 2024
AyanSinhaMahapatra added a commit that referenced this issue Jun 13, 2024
Reference: #1237
Reference: #1066
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
tdruez pushed a commit that referenced this issue Jul 1, 2024
* Resolve dependencies from lockfiles #1237

Reference: #1237
Reference: #1066
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Address feedback and add improvements

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Improve dependency resolving from lockfiles #1237

Resolves dependency for cases where multiple requirements
are resolved by one package and all the version requirements
are joined for that package.

Reference: #1237
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Update scancode-toolkit and fix tests

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Bump scancode-toolkit to v32.2.0

Reference: https://github.com/nexB/scancode-toolkit/releases/tag/v32.2.0
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Regenerate test fixtures and expectations

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Improve dependency resolver for lockfiles

Handle various lockfile cases where:
* Same package/dependencies are present in different lockfiles
* Independent lockfiles without a manifest and root package
* Ecosystems which have only a single version of package in
  their environment
* Dependency graphs where a resolved package can have many
  parent packages.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Address feedback and refactor code

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* FIx bugs for resolving python packages

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Add unit tests and refactor code

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Address comments and add CHANGELOG entries

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

---------

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Contributor

This is now there with 08c54b1 and support has been added for pypi, poetry, swift, nuget, cocoapods, npm, yarn also with additional help from deplock at https://github.com/nexB/dependency-inspector to generate lockfiles if they are not present, and then parse, store and resolve dependencies to form a graph.
Follow up issues are there on scancode-toolkit, dejacode, purldb.
Closing!

AyanSinhaMahapatra added a commit to aboutcode-org/scancode-toolkit that referenced this issue Aug 7, 2024
Renaming the dependency attribute is_resolved to is_pinned to
more accurately represent this attribute accurately.
This is more relevant after the changes in aboutcode-org/scancode.io#1066

Reference: aboutcode-org/scancode.io#1066
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit to aboutcode-org/scancode-toolkit that referenced this issue Aug 7, 2024
Renaming the dependency attribute is_resolved to is_pinned to
more accurately represent this attribute accurately.
This is more relevant after the changes in aboutcode-org/scancode.io#1066

Reference: aboutcode-org/scancode.io#1066
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants