Classify licenses based on file contents #656

wagoodman · 2021-12-07T16:32:26Z

What would you like to be added:
The ability to read entire file contents (or just the top X bytes of the file) and classify the contents as a particular license (e.g. MIT, Apache 2.0, etc). This is a larger addition than #565 (which just covers the SPDX identifiers) but should be thought about together. License content discovered could be persisted optionally in the final SBOM (supported in SPDX).

Why is this needed:
Keeping a curated list of licenses for your dependencies is a common use case for SBOMs.

Additional context:
Consider using https://github.com/google/licenseclassifier for the heavy lifting.

As a start this could key off of file extensions to filter down to source files (.py, .go, .c, etc) or by filename (e.g. "license", "LICENSE", "license..*, etc") to keep the search scope reasonable.

This could be implemented as it's own cataloger that is only responsible for finding licenses in files. This would make the configuration easily accessible, for example:

license:
  cataloger:
    enabled: true
    scope: "squashed"
  
  # keep the license content in the final SBOM
  capture-content: true

  # only search in the following files (by glob)
  globs: 
    - license*
    - License*
    - *.c
    - *.go
    - *.py
    - *.ts
    - *.tsx
    ...

More thought is needed as to how this is organized in the Syft JSON output. That is, does this show up as snippets under packages? Snippets under files? Maybe they get their own section? How does this relate to the licenses field under a package? (will it change? relate to another field? or something else?).

The text was updated successfully, but these errors were encountered:

sknick · 2021-12-08T21:14:14Z

I was literally just looking at Syft for the first time today and thought to myself how I wish it had license scanning.

wagoodman · 2024-09-09T22:02:47Z

Just a heads up on this issue -- we are adding a JVM cataloger in #3188, which could leverage this feature to catalog the <JVMDIR>/legal/**/LICENSE and attach results to the package directly.

wagoodman added the enhancement New feature or request label Dec 7, 2021

wagoodman added the license relating to software licensing label Apr 28, 2022

kzantow mentioned this issue May 9, 2024

Capture licenses for all packages #2861

Open

44 tasks

wagoodman mentioned this issue Jun 3, 2024

python cataloger: adding a support additionally to classify licenses by License-File field in metadata file #2923

Open

C0D3-M4513R mentioned this issue Jun 13, 2024

Add more rust data #2948

Open

mmarseu mentioned this issue Jun 17, 2024

License field in Python package metadata could be name or full text #2969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classify licenses based on file contents #656

Classify licenses based on file contents #656

wagoodman commented Dec 7, 2021 •

edited

Loading

sknick commented Dec 8, 2021

wagoodman commented Sep 9, 2024

Classify licenses based on file contents #656

Classify licenses based on file contents #656

Comments

wagoodman commented Dec 7, 2021 • edited Loading

sknick commented Dec 8, 2021

wagoodman commented Sep 9, 2024

wagoodman commented Dec 7, 2021 •

edited

Loading