`License` field in Python package metadata could be name or full text #2969

mmarseu · 2024-06-17T09:25:18Z

What would you like to be added:

The python-installed-package-cataloger cataloger could employ a heuristic to determine whether the License field in package metadata contains a license descriptor or the full license text.
For example, if a certain number of newlines and text length are exceeded, the value could be considered the full text.

When it's determined to be the full text, it should be added as such to the SBOM. In CycloneDX, that means creating a license object such as:

"license": {
  "name": "Found in <path>",
  "text": {
    "content": "<full text>"
  }
}

Why is this needed:

The License field isn't clearly defined. While in my experience, most packages just put down a license name or even SPDX id, it is not uncommon to find the full text in there.
For example, pandas uses it this way.

Additional context:

This would fit well with #656. If a full text is identified, it could immediately be classified.

License field might be deprecated if PEP-639 get's approved. Still, even then I believe this issue will stay relevant for years to come.

The text was updated successfully, but these errors were encountered:

Joerki · 2024-06-22T05:29:02Z

I experienced this also with pandas and scipy.

Regarding the definition: In pyproject.toml (https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) it is possible to specify either the license text (should be identifier) or file (license text file) to include license information.

This is basically not a good definition, since there should be a clear distinction between IDs and full text.

It is getting even worse when this file does not just include the project's main license, but also software that is bundled with the package. I assume that there is no safe method to distinguish between those licenses (similar to non-machine readable Debian copyright files) based on a text with licenses that have no clear separation.

I suspect that this multi-licensing is the reason that we get the full text here, and not just the ID.

Do you have another, working idea already?

mmarseu · 2024-06-24T06:52:32Z

@Joerki I believe that problem is beyond the scope of my issue but very much in scope of #656. Is has been suggested there to use https://github.com/google/licenseclassifier which attempts to deal with these kinds of aggregated license texts.

As for this issue, I'd already be happy if syft would insert the full text it finds as a single license text, even if it really contains multiple licenses.

mmarseu added the enhancement New feature or request label Jun 17, 2024

wagoodman added the license relating to software licensing label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`License` field in Python package metadata could be name or full text #2969

`License` field in Python package metadata could be name or full text #2969

mmarseu commented Jun 17, 2024

Joerki commented Jun 22, 2024

mmarseu commented Jun 24, 2024 •

edited

Loading

License field in Python package metadata could be name or full text #2969

License field in Python package metadata could be name or full text #2969

Comments

mmarseu commented Jun 17, 2024

Joerki commented Jun 22, 2024

mmarseu commented Jun 24, 2024 • edited Loading

`License` field in Python package metadata could be name or full text #2969

`License` field in Python package metadata could be name or full text #2969

mmarseu commented Jun 24, 2024 •

edited

Loading