Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License field in Python package metadata could be name or full text #2969

Open
mmarseu opened this issue Jun 17, 2024 · 2 comments
Open

License field in Python package metadata could be name or full text #2969

mmarseu opened this issue Jun 17, 2024 · 2 comments
Labels
enhancement New feature or request license relating to software licensing

Comments

@mmarseu
Copy link

mmarseu commented Jun 17, 2024

What would you like to be added:

The python-installed-package-cataloger cataloger could employ a heuristic to determine whether the License field in package metadata contains a license descriptor or the full license text.
For example, if a certain number of newlines and text length are exceeded, the value could be considered the full text.

When it's determined to be the full text, it should be added as such to the SBOM. In CycloneDX, that means creating a license object such as:

"license": {
  "name": "Found in <path>",
  "text": {
    "content": "<full text>"
  }
}

Why is this needed:

The License field isn't clearly defined. While in my experience, most packages just put down a license name or even SPDX id, it is not uncommon to find the full text in there.
For example, pandas uses it this way.

Additional context:

This would fit well with #656. If a full text is identified, it could immediately be classified.

License field might be deprecated if PEP-639 get's approved. Still, even then I believe this issue will stay relevant for years to come.

@mmarseu mmarseu added the enhancement New feature or request label Jun 17, 2024
@Joerki
Copy link

Joerki commented Jun 22, 2024

I experienced this also with pandas and scipy.

Regarding the definition: In pyproject.toml (https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) it is possible to specify either the license text (should be identifier) or file (license text file) to include license information.

This is basically not a good definition, since there should be a clear distinction between IDs and full text.

It is getting even worse when this file does not just include the project's main license, but also software that is bundled with the package. I assume that there is no safe method to distinguish between those licenses (similar to non-machine readable Debian copyright files) based on a text with licenses that have no clear separation.

I suspect that this multi-licensing is the reason that we get the full text here, and not just the ID.

Do you have another, working idea already?

@mmarseu
Copy link
Author

mmarseu commented Jun 24, 2024

@Joerki I believe that problem is beyond the scope of my issue but very much in scope of #656. Is has been suggested there to use https://github.com/google/licenseclassifier which attempts to deal with these kinds of aggregated license texts.

As for this issue, I'd already be happy if syft would insert the full text it finds as a single license text, even if it really contains multiple licenses.

@wagoodman wagoodman added the license relating to software licensing label Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request license relating to software licensing
Projects
Status: No status
Development

No branches or pull requests

3 participants