Skip to content

Project Ideas Improve License Detection Accuracy

Philippe Ombredanne edited this page Mar 4, 2020 · 1 revision

Improve ScanCode License detection accuracy

ScanCode license detection is using multiple techniques to accurately detect licenses based on automatons, inverted indexes and multiple sequence alignments. The detection is not always accurate enough.

The goal of this project is to improve the accuracy of license detection. Some of the cases where it could be improved include:

  • when multiple licenses are detected with a low score and some detections are incorrect.
  • when some unknown licenses may not be detected correctly.
  • when license references such as "see license in file LICENSE.txt" are reported as unknown license references.

So support this effort, this project can leverage the ClearlyDefined data set. ScanCode license detection is used in the ClearlyDefined.io project to massively scan million of packages.

One possible outcome of this project is to write tools and create models to massively analyze the accuracy of license detection and detect areas where the accuracy could be improved. These tools and models should then be reusable to assist in the semi-automated reviews of scan results.

A bonus would be to also suggest the semi-automatic creation of new license detection rules to fix the detected anomalies.

Clone this wiki locally