Improved license recognition

VersionEye is monitoring now more than 1.2 Million open source projects and collecting all kind of meta information to this projects. One kind of the meta information is the corresponding license. Currently the VersionEye database contains licenses to more than 8 Million artefacts.

However, sometimes the maintainers of a project didn’t provide the license information that way that it’s easy to read and recognise for machines. That is specially the case for Python and the .NET platform. Take for example the gpkit Python library. In the license field they provided the full license text, not just the license name.

python-gpkit-00

That doesn’t match very well with the license whitelist in VersionEye 😉 That’s why we improved it now!

All together we have 11335 Python licenses like that in our database and they belong to 1989 unique projects on PIP. Our new license crawler could match 9933 licenses to SPDX identifiers. That are 1799 unique projects on PIP we could assign a SPDX identifier to, which didn’t had one before. For more than 90% of this projects we could recognise and identify an SPDX identifier. And now the same library on VersionEye looks like that:

python-gpkit-01

You see clear that it’s MIT license and now this works well together with our license whitelist 🙂

In Nuget, the package manager for the .NET platform, many license names look like this.

csharp-00.png

“Nuget unknown”. That is the case if the maintainers provided a license link but didn’t provide a license name. Our new license crawler is now following this links and with machine learning it tries to identify a known license. If the similarity to a known license text is higher that 90% we assign the corresponding SPDX identifier to the software library in our database. And now the same package looks like this and you can see immediately that it’s the MIT license!

csharp-01.png

That also helps to use these packages together with our license whitelist. For the .NET packages our recognition system was not quiet as good as for Python. For .NET we could only identify 65% of the licenses of packages with a link but without a license name. Stil not bad and much better than before 😉

This are just the first results. Of course this is also work in progress and the recognition will become better in future.

Your feedback is very welcome.

2 thoughts on “Improved license recognition

  1. Do you have any numbers for npm dependencies? Since the SPDX identifier was introduced back in May 2015, but many packages are older than that or not following the license identifier. See e.g. https://npm1k.org/ for a list of popular dependencies which have invalid information

Leave a comment