Improved license recognition

VersionEye is monitoring now more than 1.2 Million open source projects and collecting all kind of meta information to this projects. One kind of the meta information is the corresponding license. Currently the VersionEye database contains licenses to more than 8 Million artefacts.

However, sometimes the maintainers of a project didn’t provide the license information that way that it’s easy to read and recognise for machines. That is specially the case for Python and the .NET platform. Take for example the gpkit Python library. In the license field they provided the full license text, not just the license name.

python-gpkit-00

That doesn’t match very well with the license whitelist in VersionEye 😉 That’s why we improved it now!

All together we have 11335 Python licenses like that in our database and they belong to 1989 unique projects on PIP. Our new license crawler could match 9933 licenses to SPDX identifiers. That are 1799 unique projects on PIP we could assign a SPDX identifier to, which didn’t had one before. For more than 90% of this projects we could recognise and identify an SPDX identifier. And now the same library on VersionEye looks like that:

python-gpkit-01

You see clear that it’s MIT license and now this works well together with our license whitelist 🙂

In Nuget, the package manager for the .NET platform, many license names look like this.

csharp-00.png

“Nuget unknown”. That is the case if the maintainers provided a license link but didn’t provide a license name. Our new license crawler is now following this links and with machine learning it tries to identify a known license. If the similarity to a known license text is higher that 90% we assign the corresponding SPDX identifier to the software library in our database. And now the same package looks like this and you can see immediately that it’s the MIT license!

csharp-01.png

That also helps to use these packages together with our license whitelist. For the .NET packages our recognition system was not quiet as good as for Python. For .NET we could only identify 65% of the licenses of packages with a link but without a license name. Stil not bad and much better than before 😉

This are just the first results. Of course this is also work in progress and the recognition will become better in future.

Your feedback is very welcome.

Suggest new licenses

VersionEye is monitoring now more than 1.2 Million open source projects and collecting all kind of meta information to this projects. One kind of the meta information is the corresponding license. Currently the VersionEye database contains licenses to more than 8 Million artefacts.

However, it is not always possible to fetch the license automatically. Sometimes things go wrong and sometimes the license is not available through a repository API. Sometimes human interaction is required to find the license for an artefact. Now everybody from the VersionEye community can suggest a license to an artefact. If you are on a VersionEye product page with an unknown license you will see a new “Suggest a license” link now.

screen-shot-2017-01-18-at-19-59-14

By clicking on the link you will come to a new page where you can suggest a license for the corresponding artefact and the form is already pre filled for you.

screen-shot-2017-01-18-at-20-01-36

By submitting the form above the VersionEye team will receive an email notification with the new license suggestion. After the submission was reviewed and approved the license will show up on the page.

I hope that many of you will use this new feature! 🙂

New license crawler finds licenses in git repositories

VersionEye is crawling a whole bunch of package managers + some GitHub Repositories. Many open source libraries provide their license directly within the package manager. On search.maven.org the license information is even mandatory. If somebody submits a new package and the license information is not included in the pom.xml file the package will be declined and not published. That’s a good thing! Unfortunately at most other package managers the license is not mandatory and so it comes that we have many open source libraries in our database without a license 😦

opensource_logo

But many projects on GitHub provide a LICENSE file. Other projects copy & paste the license text directly into their README page. That’s why I developed a new crawler which is looking for a LICENSE file in the root of a git repository. If some kind of LICENSE file was found the crawler tries to recognize the license text. Is it an MIT text, a GPL 3 text or some other license text? If the crawler recognizes the license text it saves the information into the VersionEye database. Otherwise not.
If there is no LICENSE file the crawler tries to find some license text directly in the REAMDE file. If the crawler couldn’t find anything it skips the repo and goes on to the next one.

The license crawler doesn’t crawl whole GitHub. It crawls only git repositories of open source packages which are already in the VersionEye database and have a corresponding link to GitHub. It basically completes meta data to already existing open source packages in the database. And right now there are 463.873 open source libraries in the VersionEye database.

Results

In the past couple days the license crawler was able to find license information for 22.694 open source libraries. 5.684 of this open source libraries are JavaScript libraries registered on Bower, without any license information. One popular example is angular-mocks. The repository provides a bower.json file without a license! The license text of angular-mocks is directly embedded into the README.md file. VersionEye found & recognized the license as MIT and displays it now on the corresponding VersionEye page. There are many more examples like this one.

Next Step

These works already pretty good. But the next logical step is to write a crawler which recognizes licenses in source code. That requires a bit more computing power and a bit more time to write it. Nothing for 2014. But a good goal for 2015 😉