New license crawler finds licenses in git repositories

VersionEye is crawling a whole bunch of package managers + some GitHub Repositories. Many open source libraries provide their license directly within the package manager. On search.maven.org the license information is even mandatory. If somebody submits a new package and the license information is not included in the pom.xml file the package will be declined and not published. That’s a good thing! Unfortunately at most other package managers the license is not mandatory and so it comes that we have many open source libraries in our database without a license 😦

opensource_logo

But many projects on GitHub provide a LICENSE file. Other projects copy & paste the license text directly into their README page. That’s why I developed a new crawler which is looking for a LICENSE file in the root of a git repository. If some kind of LICENSE file was found the crawler tries to recognize the license text. Is it an MIT text, a GPL 3 text or some other license text? If the crawler recognizes the license text it saves the information into the VersionEye database. Otherwise not.
If there is no LICENSE file the crawler tries to find some license text directly in the REAMDE file. If the crawler couldn’t find anything it skips the repo and goes on to the next one.

The license crawler doesn’t crawl whole GitHub. It crawls only git repositories of open source packages which are already in the VersionEye database and have a corresponding link to GitHub. It basically completes meta data to already existing open source packages in the database. And right now there are 463.873 open source libraries in the VersionEye database.

Results

In the past couple days the license crawler was able to find license information for 22.694 open source libraries. 5.684 of this open source libraries are JavaScript libraries registered on Bower, without any license information. One popular example is angular-mocks. The repository provides a bower.json file without a license! The license text of angular-mocks is directly embedded into the README.md file. VersionEye found & recognized the license as MIT and displays it now on the corresponding VersionEye page. There are many more examples like this one.

Next Step

These works already pretty good. But the next logical step is to write a crawler which recognizes licenses in source code. That requires a bit more computing power and a bit more time to write it. Nothing for 2014. But a good goal for 2015 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s