VersionEye is crawling a whole bunch of package managers + some GitHub Repositories. Many open source libraries provide their license directly within the package manager. On search.maven.org the license information is even mandatory. If somebody submits a new package and the license information is not included in the pom.xml file the package will be declined and not published. That’s a good thing! Unfortunately at most other package managers the license is not mandatory and so it comes that we have many open source libraries in our database without a license 😦
But many projects on GitHub provide a LICENSE file. Other projects copy & paste the license text directly into their README page. That’s why I developed a new crawler which is looking for a LICENSE file in the root of a git repository. If some kind of LICENSE file was found the crawler tries to recognize the license text. Is it an MIT text, a GPL 3 text or some other license text? If the crawler recognizes the license text it saves the information into the VersionEye database. Otherwise not.
If there is no LICENSE file the crawler tries to find some license text directly in the REAMDE file. If the crawler couldn’t find anything it skips the repo and goes on to the next one.
The license crawler doesn’t crawl whole GitHub. It crawls only git repositories of open source packages which are already in the VersionEye database and have a corresponding link to GitHub. It basically completes meta data to already existing open source packages in the database. And right now there are 463.873 open source libraries in the VersionEye database.
Results
In the past couple days the license crawler was able to find license information for 22.694 open source libraries. 5.684 of this open source libraries are JavaScript libraries registered on Bower, without any license information. One popular example is angular-mocks. The repository provides a bower.json file without a license! The license text of angular-mocks is directly embedded into the README.md file. VersionEye found & recognized the license as MIT and displays it now on the corresponding VersionEye page. There are many more examples like this one.
Next Step
These works already pretty good. But the next logical step is to write a crawler which recognizes licenses in source code. That requires a bit more computing power and a bit more time to write it. Nothing for 2014. But a good goal for 2015 😉