VersionEye is crawling a whole bunch of package managers + some GitHub Repositories. Many open source libraries provide their license directly within the package manager. On search.maven.org the license information is even mandatory. If somebody submits a new package and the license information is not included in the pom.xml file the package will be declined and not published. That’s a good thing! Unfortunately at most other package managers the license is not mandatory and so it comes that we have many open source libraries in our database without a license 😦
But many projects on GitHub provide a LICENSE file. Other projects copy & paste the license text directly into their README page. That’s why I developed a new crawler which is looking for a LICENSE file in the root of a git repository. If some kind of LICENSE file was found the crawler tries to recognize the license text. Is it an MIT text, a GPL 3 text or some other license text? If the crawler recognizes the license text it saves the information into the VersionEye database. Otherwise not.
If there is no LICENSE file the crawler tries to find some license text directly in the REAMDE file. If the crawler couldn’t find anything it skips the repo and goes on to the next one.
The license crawler doesn’t crawl whole GitHub. It crawls only git repositories of open source packages which are already in the VersionEye database and have a corresponding link to GitHub. It basically completes meta data to already existing open source packages in the database. And right now there are 463.873 open source libraries in the VersionEye database.
These works already pretty good. But the next logical step is to write a crawler which recognizes licenses in source code. That requires a bit more computing power and a bit more time to write it. Nothing for 2014. But a good goal for 2015 😉