At Zalando we’re a bit late to the party and putting more effort into open source software. Until recently there was no official agenda or clear process on “how to open source”, but this changed. From now on everything we do that can be open source should be open source.

As we’re a publicly listed company we also need to make sure we get the legal side of it right. Basically this includes two questions:

  • Under what license do we open source our software and
  • which licenses do we allow for the dependencies of our open source software?

Regarding the latter some choices are clear in either direction whereas others are more in a gray area. They are neither clearly suitable for us nor the opposite, but we have to make a decision regardless. One way to do this is to check whether some tools we really want to have are using this license. If they do, the license would be permitted.

Luckily Github recently introduced the License API. It does not allow to search repositories by license though (ie. “the 100 most-starred using MIT license”), but this is quite easily done.

I wrote a simple data collector that searches for all repositories created before March 1st, 2015 (arbitrarily chosen date) sorted by stars. It writes name, stars and license of every repository to a CSV file.1

Stop talking already

Fine!

So, (as already stated on the Github blog) the most popular license is clearly MIT, which was used in 42 % of the projects.2

Licenses used

But maybe those MIT-licensed projects are all lesser known? Nope. MIT also got 44 % of available stars.

Licenses by million stars

Do you want to know the highest-starred project per license? Here they are:

You didn’t answer my question

Well, whatever. You know what? Help yourself. Here is the CSV containing all 1050 repositories I got. Import them into a database, run awk on it or use the excellent q like I did. Then write a blog post about it because I’m probably also interested.

Gotchas

There are some things you need to consider when interpreting the data.

For one, Github is constantly gaining users and adding more features to make projects visible. Think news feed and the weekly trending emails. This could favor more recent projects to have more stars.

Also what you get from the License API is what Github thinks it is after applying some heuristics. There could be a popular MIT project that wasn’t detected as such because the LICENSE file is called TRWYDDED (“license” translated to Welsh). An example is Font Awesome listed under none whereas it actually has multiple licenses.

Finally not all of the projects are software as shown by Font Awesome and the list of free programming books.

  1. Fun fact: I originally wanted to fetch the first 10K results or so, but apparently the Github Search API is limited to the first 1000 results according to the error message. Somehow I still ended up with 1050 results ¯\_(ツ)_/¯

  2. Github has 44 % for MIT because they exclude unlicensed projects.