The Cathedral and the Bazaar for Data Science

The title, “The Cathedral and the Bazaar”, captures the essence of the open source movement: Do not build software using exclusive, highly compensated, highly qualified, specially appointed programmers (as special workers would build a cathedral), but rather, build software by including everyone that is interested and motivated, relying on their love for building it, and building it without compensation (as people visiting a bazaar would interact and “volunteer” their interest freely among each other). “Necessity is the mother of invention” is a well-known quote. Put another way, every notable piece of software starts by scratching a developer’s personal itch. This is how Linus Torvalds started the Linux operating system. During his student days, he had a need for a very affordable, high quality operating system similar to the high-priced UNIX operating system.

Instead of starting from scratch, he recognized that it is almost always easier to start from a good partial solution than from nothing at all. He used the principles already present in UNIX and started by using the code in the free Minix operating system to construct what would become Linux. One of his unique approaches was to “build to think”, rather than “think to build”. Instead of setting out to thoroughly plan for 6 months and then starting to recruit volunteer help, he was willing to throw away attempts that did not work. He embraced the fact that you don’t really understand the problem until you have implemented a solution the first time. Maybe you know enough to do it right the second time. Be willing to start over at least once.

Torvald’s real stroke of genius came when he realized the value of the users of his software. It was obvious to him that he was standing to gain a great deal more by giving away his source code and treating his users as co-developers. To him this was an obvious route to rapid code improvement and effective debugging. A principle he lived by was to release early, and to release often. He made a point of listening to his users and continuously pondered their feedback. Linus was keeping his users/co-developers stimulated (by having a satisfying piece of the action) and rewarded (by having them see the daily improvement in their work). Usually, software developers toil away their days to get paid for writing programs they neither need nor love. In Torvald’s “team” their compensation was much higher than $50 an hour – they had a real sense of doing important work, got regular recognition for it by having their contributions incorporated, and they really loved what they were doing.

Linus also discovered how parallelizable debugging was. The more users/co-developers he had, the more ways of stressing the code he obtained. More bugs could be found as a consequence. Most problems will be discovered and chances are that at least one user will find a fix for each problem. Given enough eyeballs, all bugs become fixable. Torvald’s big insight was that his users/co-developers were his most valuable resource.

Finally, the principles and practices of the open source movement will be, and already is, most beneficial to the data science community. Data science is still a maturing discipline and the high level of efficiency of the Linux culture is much needed. Open source alternatives to proprietary data science tools like MATLAB already has a substantial history (for example the open source Octave product). Julia is a more recent (2012) open source initiative from MIT. Python is one of the most popular data science environments, and, it is all open source. Even Microsoft, the traditional mainstay of proprietary software, came to grips with the benefits of open source data science software when they purchased the R-based company Revolution Analytics recently. They even rewrote parts of R to become the Microsoft R Server. In general, the use of proprietary data science environments like SPSS and SAS are declining in favor of open source (and free) alternatives like R, Python, Apache Hadoop, and Julia. It is interesting how “neglected” proprietary tools are in this recent blog: https://blog.appliedai.com/data-science-tools/