Free software is supposedly developed by a loosely organised "community" of programmers. However it has been quite unknown until now who, except for some well-known celebrities, belongs to this community, and more importantly how contributions are distributed. The authors present a first survey of free software authorship, with the emphasis not on building a census or even a "hall of fame", but on identifying patterns of concentration and distribution of contribution. The sample code size is not necessarily representative and there are several errors due to the automated and vast nature of the task of identifying and crediting authors. Nevertheless, comprehensive data is collated for the first time, and can be scrutinised in detail on the survey Web site.
Scope and Method
The free software (or open source) "community" is much talked about, though little hard data on this community and its activities is available. Here, for the first time, Orbiten Research provides a body of empirical data and analysis to explain and describe this community.
Simple facts, such as the number of developers contributing to free software projects, the number of such projects, and their size, have been until now unknown. The Orbiten Free Software Survey answers some of these questions, and aims to provide a foundation for empirical research on the free software community.
Building on the release of CODD over a year ago, the Survey measures and tracks over time several aspects of the free software economy. Factors analyzed include the concentration (or diversity) of contributions and contributors; the degree of intersection between projects and sharing of code; the participation of developers in different projects; and, the volatility of changes to the code base and the developer base.
Important basic statistics and data were gained during the survey process, such as total size of free software available, amount of free software being released and/or modified each month, and a compendium of developers.
The survey over time will become more comprehensive, providing an important source of information for academic researchers, free software users, and developers.
The primary findings of OFSS01 were basic: the number of developers authoring projects included in the survey (12,706), the size of the free software code base (1.04 Gigabytes, or roughly 25 mil lines), the number of identifiable free software projects (3,149). Given the total lack of data on the free software economy, rough indicators as to its size (limited by the initial scope of the survey) are, we believe, a good start.
Secondary findings relate to the degree of contribution to the code base by individual authors, defined for the purposes of this survey as the smallest identifiable grouping claiming credit for development of a software project. Unsurprisingly, the Free Software Foundation came out well ahead of anyone else by far, credited with 11% (124 Mb) of the entire surveyed code base and involved in 17% (546) of all identifiable projects. However, as with some other well-known (and highly ranked in the survey) Unix authors, such as Sun Microsystems and the Regents of the University of California, the FSF's position in our charts stems largely from the lack of credit given to individual programmers. A list of the top few contributors sorted by code and involvement in projects is given below (see Data).
Further findings relate to the distribution of authors among projects, and code base contribution. The top 1,271 authors, 10% of the total, accounted for 72.3% of the total code base. The top 10 authors alone (0.08% of the total) are credited for 19.8% of the code base. Free software development may be distributed, but it is most certainly very top heavy.
What goes for lines of code written goes for involvement in projects too. Only the top 25 authors (0.19% of the total) were credited with participation in more than 25 projects. The top 250 authors were credited with participation in over five projects, and the vast majority (over 77%) of authors were only involved in a single project. Our conclusion: Free software development is less a bazaar of several developers involved in several projects, more a collation of projects developed single-mindedly by a large number of authors.
Number of identifiable authors 12706 Uncredited/unidentifiable authors 790 % of code base uncredited 8.37% Size of code base +1116500467 Bytes or 1067 Mb. Number of identifiable projects 3149
Table 1: Top 10 authors ranked by contribution of code Author % of total free software foundation 11.231 sun microsystems 1.848 regents of the university of california 1.359 gordon matzigkeit 1.216 paul houle 1.042 thomas g. lane 0.782 massachusetts institute of technology 0.762 ulrich drepper 0.559 lyle johnson 0.528 peter miller 0.525 more...
Table 2: Author contribution by decile Authors % of total top 10 authors 19.854 top decile (1271) 72.320 2nd decile 8.928 3rd decile 4.062 4th decile 2.384 5th decile 1.515 6th decile 1.008 7th decile 0.672 8th decile 0.440 9th decile 0.239 10th decile 0.060
Table 3: Top 10 authors ranked by participation in projects Author Projects free software foundation 546 gordon matzigkeit 267 regents of the university of california 156 ulrich drepper 142 roland mcgrath 99 sun microsystems 66 rsa data security 59 martijn pieterse 50 eric young 48 login-vern 47 more...
Table 4: Author participation in projects Projects Authors > 25 25 6 - 24 211 3 - 5 928 Only 2 1924 Only 1 9617
Note: 211 authors participated in 6 to 24 projects, etc.
Scope and Method
The first Orbiten Free Software Survey has been prepared based on over 18 months of work in identifying, tracking, and modeling interaction in the free software economy. Clearly this was not enough time, and the scope and methodology of the first survey is far from ideal.
The technical task of identifying credits in poorly documented source code was complex, especially given the vast and changing nature of the code base. Credits are often not available, they rarely follow a set format, and various heuristics have been applied and "policy" decisions made on, for example, how to divide credit among multiple listed authors. Details can be found in the documentation for CODD.
The code base itself was limited. Although far from being a complete set of all code ever released without payment on the Internet - our ideal, eventual goal - we believe we have used a fairly representative sample of software projects (released under the GNU Public Licence and its variants) developed in recent years.
The source code base for OFSS01 is:
- RedHat Linux v6.1 source rpms. [http://www.redhat.com]
- Linux kernel sources version 2.2.14.
- Munitions cryptography/security archive as on January 11, 2000 [http://munitions.vipul.net]
- Approximately 50% of source code available through Freshmeat as on January 5, 2000. Explanation: source code is not easily available for all projects on Freshmeat, at least when accessed through an automated script with simple intelligence. [http://freshmeat.net]
For each module or package analysed, source code is broken into projects identified according to the package distribution. Source code and some documentation files are scanned for authorship, credit or copyright information, from which author names are identified. Data collected includes, for each identified author, number of bytes of code authored, number and names of projects authored. From this the degree of contribution, in terms of bytes of code can be calculated for any given project. Project data is collated to can be examined at several levels.
In this survey, very basic analysis has been performed. The next survey will broaden the scope of analysis to include features such as the degree of cross-participation between projects and groups of authors.
The next survey - planned for June, 2000 - will also use a bigger code base. At the very least the code base will expand to include Sourceforge [http://sourceforge.net], OpenBSD [http://openbsd.org] and Perl CPAN libraries [http://cpan.org].
As the survey continues and becomes more frequent, we plan to track changes in the code base over time (including historical perspectives using older versions of, say, the Linux kernel) and monitor movement between projects and groups.
About the Authors
Rishab Aiyer Ghosh Rishab is Co-Programme Leader of the e-Basics Research Unit at the International Institute of Infonomics, a venture of Maastricht University supported by the European Commission. He has been programme leader at the Institute since its founding in January 2000, though he is currently still working out of New Delhi, India.
In 1994 Rishab developed the "Cooking-Pot Market" model of Internet Economics, a system of non-formal, transaction-less barter. He was invited by the European Commission to speak on information economics at the first Information Society Conference under their IST programme in 1998. Since January 1999, as an advisor to the EC Brussels headquarters for the Universal Information Ecosystems/Future Emerging Technologies programme, he has provided inputs on their policy for promoting research in European companies.
Vipul Ved Prakash is a freelance hacker and network consultant. He has several years of experience developing and deploying mission-critical networked applications on the Unix platform. Vipul developed remote administration software for Silicon Graphics' failsafe servers, a high-availability Internet server solution. In 1996, Vipul developed Sense/NET, a proxy sockets meta-ISP, to provide non-censorable TCP/IP access over unclean telnet links. Presently, Vipul is heading the development of an e-commerce engine at sixcones.com, an online hypermall startup funded by Vinod Khosla.
Vipul is a network security expert and cryptography enthusiast. He has implemented various cryptographic primitives and algorithms as open-source perl software. He publishes and maintains munitions, a comprehensive archive of cryptography software for the linux operating system, which is redundantly distributed over more than 10 servers around the world.
In 1998, Vipul developed ricochet, an automated agent for tracking and reporting email spam. Ricochet is used by thousands of people and organizations to combat spam throughout the net. Working towards a proactive solution against spam, Vipul has created Razor, a distributed, collaborative, spam detection and filtering network designed to exploit the broadcast characteristic of spam distribution to throttle its propagation. Razor will be released in May 2000.
Apart from fighting spam, Vipul is interested in more general issues of network ostracism, collaborative filtering, and content rating. He is fascinated by emergent value systems and self-organizing dynamics of networked communities.
1. Orbiten. CODD documentation.
2. Rishab Aiyer Ghosh, 1998. "Cooking Pot Markets: An Economic Model for the Trade in Free Goods and Services on the Internet," First Monday, volume 3, number 3 (March), at http://www.firstmonday.org/issues/issue3_3/ghosh/index.html
3. Rishab Aiyer Ghosh. "Identifying, tracking and measuring activity in cooking-pot networks."
Paper received 1 May 2000; accepted 10 May 2000.
Copyright ©2000, First Monday
The Orbiten Free Software Survey by Rishab Aiyer Ghosh and Vipul Ved Prakash
First Monday, volume 5, number 7 (July 2000),