Open Source software engineering - The state of research

The following commentary is part of First Monday's Special Issue #2: Open Source.

Introduction

During the last years, considerable research effort has been concentrated on free/libre/open source software. Of special interest to scholars worldwide has been whether the development and the resulting systems are significantly different from the development of software within a commercial and corporate environment, for which methods and tools aimed at improving the speed, quality and overall efficiency have been developed over decades. The advent of open source software has led to considerable discussion on whether this knowledge is in question, or whether the main ideas still hold valid in this new setting as well (e.g. Fuggetta, 2004). Also the very first work on open source software engineering, the seminal paper of Raymond (1998), 'The Cathedral and the Bazaar', has already done so by employing a comparison We will present and discuss some of the findings available and questions remaining on open source software engineering and the relation to 'traditional' software development and software engineering research. The structure broadly divides between software process and product, first detailing questions pertaining to all elements of the software process, including the development team, then turning to the resulting product to explore whether there are any differences to other forms of development. Before that, we will also introduce one of the key reasons for many of the advances in the understanding of open source projects during the last years: the public availability of large amounts of data about them.

This paper hast not the intention of being exhaustive, showing all the current advances. Our aim is just to provide a glimpse of some that we find more relevant. For sure any other researcher in the field would have selected a different view of the landscape.

The importance of the public availability of information

One of the key characteristics of open source software development is that usually, it is done in the open. This means that huge quantities of information about the processes and the products are available, publicly, in the Internet. In addition, large fractions of that information are organized such that they can be retrieved and analyzed in a fairly automated way. This fact, which at first sight might have little relationship with the state of research in the area, is on the contrary fundamental, and maybe the single characteristic that shows the most influence on the research on open source software development. This public availability of information is making both more complete and in-depth empirical research and reproducibility of results possible.

For almost any open source software project, not only all the source code is available (usually from a version control repository which tracks also interactions of developers with source code): there are also mailing lists (where many decissions are taken in the open, feedback is sent and technical discussions are common), bug reporting systems, forums, etc. With such an amount of information, the development process can be tracked with great detail, and comparative studies including many projects are possible. From the research perspective, having complete data-sets, with historical information about some development aspects of a large quantity of projects is possible and even (comparatively) easy. Since the source data is available to anyone, any group can also come back to reproduce any previous study, maybe later in time and with a larger data-set, to validate or improve it, which has a great impact on the validity of the results and the quality of the findings. The potential of this availability of information is something that we are still learning to use, and is for sure at the roots of many of the advances mentioned in this paper.

Software Process

Most prior studies have found a certain organisation and division of work within open source software project teams. While more people are involved than in traditional organisational forms, we find the existence of a relatively small 'inner circle' of programmers responsible for most of the output (e.g. Mockus et al., 2002; Koch and Schneider, 2002; Ghosh and Prakash, 2000; Koch, 2004). This group of programmers with small inner core is surrounded and assisted by a much larger number of people which contribute by participating in discussions, maintaining web sites or performing other tasks. This organisation of work seems to be similar to the 'chief programmer team organisation' or 'surgical team' proposed decades ago (Mills, 1971; Brooks, 1995). Currently, research is also going on concerning the evolution of these volunteer teams and their structure (Robles et al., 2005).

Considerable software engineering research has also focused on the mapower distribution in software projects (Norden, 1960; Putnam, 1978). For the GNOME project, it has been shown that the staffing with active programmers surprisingly closely follows this model (Koch and Schneider, 2002) at least until the time of first release. After the time of first release, the assumption of a finite set of problems might prove problematic. First results seem to confirm that the standard model, without incorporating the addition of new features during the life cycle, performs badly and is surpassed by adapted models.

A major argument against open source software development is that an increasing number of participants will decrease productivity due to exponentially increasing communication costs according to Brooks's Law (Brooks 1995). Surprisingly, Koch (2004) has shown that in the Sourceforge.net dataset, the correlation between the output per person and number of active programmers is next to non-existent. This leads to the interesting conclusion that Brooks’s Law possibly does not apply. There are several possible explanations for this, including the very strict modularization, which increases possible division of labour while reducing the need for communication.

The main indicator for how the open source software development model compares with the traditional, proprietary one is the effort spent. As not even project leaders know how much time is spent by their participants, this effort will need to be estimated. For this task software engineering research has developed several methods including the well-known COCOMO (Boehm, 1981). Using this model, especially GNU/Linux distributions have been estimated with nearly 8,000 person-years (Red Hat 7.1) or even 60,000 person-years (Debian 3.1) by Amor et al. (2005). Koch (2004) has applied several different models, and concluded that models based on output metrics (e.g. lines-of-code) like COCOMO show distinctly less effort than participation-based models. Either the development indeed is more efficient due to self-selection for tasks, absence of management overhead etc., or the difference in effort accounting for about 95 percent is expended by people discussing on mailing lists, reporting bugs, maintaining web sites and similar.

Software Product

One of the most controversial topics regarding the open source software products resulting is the quality that can be reached. Samoladas et al. (2004) have presented an analysis of five open source projects and have found that code quality appears to be at least equal and sometimes better than in closed-source projects, and also seems to suffer from the same problems regarding maintainability deterioration. Of course, many projects apply considerable mechanisms to ensure quality and defect handling (Zhao and Elbaum, 2003; Koru and Tian, 2004).

Studying the evolution of software system has for a long time drawn the attention of software engineering researchers (Belady and Lehman, 1976), formulating the laws of software evolution which entail a continual need for adaptation of a system which brings increased complexity to the system and therefore a declining average incremental growth. First similar studies on open source software products have been inconclusive: Godfrey and Tu (2000) have found super-linear growth in the Linux kernel, Paulson et al. (2004) have not found any differences between open and closed-source software projects. Koch (2005) found that while in the mean the growth rate is decreasing over time according to the laws of software evolution, especially larger projects with a higher number of participants and higher inequality in the distribution of work might be more often able to sustain super-linear growth.

Conclusion

As shown, the results on open source software engineering are not yet conclusive in many areas. O ne cause for this might be that there is a huge variation in projects, ranging from very large to very small, and also from quite plan-oriented with documented processes most often for release management to extremely bazaar-like in nature.

In the near future, researchers still have many intriguing questions to explore: Does Brooks's Law apply to open source software development, under which circumstances, or why not? What are the implications from this? Is open source software development a way of producing better software more efficiently, or is an enormous effort just invisibly expended? How can we estimate the effort actually spent in an open source project, and how can we measure the impact of the community surrounding the core development team? Which differences between open source projects can be found regarding organisation, processes and products, and how do they relate to the projects' success? If we consider that software engineering has studied software development for decades, and that many projects are still experiencing problems up to complete failure, open source software engineering will be an interesting topic for many more years to come...

About the Authors

Stefan Koch is Assistant Professor of Information Business at the Vienna University of Economics and Business Administration. His research interests include cost estimation for software projects, the open source development model and the evaluation of benefits from information systems.

Jesus M. Gonzalez-Barahona works as an Associate Professor at Universidad Rey Juan Carlos ( Madrid, Spain). His research interests include quantitative analysis of software projects, libre software development models and implications of the use and production of libre software.

References

J.J. Amor, J.M. Gonzalez-Barahona, G. Robles, and I. Herraiz, 2005. '' Measuring Libre Software using Debian 3.1 (Sarge) as a Case Study: preliminary results,''Upgrade, volume VI, number 3, pp. 13-16, at http://www.upgrade-cepis.org/issues/2005/3/up6-3Amor.pdf, accessed 29 J uly 2005.

L.A. Belady and M.M. Lehman, 1976. ''A model of large program development,'' IBM Systems Journal, volume 15, number 3, pp. 225-252.

B.W. Boehm, 1981. Software Engineering Economics. Englewood Cliffs, NJ: Prentice-Hall.

F.P. Brooks Jr., 1995. The Mythical Man-Month: Essays on Software Engineering. Anniversary ed., Reading, Mass.: Addison-Wesley.

A. Fuggetta, 2004. ''Open Source and Free Software: a New Model for the Software Development Process?'' Upgrade, volume V, number 5, pp. 22-26, at http://www.upgrade-cepis.org/issues/2004/5/up5-5Fuggetta.pdf, accessed 29 J uly 2005 .

R. Ghosh and V.V. Prakash, 2000. ''The Orbiten Free Software Survey,'' First Monday, volume 5, number 7 (July), at http://www.firstmonday.org/issues/issue5_7/ghosh/, accessed 25 July 2005.

M.W. Godfrey and Q. Tu, 2000. ''Evolution in Open Source software: A case study,'' Proceedings of the International Conference on Software Maintenance (ICSM 2000), pp. 131-142.

S. Koch, 2004. ''Profiling an Open Source Project Ecology and Its Programmers,'' Electronic Markets, volume 14, number 2, pp. 77-88.

S. Koch, 2005. ''Evolution of Open Source Software Systems - A Large-Scale Investigation,'' Proceedings of the First International Conference on Open Source Systems, pp. 148-153.

S. Koch and G. Schneider, 2002. ''Effort, Cooperation and Coordination in an Open Source Software Project: Gnome,'' Information Systems Journal, volume 12, number 1, pp. 27-42.

A.G. Koru and J. Tian, 2004. "Defect Handling in Medium and Large Open Source Project," IEEE Software, volume 21, issue 4 (July/August), pp. 54-61.

A. Mockus, R. Fielding, and J. Herbsleb, 2002. ''Two case studies of open source software development: Apache and Mozilla,'' ACM Transactions on Software Engineering and Methodology, volume 11, number 3, pp. 309-346.

H.D. Mills, 1971. ''Chief Programmer Teams: Principles and Procedures,'' Report FSC 71-5108, IBM Federal Systems Division, Gaithersburg, Maryland.

P.V. Norden, 1960. ''On the anatomy of development projects,'' IRE Transactions on Engineering Management, volume 7, number 1, pp. 34-42.

J.W. Paulson, G. Succi, and A. Eberlein, 2004. ''An empirical study of open-source and closed- source software products,'' IEEE Transactions on Software Engineering, volume 30, number 4, pp. 246-256.

L.H. Putnam, 1978. ''A general empirical solution to the macro software sizing and estimating problem'', IEEE Transactions on Software Engineering, volume 4, number 4, pp. 345-361.

E. Raymond, 1998. "The Cathedral and the Bazaar," First Monday, volume 3, number 3 (March), at http://firstmonday.org/issues/issue3_3/raymond/, accessed 22 July 2005.

G. Robles, J.M. Gonzalez-Barahona, and M. Michlmayr, 2005. ''Evolution of Volunteer Participation in Libre Software Projects: Evidence from Debian,'' Proceedings of the First International Conference on Open Source Systems, pp. 100-107.

I. Samoladas, I. Stamelos, L. Angelis, and A. Oikonomou, 2004. "Open source software development should strive for even greater code maintainability," CACM, volume 47, issue 10 (October), pp. 83-87.

L. Zhao and S. Elbaum, 2003. ''Quality assurance under the open source development model,'' The Journal of Systems and Software, volume 66, pp. 65-75.

Copyright ©2005, First Monday

Copyright ©2005, by Stefan Koch and Jesus M. Gonzalez-Barahona

Open Source software engineering: The state of research
First Monday, Special Issue #2: Open Source — 3 October 2005
https://firstmonday.org/ojs/index.php/fm/article/download/1466/1381