First Monday

Measuring the development and communication of open design communities: The case of the OpenAg Initiative by Rodrigo Argenton Freire and Evandro Ziggiatti Monteiro

Open collaborative development and transparent design processes are often associated to the concept of open design (OD). Studies in remote collaborative processes are still recent and a wide number of aspects of OD remain unclear. This study explores an OD project by mining data in collaboration platforms. As our research object, we selected the Open Agriculture Initiative. Data was mined from its online forum, and Github, a development platform. Social network analysis (SNA) and topic modeling techniques were used to explore four research questions. We comment on these questions highlighting differences between both platforms, stakeholder participation and personal interests, community changes over time, activity volume and latent topics. Finally, we conclude by indicating possible pathways to investigate OD as an emergent phenomenon by using data mining techniques.


1. Introduction
2. Methods and tools
3. Results
4. Discussion
5. Conclusion



1. Introduction

Open design (OD) enthusiasts and researchers often define OD as a collaborative development process in which outcomes are publicly shared for anyone to produce/use, study, modify and distribute (Aitamurto, et al., 2015; Boisseau, et al., 2018). It is also accepted that openness, as a metric, is gradual and multifactorial. It means that OD projects can be more or less open, based on how collaborative and accessible is the development process, how robust and available are the outcomes (source documentation) and how replicable it is (Balka, et al., 2014; Bonvoisin and Mies, 2018). Collaboration, therefore, is one of the major critical aspects for achieving fully open projects. Previous studies have mapped different online collaboration processes in open source software (OSS) development (Lerner and Tirole, 2003; von Hippel and Krogh, 2003; Osterloh and Rota, 2007) and Wiki communities (Aaltonen and Seiler, 2016) in order to understand it. More recently, studies in open source hardware and OD have explored the structure of these communities and development processes by using both quantitative and qualitative approaches, such as interviews (Malinen, et al., 2010; Ferdinand, 2018) participant observations (Macul and Rozenfeld, 2015) and data mining of online platforms, such as Github (Menichinelli, 2017; Bonvoisi, et al., 2018).

Until now, studies using data mining techniques have focused mainly on online repositories. A well-known example of this type of repository is Github. Github enables users to perform commits (revision/contribution) to project files as well tracking of versions. These studies have provided interesting information about interactions between users, the influence and importance of actors, and activity volume (Menichinelli, 2017). However, one hurdle of this approach is that Github “does not capture all product development activity happening in a project.” (Bonvoisin, et al., 2018)

From our perspective, it also does not provide evidence of an important aspect of these communities — communication. We consider that Github has a limited structure for users to communicate, being limited to reports and descriptions of commits. In this sense, it is possible to identify OD projects that adopt both Github for the development process and a different type of platform, such as forums, for communication between users. Some examples of these projects are: (i) RepRap, an OS 3D printer; (ii) OpenAgriculture Foundation, aimed at the development of personal food computers (PFCs); (iii) OpenROV, a remote-operated underwater robot; and (iv) Maslow, a large CNC cutting machine. We consider that in the study of OD phenomena, it is important to understand not only the actual development process, e.g., revisioning files, but also the communication processes within a community.

In this study, we aimed at two particular outputs. First, we wanted to understand what kind of information that we can secure from mining data from communication platforms (forums) of a particular project. We elected to examine the OpenAgriculture Foundation (OpenAg) as our object of analysis. Besides having user activities in both types of platforms (Github and the Forum), the reasons for choosing this particular case are related to a bigger research project which aims to investigate whether OD can help addressing global challenges with local implications. From our perspective, OpenAg is directly linked to food supply, a topic that we wanted to investigate. A secondary output refers to what kind of assumptions that we can make about this particular project, based on the information from data mining processes. There are four questions guiding our study:

RQ1: Do OpenAg and Github perform equally in terms of collaboration and decision-making processes?

RQ2: Do users’ importance and network structure change over time? Can we identify the most important reasons for users joining the project?

RQ3: Why do single-time users participate in the community and how do they affect activity volume?

RQ4: Can we identify important discussion topics based on topic modeling tools?

The following sections provide a general overview of two particular tools that we adopted in this study: Social network analysis (SNA) and natural language processing (NLP). In Section 2, we present methods that we adopted to secure data and perform analyses. Section 3 presents results for each RQ which are then discussed in Section 4. We also highlight the limiting factors of our study and propose new research questions for further investigation (4.2). The main outcomes of this article are related to the possibilities that mining techniques present for social network analysis. By addressing one particular forum, we introduce a new perspective to understand communication processes outside project development platforms (such as Github). The results confirm, for example, that the type of communication that the forum enables in turn enhances democratization of the project by enabling users with different levels of knowledge and experience to participate, exchange ideas and solve specific issues. From a practical perspective, the results offer valuable information for project “owners”, i.e., those who initiate a particular OD project, to identify the health of their community, track topics and possibly, increase user participation.

1.1. Social network analysis (SNA)

Although SNA has gained attention in the last few years given the rise of information and communication technologies (ICTs), its use can be traced back to the 1930s in social psychology, urban sociology and mathematics (Fredericks and Durland, 2005). The sociogram, created by the social psychologist Moreno, was built based on topological notions from graph theory (Barnes, 1969) consisting of a method for representing social relationships as points and lines. More recently, SNA has been applied to a wide range to topics, such as political polarization in social media/networks (Gruzd and Roy, 2014), consumer behavior (Sitko-Lutek, et al., 2010) and groups behavior in sports (Lusher, et al., 2010). It has been adopted to understand user interactions in OS platforms (Shen and Monge, 2011), investigate the evolution of networks and their relations to product development (Le and Panchal, 2012), investigate transparency and activity volume (Bonvoisin, et al., 2018) and map the geographical distribution of users (Heller, et al., 2011).

Early studies have also identified basic structural characteristics of networks such as density, centrality and isolation (Fredericks and Durland, 2005). These different metrics are used in graph theory and are based on specific phenomena under study. They fall into two categories. First, global measures indicate the global properties of a network and are, therefore, represented by a single value. Second, nodal measures refer to the properties of nodes and have individual values for each node (Mijalkov, et al., 2017). In the study proposed by Bonvoisin, et al. (2018), for instance, the authors computed global centrality indexes — the variation in the relative importance of all nodes in a graph, and clustering indexes — the degree to which nodes tend to cluster together. Other measures can (i) indicate the importance of each node in the community, e.g., eigenvector centrality; (ii) the extent to which a graph can be divided into clear categories, e.g., modularity; and (iii) the number of connections each node has, e.g., degree.

1.2. Natural language processing (NLP)

The history of NLP can be traced to computer translation experiments during WWII. It refers to the use of “(...) computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis.” (Liddy, 2001). One of the approaches used in NLP is topic modeling, a statistical method for identifying latent topics across a set of documents. Different techniques have been developed to perform topic modeling of documents, especially after the emergence of electronic documents, e.g., books, e-mail messages and reports. Dumais, et al. (1988), for instance, introduced the latent semantic indexing (LSI) approach with an aim of improving information retrieval by automatically organizing textual documents into latent topics. Blei, et al. (2003) proposed the latent dirichlet allocation (LDA) to improve previous techniques, such as the LSI. One of the major benefits of the LDA approach is that it is based on probabilistic modeling, which means that textual documents can be represented (probabilistically) by different latent topics. Other examples of techniques are the probabilistic LSI (pLSI) (Hofmann, 1999) and hierarchical dirichlet process (HDP)(Teh, et al., 2005).

The applications of such methods are vast. Authors have used topic modeling to investigate differences between similar research concepts (D’Amato, et al., 2017) and identify research trends (Sugimoto, et al., 2011). Others applied it to the text of newspapers to identify major topics over time (Nelson, 2010; Yang, et al., 2011) as well as to study discussion forums (Ezen-Can, et al., 2015). In OS studies, topic modeling has been applied mainly to categorize bug reports (Somasundaram and Murphy, 2012), identify duplicate reports (Hindle, et al., 2016), identify informal project requirements (Vlas and Robinson, 2012) and classify user requests (Li, et al., 2018).



2. Methods and tools

The OpenAgriculture Initiative (OpenAg; was an open source community initiated at MIT’s Media Lab in January 2015. The project aimed at “building an ecosystem of food technologies to create healthier, more engaging and more inventive food systems”. The Initiative had different projects related to the optimization of crops in controlled environments and the design of food growing platforms. The personal food computer (PFC) is an example of a platform. It consisted of “a tabletop-sized, controlled environment agriculture technology platform that uses robotic systems to control and monitor climate, energy, and plant growth inside of a specialized growing chamber.” Since 2015, five versions of the PFC were developed by the community with the last release — version PFC 3.0 — made in October 2018. The Initiative also developed an educational version of the PFC.

2.1. Data extraction

Data extraction was performed from two platforms: Github and the OpenAg Forum. Github was mostly used for active modifications in source documentation of the project while the Forum was used by users to share ideas and information, present their own work and post issues and concerns. For both platforms, we adopted mining techniques using either existing Python scripts or self-developed scripts.

2.1.1. Github data extraction

Raw data extraction was performed using GitHub’s API queried through Python scripts developed by Bonvoisin (2018) and released under an OSS license. The scripts extracted metadata related to changes history from all repositories of the project and all corresponding forks. The metadata provided information about changes in a repository file (commit), ‘who’ performed the changes (committer), when it was updated and what previous commit it was related to. An example of a data extraction structure is presented in Figure 1. Each project may present a number (n) of different repositories. The repositories contained the files (n) and stored revision history for each file. In the example we provide, files 1 and 2 were changed by three users and a diverging branch was created to file 1, indicating that two users (C and D) performed different modifications after B.


Illustration of the information provided by mining the GitHub API
Figure 1: Illustration of the information provided by mining the GitHub API (based on Bonvoisin, et al., 2018).


For the OpenAg project we mined 44 existing repositories (23 archived and 21 open by the time of data extraction). We did not limit the repositories to those that were related to hardware components because we understood that both software and hardware were important for the correct functioning of OpenAg platforms. Users may contribute to different repositories as well.

2.1.2. OpenAg Forum Data extraction

As for the OpenAg Forum, data extraction was performed using Web scraping techniques. We developed four different scripts based on scrapy [1], a Python module for extracting data. Scripts were released under an OS license in an online repository (Freire, 2019). First, we collected all existing topics from April 2016 to September 2019 and their corresponding links. Second, for each topic, we collected data related to (1) the topic creator; (2) repliers; (3) textual comments; (4) replies dates; and (5) the topic category, e.g., hardware or help. Third and fourth, we obtained user data to explore (1) their affiliation; and (2) their activity on the page (first and last appearance). In the case of the Forum, collaboration between users was less evident than in Github. For that reason, we considered two possibilities for defining collaboration (Figure 2). In the first case (left), we considered that interaction occurred when one user’s comment followed someone else’s comment within the same topic (A→B,B→C,C→D). A possible limitation to that scenario was that comments might not necessarily be related to each other and that the same user might comment more than once. In the second case, we considered that interaction occurred only between the creator of a topic and a user that commented on his topic (A→B, A→C, A→D). In this case, a possible limitation is that the number of replies does not necessarily relate to the importance of the topic. For example, the “Twitter/Instagram (add yours)” topic had 57 replies and the “16 years old kid from Czech Republic is trying to build Food Computer” Topic had seven replies at the moment that data was extracted. Given the content of the topics, we considered that limiting interaction to the number of replies for a specific topic could provide misleading results. We found that after some replies, some comments were not necessarily linked to the topic but to following comments. Finally, after a comparison between the two cases, we selected the first to proceed with our analysis.


Two possible cases for structuring the data from the Forum
Figure 2: Two possible cases for structuring the data from the Forum.


As of 21 August 2019, the Forum had 1,859 subscribed users. However, only 936 users participated at least once in the Forum. We calculated the eigenvector centrality (EC) for each user and extra manual work was performed to identify the affiliations of users with a high centrality. The EC measured the importance of a node in a network. Its definition is better explained in another section (Section 2.2.1), which describes classification measurements that we adopted.

2.2. Network analysis and metrics

We adopted the Open Graph Viz Platform (Gephi [2]) for network visualization and social network analysis (SNA). We structured the data based on platform requirements, identifying users as source and target, their Ids and dates. Interaction between users was defined as either the subsequent reply for the same forum topic and the edition of the same file of a GitHub repository.

For the OpenAg Forum, we structured the data in a period of four months according to the time period presented in Table 1. Next, we developed undirected graphs for the first period and the following ones using cumulative frequency, e.g., the third graph corresponds to the summation of the first, second and third periods. Yifan Hu’s layout algorithm was used to represent the time lapse of the network for the periods 1–10. The algorithm is force-directed, i.e., it uses attraction and repulsion forces acting between the bodies of a system (Hu, 2005), enabling some (but limited) inferences about visual results. For that reason, we compared the final visual representation of the network using Yifan Hu’s layout with other graph generation techniques.


Analysis timeframes for OpenAg data
Table 1: Analysis timeframes for OpenAg data.


In addition to network visualization, other metrics can be used to understand and classify network evolution during a given timeframe. We calculated two topological indicators for each timeframe:

We also generated the Github network based on Yifan Hu’s algorithm and calculated the EG and modularity values in order to compare with results from the OpenAg Forum. We generated a single network based on the activity period from April 2016 to September 2019.

2.2.1. Activity volume and users contribution

The activity volume for the OpenAg Forum and Github was calculated considering the number of replies and file changes as reference units. Although both are not numerically comparable, such calculations enabled us to observe whether higher activity volumes in one platform were reflected in another platform and if activity volume tended to increase or decrease during a specific period of analysis.

Regarding user contributions, we explored both (1) their affiliation; and (2) whether user activity in one platform was reflected in activity in the other platform. In order to identify user affiliation, we first collected information from their profiles and defined a set of representative groups, as it follows:

This first round enabled us to classify 56 users out of the 70 most important users (according to the EC) in the Forum. Based on this classification, we adopted the following keywords for searches in user posts: “work”, “study”, “student”, “school”, “teach(er)”, “company”, “startup”, “hobby”, “I am”. Related posts were read in their totality and, when they provided sufficient information, an affiliation was assigned to a given post creator. At this stage, 82 Forum users were classified, totaling 138 users (out of 936). For Github, we classified 12 users (out of the 39 most active) based on their profiles. Concerning user mutual activity in both platforms, we first searched for similar usernames and matched those which belonged to the same users. Finally, we manually checked user profiles in Github and matched them to their corresponding users in the Forum community.

2.2.2. Topic modeling

Latent Dirichlet allocation (LDA) is an effective method for classifying and clustering textual data (topic modeling) for a large number of documents. It identifies underlying topics in text (documents) and describes them as a distribution over terms and calculates the probabilities that a document might belong to different topics, i.e., each document might be associated with one or more topics. LDA has been successfully applied to large text, such as in bibliometric analysis (D’Amato, et al., 2017); however, it has also been proved effective for shorter text as found in Twitter (Hong and Davison, 2010). In order to process LDA analysis, one must attribute a number of desired topics. Given that different topic numbers affect the quality of the model, it is important to evaluate the quality for different possibilities. By measuring topic coherence for different numbers of topics (range=2–40), we defined the criterion of eight topics to our modeling (Table 2).

We applied LDA to text (5,832 posts) mined from the OpenAg Forum. Several analyses were performed before we could achieve a coherence score higher than the average that we found previously (0.3969). Textual processing methods were applied to (a) segment each reply content to a list of words (tokenization); (b) group together the inflected forms of a word (lemmatization); (c) remove punctuation and irrelevant words (stop words); and (d) associate each reply content to a document. After performing LDA, we evaluated topic distribution for each document and compared results to the corresponding textual content of the Forum (replies).


Optimal number of topics for LDA analysis of Forum content
Table 2: Optimal number of topics for LDA analysis of Forum content.




3. Results

RQ1: Do OpenAg and Github perform equally in terms of collaboration and decision-making processes?

Figure 3 shows the percentage (of the total) of comments in the OpenAg Forum and the percentage (of the total) of commits in GitHub as a function of time. It indicates a tendency for activity volume of both platforms to decrease over time, from 16 April to 19 August. However, we could not confirm that activity level between the platforms correlated during the same period, e.g., from 17 August to 18 January.


Activity volume in GitHub (red) and OpenAg Forum (black) from April 2016 to September 2019 as a percentage of the total volume of commits and comments
Figure 3: Activity volume in GitHub (red) and OpenAg Forum (black) from April 2016 to September 2019 as a percentage of the total volume of commits and comments.


Of a total of 1,859 registered users, we considered only those users who had posts in the Forum, totaling 936. Out of that total of active users, 471 (50.3 percent) had either one or two posts in the community and were responsible for 897 posts (15.4 percent of 5,832). On the other hand, as Figure 4 illustrates, 70 users (8,45 percent) were responsible for 2,922 posts (50.12 percent). Regarding Github, 78 users had performed “commits” to the project. Tracing affiliations of users was slightly more complicated than in the OpenAg Forum, where we could only identify 12. However, we managed to identify that 39 users (50 percent of the total) were responsible for 99 percent of the commits in the project and six OpenAg members committed 69.4 percent of the contributions to the project. This number contrasts with the number that we found for the Forum, showing that official members were more likely to perform changes to the project than other types of users.


Percent of total running of posts in OpenAg Forum (top) and commits in OpenAg Github page (bottom) for each user
Figure 4: Percent of total running of posts in OpenAg Forum (top) and commits in OpenAg Github page (bottom) for each user. The view excludes users with no posts or commits.


We were able to trace affiliations of 56 of the top 70 active users in OpenAg from which five were associated to the educational sector, 10 were part of the OpenAg team, 21 were enthusiasts and 20 entrepreneurs. If we consider that the OpenAg team is also part of the educational sector (since it was hosted at MIT), the numbers indicate a similar distribution between the type of users in the community. Besides the 56 users (out of 70), we also traced the affiliation of 82 users (totaling 138 users), either because they presented this information in their profile or because it was explicit in posts. Figure 5 shows the distribution of those users according to affiliation type that we assigned. Of the total (n=138), we classified 29 users as education (EDU), 17 as OpenAg members (OAM), 52 as enthusiasts (ENTH) and 40 users as entrepreneurs (ENTR). These users were responsible for 2,966 posts, representing 16.17 percent of the total users and 50.8 percent of the posts that we mined (2,962 of 5,832). Individually, each group represented 4.3 percent (EDU), 8.50 percent (OAM), 18.35 percent (ENTH) and 19.70 percent (ENTR) of total number of posts.


Total of comments of users with mapped affiliation as a percentage of the total
Figure 5: Total of comments of users with mapped affiliation as a percentage of the total.


The Github network showed a well-defined cluster containing the most important nodes, considering its connectivity (EC > 0.40, Figure 6(A)). The OpenAg Members were also attracted to the main core, an expected result given their activity level at Github. On the other hand, the OpenAg Forum network (Figure 7) indicated a less centralized structure, with attraction forces between the most important nodes (EC > 0.40) weaker. The diversity of the community was expressed by the distribution of important nodes having representatives of all types (EDU, OAM, ENTH, ENTR). The two most important nodes were the entrepreneur (Ev=1.0) and enthusiast (Ev=0.61) groups.


Network structure of the GitHub community
Figure 6: Network structure of the GitHub community indicating the Eigenvector values (A) and the user’s affiliation (B), which are represented as light blue (OpenAg Members), green (Education), magenta (Enthusiasts).



Network structure of the OpenAg Forum community
Figure 7: Network structure of the OpenAg Forum community indicating the Eigenvector values (A) and the user’s affiliation (B), which are represented as light blue (OpenAg Members), green (Education), magenta (Enthusiasts) and orange (Entrepreneurs).


RQ2: Do users’ importance and network structure change over time? Can we identify the most important reasons for users joining the project?

Figure 8 and Figure 9 present network evolution of the OpenAg Forum based on the time series described in Table 1. The sequence illustrates the evolution of the network and changes of EC values for each user (node size). The initial time frames indicated (ts=1, ts=2, ts=3) that the community started with a single strong core including the most important nodes. Over time (ts=4, ts=5) the initial structure became less compact — gaining new cores — and other users became more important, changing the distribution of EC values. Finally, the following time frames (ts=6 ts=10) indicated the stability of this process observed in ts=4 and ts=5, i.e., the initial importance of the migration of users as the network became even less compact. The less compact structure reflected the consolidation of new clusters. In Figure 10, these changes are highlighted. It shows the distribution of EC values for all periods in comparison to values found at ts=1. Although we cannot confirm changes for all users, it is possible to see that those with very high EC values in the beginning of the analysis lost their importance over time, whilst others gained.


Network evolution of the OpenAg Forum
Figure 8: Network evolution of the OpenAg Forum. Nodes sizes are defined based on individual eigenvector values for each ts from ts1→ts6. Nodes colors are defined based on modularity for ts10.



Network evolution of the OpenAg Forum
Figure 9: Network evolution of the OpenAg Forum. Nodes sizes are defined based on individual eigenvector values for each ts from ts7→ts10. Nodes colors are defined based on modularity for ts10.


It is also important to highlight that the optimum number of communities slightly decreased during the different periods, ranging from 632 communities in ts=1 (modularity= 0.582) to 73 communities in ts=10 (modularity = 0.514). Although the modularity values decreased, we considered it to be insignificant given the variation in the optimum number of communities, indicating a tendency of the Forum community to be divided into clearly separated groups over time. Back to Figure 8, in ts=10, the colors indicate the modularity class of each node. Although the model resulted in 73 communities, only seven accounted for a staggering 69.27 percent of the total of users, including those with higher EC values. These classes are represented in orange (22.13 percent of users), magenta (14.68 percent of users), light green (11.01 percent), red (10.09 percent), purple (8.49 percent), pink (7.11 percent) and dark green (5.85 percent).


Eigenvector variance between ts=1 (black) and ts=10(red)
Figure 10: Eigenvector variance between ts=1 (black) and ts=10(red).


Finally, Figure 11 shows users’ permanence based on their first activity in the Forum, e.g., signing up and their last activity. The majority of users (34,36 percent) interacted for only one day (0.0 percent) in the community, 36.90 percent interacted between 0.01 and 10.00 percent of the possible number of days. On the other hand, the total of users with higher permanence values (above 70 percent) represented only 4.17 percent of the total. Amongst the 10 users with the highest EC values, seven had a permanence value above 70 percent. It is important, however, to note that high permanence does not necessarily mean high consistency or constant posting.


Permanence (in percentage) of users in the community considering first and last activities (comments)
Figure 11: Permanence (in percentage) of users in the community considering first and last activities (comments). One hundred percent of permanence means that a user who participated in the community since his first comment until September 2019.


RQ3: Why do single-time users participate in the community and how do they affect activity volume?

We considered single-time users as those who participated in the community for a maximum of two times, either by starting a topic thread or by replying to someone’s post. As mentioned earlier, these users represented 50.3 percent (471) of all users. After identification and categorization of keywords, we identified five main topics preferred by single-time users in the community. These are presented in Table 3 with some excerpts from messages.

First, the majority of users were involved in school or research projects, including educators and students from primary and secondary schools as well as undergraduate and graduate students. In general, the comments were not always linked to a particular question about technicalities of the project. iIt was a way for users to communicate their experiences and express how project outcomes benefited the learning environment. The second and third groups were related to users either building a PFC or interested in building one. The second group consisted of users with a higher experience level and with more specific interests. For example, users with a background — or interest — in aeroponics and hydroponics might be interested in particular aspects of the project. Others would ask more specific questions and not receive a reply. For the third group, it consisted of users motivated in building a PFC but with very general questions regarding costs involved, types of plants that could ben grow, materials and components required, and dates of new versions. Fourth, users also showed particular interest in introducing themselves and indicating where they lived. Some comments were also related to the availability of components locally and possible alternatives for those which were not available. Others mentioned their interests in building PFCs or exchanging information with other users in the same geographic area. Finally, a less expressive group consisted of users interested in presenting a business idea or contacting others with an aim of selling or buying products. Some users offered to buy assembled PFCs or pay for help, while others tried to sell either components or completed PFCs.


Comments excerpts of single-time users based on the type of comment that we mapped
Table 3: Comments excerpts of single-time users based on the type of comment that we mapped. Typos and grammatical errors were kept as in the original source.


RQ4: Can we identify important discussion topics based on topic modeling tools?

We performed LDA analysis to highlight eight topics, based on average coherence scores illustrated in Table 2. The results in Table 4 display the 10 most salient words associated with each topic and the key terms that we defined for them. Topics 2, 3 and 4 are mostly related to aspects involving environmental conditions for plant growth. In general, Topic 2 seemed more closely related to lighting and its influence on growth and other factors, while Topic 3 was associated with irrigation processes and nutrient supply. Finally, Topic 4 seems to be related to more generic aspects of environmental conditions, including temperature, water and humidity.

Topics 5, 6 and 7 focused on technical aspects of the system including hardware and software. Topic 5 included words such as light, BOM (bill of materials), cut, fan, led and box, which were associated with hardware assembling and BOM. Both topics 6 and 7 were very similar and referred to the configuration of system boards, e.g., Arduino and Raspberry Pi. Topic 6 was about running different software, configuring the database and accessing data from sensors. It included words such as data, file, database, code and software. As for Topic 7, it included the words sensor, Arduino, code, connect and pin, being related to calibrating Arduino and accessing data from sensors.

Finally, Topics 1 and 8 were unrelated to technical aspects of the project but linked to educational initiatives and user participation in developing and testing design alternatives. Topic 1 included the words “OpenAg, interested, project, great and think”. In this sense, it seemed more related to comments of users interested in building PFCs. It also included some contributions to the project. Finally, Topic 8 was defined by the words “food computer, student, group and project”. It was strongly associated with school projects and the development of PFCs by high school and undergraduate students.


Latent topics and respective keywords resulted from the LDA analysis
Table 4: Latent topics and respective keywords resulted from the LDA analysis. The coherence score of the analysis is 0.5206.


The distribution of topics is presented in Table 4. The results indicate which topics tended to be more discussed in the community based on the average scores for each comment. As for Figure 12, it presents comment distribution based on their probability for each topic. Topic 6, related to software configuration (database and sensors), is the most probable topic with an average probability value of 30.2 percent and 171 comments, with a probability value above 90.0 percent. It is followed by the Topic 8, which has an average probability value of 28.0 percent and 65 comments above 90 percent. Other results are respectively Topic 1 (23.6 percent and 36 comments); Topic 4 (22.9 percent and 40 comments); Topic 5 (19.9 percent and 23 comments); Topic 7 (16.2 percent and 21 comments); Topic 3 (10.6 percent and 24 comments); and Topic 2 (10.2 percent and four comments).


Running total of comments based on their probability to belong to a specific topic
Figure 12: Running total of comments based on their probability to belong to a specific topic.




4. Discussion

We applied data mining techniques to explore whether the information collected would enable us to understand the structure of a specific open design project. For that purpose, we retrieved metadata from the Open Agriculture Foundation project, available at their Community Forum page and Github. Surveying the structure of OD projects and their corresponding communities is of much importance to our understanding of this emergent phenomenon and possible factors that make a given community effective.

4.1. Some observations and open questions

Regarding RQ1, although there was a high heterogeneity of users with high activity in the Forum (70 users responsible for 50.12 percent of comments), Github commits were possibly limited to official OpenAg members (six users responsible for 69.3 percent of the commits). These differences between the Forum and Github were noticed in network analysis. Two different network structures for the Forum and the Github were identified. While the Forum presented a less centralized and more diverse network, in Github we noted high centralization and lower diversity. In comparative terms, the Forum and Github networks were closely related to what Bonvoisin, et al. (2018) classified as “closely connected decentral networks” and “highly centralized projects”, respectively.

The results for Github indicated limitations in decision-making processes, contributing to the debate regarding the extent of accessibility in OD projects, i.e., the degree to which any person can participate in a given development process (Balka, et al., 2010). At the same time, it was not possible to confirm whether the lack of accessibility was intentional, i.e., limited by OpenAg members, or reflected unfamiliarity with the Github platform. Confirming this possibility would require further analysis, such as interviewing members of the community. However, limitations for inexperienced users, in taking advantage of Github existing features, have been reported in studies as well (Feliciano, et al., 2016). As for the Forum, the most active users are represented by entrepreneurs, enthusiasts and OpenAg members. From our understanding, the Forum is a more intuitive platform for collaboration between users. It positively benefits replicability by enabling users to report issues and get feedback from the communities.

The role of entrepreneurs in sustaining OS communities is well reported in the literature, but mostly restricted to software development (Yetis-Larsson, et al., 2015). Although not directly involved in the decision-making process, these users are key actors for community health. Our results indicated a possible “mutualistic” relationship between entrepreneurs and the OD community. On the one hand, users take advantage of open content developed to foster their product innovation processes and, on the other hand, the community benefits from the entrepreneur collaboration either by reporting/fixing errors, making suggestions or helping other users. Another example of the links between OS and entrepreneurs is RepRap, an OS project for self-replicating 3D printers. Started in 2005, the project started with the creation of companies selling printer assembling kits and components, e.g., Bits from Bytes [3], and also 3D printer suppliers based on RepRap designs, e.g., MakerBot Industries.

Users with educational interests were also representative in the community. From our perspective, they provided valuable contributions to OpenAg. A considerable number of topics and comments, by students and instructors, were related to the development of school projects, i.e., technology and food growing. We wondered whether these users continued collaborating with the community after completing their specific projects. However, the high number of single-time users with educational related inquiries indicated that they were a low-permanence group.

As for RQ2, the evolution of the Forum network indicated both changes in the importance of users and activities as well as increases in the diversity of topics. It started as a “high centralized” network, with OpenAg members associated to the most important nodes. This was an expected result if we consider that the Forum was created by OpenAg members. An important remark is the fact that the initial structure of the Forum (2016) was similar to the structure found in Github. However, the sequential network structure of the Forum became more diverse and less centralized, indicating that late external users were important actors. We also observed a decrease in the importance of OpenAg members, which, from an open source perspective, is desirable. It possibly indicated a stage where the community depended less on the original project developers. However, we cannot safely conclude that this is the current stage of the community. Given the recent nature of the project, continuous analysis of the community would enable us to verify, for instance, if the structure of the Forum maintained the decentralization/diversification process that we have observed. It would be also possible to confirm the centralized nature of the Github community.

In RQ3, we investigated the reasons that single-time users participated in the Forum community. Although the results do not allow us to draw precise conclusions about who these users were, some observations need to be highlighted. First, the wide variety of topics discussed by these users reflected diversity that we found in the overall community. These users commented upon topics related to commercial usage of the project and educational ideas, inquired about the PFC building process and searched for users from related localities. Second, in OSS studies, authors identified that new users also joined the communities because they needed to solve specific problems (Shah, 2005; Bagozzi and Dholakia, 2006). This was also true for a number of users that we identified. However, the analysis of the Forum showed that users were motivated for different reasons, e.g., interested in agricultural techniques or in finding a new hobby. In fact, by tracing back comments from the most active users (non-OpenAg members), we managed to verify that they were particularly interested in plant growing systems prior to the release of the project. Finally, the large number of single-time users (50.3 percent) indicated that the majority of potential collaborators stopped contributing right after their first interactions with the community. As we also pointed out, over 50 percent of users had a low participation rate (less than 10 percent) considering the total time (from registration to the time data was acquired). We understood this feature to be a natural aspect of the OS ecosystem and, possibly, a positive one. It indicated some sort of enthusiasm or inspiration, even temporarily, at the same time it increased product replicability — by enabling users to ask questions about the building process and securing support from the community.

RQ4 considered the adoption of topic modeling tools to check whether it was possible to identify major topics discussed by the community. We evaluated modeling output based on our familiarity with forum posts, users and previous analysis. We noted the existence of some cohesion between topics generated from LDA analysis and the community profile, i.e., the type of users, motivation and development processes. Two of the topics referred to comments about education, user participation and similar ideas. These correlated to profiles of users that we identified, especially students and teachers. The other six topics were linked to hardware and software aspects of the project varying from more general to more specific. Another possible contribution of LDA analysis was the classification of issues into more detailed categories than those available on the Forum. Although posts were already divided in the Forum into 17 specific categories, many fell into different ones while others were posted in non-related categories. For example, posts related to temperature and humidity were observed in categories such as hardware, software, electronics and help. We wondered whether forums could benefit from applying topic modeling processes to organize content created by users and help project curators identify trends in topics. Tagging systems based on LDA, for instance, have been proposed in other studies (Krestel, et al., 2009).

4.2. Limitations of the study

The limitations of this study were related both to the amount and quality of information accessible and to some of the results which were based on our own interpretations, such as classifying user affiliations (education, work etc.). First, our analysis consisted of a single case which only allowed us to draw some possible but limited conclusions about the structure of all OD projects. Therefore, future studies will need to explore a wider range of projects and datasets, which could confirm or reject our preliminary results. The data mining techniques that we adopted, based on previous studies or proposed by us, could be valuable for such studies. Second, although our initial tests demonstrated the consistency of the approach, the methods adopted in the evaluation of user-to-user collaboration on the Forum platform were still limited. One possible alternative to investigate would be to consider only comments with direct mentions of other users. However, as we noticed, this would lead to a considerable decrease in the number of comments overall to be evaluated. Third, volume activity in both platforms showed a clear tendency to decrease over time. We wondered whether activity volume was influenced by events that were external to the collaboration platform and to product development per se. We noted, for instance, that some users joined the Forum community after watching a TED talk by the project creator. Fourth, user affiliations were assigned based on profile descriptions and keyword searching. This approach did not identify changes in user interests, e.g., a student member who became an entrepreneur, demanding extra manual work in order to filter users. For larger datasets, automated classification processes would be needed to guarantee analysis capacity. Finally, topic modelling has shown some potential for identifying latent topics in communities but its limitations need to be highlighted. These include (i) defining the optimum number of topics, which may vary significantly in short periods of time; (ii) identifying similar topics between comments in different languages; and (iii) minimizing the possibility for overlapping topics.



5. Conclusion

In this study, we investigated the extent of knowledge created from mining data of OD project communication platforms, GitHub and Forum. We defined OpenAg as our research object and four RQs were developed. Based on the data we mined, we also drew some possible conclusions about project structure which could be applied to other projects in general. The results indicated the high potential that data mining has for understanding the dynamics of OD projects.

Previous studies have proven the successful possibilities of data mining for investigations in the field of OS. With that in mind, we understand that our results and methodological approaches contribute to future researches on social structure of OSH and OD projects. They enabled us to identify, for instance, (i) some of the key users and how/whether they change overtime; (ii) their interests in participating in the community; and (iii) the possible role of educational projects to foster collaboration in OD. From a practical perspective, it helped in our understanding of the current status and limitations of remote collaborative design processes, especially related to user participation, activity volume and content. Considering replicability one of the major aspects of OD, the optimization of such collaboration platforms could positively improve the way in which users with different expertise interact as well as make it easier for newcomers to participate and find help, e.g., addressing tags to post comments.

In future work, we therefore aim to confirm or reject our initial assumptions with a larger body of OD projects. In general, we consider that each one of our RQs could be further explored in future research. We find it particular interesting to continue the development of data mining techniques, to increase the number of platforms from which we can retrieve less noisy data directly. End of article


About the authors

Rodrigo Argenton Freire is an architect, urban designer and an architecture professor. His PhD discusses the potential of the makers/open source culture to shape new forms of architecture practice, and the societal role of architects in developing regions. His recent works include facilitating participatory workshops for the development of a senior cohousing and the Sustainable Campus Initiative at the University of Campinas.
ORCID: 0000–0002–0075–0763
Direct comments to: Rodrigo [at] odlab [dot] cc

Evandro Ziggiatti Monteiro is an architect, urban designer and professor at the University of Campinas, Brazil. He currently investigates urban morphology, its dynamics, frictions and impacts to the urban landscape. Since 2010, he has been an associate coordinator of Fluxus, a research group on Sustainability and Urban Environmental Technical Networks.
ORCID: 0000–0002–6304–1614
E-mail: evandrozig [at] fec [dot] unicamp [dot] br



This study was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior — Brasil (CAPES) — Finance Code 01–P–04375–2015. We would also like to thank the creators from the Noun Project which provided some icons used in this article. Creators are: Figure 1 — H Alberto Gongora, SBTS, ibrandify and Nathan David Smith; Figure 2 — amantaka and Nathan David Smith.





3. Bits From Bytes was a provider of affordable 3D printers and 3D printer kits. It was bought by 3D Systems Corporation in 2010.



Aleksi Aaltonen and Stephan Seiler, 2016. “Cumulative growth in user-cenerated Content production: Evidence from Wikipedia,” Management Science, volume 62, number 7, pp. 2,054–2,069.
doi:, accessed 6 August 2020.

Tanja Aitamurto, Dónal Holland and Sofia Hussain, 2015. “The open paradigm in design research,” Design Issues, volume 31, number 4, pp. 17–29.
doi:, accessed 6 August 2020.

Richard P. Bagozzi and Utpal M. Dholakia, 2006. “Open source software user communities: A study of participation in Linux user groups,” Management Science, volume 52, number 7, pp. 1,099–1,115.
doi:, accessed 6 August 2020.

Kerstin Balka, Christina Raasch and Cornelius Herstatt, 2014. “The effect of selective openness on value creation in user innovation communities,” Journal of Product Innovation Management, volume 31, number 2, pp. 392–407.
doi:, accessed 6 August 2020.

Kerstin Balka, Christina Raasch and Cornelius Herstatt, 2010. “How open is open source? — Software and beyond,” Creativity and Innovation Management, volume 19, number 3, pp. 248–256.
doi:, accessed 6 August 2020.

J.A. Barnes, 1969. “Graph theory and social networks: A technical comment on connectedness and connectivity,” Sociology, volume 3, number 2, pp. 215–232.
doi:, accessed 6 August 2020.

David M. Blei, Andrew Y. Ng and Michael I. Jordan, 2003. “Latent Dirichlet allocation,” Journal of Machine Learning Research, volume 3, pp. 993–1,022, and at, accessed 6 August 2020.

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre, 2008. “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, volume 2008.
doi:, accessed 6 August 2020.

Étienne Boisseau, Jean-François Omhover and Carole Bouchard, 2018. “Open-design: A state of the art review,” Design Science, volume 4, e3.
doi:, accessed 6 August 2020.

Jérémy Bonvoisin, 2018. “jbon/github-mining: For Design Science Journal publication,” version v0.1 (27 March), Zenodo.
doi:, accessed 6 August 2020.

Jérémy Bonvoisin and Robert Mies, 2018. “Measuring openness in open source ardware with the Open-o-Meter,” Procedia CIRP, volume 78, pp. 388–393.
doi:, accessed 6 August 2020.

Jérémy Bonvoisin, Tom Buchert, Maurice Preidel and Rainer G. Stark, 2018. “How participative is open source hardware? Insights from online repository mining,” Design Science, volume 4, e19.
doi:, accessed 6 August 2020.

D. D’Amato, N. Droste, B. Allen, M. Kettunen, K. Lhtinen, J. Korhonen, P. Leskinen, B.D. Matthies and A. Toppinen, 2017. “Green, circular, bio economy: A comparative analysis of sustainability avenues,” Journal of Cleaner Production, volume 168, pp. 716–734.
doi:, accessed 6 August 2020.

Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Scott Deerwester and Richard Harshman, 1988. “Using latent semantic analysis to improve access to textual information,” CHI ’88: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 281–285.
doi:, accessed 6 August 2020.

Aysu Ezen-Can, Kristy Elizabeth Boyer, Shaun Kellogg and Sherry Booth, 2015. “Unsupervised modeling for understanding MOOC discussion forums,” LAK ’15: Proceedings of the Fifth International Conference on Learning Analytics And Knowledge, pp. 146–150.
doi:, accessed 6 August 2020.

Rodrigo A. Freire, 2019. “Scraping scripts for mining Discourse based Forums,” Zenodo (18 December).
doi:, accessed 6 August 2020.

Joseph Feliciano, Margaret-Anne Storey and Alexey Zagalsky, 2016. “Student experiences using GitHub in software engineering courses,” ICSE ’16: Proceedings of the 38th International Conference on Software Engineering Companion, pp. 422–431.
doi:, accessed 6 August 2020.

Jan-Peter Ferdinand, 2018. Entrepreneurship in innovation communities: Insights from 3D printing startups and the dilemma of open source hardware. Cham, Switzerland: Springer International.
doi:, accessed 6 August 2020.

Kimberly A. Fredericks and Maryann M. Durland, 2005. “The historical evolution and basic concepts of social network analysis,” New Directions for Evaluation, volume 2005, number 107, pp. 15–23.
doi:, accessed 6 August 2020.

Anatoliy Gruzd and Jeffrey Roy, 2014. “Investigating political polarization on Twitter: A Canadian perspective,” Policy & Internet, volume 6, number 1, pp. 28–45.
doi:, accessed 6 August 2020.

Brandon Heller, Eli Marschner, Evan Rosenfeld and Jeffrey Heer, 2011. “Visualizing collaboration and influence in the open-source software community,” MSR ’11: Proceedings of the Eighth Working Conference on Mining Software Repositories, pp. 223–226.
doi:, accessed 6 August 2020.

Abram Hindle, Anahita Alipour and Eleni Stroulia, 2016. “A contextual approach towards more accurate duplicate bug report detection and ranking,” Empirical Software Engineering, volume 21, number 2, pp. 368–410.
doi:, accessed 6 August 2020.

Eric von Hippel and Georg von Krogh, 2003. “Open source software and the ‘private-collective’ innovation model: Issues for organization science,” Organization Science, volume 14, number 2, pp. 209–223.
doi:, accessed 6 August 2020.

Thomas Hofmann, 1999. “Probabilistic latent semantic indexing,” SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57.
doi:, accessed 6 August 2020.

Liangjie Hong and Brian D. Davison, 2010. “Empirical study of topic modeling in Twitter,” SOMA ’10: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88.
doi:, accessed 6 August 2020.

Yifan Hu, 2005. “Efficient, high-quality force-directed graph drawing,” Mathematica Journal, volume 10, number 1, pp. 37–71.

Ralf Krestel, Peter Fankhauser and Wolfgang Nejdl, 2009. “Latent Dirichlet allocation for tag recommendation,” RecSys ’09: Proceedings of the Third ACM Conference on Recommender Systems, pp. 61–68.
doi:, accessed 6 August 2020.

Qize Le and Jitesh H. Panchal, 2012. “Analysis of the interdependent co-evolution of product structures and community structures using dependency modelling techniques,” Journal of Engineering Design, volume 23, numbers 10–11, pp. 804–825.
doi:, accessed 6 August 2020.

Josh Lerner and Jean Tirole, 2003. “Some simple economics of open source,” Journal of Industrial Economics, volume 50, number 2, pp. 197–234.
doi:, accessed 6 August 2020.

Chuanyi Li, Liguo Huang, Jidong Ge, Bin Luo and Vincent Ng, 2018. “Automatically classifying user requests in crowdsourcing requirements engineering,” Journal of Systems and Software, volume 138, pp. 108–123.
doi:, accessed 6 August 2020.

Elizabeth D. Liddy, 2001. “Natural language processing,” In: Allen Kent and Harold Lancour (editors). Encyclopedia of library and information science. Second edition. New York: Marcel Decker.

Dean Lusher, Garry Robins and Peter Kremer, 2010. “The application of social network analysis to team sports,” Measurement in Physical Education and Exercise Science, volume 14, number 4, pp. 211–224.
doi:, accessed 6 August 2020.

Vctor Macul and Henrique Rozenfeld, 2015. “How an open source design community works: The case of open source ecology,” DS 80-3: Proceedings of the 20th International Conference on Engineering Design, volume 3, pp. 359–366.

Tiina Malinen, Teemu Mikkonen, Vesa Tienvieri and Tere Vadén, 2010. “Open source hardware through volunteer community: A case study of eCars — now!,” MindTrek '10: Proceedings of the 14th International Academic MindTrek Conference: Envisioning Future Media Environments, pp. 65–68.
doi:, accessed 6 August 2020.

Massimo Menichinelli, 2017. “A data-driven approach for understanding Open Design. Mapping social interactions in collaborative processes on GitHub,” Design Journal, volume 20, supplement 1, pp. S3,643–S3,658.
doi:, accessed 6 August 2020.

Mite Mijalkov, Ehsan Kakaei, Joana B. Pereira, Eric Westman and Giovanni Volpe, 2017. “BRAPH: A graph theory software for the analysis of brain connectivity,” PLoS ONE, volume 12, number 8, e0178798.
doi:, accessed 6 August 2020.

Robert K. Nelson, 2010. “Mining the Dispatch,” Digital Scholarship Lab, University of Richmond, at, accessed 15 December 2019.

Margit Osterloh and Sandra Rota, 2007. “Open source software development — Just another case of collective invention?” Research Policy, volume 36, number 2, pp. 157–171.
doi:, accessed 6 August 2020.

Sonali K. Shah, 2005. “Open beyond software,” In: Danese Cooper, Chris DiBona and Mark Stone (editors). Open sources 2.0. Sebastopol, Calif.: O’Reilly, pp. 339–360.

Cuihua Shen and Peter Monge, 2011. “Who connects with whom? A social network analysis of an online open source software community,” First Monday, volume 16, number 6, at, accessed 6 August 2020.
doi:, accessed 6 August 2020.

Agnieszka Sitko-Lutek, Supakij Chuancharoen, Arkhom Sukpitikul and Kongkiti Phusavat, 2010. “Applying social network analysis on customer complaint handling,” Industrial Management & Data Systems, volume 110, number 9, pp. 1,402–1,419.
doi:, accessed 6 August 2020.

Kalyanasundaram Somasundaram and Gail C. Murphy, 2012. “Automatic categorization of bug reports using latent Dirichlet allocation,” ISEC ’12: Proceedings of the 5th India Software Engineering Conference, pp. 125–130.
doi:, accessed 6 August 2020.

Cassidy R. Sugimoto, Daifeng Li, Terrell G. Russell, S. Craig Finlay and Ying Ding, 2011. “The shifting sands of disciplinary development: Analyzing North American library and information science dissertations using latent Dirichlet allocation,” Journal of the American Society for Information Science and Technology, volume 62, number 1, pp. 185–204.
doi:, accessed 6 August 2020.

Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei, 2005. “Sharing clusters among related groups: Hierarchical dirichlet processes,” Advances in Neural Information Processing Systems, pp. 1,385–1,392; version at, accessed 6 August 2020.

Radu E. Vlas and William N. Robinson, 2012. “Two rule-based natural language strategies for requirements discovery and classification in open source software development projects,” Journal of Management Information Systems, volume 28, number 4, pp. 11–38.
doi:, accessed 6 August 2020.

Tze-i Yang, Andrew J Torget and Rada Mihalcea, 2011. “Topic modeling on historical newspapers,&tdquo; LaTeCH ’11: Proceedings of the Fifth ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 96–104.

Zeynep Yetis-Larsson, Robin Teigland and Olga Dovbysh, 2015. “Networked entrepreneurs: How entrepreneurs leverage open source software communities,” American Behavioral Scientist, volume 59, number 4, pp. 475–491.
doi:, accessed 6 August 2020.


Editorial history

Received 7 February 2020; 19 April 2020; accepted 16 June 2020.

Creative Commons License
This paper is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Measuring the development and communication of open design communities: The case of the OpenAg Initiative
by Rodrigo Argenton Freire and Evandro Ziggiatti Monteiro.
First Monday, Volume 25, Number 9 - 7 September 2020