One of the methodological and logistic problems of network research is the challenge of big data. Dynamics and network qualities are not as easy to extrapolate as averages and distributions. The numbers are huge, and traditional sampling doesn’t solve the problem. What if, by using a small sub–group of the members of a population, we could understand the nature of the network connecting them? For instance, what if we could draw network analysis conclusions, such as predicting the outbreak and evolution of an epidemic, without measuring the entire network of individuals? Christakis and Fowler (2010, cited in Wilson, 2010) found a unique group of users that could predict an epidemic days before its peak in the relevant population.
This study continues their work. Instead of exploring millions of online social activities, we suggest investigating the active users (as we define them) in their community and using their activity logs to build a partial network. This network of intensive users can depict the dynamics of a huge social network, in our case Yahoo! Answers intensive activities.
Barabási, et al. (2002) explored the connection between topology and network size on real–life networks. Twelve years later, our online Q&A social network study reached the same findings and conclusion: the partial network has several basic topological parameters that correlate with activity parameters of the entire social network and, hence, make it suitable for depicting the dynamic parameters of the huge network.
Since exploring online social lives is so interesting and time consuming, we believe that our findings can help the investigation of huge social networks. We call for further investigation of these findings and their implications.
2. Related work
3. The active users’ network
7. Discussion and conclusions
8. Study limitations and future work
Social networks, big data, and information overflow are among the hottest topics in research today. It seems that online social networks (OSN) accumulate more and more users, and the amount of metadata (the data regarding these networks) is growing fast. Using these data and learning the dynamic of size and volume of huge social networks make for an extremely interesting but a challenging mission. Harvesting, extracting, cleansing, aggregating, analyzing, and understanding millions of user transactions do not offer an easy task. Accordingly, we suggest a shortcut to analyzing huge social networks that might save time and research efforts.
Online social networks (OSN) such as Facebook, Twitter, and Instagram are tools for sharing, organizing, and searching relevant content. This study focuses on a segment of OSN, the social questions and answers (SQA) sites. Sites such as Ask, Cha–Cha, Answers.com, Quora, and Yahoo! Answers are Web–based information–seeking services, on which questions are asked and answers provided by the users. These SQA sites attract active and consistent users (Shah, et al., 2008), and the answers provided have a higher quality than those of specialists, according to some researchers (Harper, et al., 2008).
The present study explores Yahoo! Answers, the world’s largest question–answer system, which acts as both a community (Harper, et al., 2008) and an “online social network” (Agichtein, et al., 2007). We investigate the continuous and consistent users on Yahoo! Answers, those who respond to questions, create value, personify site norms, earn social capital, and have a strong influence on answer–quality assessment (Gazan, 2011).
In particular, we study two empirical questions regarding these “active users” (a precise definition of which is given later in this paper in Section 5):
- Existence and possible importance. Do “active users” exist in all of Yahoo! Answers’ social communities? What is the relationship (if any) between their activity and the network’s overall activity?
- Modeling. Can the analysis of an active users’ network help in understanding the dynamics of the topological parameters of the overall network?
The paper comprises eight sections. Following this introduction, Section 2 reviews related work. Then, in Section 3, we present the active users’ network. Section 4 details our two hypotheses. The research methodology and the research results are reported in Sections 5 and 6, respectively. The results are discussed in Section 7. Section 8 concludes the paper with the study’s limitation and recommendations for future work.
2. Related work
2.1. Yahoo! Answers
Launched on 5 July 2005, Yahoo! Answers enables participants to ask and answer questions on any topic. It provides more than 20 million answers per month and serves many needs: answering questions, providing and receiving support, and requesting everyday advice (Adamic, et al., 2008). Yahoo! Answers has more than twenty top categories and more than 1,600 sub–categories. Some categories are huge, with more than 100,000 users per month, while others have only a few users per month. The content of this SQA site includes informational questions and conversational discussions (Harper, et al., 2009), which are composed of opinion–type questions, evaluations, and points of view (Kim, et al., 2008). Good answers are provided for conversational questions (Liu and Agichtein, 2008) and for categories whose answerers are active on specific topics only (Adamic, et al., 2008).
Yahoo! Answers uses crowds to obtain data and information. Unlike other social platforms, such as Twitter, on which users must direct their questions to a certain user to receive results (Nichols and Kang, 2012; Paul, et al., 2011), posts on Yahoo! Answers are archived, and therefore comments and views can be accumulated over a long period of time (much longer than Twitter transactions).
The Yahoo! Answers process is quite straightforward. An asker places a question on Yahoo! Answers by selecting a category and entering the question subject (title) and, optionally, giving details (description). Questions are asked in an “open” state, meaning that answers are received from all users who answer. Once the asker is satisfied with any of the answers, he or she can choose it as a Best Answer (BA) and provide feedback (e.g., in the form of stars or textual feedback). If the asker does not indicate a BA within four to eight days the BA is chosen by the community in a voting process. When a BA is finally chosen, the question is said to be “resolved”; however comments can be added. As part of the Yahoo! Answers process, users can tag questions (e.g., award stars for quality), as well as answers (thumbs up or thumbs down).
2.2 Active users on Q&A sites
“At the heart of small cliques are a few strong relationships, and as long as these persist, the community around them is stable.” (Palla, et al., 2007)
Online groups are most successful when a leader sets the agenda and the quality of the answer (Kerr, 1986). On Yahoo! Answers, there is no official leader or coordinator; those who contribute to Yahoo! Answers’ success are the site’s continuous and consistent users (Shah, et al., 2008).
The importance of the loyalty of the users of a community has been a long–studied issue. Morgan, et al. (1997) suggested that at the center of any network are its stable “core” members, whereas “latent” vertices are peripheral and less stable. The core might comprise as little as 10 percent of the largest number of high–density vertices (Zinoviev, 2008). It can determine the group’s nature (Backstrom, et al., 2008) and the intensity with which members participate (Koh, et al., 2007). However, once the core is removed, the social network loses its ability to function as a whole and disintegrates (Mislove, et al., 2007).
In the present study, “active users” are defined as those who ask and answer consistently over a six–month time period. The definition includes a mixture of the number of contributions, their quality (tagged as “Best Answers”), and the users’ persistence (Lapas and Terzi, 2008).
Our definition resembles that of the “Answer People” (Turner, et al., 2005; Welser, et al., 2007; Wellman, 2009) and “Most Active Users” (Yang, et al., 2010). The active users comprise the core of an online social network. Their participation and stability are critical, since participation and activity in social networks are rarely stable: 80 percent of the nodes appear in fewer than two snapshots of a social network (Bouguessa, et al., 2008). The active users may indeed be relatively few in number, but they record a high level of participation and account for the majority of the action (Lakhani and von Hippel, 2003; Soroka and Rafaeli, 2006; Murata and Moriyasu, 2007; Brandtzæg and Heim, 2008; Nazir, et al., 2008; Chen and Nayak, 2012). These users, furthermore, can be as much as 23 percent more influential in a social network, such as Flickr (Papagelis, et al., 2011).
Active users might be considered either opinion leaders or influential. Weimann (1994) suggests that opinion leaders are those who spread information or advice in the hope of shaping opinions. Influential individuals were defined as a “minority of individuals who influence an exceptional number of their peers” (Watts and Dodds, 2007; Sakamoto, et al., 2008). These users are also highly relevant to understanding the diffusion of topics on the public agenda (Romero, et al., 2011). It seems that the active users on Yahoo! Answers have the potential of being opinion leaders. The longer a user functions as an active user, asking many questions and providing numerous good answers, the greater are the chances that this person can set the category’s agenda and become “influential” in the specific category. Hence, it is not surprising that the best answers on Yahoo! Answers are correlated with consistent participation (Nam, et al., 2009) and that the highest–ranking users (“Yahoo!’s Best Contributors”) are more contributors than they are consumers (Shah, et al., 2008). However, Yahoo! Answers’ active users are not necessarily influential users (see the case of active bloggers in Agarwal, et al., 2008).
Several measures have been suggested for topologically identifying the core members of a network: In–degree centrality, Out–degree centrality, Closeness centrality, Betweenness centrality (Wasserman and Faust, 1994), HITS (Jurczyk and Agichtein, 2007), and PageRank (Sie, et al., 2008). In addition, the node’s position on the network affects its ability to create collective action (Marwell, et al., 1998; Chwe, 1999; Kitsak, et al., 2010). And yet, no correlation between the users’ inner ranking and their position on the social network has been found (Ganley and Lampe, 2009). An opinion leader’s links to the neighboring vertices were found to be more important than his or her overall location on the network (Valente and Davis, 1999), and the dominant users were not those with an official role on the network (Ravid and Rafaeli, 2004).
The fact that active users have the potential to become opinion leaders and the uncertainty regarding their topological place on the network supplied the main motivations for an exploration of their existence and importance on social networks. The next section will describe the network of active users, which was drawn from the Yahoo! Answers’ overall activities.
3. The active users’ network
Social networks constantly change (Tang, et al., 2009). This phenomenon is particularly true with SQA sites, where visitors normally seek a single piece of information (Gazan, 2011); therefore, it seems that there is no use in referring to average parameters or topology in this case (Hill, 2009). In general, there is growing concern about the role and representativeness of measures of central tendency in a world of networks (Christakis, Edge 2014). The preferred alternative would be the construction and gauging of network metrics.
Consequently, we chose to explore the implicit network of the active users. Based on studies by Shi, et al. (2008) and Guillaume and Latapy (2006), we built a network of active users to represent Yahoo! Answers activities. Following several researchers (Morgan, et al., 1997; Kossinets and Watts, 2006; Viswanath, et al., 2009), we tried to identify the correlation between a change in the topology of the active users’ network and changes in the overall activity of the entire Yahoo! Answers.
Following Rodrigues and Milic–Frayling (2009) and Jurczyk and Agichtein (2007), we first built a database that included the 20 most active askers and the 20 most active “best answerers” from each category each month of the 19–month study. Next, we removed all users with fewer than 30 contributions per month (Shi, et al., 2007), and of those we chose only users who had been active for at least six months. The active users and the content categories supplied the two types of nodes. The activities of the active users in the 1,600 content categories represent the vertices.
We explored the parameters of this simplified picture of Yahoo! Answers, its dynamics over time, and its topology. A few studies explored the relationship between the topological parameters of local networks and the whole network’s topology (Barabási, et al., 2002; Kossinets and Watts, 2006). However, to the best of our knowledge, the idea of exploring and using a change in the topological parameters of the active users’ network as a substitute for exploring the dynamics of a huge network is novel. This is what we propose to do here.
First, we considered the importance of active users. Is there a connection between the presence of active users in a content category and the volume of activity of this category? If a connection is found, then the next question can be investigated: Can the activity and the network of the active users be used to depict the entire social network? In other words, can the active users’ network model a huge social network?
4.1. The importance of active users — Active users’ presence and volume of activity
SQA sites provide a platform for synthetic, collaborative work with several active users who participate regularly. We inquired whether content categories with active users necessarily have an overall higher volume of activity. Intuitively, active users might attract more users to a specific content category; on the other hand, the presence of many users and a high volume of activity can encourage the creation of “active users” in a specific category. Since we cannot ascertain the direction of causality between the presence of major players and category activity, we looked only for a correlation. Our first hypothesis is as follows:
H1: There will be a positive correlation between the presence of major players in a content category and the overall activity of this category. In other words, there will be a significant difference in activity level between categories with and without active users.
4.2. The ability of an active users’ network — Modeling a huge social network with an active users’ network
Can the network of the active users alone model the overall dynamics of a huge social network? For this issue, we looked for a correlation between the change in several topological parameters of the active users’ network and a change in the activity volume of Yahoo! Answers. Finding such correlation would provide empirical proof of our theoretical assumption that an active users’ network can represent the entire (and a much larger) social network. The second hypothesis was the following:
H2: A correlation exists between a change in specific topological characteristics of the active users’ network and a change in the activity level on Yahoo! Answers.
The data reported here consist of all the activities on Yahoo! Answers between 1 January 2009 and 31 August 2010, excluding July 2009 (for which data were missing).
To choose the active users, we defined and extracted the 20 most active askers and the 20 most active “best answerers.” The overall number of records amounted to approximately 840,000. Next, we defined a user as an active user if and only if (1) the user had an average of at least one activity (asking or best answering) per day; (2) the user had been nominated in the active users records for at least six months of activity (in the course of the 19 months examined). We created a final list of more than 1,000 active users.
Since active users and content categories constitute totally different kinds of nodes, we had to build a bipartite network. In real life, the active users might or might not reply to one another; thus, the only way to place them all in one graph is to build a bipartite network, in which the users and the categories are the nodes. The links between the nodes represent the actual participation of an active user in a specific category, the same as actors and films in a “Kevin Bacon number” (Barabási, 2003). This network will depict the monthly connections among Yahoo! Answers’ categories, using the activities of the active users.
In this network, an active user is connected to another active user only through a mutual content category in a specific month. The same with categories; each two categories (or more) are connected through one or more active users who participate in both (or each) of them.
Rather than using an optional normalized weight, the weight of a connection between a category and an active user is represented here by the number of times an active user participated in a category.
This study searched for two correlations, as expressed by the two hypotheses, respectively: (1) the correlation between the presence of major players in a content category and the overall activity of this category; and, (2) the correlation between the change in specific topological characteristics of the active users’ network and the change in activity level on Yahoo! Answers. The explanatory power of the correlations was evaluated by using Pearson correlations, and the significance threshold selected was (2–tailed) Alpha <0.001.
This section presents results concerning three issues: (1) the basic Yahoo! Answers’ activity data; (2) data regarding the differences between categories with and without active users; and, (3) data regarding the active users’ network and the correlation between the dynamics of this network and the change in the activities of Yahoo! Answers.
6.1. General data
1. The number of categories in which each active user participated is presented in Table 1.
Table 1: Distribution of categories receiving contributions per active user on Yahoo! Answers. Number of categories in which each active user participated Number of active users Percentage 1 2,997 83.9% 2 425 11.9% 3 98 2.7% 4 26 0.7% 5 15 0.4% 6–10 10 0.2% Total 3,571 100%
These results are in line with Chen and Nayak’s (2008, 2012) findings, that most answerers prefer to participate in a single (topic) category.
2. More than half (56 percent) of the active users were nominated as active users for fewer than nine months; only 14 percent were active users for most of the relevant time period. The distribution of active users’ duration is presented in Table 2. The detailed distribution is presented in Table 11 in the Appendix.
Table 2: Distribution of duration of active users. Duration, in months Active users Percentage 6–9 1,999 56.0% 9–12 664 18.5% 12–15 409 11.5% 15–19 499 14.0% Total 3,571 100%
A significant weak negative correlation was found between the duration of the contributions of active users nominated as “Best Answerer” and (1) the number of all “Best Answers” in the category (-0.18) and, (2) the number of users in the category (-0.23). These results might indicate that the period of being an active user and, hence, the potential of providing “Best Answers” to influence the category are weakened as category size enlarges (offering more users and more “Best Answers”).
6.2. Activity differences between categories with and without active users
1. Active users’ activities and size and volume of their categories
A significant high positive correlation (0.759, N=429) was found between the number of “Best Answers” given by the active users and (1) the number of “Best Answers” and, (2) the number of questions in the category.
A significant medium positive correlation (0.6, N=429) was found between the number of “Best Answers” given by the active users and the number of answers in the category.
A significant medium positive correlation (0.52, N=139) was found between the number of questions asked by the active users and the volume of questions that the category generated.
A significant strong positive correlation (0.71, N=139) was found between the number of questions asked by the active users and the number of answers offered in the category.
Table 3 presents the correlation results between the active users’ activities and the size of categories.
Table 3: Correlations between number of active users and the number of users in the relevant categories.
Note: ** Significant at the 0.01 level (2–tailed).
Number of best answers Number of askers Total users in category Number of best answers Pearson correlation 1 .266** .401** Sig. (2–tailed) .002 .000 N 429 139 429 Number of askers Pearson correlation .266** 1 -.011 Sig. (2–tailed) .002 .901 N 139 139 139 Total users in category Pearson correlation .401** -.011 1 Sig. (2–tailed) .000 .901 N 429 139 429
2. The proportional–to–size effect of the active users
A correlation test was performed between the relative share of the active users’ activities and the category’s activity volume. Two types of activities were tested: supplying “Best Answers” and asking questions.
A significant weak negative correlation was found between the relative share of the “Best Answers” of the active users and the total number of “Best Answers” in the category (-0.21), the number of users in the category (-0.264), and the number of questions in the category (-0.209). This result might indicate that active users may be a victim of their own success: the contribution and, hence, the importance of the “Best Answers” to the category diminishes as category size increases.
Next, we compared the size of categories with active users and those without active users. In the 429 categories where active users were found: (1) the average number of users was much higher (320k users versus 2.1k); (2) the number of questions asked was much higher (160k versus 0.69k); (3) the total number of answers was much higher (799k on average per category versus 2.8k); and, (4) the total number of “Best Answers” was much higher (140k versus 0.7k). The same results are apparent in the median and maximum figures.
The results suggest that a correlation exists between the category’s size and volume and the presence of active users. The main results are presented in Table 7, Differences in Categories with and without Active Users, in the Appendix.
6.3. The active users’ network topology
We built a monthly network consisting of categories and their active users as nodes and the activities as edges. The network describes how Yahoo! Answers is composed of many categories connected by active users. A path can exist only between an active user and a category, meaning that a user was active, i.e., asked or answered in this content category during the specific month. Theoretically, if each user asks and answers in one specific content category, the network would have no links. If Yahoo! Answers’ users are active in many content categories, the network would be well connected and dense.
The tables to follow present the analyses of the weighted, non–directional, bipartite active users’ network in the course of 19 activity months. As mentioned, our interest lay in the change in the topological parameters of the active users’ network and its correlation with the change in the size and volume of Yahoo! Answers.
The selection of each parameter and its ability to predict a change in Yahoo! Answers’ activity are rooted in the work of Barabási, et al. (2002). NWB software was used to graphically analyze network parameters . The main parameters of the active users’ network are as follows:
- Diameter — The shortest path between the furthest connected nodes.
- Number of nodes — The number of categories and their active users each month.
- Number of edges — The number of active users’ activities (asking and answering) each month.
- Maximum weight — The highest number of “Best Answers” or questions that were answered or asked by an active user in a category in a specific month.
- Average weight — The mean number of “Best Answers” or questions that were answered or asked by an active user in a category in a specific month.
- Average degree — The average number of categories in which the active user participated, or the average number of active users in each category in a specific month.
- GCC — The highest number of connected categories (even temporarily) in a specific month.
- Density — The number of active users who participated in a category divided by the maximum participation potential.
Table 4 presents the descriptive statistic of these eight parameters of the active users’ network during the study’s 19 months. Table 8 in the Appendix presents the network analysis for each activity month.
The degree (k) and the density of participation were quite static during all 19 months of activity. The explanation for this phenomenon is quite simple. Since we dictated the number of users — 20 askers and 20 best answerers — it may be said that as long as they were active at least 30 times per month for six months, we almost dictated the number of categories in which they were involved; therefore, the degree (through all 19 months) was between 2.07 and 2.14. Regarding a stable density, we suggested earlier that the fact that almost 84 percent of the active users were involved in only one category should lead to a relatively sparse network. Figure 1 presents graphically the behavior over time of the main topological parameters.
Figure 1: Topological parameters of active users’ behavior over 19 months of activity (see Table 8 in the Appendix).
Since the average degree (k) and the density were quite static during all 19 months of activity, these parameters cannot explain any changes in the size or volume of the Yahoo! Answers network. However, other parameters varied during the activity months, and these might be correlated with Yahoo! Answers’ activity changes. See Table 8 in the Appendix for full details.
Table 4: Active users’ topological parameters — descriptive statistics for 19 months of activity. Parameter Minimum Maximum Mean Std. deviation Diameter 18 27 22.68 2.35 Nodes 4,514 5,064 4,796 149.9 Edges 4,725 5,330 5,030 171.5 Highest link weigh — Number of interactions 1,271 2,904 1,997 419.4 Average link weigh — Number of interactions 89 99.9 94.91 2.75 Average. degree — Number of connecting users (k) 2.07 2.14 2.09 .016 GCC — Number of connected users and categories 3,714 4,316 4,042 178.82 Density .00042 .00046 .000437 .00001
6.4. Correlation of the active users’ network with Yahoo! Answers’ overall activities and the ability to depict the site
We explored several parameters of Yahoo! Answers: the Total number of Questions, Answers, “Best Answers,” users who asked and answered, users who only asked, users who only answered. The full correlation test results are presented in Table 10 in the Appendix. The main results (all with alpha <0.01) follow:
A. The number of categories in the active users’ network is positively correlated with (1) the total number of questions asked and (2) the total number of users asking questions.
B. The number of active users in the active users’ network is positively correlated with (1) the total number of questions asked and (2) the total number of users asking questions.
C. The contributions of the active user to the content categories in the active users’ network are positively correlated with (1) the total number of answers; (2) the total number of users answering; (3) the total number of users both asking and answering; and, (4) the total number of users on Yahoo! Answers.
D. The average number of content categories in which the active users were active is positively correlated with (1) the number of answers; (2) the number of users asking; (3) the number of users both asking and answering; and, (4) the total number of users on Yahoo! Answers.
E. The number of categories that were visited by the same users (the GCC of the users’ network) is positively correlated with (1) the total number of questions; (2) the total number of “Best Answers”; and, (3) the number of askers on Yahoo! Answers.
F. The actual activity of active users in relation to the potential of their activity (i.e., the density of the active users’ network) is negatively correlated with all size and volume parameters of Yahoo! Answers. This means that as more and more people participate in Yahoo! Answers’ activities, the interconnections among active users become sparser.
7. Discussion and conclusions
Active users play an important role in Yahoo! Answers’ processes. A user can post a question to a specific category and, soon enough, answers will arrive. The active users are stable, active, and have a positive correlation with the activity of the whole network. Apparently, this is the reason for our concluding that an examination of the topology of the active users’ network might depict the dynamic of all activities on Yahoo! Answers.
7.1. H1 findings
The data support the H1 assumption. There are several major differences between categories with active users and categories without active users. The former had more users, questions, answers, and Best Answers. Moreover, the total number of questions per user, the total number of answers per user, and the total of Best Answers per user were dramatically different (0.32 versus 0.49, 1.3 versus 2.4, and 0.32 versus 0.42, respectively).
These results confirm that a connection exists between the presence of active users and overall activity and that exploring the active users is worthwhile. This finding might also suggest two attendant alternative explanations:
Active users influence the parameters of categories. It might be sufficient for an active user to be moderately active (once a day for six out of nineteen months) in order to influence a category’s size parameters. However, a higher level of contribution or longer periods as an active user does not increase the volume of activity in the category. Thus, it seems that even if there is an influence on the part of the active user on a category’s activity, the influence is quite limited and with an upper limit.
Categories create their active users. Categories with a high volume of activity and many participants create the needed social capital enabling users to deliver more inputs daily, mostly as “Best Answers,” and gradually to become active users.
7.2. H2 findings
Since the active users are dominant nodes in the network, it can be beneficial to study their activities and their network in order to learn more about the entire network. In H2, we were looking for a correlation between changes in the parameters of the active users’ network and changes in Yahoo! Answers’ activities. Such a correlation might imply that we can obtain valuable insights into the entire network from just learning about the active users’ activities and their network’s topology.
Indeed such correlations were found, and therefore the active users’ network can reflect the activity on the Yahoo! Answers network. Our findings follow the work of Barabási, et al. (2002), according to which network growth over the years actually increased the average degree and GCC and decreased its diameter and CC. This is in contrast to the assumption of conventional models that the average distance should increase slowly as the network grows (like O(log n)), Although Barabási, et al.’s (2002) study explored real–life networks and not online Q&A–site–based social networks, the present study’s findings express the same dynamics. While the entire Yahoo! Answers’ network experienced an increase in number of users and number of activities, the diameter and the GCC of its active users’ activity grew, as well. At the same time, the CC of the active users became smaller.
It is worth mentioning that the only parameter in our study that did not follow Barabási, et al.’s (2002) findings is the diameter. The explanation for this is that in the active users’ network, the diameter has no real meaning. Since the users act according to their changing needs, they enter and leave the SQA site as they please; there is no actual “path” between the users and, hence, no real diameter. We could measure a diameter in the active users’ network, but its value would be meaningless.
For the most part, our findings are applicable to researchers who wish to investigate dynamic changes, such as growth or decay in huge social networks. Analyzing huge networks with millions of nodes and links is a complicated and sometimes even impossible task in real life. Using a small subset of the entire population, exploring it, and applying the results to understand the entire network offer a tremendously useful tool in terms of time and money.
We suggested that in order to understand the change in size and growth dynamics of a huge network such as Yahoo! Answers (20M monthly interactions), one could explore the active users’ topology (50k monthly interactions). Since there is an order of magnitude difference between the two, exploring the active users’ network should save time and money.
8. Study limitations and future work
Our study was based on Yahoo! Answers data only; consequently, the generalization of the correlations should be weighed carefully. Might these findings be relevant only to this platform or to huge Q&A sites alone? Future research should examine active users in other social networks, such as Facebook or LinkedIn, to explore the external validity of the conclusions reached here.
Since we did not develop a conceptual model that included causality to understand the direction of the mutual correlations in our hypotheses, we tested only correlations. Two regression models might be applied here in the future: the first to present the causality and the connections between the active users’ activities and categories’ size and volume; the second regression to uncover the causality between the active users’ topological parameters and the size and volume parameters of the social network.
We are aware of the fact that choosing an activity threshold inevitably causes dramatically different active users’ network structures (De Choudhury, et al., 2010). We defined an active user as one who (1) acted at least on a daily basis and (2) appeared at least six months out of 19 months of activity (the duration of the study). These definitions generate successful results, and yet a future study might suggest alternative definitions and put forth other actions as preconditions to be considered. We believe that other insights might arise from alternative thresholds in determining active users and that a sensitivity test regarding the volume of the contributions of the active player can be administered.
About the authors
Amit Rechavi is Lecturer in the Graduate School of Management at the University of Haifa, Israel.
E–mail: Sheizaf [at] rafaeli [dot] net
Sheizaf Rafaeli is Professor and Director of the Center for Internet Research at the University of Haifa, Israel.
E–mail: Amit[dot] rechavi [at] gmail [dot] com
We wish to thank Ricardo Baeza–Yates of Yahoo! Research Barcelona and Yoelle Maarek and her team from Yahoo! Labs Haifa for sharing their knowledge and for their support. We also acknowledge with thanks the financial support of the Center for Internet Research at the University of Haifa, Israel.
1. NWB Team, 2006. “Network Workbench Tool: Indiana University, Northeastern University, and University of Michigan,” at http://nwb.slis.indiana.edu.
L.A. Adamic, J. Zhang, E. Bakshy, and M.S. Ackerman, 2008. “Knowledge sharing and Yahoo Answers: Everyone knows something,” WWW ’08: Proceedings of the 17th International Conference on World Wide Web, pp. 665–674.
doi: http://dx.doi.org/10.1145/1367497.1367587, accessed 27 July 2014.
N. Agarwal, H. Liu, L. Tang, and P.S. Yu, 2008. “Identifying the influential bloggers in a community,” WSDM ’08: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 207–218.
doi: http://dx.doi.org/10.1145/1341531.1341559, accessed 27 July 2014.
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, 2007. “Finding high–quality content in social media with an application to community–based question answering,” Yahoo! Research Report, number YR–2007–005 (25 September), at http://labs.yahoo.com/files/2007-005_Agichtein.pdf, accessed 27 July 2014.
L. Backstrom, R. Kumar, C. Marlow, J. Novak, and A. Tomkins, 2008. “Preferential behavior in online groups,” WSDM ’08: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 117–128.
doi: http://dx.doi.org/10.1145/1341531.1341549, accessed 27 July 2014.
A.L. Barabási, 2003. Linked: How everything is connected to everything else and what it means for business, science, and everyday life. New York: Plume.
A.L. Barabási, H. Jeong, Z. Néda, E. Ravasz, A. Schubert, and T. Vicsek, 2002. “Evolution of the social network of scientific collaborations,” Physica A: Statistical Mechanics and its Applications, volume 311, numbers 3–4, pp. 590–614.
doi: http://dx.doi.org/10.1016/S0378-4371(02)00736-7, accessed 27 July 2014.
M. Bouguessa, B. Dumoulin, and S. Wang, 2008. “Identifying authoritative actors in question–answering forums: The case of Yahoo! Answers,” KDD ’08: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 866–874.
doi: http://dx.doi.org/10.1145/1401890.1401994, accessed 27 July 2014.
P.B. Brandtzæg and J. Heim, 2008. “User loyalty and online communities: Why members of online communities are not faithful,” INTETAIN ’08: Proceedings of the Second international Conference on INtelligent TEchnologies for interactive enterTAINment, article number 11.
L. Chen and R. Nayak, 2012. “Leveraging the network information for evaluating answer quality in a collaborative question answering portal,” Social Network Analysis and Mining, volume 2, number 3, pp 197–215.
doi: http://dx.doi.org/10.1007/s13278-011-0046-4, accessed 27 July 2014.
L. Chen and R. Nayak, 2008. “Expertise analysis in a question answer portal for author ranking,” WI–IAT ’08: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, volume 1, pp. 134–140.
doi: http://dx.doi.org/10.1109/WIIAT.2008.12, accessed 27 July 2014.
N.A. Christakis and J.H. Fowler, 2010. “Social network sensors for early detection of contagious outbreaks,” PLoS ONE, volume 5, number 9, e12948.
doi: http://dx.doi.org/10.1371/journal.pone.0012948, accessed 27 July 2014.
M. S–Y. Chwe, 1999. “Structure and strategy in collective action,” American Journal of Sociology, volume 105, number 1, pp. 128–156.
M. De Choudhury, W.A. Mason, J.M. Hofman, and D.J. Watts, 2010. “Inferring relevant social networks from interpersonal communication,” WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 301–310.
doi: http://dx.doi.org/10.1145/1772690.1772722, accessed 27 July 2014.
D. Ganley and C. Lampe, 2009. “The ties that bind: Social network principles in online communities,” Decision Support Systems, volume 47, number 3, pp. 266–274.
doi: http://dx.doi.org/10.1016/j.dss.2009.02.013, accessed 27 July 2014.
R. Gazan, 2011. “Social Q&A,” Journal of the American Society for Information Science and Technology, volume 62, number 12, pp. 2,301–2,312.
doi: http://dx.doi.org/10.1002/asi.21562, accessed 27 July 2014.
J–L. Guillaume and M. Latapy, 2006. “Bipartite graphs as models of complex networks,” Physica A: Statistical Mechanics and its Applications, volume 371, number 2, pp. 795–813.
doi: http://dx.doi.org/10.1016/j.physa.2006.04.047, accessed 27 July 2014.
F.M. Harper, D. Moy and J.A. Konstan, 2009. “Facts or friends? Distinguishing informational and conversational questions in social Q&A sites,” CHI ’09: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 759–768.
doi: http://dx.doi.org/10.1145/1518701.1518819, accessed 27 July 2014.
F.M. Harper, D. Raban, S. Rafaeli, and J.A. Konstan, 2008. “Predictors of answer quality in online Q&A sites,” CHI ’08: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 865–874.
doi: http://dx.doi.org/10.1145/1357054.1357191, accessed 27 July 2014.
P. Jurczyk and E. Agichtein 2007. “Discovering authorities in question answer communities by using link analysis,” CIKM ’07: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 919–922.
doi: http://dx.doi.org/10.1145/1321440.1321575, accessed 27 July 2014.
E. Kerr, 1986. “Electronic leadership: A guide to moderating online conferences,” IEEE Transactions on Professional Communications, volume PC–29, number 1, pp. 12–18.
doi: http://dx.doi.org/10.1109/TPC.1986.6449009, accessed 27 July 2014.
S. Kim, J.S. Oh, and S. Oh, 2007. “Best–answer selection criteria in a social Q&A site from the user–oriented relevance perspective,” Proceedings of the American Society for Information Science and Technology, volume 44, number 1, pp. 1–15.
doi: http://dx.doi.org/10.1002/meet.1450440256, accessed 27 July 2014.
N. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H.E. Stanley, and H.A. Makse, 2010. “Identification of influential spreaders in complex networks,” Nature Physics, volume 6, number 11, pp. 888–893.
doi: http://dx.doi.org/10.1038/nphys1746, accessed 27 July 2014.
J. Koh, Y.–G. Kim, B. Butler, and G.–W. Bock, 2007. “Encouraging participation in virtual communities,” Communications of the ACM, volume 50, number 2, pp. 68–73.
doi: http://dx.doi.org/10.1145/1216016.1216023, accessed 27 July 2014.
G. Kossinets and D.J. Watts, 2006. “Empirical analysis of an evolving social network,” Science, volume 311, number 5757 (6 January), pp. 88–90.
doi: http://dx.doi.org/10.1126/science.1116869, accessed 27 July 2014.
K.R. Lakhani and E. von Hippel, 2003. “How open source software works: ‘Free’ user–to–user assistance,” Research Policy, volume 32, number 6, pp. 923–943.
doi: http://dx.doi.org/10.1016/S0048-7333(02)00095-1, accessed 27 July 2014.
Y. Liu and E. Agichtein, 2008. “You’ve got answers: Towards personalized models for predicting success in community question answering,” Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 97–100; version at http://www.mathcs.emory.edu/~eugene/papers/acl08s_cqa-personalization-prelim.pdf, accessed 27 July 2014.
G. Marwell, P.E. Oliver, and R. Prahl, 1988. “Social networks and collective action: A theory of the critical mass. III,” American Journal of Sociology, volume 94, number 3, pp. 502–534.
A. Mislove, M. Marcon, K.P. Gummadi, P. Druschel, and B. Bhattacharjee, 2007. “Measurement and analysis of online social networks,” IMC ’07: Proceedings of the Seventh ACM SIGCOMM Conference on Internet Measurement, pp. 29–42.
doi: http://dx.doi.org/10.1145/1298306.1298311, accessed 27 July 2014.
D.L. Morgan, M.B. Neal, and P. Carder, 1997. “The stability of core and peripheral networks over time,” Social Networks, volume 19, number 1, pp. 9–25.
doi: http://dx.doi.org/10.1016/S0378-8733(96)00288-2, accessed 27 July 2014.
T. Murata and S. Moriyasu, 2007. “Link prediction of social networks based on weighted proximity measures,” WI ’07: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 85–88.
K.K. Nam, M.S. Ackerman, and L.A. Adamic, 2009. “Questions in, knowledge in? A study of Naver’s question–answering community,” CHI ’09: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 779–788.
doi: http://dx.doi.org/10.1145/1518701.1518821, accessed 27 July 2014.
A. Nazir, S. Raza, and C.N. Chuah, 2008. “Unveiling Facebook: A measurement study of social network–based applications,” IMC ’08: Proceedings of the Eighth ACM SIGCOMM Conference on Internet Measurement, pp. 43–56.
doi: http://dx.doi.org/10.1145/1452520.1452527, accessed 27 July 2014.
J. Nichols and L.–H. Kang, 2012. “Asking questions of targeted strangers on social networks,” CSCW ’12: Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, pp. 999–1,002.
doi: http://dx.doi.org/10.1145/2145204.2145352, accessed 27 July 2014.
G. Palla, A.L. Barabási, and T. Vicsek, 2007. “Community dynamics in social networks,” Noise and stochastics in complex systems and finance: 21–24 May 2007, Florence, Italy, Proceedings of SPIE, volume 6601.
doi: http://dx.doi.org/10.1117/12.724517, accessed 27 July 2014.
M. Papagelis, V. Murdock, and R. van Zwol, 2011. “Individual behavior and social influence in online social systems,” HT ’11: Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia, pp. 241–250.
doi: http://dx.doi.org/10.1145/1995966.1995998, accessed 27 July 2014.
S.A. Paul, L. Hong, and E.H. Chi, 2011. “Is Twitter a good place for asking questions? A characterization study,” Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, at https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2813/3225, accessed 27 July 2014.
G. Ravid and S. Rafaeli, 2004. “Asynchronous discussion groups as small world and scale free networks,” First Monday, volume 9, number 9, at http://firstmonday.org/article/view/1170/1090, accessed 27 July 2014.
M.E. Rodrigues, and N. Milic–Frayling, 2009. “Socializing or knowledge sharing? Characterizing social intent in community question answering,” CIKM ’09: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1,127–1,136.
doi: http://dx.doi.org/10.1145/1645953.1646096, accessed 27 July 2014.
D. Romero, W. Galuba, S. Asur, and B. Huberman, 2011. “Influence and passivity in social media,” In: D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis (editors). Machine learning and knowledge discovery in databases. Lecture Notes in Computer Science, volume 6913, pp. 18–33.
doi: http://dx.doi.org/10.1007/978-3-642-23808-6_2, accessed 27 July 2014.
C. Shah, J.S. Oh,and S. Oh, 2008. “Exploring characteristics and effects of user participation in online social Q&A sites,” First Monday, volume 13, number 9, at http://firstmonday.org/article/view/2182/2028, accessed 27 July 2014.
X. Shi, L.A. Adamic, and M.J. Strauss, 2007. “Networks of strong ties,” Physica A: Statistical Mechanics and its Applications, volume 378, number 1, pp. 33–47.
doi: http://dx.doi.org/10.1016/j.physa.2006.11.072, accessed 27 July 2014.
X. Shi, M. Bonner, L.A. Adamic, and A.C. Gilbert, 2008. “The very small world of the well–connected,” HT ’08: Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia, pp. 61–70.
doi: http://dx.doi.org/10.1145/1379092.1379108, accessed 27 July 2014.
V. Soroka and S. Rafaeli, 2006. “Invisible participants: How cultural capital relates to lurking behavior,” WWW ’06: Proceedings of the 15th international conference on World Wide Web, pp. 163–172.
doi: http://dx.doi.org/10.1145/1135777.1135806, accessed 27 July 2014.
T.C. Turner, M A. Smith, D. Fisher and H.T. Welser, 2005. “Picturing Usenet: Mapping computer–mediated collective action,” Journal of Computer–Mediated Communication, volume 10, number 4.
doi: http://dx.doi.org/10.1111/j.1083-6101.2005.tb00270.x, accessed 27 July 2014.
T.W. Valente, and R.L. Davis, 1999. “Accelerating the diffusion of innovations using opinion leaders,” Annals of the American Academy of Political and Social Science, volume 566, number 1, pp. 55–67.
B. Viswanath, A. Mislove, M. Cha, and K.P. Gummadi, 2009. “On the evolution of user interaction in Facebook,” WOSN ’09: Proceedings of the 2nd ACM workshop on Online social networks, pp. 37–42.
doi: http://dx.doi.org/10.1145/1592665.1592675, accessed 27 July 2014.
S. Wasserman, and K. Faust, 1994. Social network analysis: Methods and applications. Cambridge: Cambridge University Press.
D.J. Watts and P.S. Dodds, 2007. “Influentials, networks, and public opinion formation,” Journal of Consumer Research, volume 34, number 4, pp. 441–458.
doi: http://dx.doi.org/10.1086/518527, accessed 27 July 2014.
G. Weimann, 1994. The influentials: People who influence people. Albany: State University of New York Press.
H.T. Welser, E. Gleave, D. Fisher, and M. Smith, 2007. “Visualizing the signatures of social roles in online discussion groups,” Journal of Social Structure, volume 8, number 2, pp. 564–586, and at http://www.cmu.edu/joss/content/articles/volume8/Welser/, accessed 27 July 2014.
M. Wilson, 2010. “Using the friendship paradox to sample a social network,” Physics Today, volume 63, number 11, pp. 15–16.
doi: http://dx.doi.org/10.1063/1.3518199, accessed 27 July 2014.
J. Yang, X. Wei, M.S. Ackerman, and L.A. Adamic, 2010. “Activity lifespan: An analysis of user survival patterns in online knowledge sharing communities,” Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, pp. 186–193.
D. Zinoviev, 2008. “Topology and geometry of online social networks,” WMCSI'08: Proceedings of 12th World Multi-Conference on Systemics, Cybernetics, and Informatics, volume 4, pp. 138–143; version at http://arxiv.org/abs/0807.3996, accessed 27 July 2014.
Table 5: Correlations between active users’ activity and relevant categories’ volume of activity.
Note: ** Significant at the 0.01 level (2–tailed).
Mean contribution as best answerer Mean contribution as asker Total answers in category Total Best Answers in category Total questions in category Mean contribution as Best Answer Pearson correlation 1 .578** .601** .759** .759** Sig. (2–tailed) .000 .000 .000 .000 N 429 139 429 429 429 Mean contribution as asker Pearson correlation .578** 1 .710** .585** .527** Sig. (2–tailed) .000 .000 .000 .000 N 139 139 139 139 139 Total answers in category Pearson correlation .601** .710** 1 .900** .855** Sig. (2–tailed) .000 .000 .000 .000 N 429 139 429 429 429 Total Best Answers in the category Pearson correlation .759** .585** .900** 1 .994** Sig. (2–tailed) .000 .000 .000 .000 N 429 139 429 429 429 Total questions in the category Pearson correlation .759** .527** .855** .994** 1 Sig. (2–tailed) .000 .000 .000 .000 N 429 139 429 429 429
Table 6: Correlations between active users’ share and the relevant categories’ parameters.
Note: ** Significant at the 0.01 level (2–tailed).
Number of users Share of best answerers Number of Best Answers Number of answers Number of questions Number of users in category Pearson correlation 1 -.264** .916** .710** .938** Sig. (2–tailed) .000 .000 .000 .000 N 429 429 429 429 429 Share of best answerers in category Pearson correlation -.264** 1 -.210** -.133** -.209** Sig. (2–tailed) .000 .000 .006 .000 N 429 429 429 429 429 Number of Best Answers Pearson correlation 916** -.210** 1 .900** .994** Sig. (2–tailed) .000 .000 .000 .000 N 429 429 429 429 429 Number of answers in category Pearson correlation .710** -.133** .900** 1 .855** Sig. (2–tailed) .000 006 .000 .000 N 429 429 429 429 429 Number of questions in category Pearson correlation .938** -.209** .994** .855** 1 Sig. (2–tailed) .000 .000 .000 .000 N 429 429 429 429 429
Table 7: Correlations between active users’ share and the relevant categories’ parameters. Categories with and without active users Mean number of users Mean number of questions Mean number of answers Mean number of Best Answers With active user Mean 326,369 160,026 799,279 140,238 N 429 429 429 429 Standard deviation 544,514 342,814 2,748,595 297,814 Median 117,310 47,504 185,969 43,102 Maximum 5,217,921 3,659,713 44,841,364 3,575,315 Without active user Mean 2,119 699 2,808 711 N 1,221 1,222 1,217 1,217 Standard deviation 6,650 1,811 8,951 1,753 Median 454 127 416 132 Maximum 149,672 28,057 203,634 25,681
Table 8: Active users’ network parameters over 19 activity months. Month–Year Diameter Nodes Edges Max. weight Mean weight Average degree WCC count No. of categories and users in GCC Density 01–2009 22 4,685 4,895 1,271 89 2.09 116 4,011 0.00045 02–2009 22 4,514 4,725 2,349 89 2.1 117 3,876 0.00046 03–2009 22 4,923 5,180 1,538 94 2.1 134 4,218 0.00043 04–2009 22 4,748 5,023 1,523 93.6 2.11 123 4,119 0.00045 05–2009 20 4,825 5,165 2,108 95.7 2.14 108 4,291 0.00044 06–2009 25 4,933 5,192 1,774 93.4 2.11 122 4,204 0.00043 08–2009 26 5,064 5,330 2,196 94.5 2.11 113 4,266 0.00042 09–2009 23 4,898 5,157 2,471 95.6 2.11 121 4,055 0.00043 10–2009 24 4,888 5,158 1,815 97.5 2.11 118 4,316 0.00043 11–2009 22 4,779 5,000 1,760 93 2.09 130 3,981 0.00044 12–2009 18 4,543 4,769 1,717 94.9 2.1 138 3,808 0.00046 01–2010 24 4,739 4,948 2,406 95.0 2.09 147 3,774 0.00044 02–2010 20 4,746 4,949 2,155 93.5 2.08 142 3,714 0.00044 03–2010 23 5,050 5,301 1,779 96.8 2.1 136 4,160 0.00042 04–2010 27 4,738 4,984 1,615 97.6 2.08 145 3,984 0.00044 05–2010 26 4,689 4,869 1,776 96.6 2.07 151 3,896 0.00044 06–2010 21 4,655 4,838 2,444 95.9 2.08 139 3,953 0.00045 07–2010 20 4,919 5,109 2,904 99.9 2.08 142 4,035 0.00042 08–2010 24 4,792 4,987 2,356 97.9 2.08 125 4,152 0.00043
Table 9: Active users’ network — Correlations of parameters over 19 activity months. Diameter No. of nodes No. of edges Max. weight Mean weight Average degree Nodes in the GCC Density Diameter Pearson correlation
No. of nodes Pearson correlation
No. of edges Pearson correlation
Max. weight Pearson correlation
Mean weight Pearson correlation
Average degree Pearson correlation
Nodes in the GCC Pearson correlation
Density Pearson correlation
Table 10: Correlation between change in active users’ network topology and change in Yahoo! Answers’ activity parameters. Q Total Answers
Users both asking and answering Users only
Diameter Pearson correlation .233 .032 -.046 -.018 .196 -.021 .002 Sig. (2–tailed) .336 .895 .852 .940 .422 .932 .995 Nodes Pearson correlation .609 .431 .191 .252 .601 .139 .200 Sig. (2–tailed) .006 .066 .433 .297 .006 .570 .412 Edges Pearson correlation .660 .502 .284 .340 .674 .229 .289 Sig. (2–tailed) .002 .028 .240 .154 .002 .346 .230 Weight Max Pearson correlation -.246 -.327 -.387 -.270 -.292 -.390 -.378 Sig. (2–tailed) .310 .171 .102 .264 .225 .098 .110 Weight Mean Pearson correlation -.235 -.426 -.656 -.587 -.304 -.679 -.646 Sig. (2–tailed) .333 .069 .002 .008 .206 .001 .003 Average Degree Pearson correlation .566 .566 .593 .623 .670 .561 .589 Sig. (2–tailed) .012 .011 .007 .004 .002 .012 .008 GCC Pearson correlation .753 .631 .451 .474 .718 .414 .460 Sig. (2–tailed) .000 .004 .052 .040 .001 .078 .048 Density Pearson correlation -.682 -.684 -.643 -.631 -.708 -.619 -.642 Sig. (2–tailed) .001 .001 .003 .004 .001 .005 .003
Table 11: Distribution of active users in all categories in Yahoo! Answers (in months). Time period in months Frequency Percentage Number of months in a row active users were “active users” 6.00 767 21.5 7.00 497 13.9 8.00 365 10.2 9.00 273 7.6 10.00 228 6.4 11.00 205 5.7 12.00 140 3.9 13.00 153 4.3 14.00 113 3.2 15.00 87 2.4 16.00 91 2.5 17.00 79 2.2 18.00 75 2.1 19.00 127 3.6 Total 3,571 100.0
Received 4 February 2014; accepted 29 July 2014.
Copyright © 2014, First Monday.
Copyright © 2014, Amit Rechavi and Sheizaf Rafaeli.
Active players in a network tell the story: Parsimony in modeling huge networks
by Amit Rechavi and Sheizaf Rafaeli.
First Monday, Volume 19, Number 8 - 4 August 2014