First Monday

A scholarly divide: Social media, Big Data, and unattainable scholarship by Asta Zelenkauskaite and Erik P. Bucy

Recent decades have witnessed an increased growth in data generated by information, communication, and technological systems, giving birth to the ‘Big Data’ paradigm. Despite the profusion of raw data being captured by social media platforms, Big Data require specialized skills to parse and analyze — and even with the requisite skills, social media data are not readily available to download. Thus, the Big Data paradigm has not produced a coincidental explosion of research opportunities for the typical scholar. The promising world of unprecedented precision and predictive accuracy that Big Data conjure remains out of reach for most communication and technology researchers, a problem that traditional platforms, namely mass media, did not present. In this paper, we evaluate the system architecture that supports the storage and retrieval of big social data, distinguishing between overt and covert data types, and how both the cost and control of social media data limit opportunities for research. Ultimately, we illuminate a curious but growing ‘scholarly divide’ between researchers with the technical know-how, funding, or institutional connections to extract big social data and the mass of researchers who merely hear big social data invoked as the latest, exciting trend in unattainable scholarship.


A scholarly divide
Data considerations
Access denied
Social media scholarship
Critiquing the Big Data paradigm
Discussion and conclusion




Recent decades have witnessed an increased growth in data generated by information, communication, and technological systems, giving birth to the ‘Big Data’ paradigm across an array of disciplines. Big Data have been defined as “the data sets and analytical techniques in applications that are so large (from terabytes to exabytes) and complex (from sensor to social media data) that they require advanced and unique data storage, management, analysis, and visualization technologies” [1]. Semantically, the term Big Data derives from scalability issues in data management and analysis. The buzzphrase ‘big data analytics’ refers to the generation of key insights from patterns discerned in large-scale data runs that would not be visible otherwise (Hilbert, 2013). Although typically referenced in relation to size, Big Data “is less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets” [2].

The Big Data paradigm in social media scholarship has already reached early critical mass, producing by 2012 more than 400 empirical studies on Facebook alone (Wilson, et al., 2012). Such rapid growth exemplifies how researchers with the tools, resources, and acumen to harvest social media flows enjoy publication advantages from the collection and analysis of digital traces of human behavior that are available online. Advantages of such trace data were recognized relatively early in the networked media era, when unobtrusive measures such as user-generated content or cookies that track and collect user browsing behaviors were compared with self-report or interview data in the interface design and testing process (Burton and Walther, 2001). Unlike self-reports, big social data are naturally occurring accumulations of user communication that are ‘found’ rather than ‘made’ during the research process (Taylor, 2013). With access to such data streams, researchers can identify patterns and extract meaning without the need to design stimuli and set up parameters a priori. Therefore, access becomes a valuable research asset — and an important discriminating factor in new technology research.

Although academic use of big social data represents a mere fraction of its commercial application [3], the existence of big social data has over-sized implications for academic research, particularly in the social sciences. Despite their idiosyncrasies, big social data are attractive to academic researchers for several reasons. The benefits of big social data have been expressed in terms of their potential to unify unstructured and fragmented data sources, as well as to enhance our understanding of social phenomena. Big social data have been gainfully employed by communication scholars to reconsider and advance longstanding theories (see Neuman, et al., 2014), illuminate relationships between second screen use and audience response to presidential debates (Shah, et al., 2015), and achieve a better understanding of political influence via social media (Jungherr, 2014), to mention just a few examples.

Yet with the emergence of the Big Data paradigm, disparities emerge among scholars, particularly between researchers who share a common interest in information and communication technology research. These disparities come into stark relief when comparing the skill set of those in computational areas, such as computer science and informatics, with those in the social sciences and humanities. Ideally, scholars from different traditions work together to bridge disciplinary divides and harness their unique contributions and expertise; in practice, however, that interdisciplinary ethos is often lacking and difficult to sustain where it does exist. As a result, we are on the cusp of a data-driven paradox — with more data available than ever before, only a select few researchers are in a position to truly reap the benefits of big social data analysis.

The implications of this paradox are twofold: on the one hand, big social data represent an important trend in communication technology research that challenges the social sciences conceptually and methodologically while calling for new collaborative initiatives (Ruppert, 2013). On the other, the promising world of unprecedented precision and predictive accuracy that big social data conjures remains out of reach for most communication and technology researchers, a problem that traditional mass media did not present.

The availability of big social data thus discriminates in any numbers of ways: between analysts who sift through reams of available data points to discover patterns of behavior in users and users who (largely unwittingly) generate data for analysis (e.g., Lerman, 2013); between marketers who conduct segmentation studies to more effectively target consumers and consumers who are targeted by an increasingly number of tailored ads (Turow. et al., 2015); and, recently, between scholars who have access to big social data to perform novel analyses and open new methodological frontiers (see Lewis, et al., 2013; Menchen–Trevino, 2013) and conventional, i.e., noncomputational, researchers who lack the technological savvy to extract meaning from big social data. We should say at the outset — many social scientists are highly competent in statistical methods and analysis, and very sophisticated in the work that they do. But this doesn’t mean they are conversant with big social data mining and analytics, which require a skill set more compatible with computer science than most social sciences.

In earlier eras, it was quite acceptable (if not de rigueur) for one field of study to ignore developments in another, particularly if they were already at arm’s length. But times are changing, and fields are merging. Big Data contexts are particularly relevant to social scientists due to recent changes in the amount and types of social media data that are now available. Prior to digitalization, monetization in the mass media industries was based on the projection of audience size, typically through industry ratings systems or large-scale surveys (see Bruns, 2008). Now, with the growth in different data types, particularly user-generated content, Web log data, and other behavioral information generated by online platforms, individualized user preferences become a new criterion of interest to advertisers.

Ironically, even though more data streams are being generated from more platforms than ever before, the vast majority of these data are inaccessible to the typical researcher. Either posts are kept private, limited to a circle of friends or followers who are connected to a particular person, or companies that severely restrict access to their full data stream of posts that are public. In this new sociotechnical configuration, users leave traces not as members of a media audience but as subscribers to, or registrants in, a private information or communication service (see Bordewijk and van Kaam, 1986). Even data controls ostensibly in the hands of individuals, e.g., in privacy settings that allow users to keep their Facebook postings or tweets hidden, do little to block the company or platform itself from retaining and accessing all user information.

This out of reach quality of big social data prompts consideration of the differences between new and old media systems and the socioeconomic contexts that amplify these differences. From this vantage point, it is worth asking who benefits from data acquisition and analysis, and who loses? Drawing on the academe as a place of creative innovation, are there existing models of success that might show how interdisciplinary collaboration could overcome some of the more exclusionary aspects of big social data mining and meaning extraction? Such questions merit consideration in an era of increasing data acquisitiveness.



A scholarly divide

Our contribution to this debate is situated in the context of computational social science research as an emerging research paradigm (Kuhn, 1996), now with its own scholarly journal (Big Data & Society), international conferences, and special forums devoted to big social data analysis in more general journals [4]. This emerging research paradigm co-exists with other traditions but places researchers on an uneven playing field due to the above-mentioned disparities in data access and analytics. Even while we refer to this scholarly divide as a looming concern, we acknowledge that Big Data approaches aren’t a futile enterprise. Quite the contrary, large-scale social data analyses can be leveraged for a variety of constructive purposes, including audience research, public opinion tracking, disaster preparedness, communication about public health threats, and national security, just to name a few. Indeed, we are both involved in a series of productive studies on social media, user empowerment, and civic involvement (see Shah, et al., 2015; Zelenkauskaite and Simões, 2015; Zelenkauskaite, 2016). However, this doesn’t blind us to the very real challenges confronting researchers in the area.

Criticism of this disparity grows out of the realization that big social data analysis not only represents a paradigm involving exponentially larger amounts of data for parsing, but also order of magnitude increases in the complexity of data storage and retrieval and the need for sophisticated statistical tools to analyze it (see Jacobs, 2009). There is no such thing as a small scale, qualitative analysis of big social data.

Almost invariably, discussions about disparities wind their way toward debates over the advantages of early adoption. Early embrace of a given innovation, including big social data analysis, can have direct consequences for reaping first-mover benefits, such as publication success, outside funding, or research notoriety. At this still-early stage of Big Data research in the social sciences, innovators and early adopters are beginning to enjoy the benefits and recognition of being among the first to explore and exploit this new analytical space. Academic departments are in a rush to hire computational social scientists, particularly in ‘nontraditional’ departments like communication and media studies that are open to hiring faculty from other disciplines. Yet, we might ask whether these first-movers also bear some responsibility for cultivating a more level playing field for all researchers with regard to big social data practices.

In the sections below we consider the implications of this curious but growing research gap constituting a scholarly divide between researchers with the technical know-how, funding, or institutional connections to extract and analyze proprietary big social data and the mass of researchers who merely hear big social data invoked as the latest, exciting trend in unattainable scholarship. Similar to the “disparity between developed and developing countries [that remain worlds apart] in their use of innovative computational research methods” [5], we see a divergence between the ‘deep’ analytics of a few and ‘surface’ knowledge among the mass of researchers expose to the term, Big Data (see also Manovich, 2012). Applying the notion of disparity to the research process, even if physical access to data is available to scholars, computational skills and technical expertise become a limiting factor, thereby introducing a schism in scholarly opportunities that conventional social science and humanities traditions are not entirely prepared to deal with.

Salient challenges to harnessing insights from big social data include the messy nature and unwieldy variety of the data themselves and emergent solutions to overcome the opaqueness of data extraction and mining through application programming interfaces (APIs) and other programming languages like Python. APIs are often a product of third-party data providers who are authorized to release a portion of social media data from a given private or a commercial provider. Before concluding that this is a problem easily resolved by the development of the right algorithm, it is important to appreciate the investment of time and effort necessary to become conversant with an entirely new way of acquiring and parsing data. Rarely do academic researchers switch gears in midstream to embrace a new methodological approach or culture of research — what’s engrained in graduate school tends to stick and most departments do not support retraining of faculty. The tendency instead is to provide software solutions that are ostensibly more ‘user-friendly’ and thus accessible to social scientists who are not trained in computer science (Goggins, et al., 2013) [6]. But this by definition places would-be Big Data researchers in the ‘end user’ category — not a position of much analytical strength.

Developing programming expertise is problematic in the social sciences as well because data mining and computational techniques are not traditionally part of advanced research training. Workshops in big social data techniques are now offered periodically at conferences and coding camps but a mere introduction to big social data analysis, which is what these workshops offer, is not the same as true parsing ability. Each of these issues contributes to the inability to generate useful insights from big social data in scholarship.

We next examine the problem of data access by evaluating properties of the systems in which big social data are found, focusing on proprietary versus open system architecture. Here, structural constraints become salient, as system design decisions can either promote or hinder the collection of big social data through the designation of overt and covert data types. While illustrating our argument with reference to the two leading social media platforms, Facebook and Twitter, we address universal data architecture concerns as encoded repository-based issues that influence data collection as well as data ownership. Facebook and Twitter have dominated the social media sphere over the past decade and so are convenient to discuss, yet the same questions could be asked about social media platforms currently gaining popularity, such as Snapchat and Instagram.



Data considerations

The distinction between overt and covert data exemplifies how data typing restricts accessibility. Here, system architecture and design considerations encapsulate a power dimension that determines how raw data are shared and utilized. Overt data are openly available and nonexclusive but are typically collected manually and thus limited in quantity. Covert data are exclusive in the sense that they are either shrouded from public view or only available for extraction with a combination of programming skills and financial resources. But, they can be very large scale — more on this below.

In the early days of Web research, researchers mined Web logs from servers for continuously generated traffic data to generate large datasets (Burton and Walther, 2001). Web log data enable behavioral inferences through analysis of navigational and interface activity that leave digital traces of what users do online, including click throughs and log-in durations for a given Web site; demographic information regarding individual users, such as gender, age, and income; and, performance-related information, including the type of activities in which users engage (Burton and Walther, 2001). Web log data have been utilized by user experience designers and usability experts, among others, to improve interface design and optimize navigational flow on specific Web pages (Byrne, et al., 1999). Analysis of browsing history provides insights about the design optimization of a given page by tracking, among other things, download wait times, page views, click throughs, scrolling activity, and use of multiple browser windows.

Detailed metrics have long been in the hands of system operators and in-house analysts. At various times, academic researchers have found ways of extracting similar data. For example, Paolillo (1999) downloaded user interaction log files from Internet Relay Chat to study social network and language variation, while Jones (1997) and Rafaeli and Sudweeks (1997) examined USENET files to document interactivity and the use of news sources in computer-mediated communication. Such data provide large amounts of contextual information to facilitate analytical inference. But similar to big social data, they also require technical knowledge and proprietary access.

Because a large number (although very low percentage) of posts on social media platforms like Facebook and Twitter are readily visible, these user-generated sites provide the impression of openness. But beyond manual copying, the data they gather from users, which have become increasingly valuable to marketers, are not very accessible to researchers. The expectation of easy access stems perhaps from the early history of a relatively open Web and the availability — with a costly subscription — of digital news and other media transcripts through information providers like LexisNexis. While not free, LexisNexis is routinely provided by major campuses at which research takes place — the cost is absorbed by the institution. With the rise of social media and platforms populated with user-generated content, access to these data streams seemed to promise a new era of data abundance — again, a presumption that can be traced back to the early stages of online research when it was common to examine the small data of Web logs and surface content that could be manually scraped from Web pages.

Big social data marketers, who promote the notion that social media data are ubiquitous and there for the asking, invoke this history. Reinforcing the narrative of ubiquity is a continual expansion in the amount of archived data — from 4.4 zettabites in 2013 to a projected 44 zettabites by 2020 (EMC, 2014). Yet, despite this apparent abundance, there is a tight concentration of companies involved in the handling of data generated by social media and the software needed to extract and parse meaning from it (Turino and Kulik, 2014). SAP’s Hana and Oracle’s Analytics appliances, launched in 2010 and 2011, are two services that facilitate data hosting and acquisition. Such concentrations inevitably work against the small-scale researcher in terms of pricing and other barriers to entry. Thus, the proprietary nature of most social media data streams place restrictions on access.



Access denied

Social media platforms, which differ in the degree of data access they are willing to provide, simply do not allow researchers to freely gather unrestricted amounts of data generated by the users of these platforms. In April 2014, Twitter somewhat famously cut off ‘firehose’ access to publicly available tweets from third party vendors who had been reselling “the unfiltered, full stream of Tweets and all related metadata that goes along with them” (Lunden, 2015). In social media settings, restrictions on access begin with users themselves. Yet such restrictions and privacy settings really only apply to third parties who want access to private data, either for research purposes, for reselling, or some other interest. The social media companies — whether Facebook, Twitter, Snapchat, Instagram, or Google+ — retain unfettered access to their entire user base of information.

Proprietary settings add another layer of exclusion, namely, that permission to access data must be granted by a commercial entity that either owns or, just as critically, controls content. Data providers often occupy a crucial gatekeeping position even if the technical expertise exists to effectively navigate at the system level. When data are made available, they are often limited by parameters the company or provider decides are acceptable. In some cases, third party providers grant access to a portion of the data, while companies maintain control over the complete corpus. Sporadic initiatives to provide enhanced data access have emanated from some social media services themselves, but under the familiar cloak of exclusivity. In a 2014 competition, Twitter made available full “access to our public and historical data” (Twitter, 2014a) — but only to a handful of research institutions with competitive proposals for #DataGrants [7]. In such cases, social media data are presented to researchers as a valuable asset, but limiting these awards to a small number of grant recipients only perpetuates the scholarly divide at the institutional level.

APIs: Pulling from proprietary systems

Data restrictions are not new to communication technology research. When business models developed in traditional mass media, private companies such as Nielsen and Arbitron began to compile strategic information on viewership, readership, and listenership without any intention of sharing the data freely with the academic community (although occasionally data from Nielsen, which acquired Arbitron in 2013, does show up in academic research). While Nielsen makes some ratings data available to academic researchers, cost issues would prohibit most investigators from accessing their data. For example, Nielsen media and audience measurement reports for trends in the U.S. start at around US$500 and run into the thousands, depending on the type of report (Nielsen, 2016) — enough to prevent such data from becoming a central outcome variable in applied media research, which arguably it should be.

Automated tools, which provide access to digital data troves, are being deployed to capture and account for increased data flows. Chief among these are APIs, which facilitate the gathering of publicly available social media data. The APIs for Twitter and Facebook are pioneering applications in the emerging marketplace of access to data streams and social media content. Although promising as a research tool, APIs are limiting as well. The key restriction is their predefined scope, which is embedded in the set parameters through which they are programmed to gather data. To enhance flexibility, APIs are often joined to create mashups that combine multiple APIs or other applications, generating new prototypes with broader search parameters. But here, too, another limitation of big social data analysis emerges — the comparability of datasets using different APIs. Considering the variability in APIs, differences in how they are manipulated by mashups, and the fluctuating timeframes in which data are gathered, the comparability of datasets using different APIs remains clouded at best.

Given the restrictions on big social data access due to the requisite programming expertise and third party or platform participation, social media research is developing a dependency on both computer science training and corporate cooperation. It’s a ‘world of data’ out there, yet the successful extraction of big social data depends on the technical skill of a given research team and provider permission to collect and analyze the data provided (not to mention human subjects protections overseen by Institutional Review Boards). System design, or architectural, constraints are another important consideration. Significantly, the complexity of mapping data in multiplatform environments increases proportionally to the number of platforms that are of interest.

Data architecture: Overt and covert elements

Data architecture — the rules and standards that govern how data is collected and structured in information technology systems — predetermines the ways in which user-generated data can be assembled and the types of datasets that may be generated. We distinguish between different types of data based on their accessibility. As mentioned, overt data consists of publicly visible and available content, usually on the user interface side, that does not require advanced technical expertise or significant resources to access. In this sense, overt data are analogous to content from traditional mass media. By contrast, covert data consist largely of user-generated content collated by automated and systematic data-scraping mechanisms, such as APIs — and require special expertise or substantial resources to access. In this sense, covert data correspond to the big social data that are driving concerns about an emerging scholarly divide.

Overt data collection proceeds conventionally, with construction of a sampling frame and retrieval of content that are publicly available. The mechanics of overt data collection may also involve a combination of snowball (collection of data from a few members of a targeted population and expanding it by word of mouth) and purposive (designed to study a subset of a larger population based on specific criteria) sampling techniques useful in social media analyses that facilitate the construction of localized and context-specific datasets. Access is nevertheless limited because data gathering is restricted to what the individual researcher can see and download. Purposive or convenience sampling limits the generalizability of what can be reported — a problematic issue when considering the requirements of conventional content analysis. With overt data, one can never truly randomly sample cases; only purposive sampling is possible. With covert data, random sampling is possible but analysis can only include sampling points that are predetermined by the relevant API.

In cases of participant observation, some big social data platforms are easier to navigate than others. The data from online forums, for example, are generally available in whole. Yet, on social networking platforms such as Facebook, researchers can freely access only those data that are available through friend lists or through searches of publicly available profile data. As an extreme form of convenience sampling, friend-list data severely limit the scope of analysis — and new knowledge that can be generated.

On account of these limitations, overt data collection presents problems in tracing the evolution of discussions and networked activities, and handicaps the attempt to gather data longitudinally over extended periods of time to increase robustness. If collected manually, overt data are challenging to download consistently and to replicate, especially since users can change their privacy settings and restrict the public availability of personal information at any time, whether by making deletions, hiding certain forms of information, unsubscribing, or simply discontinuing their activity.

Covert data, by residing in the inner layers of data depositories that are immune to user changes and controls, provide multiple advantages if they become retrievable, namely by allowing “more systematic forms of high-speed and high-volume data gleaning” [8]. Theoretically at least, covert data are ideally situated for analysis because they are unrestricted in scope or duration — assuming, of course, that they can be retrieved.

As the term implies, covert data are generated in the inner layers of a system’s architecture that are not readily visible. Covert data are available only to users who have permission to access these deeper system tiers, who possess the know-how necessary for automatic extraction of user data, or possess some combination of permission and expertise. Covert data layers are the building blocks of big social data repositories and can yield large volumes of information based on search query specifications. The advantages of covert data collection reside in the ease and systematic nature of sampling procedures once a programming language or algorithm is employed.

Implications of the overt/covert distinction

Overt and covert data gathering mechanisms create two different classes of data from a single platform or service and thus structure an uneven playing field for researchers. Overt data are more likely to be associated with the production of smaller datasets since they involve manual or semi-manual data collection. Covert data present a volume and delivery advantage in terms of the amount and type of metadata (information about the data) collected for analysis, along with the content present in a given data stream. Gathering of overt data, while rich in context, is slower, smaller, manually retrieved, and associated with qualitative or descriptive research. Covert data gathering, which can provide the same richness depending on the parameters through which datasets are constructed, is faster, (much) larger, automated, and associated with quantitative research and the use of inferential statistics.

Because it involves automated or programmed routines that can be implemented on a large scale, covert data gathering can easily accommodate longitudinal analysis over an extended period of time — weeks, months, or even years — and can be paired with other forms of data, for instance, biobehavioral analysis of media content, to reveal undiscovered relationships (e.g., Shah, et al., 2015). Similarly, covert data analyses can also look back in time, through retrievals of archival data. And because of the richness of metadata contained within covert datasets, access also enables analysis of background information like location, programming language, ownership, tools used to create the field, key words, and so on.

Although the long-term value of big social data analysis in the social sciences is still being negotiated, studies based on covert forms of data generally have more visibility and impact than those based on overt data sources because they uncover otherwise hidden patterns and relationships. Researchers who are limited to overt forms of social media data, and who must confine their analyses to small scale sampling of a given platform within a limited period of time, are analytically disadvantaged compared to those with access to covert data — and their work is likely to have less impact on society.



Social media scholarship

The development of APIs for social media platforms carries with it the promise for academics, especially social scientists, of incorporating the analysis of big social data into their research. Using EBSCO’s Communication Source database (a recently merged resource comprising Communication and Mass Media Complete plus Communication Abstracts, formerly published by Sage), we identified the peer-reviewed research articles that mentioned Twitter or Facebook in the title and ‘API’ anywhere in the text for the period January 2010 to January 2016, a timespan that encompasses the introduction of APIs to networked data collection [9]. Limiting our query to articles in Communication Source represents a somewhat conservative approach to article identification compared to a broad-based Google Scholar search since the database does not include work published in more technical research areas, such as computer science or software engineering.

Even though APIs were only recently introduced, we extended the search back to 2010 to assess the volume of research conducted on the two most popular social media platforms since their emergence as dominant media players. Our search was based on the following criteria. First, we filtered for peer-reviewed research articles. Next, we searched for ‘Twitter’ anywhere in the article, which produced 3,598 studies. We subsequently reduced the search to ‘Twitter’ in the title only (such delimitation conservatively assumes that the article’s focus was placed on Twitter and eliminates false positives where Twitter is mentioned tangentially). This process identified 349 studies. Finally, the filter was narrowed further to ‘Twitter’ and ‘API’ anywhere in text. Delimiting the search to these criteria identified 25 articles.

An identical analysis was performed using ‘Facebook’ as the search term. The search range was again January 2010 to January 2016. A total of 4,801 peer-reviewed articles were produced with the word ‘Facebook’ in the text. Again, a much smaller number, 410, contained ‘Facebook’ in the title. Just four studies included ‘Facebook’ in the title and ‘API’ anywhere in the text (showing the relative difficulty of extracting big social data from Facebook compared to Twitter). While just intended for illustrative purposes, this analysis of EBSCO-indexed articles on Facebook and Twitter shows that despite the interest shown by communication technology scholars in social media as a site of research, only a very limited number of studies are utilizing APIs as data-gathering mechanisms as of yet — a trend we would say is indicative of the scholarly divide.



Critiquing the Big Data paradigm

Given disparities in skill sets and access to social media data among scholars who would be willing to work on larger datasets analyzing online behavior, we see the resulting critique of the Big Data paradigm as all but inevitable. While big social data analytics have already achieved impressive results in social science scholarship, the real potential of this emergent research paradigm won’t be realized until the barriers for analysis can be substantially eliminated. Despite the appeal of big social data as the next new thing in communication and technology scholarship, the approach has been rightfully criticized in terms of its sociotechnical consequences and cultural, technological, and scholarly implications for social science research (see boyd and Crawford, 2012; Bruns, 2013; Bucy and Zelenkauskaite, 2014; Langlois, et al., 2015; Manovich, 2012; Vis, 2013).

In the digital humanities, the promise of big social data, referred as to knowledge about many, has been challenged by questioning the paradigm’s analytical trajectory and inability thus far to provide much depth of insight or theory driven concept development (Manovich, 2012). Even as researchers analyze an ever-growing number of tweets, Facebook photos, YouTube videos, and other social media content, none of these data provide a “transparent window” into the self or necessarily reveal the imaginations, intentions, motives, opinions and ideas of users (Manovich, 2012). They are, moreover, carefully created and systematically managed outputs. Manovich (2012) contends that the information from Web servers — “the aggregated behavior and sentiments of users overall — should not be conflated with the emotions, motivations, and deliberative thoughts of individual users” [10]. Though scholars are for the most part careful to avoid this form of ecological fallacy, the critique serves as an important reminder to consider the complexity and context in which data, big or small, are situated.

Commenting on the ‘computational turn’ in research (Berry, 2011), in which fields that have traditionally dealt with more limited collections of evidence must now contend with vast amounts of it, Bruns (2013) argues that the big social data paradigm can be leveraged for scholarly advancement by situating academic work within a unified depository. Such a depository is seen as a vehicle for facilitating research and analysis by providing an information stockpile, in essence, that would reduce some of the troublesome barriers to data access, storage, and retrieval facing researchers today. Similarly, in light of the increased availability of data sources more generally, Lomborg and Bechmann (2014) call for enhanced integration of quantitative and qualitative as well as mixed methods research designs. Further efforts should be made to situate big social data outcomes as complementary to existing approaches rather than regarding them as analytical islands.

Others argue that consequences of big social data for research will deepen existing methodological divides or even result in “methodological and epistemological wars” (Peled, 2013). Scholars in the digital humanities have pondered whether all intellectual work in the big social data era will basically be transformed into ‘software studies,’ where technology occupies a privileged position in relation to the studied phenomenon [11]. Further issues are raised by questioning the benefits of digital approaches compared to traditional methodologies. Analysis of larger datasets does not automatically generate more insightful answers to the questions social science seeks to answer; rather, the insights derived from big social data must be carefully calibrated to the aggregate level where the analysis fits with the outcome measured.



Discussion and conclusion

Lacking detailed knowledge of the new sociotechnical reality, researchers in the ‘end user’ category (i.e., the vast majority of those studying communication and technology issues) are relegated either to manual data collection techniques or the use of automated tools that map relationships but don’t download raw data for intensive analysis. To reduce this developing scholarly divide, training in big social data should include the tools and techniques of covert data extraction and analysis — a means of data production now every bit as important as traditional methods of data collection. Otherwise, the future of big social data research seems to be one of increasing segregation, partitioned on the basis of available funding as well as subdisciplinary training and expertise. Such scenarios would seem to call for enhanced social science training in the understanding and use of APIs for constructing big social data repositories and the analytical tools necessary to derive meaning from them.

The overt and covert quality of data problematizes social media research in several ways. To a large extent, the unavailability of big social data derives from the automated way it is collected. Unlike traditional media contents, which are published or broadcast overtly and within public view, much user-generated content is gathered as a byproduct of system or software design decisions, which track users as they post on social media platforms, exchange messages which they assume are private, make purchases or simply show interest in products on e-commerce sites, express preferences in different communities, and otherwise navigate their way through networked systems. Because covert data are inaccessible without system-level access and understanding, the ability to retrieve big social data becomes an important discriminating factor in the availability of material to analyze — and opportunities to generate knowledge.

Big social data provide new opportunities to access user behavior in online environments in unobtrusive ways. Yet a divide between those who find themselves with access to big social data and those who lack access and expertise to parse it is reflected in the notion of a deepening digital divide (van Dijk and Hacker, 2003), which emphasizes the usage gaps that emerge between different users of technology. Now such concerns are centrally relevant to the scholarly enterprise itself. Diverse competencies that include the ability to pose meaningful and relevant questions for specific sociotechnical systems, access and retrieve covert data, and perform advanced data analytics become important, if not required, assets in big social data research.

In many ways, the access dimension of the scholarly divide we have described here is not new. Some of the key issues are related to the monetization and commercialization of digital data and research, where datasets (of limited size) can be purchased given available funding (Peled, 2013). Similarly, audience aggregation agencies such as Nielsen and comScore have long exemplified the unattainability of longitudinal media data for researchers even though they have been quite good about offering a ‘preview’ of their data through summaries or weekly box office releases (comScore, 2016; Nielsen, 2016). The commercial aspect of research in the networked era highlights how big social data has provided a new context for familiar problems in relation to data access, reconfigured in an emergent paradigm.

Beyond monetary issues, big social data collection through private social media companies poses challenges in terms of data validity and generalizability. Even if scholarly access to big social data could be enhanced through support grants and training, challenges would still linger. Most notably, the use of APIs presents a burden and dependency on third parties for data collection. Control of the data rests with the design of the application and system infrastructure architecture. Even if communication and technology scholars expressed an urgent need to gain a better understanding of covert data collection through APIs (see Burgess and Bruns, 2012; Langlois, et al., 2009), or began working in much closer collaboration with computer scientists, the issues surrounding full access seem difficult to surmount. At present, there seems to be an unhealthy dependency on the private companies that develop APIs and provide most big social data for research.

To begin meeting the demand created by the computational turn that social media analysis has taken, some private companies are stepping into the breech and various homegrown initiatives have sprouted, including coding camps (e.g., Data in a Day [12]), conference workshops in new programming languages and statistical techniques, and curated Web resources. The Twitter Collection and Analysis Toolkit (TCAT), a cloud-based software solution that allows researchers to gather publicly available tweets off of the Stream API and process the data for network analysis and visualization in Gephi, offers a low-knowledge entre into social media analysis (see Gaffney and Puschmann, 2014). Self-teaching in basic programming skills through open source programming languages like Python, which provide direct and simple access to social media APIs, is another tactic.

Open access tools associated with the R statistical package have also been developed, including one called an “Absolute beginner’s guide to the SocialMediaLab package in R,” which comes with accompanying tutorials (Graham and Ackland, 2016). A plenitude of social media collection tools have been identified and meticulously organized by a few skillful scholars who strive to make big social data research more accessible. Deen Freelon (n.d.), for example, curates a wiki list titled ‘Social media data collection tools’ ( which includes an extensive assortment of applications, modules, libraries, and other tools, some of which do not require programming skills but many times cost and others that require programming skills but many times are free.

Social media data vendors play an important role in data access and visualization. Listed below are vendors of various types that provide data visualizations, metrics, or other data syntheses:

Another tool that has been developed to help researchers scrape Twitter data without knowledge of Python or APIs is an initiative called Twitter Zombie (Black, et al., 2012). The software is a search engine based on keywords. The program connects to a search API so the user does not have to do this herself. Twitter Zombie is simple to use and generates social media data. However, while such tools are promising, they are also limiting in both search capacity and archive depth; Twitter Zombie will only retrieve six days’ (or less) worth of content, and a query can be rejected if it is too complex. Moreover, each query generates a maximum of 1,500 cases, which in a big social data context can only be considered pilot or test data (Black, et al., 2012).

Finally acknowledging the new reality, the newest editions of some quantitative methods textbooks in communication research are now including sections or chapters on Big Data analysis. But instead of actually teaching the methods, they merely provide an overview of big social data research rather than the details of how to perform the analysis (see Wrench, et al., 2015). This reinforces the notion that big social data research is still an individual enterprise rather than pressing disciplinary concern.

The disparity in access to big social data inevitably has consequences for research. Inequities or even ‘methodological wars’ between social media researchers are one undesirable outcome. Data access, programming knowledge, and visualization ability also put some researchers at a distinct advantage. Ironically, the best undergraduate training in communication and technology studies in preparation for advanced research at the graduate level may now be a degree in computer science, informatics, or engineering. Such positioning could be described as an early adopter advantage, which may eventually subside. At present, reconciling the scholarly divide we have described seems dependent on the development of more automated tools for non-experts. If the skills for effective big social data analysis ever did become a dominant training focus among social scientists, or if data from media platforms became more readily available and access tools more transparent, conducting research with big social data could evolve as a routine epistemological choice of the researcher, rather than remain a much ballyhooed but unattainable form of scholarship.

Until then, the disparities in big social data scholarship will continue to require a combination of “ingenuity, clever data design, collaboration, humility, and humanity” [13] to overcome. End of article


About the authors

Asta Zelenkauskaite is an assistant professor in the Department of Communication at Drexel University. Broadly, her research focuses on the interplay between user logic(s) and media logic(s) regarding social media. Her current research includes online influence and its implications for Big Data. She engages issues of dominant and emergent user practices, either facilitated or impeded by sociotechnical systems.
E-mail: az358 [at] drexel [dot] edu

Erik P. Bucy is the Marshall and Sharleen Formby Regents Professor of Strategic Communication in the College of Media and Communication at Texas Tech University. He is the author of Image bite politics: News and the visual framing of elections (with Maria Elizabeth Grabe, Oxford, 2009) and editor of the Sourcebook for political communication research: Methods, measures, and analytical techniques (with R. Lance Holbert, Routledge, 2013). His work on new technology, political communication, and media evaluation has been published in a wide variety of leading journals.
E-mail: erik [dot] bucy [at] ttu [dot] edu



1. Chen, et al., 2012, p. 1,166.

2. boyd and Crawford, 2012, p. 663.

3. In the commercial context, investment in Big Data analytics has been marketed as a worthwhile expenditure that delivers important market intelligence in real time (Manyika, et al., 2011).

4. In recent years, numerous journals have sponsored forums on Big Data, including First Monday (2013), Journal of Communication (2014), International Journal of Communication (2014), Annals of the American Academy of Political and Social Science (2015), Digital Journalism (2015), Journalism & Mass Communication Quarterly (2015), New Media & Society (2015), and Social Science Computer Review (2015), among others — and more special issues are planned, e.g., in American Behavioral Scientist and Big Data & Society.

5. Skoric, 2013, p. 175.

6. Granted, some enterprising individuals in the humanities and social sciences do teach themselves the programming and technical skills necessary to parse Big Data, or assemble teams of specialists to perform Big Data analysis. But this is the exception, not the rule.

7. In April of 2014 Twitter (2014b) reported to have received 1,300 proposals from more than 60 different countries, including grants awarded to the following campuses:

8. Rieder, 2013, pp. 346–355.

9. A data crawling API for Twitter was launched in October 2012 (Twitter, 2016). Other unverified sources claim that Twitter introduced an API as far back as September 2006; Facebook launched its development platform and API in August 2006 (API Evangelist, n.d.). Yet, it is not clear what types of information were accessible in the early days of either platform.

10. Manovich, 2012, p. 466.

11. Peled, 2013, p. 4.

12. Offered by Boston University’s Division of Emerging Media Studies in August 2015, ‘Data in a Day’ was described as a hands-on, one-day workshop “designed to equip attendees with the knowledge and skills to quickly analyze and visualize data from popular social media platforms,” see

13. Peled, 2013, p. 17.



API Evangelist, n.d. “History of APIs,” at, accessed 17 April 2016.

David Berry, 2011. “The computational turn: Thinking about the digital humanities,” Culture Machine, volume 12, at, accessed 17 April 2016.

Alan Black, Christopher Mascaro, Michael Gallagher, and Sean P. Goggins, 2012. “Twitter zombie: Architecture for capturing, socially transforming and analyzing the Twittersphere,” GROUP ’12: Proceedings of the 17th ACM International Conference on Supporting Group Work, pp. 229–238.
doi:, accessed 17 April 2016.

Jan L. Bordewijk and Ben van Kaam, 1986. “Towards a new classification of tele-information services,” InterMedia, volume 14, number 1, pp. 16–21.

danah boyd and Kate Crawford, 2012. “Critical questions for Big Data: Provocations for a cultural, technological, and scholarly phenomenon,” Information, Communication & Society, volume 15, number 5, pp. 662–679.
doi:, accessed 17 April 2016.

Axel Bruns, 2013. “Faster than the speed of print: Reconciling ‘Big Data’ social media analysis and academic scholarship,” First Monday, volume 18, number 10, at, accessed 17 April 2016.
doi:, accessed 17 April 2016.

Axel Bruns, 2008. “Reconfiguring television for a networked, produsage context,” Media International Australia, volume 126, number 1, pp. 82–94.

Michael D. Byrne, Bonnie E. John, Neil S. Wehrle, and David C. Crow, 1999. “The tangled web we wove: A taskonomy of WWW use,” CHI ’99: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 544–551.
doi:, accessed 17 April 2016.

Erik P. Bucy and Asta Zelenkauskaite, 2014. “Big data and unattainable scholarship,” In: Seeta Peña Gangadharan with Virginia Eubanks and Solon Barocas (editors). Data and discrimination: Collected essays. Washington, D.C.: Open Technology Institute, New America Foundation, pp. 21–25, at, accessed 17 April 2016.

Jean Burgess and Axel Bruns, 2012. “Twitter archives and the challenges of ‘big social data’ for media and communication research,” M/C Journal, volume 15, number 5, at, accessed 17 April 2016.

Mary C. Burton and Joseph B. Walther, 2001. “The value of Web log data in use-based design and testing,” Journal of Computer-Mediated Communication, volume 6, number 3.
doi:, accessed 17 April 2016.

Hsinchun Chen, Roger H.L. Chiang, and Veda C. Storey, 2012. “Business intelligence and analytics: From Big Data to big impact,” MIS Quarterly, volume 36, number 4, pp. 1,165–1,188.

comScore, 2016. “Top entertainment rankings; worldwide box office (estimates),” at, accessed 17 April 2016.

EMC, 2014. “The digital universe of opportunities: Rich data and the increasing value of the Internet of things,” at, accessed 17 April 2016.

Deen Freelon, n.d. “Social media data collection tools,” at, accessed 17 April 2016.

Devin Gaffney and Cornelius Puschmann, 2014. “Data collection on Twitter,” In: Katrin Weller, Axel Bruns, Jean Burgess, Merja Mahrt, and Cornelius Puschmann (editors). Twitter and society. New York: Peter Lang, pp. 55–67.

Sean P. Goggins, Christopher Mascaro, Nora McDonald, Alan Black, and Guiseppe Valetto, 2013. “Big social data for social and information scientists,” iConference 2013 Proceedings, pp. 1,011–1,012, at, accessed 17 April 2016.

Tim Graham and Robert Ackland, 2016. “Absolute beginner’s guide to the SocialMediaLab package in R” (6 April), at, accessed 17 April 2016.

Martin Hilbert, 2013. “Big Data for development: From information- to knowledge societies,” Social Science Research Network (15 January), at, accessed 17 April 2016.
doi:, accessed 17 April 2016.

Adam Jacobs, 2009. “The pathologies of Big Data,” Communications of the ACM, volume 52, number 8, pp. 36–44.
doi:, accessed 17 April 2016.

Steve Jones, 1997. “Using the news: An examination of the value and use of news sources in CMC,” Journal of Computer-Mediated Communication, volume 2, number 4.
doi:, accessed 17 April 2016.

Andreas Jungherr, 2014. “The logic of political coverage on Twitter: Temporal dynamics and content,” Journal of Communication, volume 64, number 2, pp. 239–259.
doi:, accessed 17 April 2016.

Thomas Kuhn, 1996. The structure of scientific revolutions. Third edition. Chicago: University of Chicago Press.

Ganaele Langlois, Joanna Redden and Greg Elmer (editors), 2015. Compromised data: From social media to Big Data. New York: Bloomsbury.

Ganaele Langlois, Greg Elmer, Fenwick McKelvey, and Zachary Devereaux, 2009. “Networked publics: The double articulation of code and politics on Facebook,” Canadian Journal of Communication, volume 34, number 3, pp. 415–434, and at, accessed 17 April 2016.

Kristina Lerman, 2013. “Social informatics: Using Big Data to understand social behavior,” In: Pietro Michelucci (editor). Handbook of human computation. New York: Springer, pp. 751–759.
doi:, accessed 17 April 2016.

Seth C. Lewis, Rodrigo Zamith, and Alfred Hermida, 2013. “Content analysis in an era of Big Data: A hybrid approach to computational and manual methods,” Journal of Broadcasting & Electronic Media, volume 57, number 1, pp. 34–52.
doi:, accessed 17 April 2016.

Stine Lomborg and Anja Bechmann, 2014. “Using APIs for data collection on social media,” Information Society, volume 30, number 4, pp. 256–265.
doi:, accessed 17 April 2016.

Ingrid Lunden, 2015. “Twitter cuts off DataSift to step up its own Big Data business,” TechCrunch (11 April), at, accessed 17 April 2016.

Lev Manovich, 2012. “Trending: The promises and the challenges of big social data,” In: Matthew K. Gold (editor). Debates in the digital humanities. Minneapolis: University of Minnesota Press, pp. 460–475.
doi:, accessed 17 April 2016.

James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers, 2011. “Big Data: The next frontier for innovation, competition, and productivity,” New York: McKinsey Global Institute, at, accessed 17 April 2016.

Ericka Menchen–Trevino, 2013. “Collecting vertical trace data: Big possibilities and big challenges for multi–method research,” Policy & Internet, volume 5, number 3, pp. 328–339.
doi:, accessed 17 April 2016.

Nielsen, 2016. “Welcome to the Nielsen store,” at, accessed 17 April 2016.

W. Russell Neuman, Lauren Guggenheim, S. Mo Jang, and Soo Young Bae, 2014. “The dynamics of public attention: Agenda-setting theory meets Big Data,” Journal of Communication, volume 64, number 2, pp. 193–214.
doi:, accessed 17 April 2016.

John Paolillo, 1999. “The virtual speech community: Social network and language variation on IRC,” Journal of Computer-Mediated Communication, volume 4, number 4.
doi:, accessed 17 April 2016.

Alon Peled, 2013. “The politics of Big Data: A three-level analysis,” paper presented at the European Consortium of Political Research (ECPR) general conference (Bordeaux, France), at, accessed 17 April 2016.

Sheizaf Rafaeli and Fay Sudweeks, 1997. “Networked interactivity,” Journal of Computer-Mediated Communication, volume 2, number 4.
doi:, accessed 17 April 2016.

Bernhard Rieder, 2013. “Studying Facebook via data extraction: The Netvizz application,” WebSci ’13: Proceedings of the Fifth Annual ACM Web Science Conference, pp. 346–355.
doi:, accessed 17 April 2016.

Evelyn Ruppert, 2013. “Rethinking empirical social sciences,” Dialogues in Human Geography, volume 3, number 3, pp. 268–273.
doi:, accessed 17 April 2016.

Dhavan V. Shah, Alex Hanna, Erik P. Bucy, Chris Wells, and Vidal Quevedo, 2015. “The power of television images in a social media age: Linking biobehavioral and computational approaches via the Second Screen,” Annals of the American Academy of Political and Social Science, volume 659, number 1, pp. 225–245.
doi:, accessed 17 April 2016.

Marko M. Skoric, 2013. “The implications of Big Data for developing and transitional economies: Extending the triple helix?” Scientometrics, volume 99, number 1, pp. 175–186.
doi:, accessed 17 April 2016.

Sean J. Taylor, 2013. “Real scientists make their own data” (25 January), at, accessed 17 April 2016.

James Turino and Hadrien Kulik, 2014. “Sector report: Business intelligence,” New York: Redwood Capital, at, accessed 17 April 2016.

Joseph Turow, Lee McGuigan, and Elena R. Maris, 2015. “Making data mining a natural part of life: Physical retailing, customer surveillance and the 21st century social imaginary,” European Journal of Cultural Studies, volume 18, numbers 4–5, pp. 464–478.
doi:, accessed 17 April 2016.

Twitter, 2016. “Calendar of API changes,” at, accessed 17 April 2016.

Twitter, 2014a, “Introducing Twitter data grants” (5 February), at, accessed 17 April 2016.

Twitter, 2014b, “Twitter #DataGrants selections” (17 April), at, accessed 17 April 2016.

Jan van Dijk and Kenneth Hacker, 2003. “The digital divide as a complex and dynamic phenomenon,” Information Society, volume 19, number 4, pp. 315–326.
doi:, accessed 17 April 2016.

Farida Vis, 2013. “A critical reflection on Big Data: Considering APIs, researchers and tools as data makers,” First Monday, volume 18, number 10, at, accessed 17 April 2016.
doi:, accessed 17 April 2016.

Robert E. Wilson, Samuel D. Gosling, and Lindsay T. Graham, 2012. “A review of Facebook research in the social sciences,” Perspectives on Psychological Science, volume 7, number 3, pp. 203–220.
doi:, accessed 17 April 2016.

Jason S. Wrench, Candice Thomas-Maddox, Virginia Peck Richmond, and James C. McCroskey, 2015. Quantitative research methods for communication: A hands-on approach. Third edition. New York: Oxford University Press.

Asta Zelenkauskaite, 2016. “Remediation, Convergence, and Big Data: Conceptual Limits of cross-platform social media,” Convergence: The International Journal of Research into New Media Technologies (17 February).
doi:, accessed 17 April 2016.

Asta Zelenkauskaite and Bruno Simões, 2015. “User interaction profiling on Facebook, Twitter, and Google+ across radio stations,” 2015 48th Hawaii International Conference on System Sciences (HICSS), pp. 1,657–1,666.
doi:, accessed 17 April 2016.


Editorial history

Received 16 December 2015; revised 18 April 2016; accepted 19 April 2016.

Copyright © 2016, First Monday.
Copyright © 2016, Asta Zelenkauskaite and Erik P. Bucy. All Rights Reserved.

A scholarly divide: Social media, Big Data, and unattainable scholarship
by Asta Zelenkauskaite and Erik P. Bucy.
First Monday, Volume 21, Number 5 - 2 May 2016