The promise of ‘big data’ has generated a significant deal of interest in the development of new approaches to research in the humanities and social sciences, as well as a range of important critical interventions which warn of an unquestioned rush to ‘big data’. Drawing on the experiences made in developing innovative ‘big data’ approaches to social media research, this paper examines some of the repercussions for the scholarly research and publication practices of those researchers who do pursue the path of ‘big data’–centric investigation in their work. As researchers import the tools and methods of highly quantitative, statistical analysis from the ‘hard’ sciences into computational, digital humanities research, must they also subscribe to the language and assumptions underlying such ‘scientificity’? If so, how does this affect the choices made in gathering, processing, analysing, and disseminating the outcomes of digital humanities research? In particular, is there a need to rethink the forms and formats of publishing scholarly work in order to enable the rigorous scrutiny and replicability of research outcomes?
‘Big data’ has become the latest buzzword in the scholarly as well as commercial research community. In the humanities and social sciences, which this paper focusses on, the rise of ‘big data’ is associated with, and indeed to some degree enables, what David Berry (2011) has described as the “computational turn”: the increased incorporation of advanced computational research methods that are able to draw on and process large datasets into fields which have traditionally dealt with considerably more limited collections of evidence, and which generated and processed such evidence mainly through manual methods ranging from ethnographic observation to the close reading of texts. This gradual turn towards computational methods, both augmenting and at times replacing more conventional research approaches, also contributes to the rise of a new field of “digital humanities” (Schreibman, et al., 2004), whose growing importance has been demonstrated not least by the increasing number of high–profile digital humanities conferences held in recent years .
‘Big data’ is a term which has risen to considerable prominence within a short space of time — also because of the significant commercial interests which now attach themselves to the generation, marketing, and utilisation of ‘big data’. The hype surrounding the term has led a substantial number of researchers to rebadge what they do as ‘big data research’ without necessarily engaging with the concept in any significant scholarly way. For this reason, the ‘big data’ trend in the humanities has met with important and often well–justified criticism; it shares this fate with other scholarly and popular buzzwords, such as ‘Web 2.0’. boyd and Crawford’s “Six provocations for big data” (2011; also see boyd and Crawford, 2012) have been widely recognised as a seminal moment in the scholarly critique of ‘big data’. Arguing against the blind acceptance of such large datasets as unproblematic research resources, they note the following critical questions:
- How do big data change the meaning of knowledge?
- How objective or accurate are these data?
- Are ‘big data’ always more suited to the research task than ‘small data’?
- Can ‘big data’ preserve the contexts of what they describe?
- Is the use of ‘big data’ ethically acceptable?
- Will ‘big data’ lead to a new digital divide between ‘data haves’ and ‘data have–nots’? (paraphrased from boyd and Crawford, 2012)
This intervention — and the discussion which has followed it — has been instrumental in moving scholarly engagement with ‘big data’ from an unquestioned acceptance of the data evangelism of interested parties towards a more critical, considered stance. It does not constitute a rejection of new computational research methods, or of the digital humanities as such; rather, it recognises that while ‘big data’ are useful and even necessary for a wide range of research projects and initiatives, the implications of their use must be carefully considered: ‘big data’ are no quick and easy shortcut towards new fields of research.
Building on boyd and Crawford’s critical remarks, then, and assuming that satisfactory responses justifying the careful and considered use of ‘big data’ resources in specific research projects and contexts can be found, a further set of questions arise. These centrally concern the potential for doing and disseminating ‘big data’ research without simply ignoring boyd and Crawford’s provocations: if their intervention is interpreted at its core as a call to break open the black box of ‘big data’ and to describe its inner workings — to document how the data were gathered, what their limitations are, how they were processed and interpreted, and generally to open these datasets to independent scrutiny — then the digital humanities scholars seeking to do so still have some substantial work ahead of them.
To be clear on this point: not all of the research drawing on ‘big data’ seeks to engage in modes of analysis which incorporate substantial quantitative, statistical, computational components. ‘Big data’ research in the humanities is rarely an end in itself, and does not replace or invalidate other approaches. It remains possible, and important, to continue to employ the well–established analytical repertoire of media, communication, and cultural studies (and related disciplines), and indeed to use an initial exploration of the larger datasets which are now available as a way to pinpoint the specific areas that will prove most fruitful for ‘small data’ and ‘deep data’ research that proceeds, for example, through close reading, direct engagement, or ethnographic observation. Such research continues to offer valid and often indispensable insights, and this article does not aim to offer a critique of these methods.
But alongside these well–established research approaches, new forms of ‘big data’ research in the digital humanities borrow increasingly from computer science and the natural sciences in their use of computational (and hence mathematical and statistical) methods. Given the very different objects of research, the translation of such methods to the digital humanities is far from trivial: researchers in this emerging field must address the question of whether they need to import the associated approaches to documenting and publishing the research process and its outcomes from these disciplines as well. Does an increased reliance on quantitative, statistical, ‘scientific’ data analytics necessitate a standardisation of methods in order to ensure the replicability of results? What is the effect of an increasing scientificity (in the narrow, ‘hard sciences’ sense) on how we, as humanities researchers, document and publish our work? What can we gain, what do we stand to lose by emulating the scholarly language and publication formats of fields from mathematics through physics to computer science?
This paper considers the challenges ahead for the scholarly publishing in the digital humanities, therefore. It does so by focussing especially on a particular branch of ‘big data’ research which can be considered to be at the forefront of many of these developments: social media research. Building on the substantial volume of public communication data which can now be accessed through the application programming interfaces (APIs) of platforms such as Facebook and, especially, Twitter (cf., Burgess and Bruns, 2012), there have been particularly many claims to ‘big data’ research in this area. The ‘big social data’ retrieved — or at least theoretically retrievable — through such APIs promise a more comprehensive, real–time perspective on the vast range of everyday public communicative activities which the hundreds of millions of users of such spaces undertake.
However, where the processing and analysis of very large datasets does become an important or indeed the central focus of the research, and where methods for the computational analytics of patterns in the data are appropriated from other disciplines in order to carry out such analysis, digital humanities researchers face considerable challenges in adapting these methods to their own ends without taking potentially problematic shortcuts. One question in this context relates to the extent to which, alongside the methods and tools of computational analysis, the standard approaches to documenting and testing such analyses must also be imported into the digital humanities: should we spend considerably more time on describing in fine detail the provenance of our datasets, the steps taken, tools used, and assumptions employed in processing our data? And how can the results of such very large–scale social media analytics be communicated effectively, and with the scholarly rigour necessary to defend the work against critical scrutiny?
One of the dangers associated with the use of ‘big data’ in social media research is that it can give such work a veneer of scientificity which it does not necessarily always deserve. In order for ‘big data’–centric social media analytics in the humanities to be scientifically defensible, rather than merely to appear vaguely ‘sciencey’, a number of key conditions must be met:
2.1. Non–opportunistic data gathering
Social media (and here, especially Twitter) provide rapidly changeable, real–time communicative spaces, and a considerable subset of current social media research focusses specifically on the live dynamics of such spaces — for instance, in the context of acute events (Burgess and Crawford, 2011) such as natural disasters (e.g., Hughes and Palen, 2009; Mendoza, et al., 2010; Palen, et al., 2010; Bruns, et al., 2012) or political crises (e.g., Lotan, et al., 2011; Bruns, et al., 2013), but also of audience participation in shared televised experiences (Harrington, et al., 2012; Highfield, et al., 2013). However, the very fact that such live events are unfolding in real time and may take unexpected turns also makes it difficult to gather social media datasets which reliably represent the object of study. On Twitter, for example, the API makes it possible to capture specific hashtags which are related to a given event, but these constitute only the very lowest–hanging fruit which are available to researchers: for example, hashtag archives fail to capture topical conversations relating to an event which do not use a hashtag (or which use one of a number of alternative hashtags, by accident or deliberately), and generate an apparently well–delimited object of research which may not have been experienced in this form by any actual user.
Hashtag–based research of Twitter communication, in other words, abstracts from a considerably more complex reality — in which users may encounter hashtagged tweets through their network connections, but not follow the hashtag feed itself, and in which these hashtagged tweets exist alongside a range of material which may or may not relate to the same event or issue, and which originates from the same network of followers. Researchers focus on hashtag datasets in order to simplify data gathering and analysis processes, but in doing so they create and describe a new reality which does not necessarily represent the lived experience of any one user. Such research is opportunistic, and in the absence of more sophisticated or more comprehensive data gathering approaches which can transcend the limitations of hashtag work, that opportunism can be justified — but its limitations must be spelt out in any publications which are based on hashtag analyses, and attempts must be made to develop better research methods (and tools) which address these limitations.
One response to these limitations could lie in an embrace of even bigger data: the continuous tracking of all social media activity on a platform like Twitter, independent of specific events, would enable researchers to search these datasets for engagement with specific events and themes after the fact — by identifying relevant hashtags, but also other keywords, users, or shared resources (links, images, videos, ...) that were relevant to the topic at hand. (A focus on Twitter only would still not be able to address the problem that even a comprehensive dataset for this one, specific platform cannot be used to extrapolate to social media usage as such, of course.) The very large datasets required for such filtering a posteriori are rarely available to scholarly researchers, however: to capture, store, and maintain such (rapidly growing) data collections of everyday social media activity would require very substantial infrastructure investments. At present, only commercial third–party data access providers such as DataSift offer historical social media data services (at significant prices), while the much–anticipated comprehensive archive of all Twitter data since the inception of the service which was gifted to the Library of Congress in 2010 has yet to be made available, held up by significant concerns over the technical, legal, and ethical frameworks under which public access may be provided (see e.g., McMillan, 2013).
Even some more limited, continuous social media data tracking exercises — focussing for example on a specific, pre–determined sample of all social media activity on a given platform (such as a bundle of relevant keywords, users within a defined geographical area, or a representative sample; cf., Gerlitz and Rieder, 2013, on the latter) would still require significant data storage and processing infrastructure if it is to deliver datasets that provide a more realistic picture of actual social media activity than Twitter hashtag datasets are able to offer. Further, it remains important to heed boyd and Crawford’s warning that “bigger data are not always better data”  — the challenge here is to improve the quality rather than (or at least at the same time as) the quantity of data gathered. This, then, means that researchers must carefully consider just how they select the social media datasets or data streams they draw on, and what the limitations of these sources are — rather than (as is too often the case with simple hashtag datasets) to use these datasets only because they constituted the collections which were the easiest to capture while still appearing to contain meaningful data.
2.2. Full documentation of methods
The discussion of research choices cannot stop with data gathering approaches, of course. Just as importantly, the further steps in processing these data must also be documented in detail. This is especially important because of the relative novelty of social media research, and the substantial and continuing innovation in methods and tools which is taking place at present. However, those processes of innovation also complicate the documentation of methods, as many of the computational methods for processing large–scale social media data which researchers have developed manifest in the first place in home–made, idiosyncratic software solutions that are built on a wide variety of development platforms and supporting infrastructure and are far from ready to be shared publicly even if (as is not always the case) its developers do intend to eventually make their tools available under open source or similar licences .
Alternatively, researchers may utilise one of a number of commercial social media analytics services or draw on various emerging standard tools for computer–aided content and network analysis. While this approach aids the standardisation and replicability of research methods, it is often forced to place blind trust in the assumptions (and their representation in algorithms and software code) about the data and the ‘correct’ approach to their analysis that underlie such tools. There is a substantial danger that social media analytics services and tools are treated by researchers as unproblematic black boxes which convert data into information at the click of a button, and that subsequent scholarly interpretation and discussion build on the results of the black box process without questioning its inner workings. (The same may well be true also for methods of statistical analysis which have by now become part of the fundamental toolkit that is taught at undergraduate level and outlined in introductory social science textbooks — here, too, the applicability of such methods in specific contexts is rarely questioned.)
This can be the case even for tools whose processing steps are comparatively well–documented, but often lie outside of the disciplinary expertise of (social) media and communication researchers: the various visualisation algorithms offered by the well–known, open source network analysis software Gephi, for example, are generally described in some detail in software guides and related literature, but relatively few of the scholarly publications which draw on Gephi to visualise the social networks they study insert any substantive discussion of the benefits or limitations of the particular Gephi network visualisation algorithms they have chosen, or of the specific visualisation settings which were used to direct the algorithm itself (cf., Markham and Lindgren, 2013, for a more detailed discussion of such issues, also using Gephi as an example). Neither, it should be noted, do the referees for articles on these topics usually request such methodological information: treatment of tools such as Gephi as black boxes whose interior operations can be ignored is not limited to scholarly authors alone.
There are clear practical reasons for the bypassing of such discussion, of course: given standard word or time limits for articles and presentations, a lengthy methodological discussion would limit the space available for actual data analysis, and authors and referees may not have the disciplinary background required to authoritatively discuss and assess the relative benefits of one algorithm or another, of one algorithmic setting or another. Such excuses are unsustainable in the longer term, however, if data-intensive social media research is to operate at levels of scholarly rigour which are comparable to other fields of research. Where scholars themselves lack the necessary grounding in network analysis, statistics, computer science, or other areas which are now becoming crucial to successfully conduct rigorous social media analytics work, they will need to develop their own knowledge or to enter into interdisciplinary collaborations with researchers from these fields; where available space does not permit the discussion of research approaches ab initio in each new article, it will be important to establish a range of standard tools and methods which are widely accepted in the field as it emerges and generate comparable and replicable results for each dataset; where researchers are developing their own tools for social media analysis, it is incumbent on them to document and share those tools openly so that other scholars can evaluate their functionality and use them for their own research projects. Such joint initiatives to establish a shared set of foundational, well–tested and standardised research methods and tools are common in other scientific disciplines; they are crucial for the further development of ‘big data’ social media analytics. The aim of such initiatives, it should be noted, is not to stifle methodological or technological innovation in research by proscribing a small number of standard methods which must be used, but rather to enable further methods and tools development by standardising and documenting those foundational, reliable methods which no longer need to be re–invented from scratch each time a new social media research project commences.
2.3. Replicability of results
A central goal in any of this is to enhance the replicability of research results, and thereby the evaluability of methods and theories. Replicability is a fundamental requirement of any scientific work, of course; unless it can be shown that a repeat of an existing research experiment under similar circumstances generates comparable results, the first experiment may have been affected by circumstantial factors which render its findings unreliable. ‘Big data’ social media work which aims to find patterns in how people use social media services must be replicable by definition: an observation of a single hashtag event on Twitter, for example, is unable to make meaningful statements about activity patterns beyond that event unless similar activity dynamics are observed for other hashtag events using the same methodology. Using a standard set of Twitter metrics as defined in Bruns and Stieglitz (2013), by contrast, the investigation of broader patterns that transcend individual events becomes possible: Bruns and Stieglitz (2012) show, for example, that hashtags which relate to crisis events all appear to exhibit a number of similar characteristics.
But replicability does not only relate to the translatability of research methods across datasets, but also to the repeatability, by different research teams, of analyses which use the same dataset. Such repetition of analysis is important in order to test the analytical assumptions and conceptual choices made by each researcher or team, and to identify potential alternatives which may affect and change the interpretation of the data. At present, in social media research such replicability is severely hindered, if not prevented entirely, by the idiosyncratic format and provenance of most datasets: drawing on the platforms’ application programming interfaces and using a range of often home-grown tools to gather and process incoming data, there is no guarantee that two teams of researchers attempting to gather the same data at the same time will end up with identical datasets; similarly, any further processing and analysis of the data may also lead to the generation of research datasets which are widely divergent from one another even if they draw on the same source.
One solution which would enable researchers to test each other’s methods and results, then, would entail the open sharing of data across teams, perhaps especially alongside the published articles which draw on such data (or even as a precondition for their publication). However, in many cases such sharing of API–derived data will almost certainly violate the terms and conditions under which these data are made available to API users, including researchers, in the first place: the Twitter API’s “Developer rules of the road”, for example, state that “exporting Twitter Content to a datastore as a service or other cloud based service ... is not permitted” (Twitter, 2013; my emphasis), and any attempt by researchers to publicly share their datasets may break these rules (also cf., Puschmann and Burgess, 2013).
While the legality of these restrictions (especially where they clash with the requirements for the open publishing of data and results which may be placed on researchers by their funding institutions) has yet to be tested in most jurisdictions, a number of precedents exist in which the mere threat of legal action by a social media platform provider has led researchers or research support service providers to discontinue their open publishing of data; amongst the better known of these is the closure of Twapperkeeper, a popular Twitter keyword/hashtag archiving service (see Melanson, 2011). Twapperkeeper allowed its users to track their terms of interest, but also made the hashtag archives created in the process available to all other users of the service, thereby violating the API rules; after some time of toleration by Twitter, it was finally required to shut down in 2011. However, such censorship (or pre–emptive self–censorship) of open data publishing by researchers has resulted only in the reduplication of data gathering efforts and a widely acknowledged network of covert data sharing activities. An open source version of Twapperkeeper, yourTwapperkeeper, which remains within the “Rules of the road” by providing merely the functionality to gather data, but not the data sharing service itself, remains available for researchers to install and operate on their own servers; not all such installs are locked down to prevent access to their datasets by unauthorised users, and many of the datasets thus gathered are also shared between interested researchers through other means.
In reality, therefore, Twitter’s increasingly restrictive interpretation of its API access and data sharing rules has not had an appreciable impact on the circulation of Twitter data amongst social media researchers, but has led to a substantial reduplication of data gathering efforts (wasting time and resources), to significant uncertainty that the many installs of yourTwapperkeeper and similar gathering tools are operated and maintained with comparable levels of care and attention (raising the possibility that researchers may inadvertently base their analyses on datasets that are faulty and incomplete), and to a worrying number of Twitter researchers working in a legal and ethical grey area where their data handling and management practices may be in breach of Twitter’s terms and conditions and/or their university ethical review boards’ requirements. The only alternative to such ultimately unsustainable practices which appears to be available at this point is to draw on the commercial social media data resellers such as Gnip and DataSift, but their services are priced at a level which renders them unaffordable for the majority of social media researchers in universities. Social media analytics researchers in the academy face a dilemma between the unsustainable and the unaffordable, therefore; as long as that dilemma remains unresolved, this means that they must cede a considerable amount of ground in the field to commercial social media analytics research organisations which have very different standards and agendas.
These observations do not denigrate the quality and achievements of extant scholarly social media research; scholars in the academy have shown remarkable resilience and inventiveness in working around these restrictions and limitations, if at times by bending the rules of what social media APIs allow them to do. Their travails do not end once they have managed to find an arrangement of sorts with the data sources they draw on, however; a second set of challenges emerge when it comes to disseminating their ‘big social data’ analyses to their peers:
3.1. Spatial limitations
As noted above, the comparative novelty of social media research (in general, and of social media research which draws on ‘big data’ in particular) means that the field is still developing its research methods and tools, and that these need to be documented in much greater detail than has been the case to date. Such detailed documentation is almost impossible alongside the substantive discussion of research findings and results, however, due to the word limit of standard journal articles; similarly, for the same reasons a full discussion of the provenance and reliability of datasets can rarely be included in article drafts.
Space is also a significant limiting factor in the publication of datasets. While it may be possible to include ‘small data’ datasets in an appendix to a published article, this is inherently impossible for social media datasets that contain several hundreds of thousands or even millions of individual tweets, Facebook posts, or other interactions; again, the publication of such datasets, even in a non–programmatic format which means that they cannot be immediately reused in further analysis, may also violate the terms and conditions under which social media platform APIs make the data available in the first place.
Taken together, these limitations arising from the lack of available publication space result in an undesirable situation where the analysis of social media activities that is presented in scholarly work may be fascinating and potentially insightful, but where article referees and interested readers do not have sufficient background information on methods, tools, and the nature of the datasets underpinning the analysis to form a comprehensive independent view on the quality and validity of the work presented in an article. This is self–evidently problematic; it may be addressed in the first place by a greater emphasis on researchers’ documenting and publishing their methodological development initiatives (ranging from the conceptual establishment of frameworks for the understanding of social media activities to the incorporation of such frameworks in research software for the gathering, processing, and analysis of social media data at large scale), separate from and as companion pieces to the articles which present the data analyses conducted by using such concepts, methods, and tools. In part, this should also be seen as a call to greater interdisciplinary recognition and collaboration: simplifying the situation only slightly, it is currently the case that much methodological development in social media research is published in computer science and related journals and proceedings, while data analysis and interpretation is presented in humanities and social science works, but that these two strands of social media analytics remain largely disconnected from and unaware of each other.
3.2. Temporal limitations
A second major set of limitations concerns the speed of academic publishing. This is a problem which is by no means limited to the field of social media research: researchers have long criticised and even mocked a scholarly publishing industry in which journal articles and book chapters can sometimes take more than two years from submission to publication (and where even conference proceedings do not necessarily emerge at a substantially quicker speed). However, in much of its present work, social media analytics deals particularly frequently with current issues and events — for example, with elections, sporting and entertainment events, crises, or other momentary phenomena (e.g., Larsson and Moe, 2011; Highfield, 2013; Highfield, et al., 2013). Additionally, the fast–paced development of social media (a reminder is appropriate at this point that the two major global social media platforms, Facebook and Twitter, are both less than a decade old), as well as of social media scholarship alongside it, also means that the work of scholars is in danger of being outdated by the time it reaches its audience unless the process from submission to publication can be sped up considerably.
Recent initiatives by print journals to introduce “online first” platforms in which accepted articles are published well in advance of print publication, and the continuing growth of online–only, open access journals which eliminate the delays caused by the need to print and distribute hardcopies of articles, have mitigated this fundamental problem with scholarly publishing only to a limited extent; perceptions of the ephemerality of social media data, and consequently of the ephemerality of social media research, have led many researchers to seek even more rapid platforms for the dissemination of their work, even if it means bypassing conventional peer review processes and the official validation that they imply. A substantial component of social media research is therefore now published through a loose network of researchers’ and research groups’ blogs and other institutional Web sites, enabling these scholars to disseminate their findings in both preliminary and fully–formed shape considerably closer to the events which they pertain to.
Commercial social media research institutions — as well as platform providers such as Facebook and Twitter (the latter especially through its Twitter blog) — are considerably ahead of scholarly researchers in this context, and already publish a range of quick–fire analyses of social media phenomena; this is done not least also with a view to generating media exposure for these commercial entities. Concerns about the quality of such largely instrumental commercial research work can be raised, therefore; in turn, these also highlight the comparative absence of truly scholarly perspectives in much of the public discussion of social media events and phenomena to date. U.K. newspaper The Guardian’s collaboration with leading social media scholars in an analysis of social media activity during the 2011 U.K. riots, as part of its Reading the Riots initiative (Guardian Interactive Team, et al., 2011), is one notable and welcome exception which points out a possible approach to addressing this problem.
In principle, then, rapid dissemination initiatives for scholarly work should be encouraged and supported; however, the lack of formal peer review which necessarily results from such self–publication of results also means that (accidental or deliberate) errors or misinterpretations in data and analysis can slip through uncorrected much more easily. Much like unsubstantiated rumours on social media platforms themselves, insufficiently verified research on social media phenomena may thus widely circulate and be built upon in further research within the scholarly community, without such errors being noticed or corrected. An optimistic response to such concerns dates back at least to Clay Shirky’s (2002) “publish, then filter” mantra and holds that the community of scholars and other interested readers will eventually detect and address such errors even if no formal peer review was conducted: “the good is sorted from the mediocre after the fact”. Using a related approach, some scholarly journals now employ post–publication, open peer review to balance the needs for speed and verification — but such departures from conventional peer review approaches are not without their critics.
There can be little doubt that the speed of social media will continue to be substantially greater than the speed of peer–reviewed scholarly publishing, whether in print or online. Attempts to streamline the latter while maintaining accountability, which have emerged in a range of fields over past years, will increasingly play an important role in this field as well; post–publication peer review and ad hoc rather than volume– and issue–bound refereeing processes may offer some answers to the present situation. It should also be noted that in a range of fields in the natural and computer sciences, conferences (and their proceedings) are significantly more highly valued than they are in the humanities and social sciences; it may well be the case that a greater valorisation of such avenues for the more timely exchange of current research findings may be necessary in social media studies in the humanities as well (and several key conferences in the field, chiefly including the annual conference of the Association of Internet Researchers, already point to the utility of conferences over journal articles).
3.3. Format limitations
Finally, however, even if the means to increase the speed of publication can be found, the format of published work still requires further consideration. Conventional article formats as inherited from print journals still dominate, even in online journals where more innovative formats could be explored; especially in social media research the need for such exploration is increasingly strongly felt. The outcomes of ‘big data’ social media research often necessarily include complex data visualisations which need to draw on three or even four dimensions to present their findings in full detail; the strongly temporal nature of social media data (consisting as they do of a stream of individual utterances made one after the other) points especially to an exploration of dynamic data visualisations that show the development of communicative processes over time (see e.g., Bruns, 2011, for a discussion of such visualisation opportunities). This also implies that online publications would need to serve as the preferred venue for social media research, of course.
Additionally, the multifaceted nature of social media data predestines them for interactive visualisations that enable users of the research to explore various aspects of the dataset by adding or removing specific data layers (again, The Guardian’s “Data Blog” serves as a leading example from a non– or semi–scholarly publishing context here; see Guardian, 2013). Such visualisations would seek to reposition the scholarly audience as users and even co-researchers rather than just readers, and have the potential of communicating to users a greater understanding of the nature of the dataset and of the analytical processes involved in examining it than is possible with a small number of static graphs in a conventional paper. Interactive visualisations must be well–designed and still require substantial discussion to be associated with them, however; they should therefore by no means be seen as a shortcut to a presentation of research data and findings which bypasses the core work of the scholar, or as valuing visual over textual research outputs.
In spite of significant discussion about such new, more interactive models for scholarly online journals over the past decade, few well–established venues for such novel publication formats exist to date. Journals such as Vectors: Journal of Culture and Technology in a Dynamic Vernacular (http://vectors.usc.edu/) represent welcome initiatives that deserve stronger support from scholars working with complex datasets, but have yet to be widely accepted as mainstream publishing venues; Vectors’ limited output since its inception in 2005 (with only one issue published since 2007) points to the difficulties in publishing novel and innovative content as much as to the reluctance of scholarly authors in preparing it. Academic promotion and tenure systems which privilege journal articles and other standard outputs because they are easier to quantify than more unusual outputs must share some of the blame for this.
Much as the greater incorporation of dynamic and even interactive data visualisations in (online) scholarly publications can serve to enhance the presentation of data analysis and findings, then the greater flexibility of online formats for the sharing and publication of the underlying datasets must also be explored further. Here, the significant obstructions resulting from API terms of service cannot be easily ignored, of course; further, even if these issues can be resolved, the ethical and privacy implications of sharing social media datasets that contain the posts and associated metadata of thousands or millions of users still serve as significant barriers to an open sharing of datasets which would enhance the reliability and replicability of published research. At least partial solutions to these necessary limitations may be able to be found, however, by sharing data in carefully de–identified formats or restricting access only to accredited researchers who submit to strict guidelines for the ethical handling of social media data. Ultimately, it appears that this is an issue which must be addressed in the near future if the current situation, in which researchers are forced to put blind faith in the ‘black box’ datasets upon which their peers base their analyses, is to end.
The current state of scholarly research (and research publication) in social media analytics is perhaps not unlike that of other emergent fields in recent times; however, this should not be seen as licence for the field to continue on its present course. At this stage, the body of scholarly social media analytics research — especially where it seeks to explore the new opportunities promised by the emergence of ‘big data’ — exists as a hybrid network of publications that stretches across refereed print and online articles, officially and unofficially published conference papers, and non–refereed blog posts and other research updates on the sites of research institutes and in the interested media. In combination, the disorganised nature of this body of work, the incompatibility of research concepts, methods, and tools, and the lack of comprehensive perspectives on current developments in social media research have led to a notable balkanisation of research efforts: the field is characterised by a substantial volume of research initiatives which are, however, often isolated from one another and produce work whose methodologies and outcomes are incompatible and incomparable with one another.
The substantial delays which apply to publishing in most peer–reviewed venues mean that post–graduate researchers and other emerging scholars who seek to gain an overview of the field are perhaps better advised at this point to explore a range of leading blogs and other research Web sites which regularly publish updates on their research groups’ activities than to follow the key journals, which often serve mainly as the final resting places for work which originated in the blogs and was subsequently extended and written up for full publication. The same lags between research and publication also mean that commercial researchers (for example at a number of market and media research institutes which now also cover social media trends), as well as a handful of enterprising scholars who have been especially adept at developing non–traditional platforms for their work, take a major share of the limelight for their work; the attention paid to such research is at times due to the speed of publication more than the quality of the insights presented, however.
Further, the creeping commercialisation of truly ‘big data’ as a result of the limitation of open API capabilities and the emergence of third–party social media data resellers like Gnip and DataSift means that scholars at conventional research institutions are likely to be further marginalised from this growing field, due to their inability to marshal the funds required to access the comprehensive or large–scale datasets which such resellers now offer. While boyd and Crawford’s warning that “bigger data are not always better data”  still applies, the situation as it now emerges is such that the distinction between bigger and smaller datasets is no longer a considered choice that researchers are free to make, but a two–class system forced upon them by economic necessities. Increasingly, a handful of (commercial, or strongly commercially–supported) research institutes dominate the ‘big social media data’ field, while the majority of ‘regular’ scholars must scratch for crumbs and make do with the more and more limited data which the platforms’ open APIs still provide. This new digital divide between data ‘haves’ and ‘have–nots’ stands to have deleterious effects for the quality of social media research.
At this critical juncture for the further development of social media research, especially in the academy, a range of options must be explored. First, it appears necessary to find ways to address the tightening data access bottleneck. Non–API access methods (such as data scraping from the public Web pages of social media sites) are usually impractical, not least because they fail to retrieve some of the crucial underlying metadata which are not exposed to Web visitors and are only available through the API. At the same time, a change of heart amongst the social media platform providers which would see them open up their APIs to greater extent again at least for publicly–funded, public–interest social media research — for example into the role of social media in democratic participation or during natural disasters — appears unlikely. Therefore, it seems necessary for university–based researchers to begin to pool their resources rather than to continue to waste time and effort by continuing to reinvent their data–gathering approaches in isolation from each other for each new research initiative.
One option worth exploring is the development of coordinated, federated data access consortia which would pool their resources in order to acquire ‘big data’ access to social media data from services like Gnip or DataSift and would make these data available to accredited researchers at all consortium member institutions. Such consortium initiatives may need to test — if necessary, in court — the applicability of terms of service restrictions relating to the on–sharing of datasets retrieved through provider or third–party reseller APIs, but in doing so may also highlight to providers the stifling effects which their data commercialisation efforts have on the conduct of legitimate public–interest research which often documents and demonstrates the utility of the platforms themselves. The considerably more permissive data access approaches of other platforms might also be highlighted in order to persuade providers to loosen their access restrictions for research purposes: Wikipedia, for example (though admittedly run by a non–profit organisation), provides unfettered API access and even offers its entire database for download.
Second, there is a need to expand and coordinate the opportunities for publishing social media research in both traditional and non–traditional formats. In the first place, this means broadening the range of potential outlets, especially with a view to redressing the balance between methods– and analysis–related work. While the latter can already find outlets in conventional media and communication journals, the former sits more uneasily in such spaces and is at times relegated to computer science publications or a handful of methods–focussed special issues in the humanities and social sciences. Analogous to the way that journalism research is served by sister journals Journalism Studies and Journalism Practice in order to cover both major facets of contemporary journalism scholarship, social media research may need twin outlets for Social Media Methods and Social Media Analytics in order to develop further as a field — or at least requires a greater appetite for methods-centric work in extant publications. Article referees should also demand a greater focus on methods discussion in the papers they review. The problem lies at both supply and demand sides, however; humanities–based social media researchers themselves must also become more proactive in documenting and publishing their methodological frameworks than they have been to date.
This, then, would also encourage the further standardisation of social media research methods and tools — a process which has barely begun so far. Agreeing on a range of core tools and approaches for the fundamental tasks of social media research, and making these widely available throughout the scholarly community, would save considerable time and effort which could in turn be spent at the leading edge of further methodological innovation in social media research. It would enable a greater range of researchers to become active in the field as they would be able to draw on standard and well–documented approaches and processes, and it would substantially boost the compatibility and comparability of research activities and findings across diverse projects and institutions.
But changes to the scholarly publishing environment for social media research cannot lie only in the development of additional conventional publishing outlets. Rather, it is important especially for this field of research also to challenge and rethink the traditional processes and formats of scholarly publishing, and to explore alternative avenues. The key aims in this should be speed and flexibility: in this fast–lived field, the turnaround from analysis to publication must be shortened without sacrificing scholarly rigour and accountability, and the presentation of research findings must be improved in order to make them more accessible and interrogable. This inherently implies a focus on online publication, an incorporation of advanced and dynamic visualisations, and an exploration of alternative approaches to peer review (including open and post-publication refereeing); it may also mean a greater valorisation of conferences and conference publications over print and online journals in this branch of the humanities and social sciences.
Finally, such innovative publishing venues should also encourage the sharing of the underlying datasets wherever possible, even in the face of restrictive terms and services for the use of social media API data. As noted, access to data is often crucial to ensure the reliability and replicability of the analysis, and the social media providers’ terms of service have forced the field of social media research to be plagued by an abundance of black boxes: research publications whose analyses appear valid, but whose validity is impossible to test for readers and referees alike without gaining access to the data upon which they are built. However difficult in practice, for legal as well as ethical reasons, this is a crucial issue which must be addressed in order to ensure the future development of social media analytics as a scholarly and rigorous field of research.
About the author
Dr. Axel Bruns is an Associate Professor in the Creative Industries Faculty at Queensland University of Technology in Brisbane, Australia, and a Chief Investigator in the ARC Centre of Excellence for Creative Industries and Innovation (http://cci.edu.au/). He is the author of Blogs, Wikipedia, Second Life and beyond: From production to produsage (2008) and Gatewatching: Collaborative online news production (2005), and a co–editor of Twitter and society (2013), A companion to new media dynamics (2012) and Uses of blogs (2006). Bruns is an expert on the impact of user–led content creation, or produsage, and his current work focusses on the study of user participation in social media spaces such as Twitter, especially in the context of acute events. His research blog is at http://snurb.info/, and he tweets at @snurb_dot_info. See http://mappingonlinepublics.net/ for more details on his current social media research.
E–mail: a [dot] bruns [at] qut [dot] edu [dot] au
1. E.g., the Digital Humanities conference series, held annually since 1990 and growing steadily in size and stature, or the biennial Digital Humanities Australasia conference, inaugurated in 2012. An Alliance of Digital Humanities Organisations (ADHO, http://adho.org/) was established in 2005.
2. boyd and Crawford, 2012, p. 668.
3. It is difficult to state with any certainty whether researchers in the (digital) humanities are more reluctant to share their methods — and tools–in–progress — with their peers at an early stage of development. However, anecdotal evidence suggests that the greater prevalence of individualistic rather than (lab–style) group–based work in the humanities, combined with some degree of embarrassment over their proficiency at writing code, might mean that humanities researchers are less inclined to publicly share their code than their counterparts in computer science or in the computational natural sciences.
4. boyd and Crawford, 2012, p. 668.
David Berry, 2011. “The computational turn: Thinking about the digital humanities,” Culture Machine, volume 12, at http://www.culturemachine.net/index.php/cm/article/view/440/470, accessed 24 September 2013.
danah boyd and Kate Crawford, 2012. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon,” Information, Communication & Society, volume 15, number 5, pp. 662–679.
doi: http://dx.doi.org/10.1080/1369118X.2012.678878, accessed 24 September 2013.
danah boyd and Kate Crawford, 2011. “Six provocations for big data,” paper presented at “A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society” (Oxford).
doi: http://dx.doi.org/10.2139/ssrn.1926431, accessed 24 September 2013.
Axel Bruns, 2011. “How long is a tweet? Mapping dynamic conversation networks on Twitter using Gawk and Gephi,” Information, Communication & Society, volume 15, number 9, pp. 1,323–1,351.
doi: http://dx.doi.org/10.1080/1369118X.2011.635214, accessed 24 September 2013.
Axel Bruns and Stefan Stieglitz, 2013. “Towards more systematic Twitter analysis: Metrics for tweeting activities,” International Journal of Social Research Methodology, volume 16, number 2, pp. 91–108.
doi: http://dx.doi.org/10.1080/13645579.2012.756095, accessed 24 September 2013.
Axel Bruns and Stefan Stieglitz, 2012. “Quantitative approaches to comparing communication patterns on Twitter,” Journal of Technology in Human Services, volume 30, numbers 3–4, pp. 160–185.
doi: http://dx.doi.org/10.1080/15228835.2012.744249, accessed 24 September 2013.
Axel Bruns, Jean Burgess, and Tim Highfield, 2013. “The Arab Spring and social media audiences: English and Arabic Twitter users and their networks,” American Behavioral Scientist, volume 57, number 7, pp. 871–898.
doi: http://dx.doi.org/10.1177/0002764213479374, accessed 24 September 2013.
Axel Bruns, Jean Burgess, Kate Crawford, and Frances Shaw, 2012. “#qldfloods and @QPSMedia: Crisis communication on Twitter in the 2011 South East Queensland floods,” Brisbane: ARC Centre of Excellence for Creative Industries and Innovation, at http://cci.edu.au/floodsreport.pdf, accessed 14 September 2013.
Jean Burgess and Axel Bruns, 2012. “Twitter archives and the challenges of ‘big social data’ for media and communication research,” M/C Journal, volume 15, number 5, at http://journal.media-culture.org.au/index.php/mcjournal/article/view/561, accessed 14 September 2013.
Jean Burgess and Kate Crawford, 2011. “Social media and the theory of the acute event,” paper presented at Internet Research 12.0 — Performance and Participation (Seattle, October 2011).
Carolin Gerlitz and Bernhard Rieder, 2013. “Mining one percent of Twitter: Collections, baselines, sampling,&edquo; M/C Journal, volume 16, number 2, at http://journal.media-culture.org.au/index.php/mcjournal/article/viewArticle/620, accessed 14 September 2013.
The Guardian, 2013. “Data Blog: Facts are sacred,” at http://www.guardian.co.uk/news/datablog, accessed 14 September 2013.
Guardia Interactive team, Rob Procter, Farida Vis, and Alex Voss, 2011. “How riot rumours spread on Twitter” In: Reading the riots: Investigating England’s summer of disorder, The Guardian (8 December), at http://www.guardian.co.uk/uk/interactive/2011/dec/07/london-riots-twitter, accessed 14 September 2013.
Stephen Harrington, Tim Highfield, and Axel Bruns, 2012. “More than a backchannel: Twitter and television,” In: José M. Noguera (editor). Audience interactivity and participation. Brussels: COST Action Transforming Audiences, Transforming Societies, pp. 13–17.
Tim Highfield, 2013. “Following the yellow jersey: Tweeting the Tour de France,,” In: Katrin Weller, Axel Bruns, Jean Burgess, Merja Mahrt, and Cornelius Puschmann (editors). Twitter and society. New York: Peter Lang.
Tim Highfield, Stephen Harrington, and Axel Bruns, 2013. “Twitter as a technology for audiencing and fandom: The #Eurovision phenomenon,” Information, Communication & Society, volume 16, number 3, pp. 315–339.
doi: http://dx.doi.org/10.1080/1369118X.2012.756053, accessed 24 September 2013.
Amanda L. Hughes and Leysia Palen, 2009. “Twitter adoption and use in mass convergence and emergency events,” International Journal of Emergency Management, volume 6, numbers 3–4, pp. 248–260.
doi: http://dx.doi.org/10.1504/IJEM.2009.031564, accessed 24 September 2013.
Anders O. Larsson and Hallvard Moe, 2011. “Studying political microblogging: Twitter users in the 2010 Swedish election campaign,” New Media & Society, volume 14, number 5, pp. 729–747.
doi: http://dx.doi.org/10.1177/1461444811422894, accessed 24 September 2013.
Gilad Lotan, Erhardt Graeff, Mike Ananny, Devin Gaffney, Ian Pearce, and danah boyd, 2011. “The revolutions were tweeted: Information flows During the 2011 Tunisian and Egyptian revolutions,” International Journal of Communication, volume 5, at http://ijoc.org/index.php/ijoc/article/view/1246, accessed 24 September 2013.
Annette Markham and Simon Lindgren, 2013. “From object to flow: Network sensibility, symbolic interactionism, and social media,” Studies in Symbolic Interaction; version at http://markham.internetinquiry.org/writing/MarkhamLindgrenOptimizedForBlog.pdf, accessed 24 September 2013.
Graeme McMillan, 2013. “Huzzah! Library of Congress’ useless Twitter archive is almost complete ... but you can’t read it yet,” Digital Trends (7 January), at http://www.digitaltrends.com/social-media/library-of-congress-useless-twitter-archive-is-almost-complete/, accessed 14 September 2013.
Mike Melanson, 2011. “Twitter puts the smack down on another popular app: Whither Twitter as a platform?” ReadWriteWeb (22 February), at http://readwrite.com/2011/02/22/twitter_puts_the_smack_down_on_another_popular_app, accessed 14 September 2013.
Marcelo Mendoza, Barbara Poblete, and Carlos Castillo, 2010. “Twitter under crisis: Can we trust what we RT?” SOMA ’10: Proceedings of the First Workshop on Social Media Analytics, pp. 71–79.
doi: http://dx.doi.org/10.1145/1964858.1964869, accessed 24 September 2013.
Leysia Palen, Kate Starbird, Sarah Vieweg, and Amanda Hughes, 2010. “Twitter–based information distribution during the 2009 Red River Valley flood threat,” Bulletin of the American Society for Information Science and Technology, volume 36, number 5, pp. 13–17.
doi: http://dx.doi.org/10.1002/bult.2010.1720360505, accessed 24 September 2013.
Cornelius Puschmann and Jean Burgess, 2013. “The politics of Twitter data,” In: Katrin Weller, Axel Bruns, Jean Burgess, Merja Mahrt, and Cornelius Puschmann (editors). Twitter and society. New York: Peter Lang.
Susan Schreibman, Ray Siemens, and John Unsworth (editors), 2004. A companion to digital humanities. Oxford: Blackwell, at http://www.digitalhumanities.org/companion/, accessed 14 September 2013.
Clay Shirky, 2002. “Broadcast institutions, community values,” Clay Shirky's Writings about the Internet: Economics and Culture, Media and Community, Open Source (9 September), at http://shirky.com/writings/herecomeseverybody/broadcast_and_community.html, accessed 14 September 2013.
Twitter, 2013. “Developer rules of the road” (2 July), at https://dev.twitter.com/terms/api-terms, accessed 14 September 2013.
Received 16 September 2013; accepted 17 September 2013.
“Faster than the speed of print: Reconciling ‘big data’ social media analysis and academic scholarship” by Axel Bruns is licensed under a Creative Commons Attribution–NonCommercial–ShareAlike 3.0 Australia License.
Faster than the speed of print: Reconciling ‘big data’ social media analysis and academic scholarship
by Axel Bruns.
First Monday, Volume 18, Number 10 - 7 October 2013
A Great Cities Initiative of the University of Illinois at Chicago University Library.
© First Monday, 1995-2013.