In April 2010, the U.S. Library of Congress and the popular micro-blogging company Twitter announced that every public tweet, since Twitter’s inception in March 2006, will be archived digitally at the Library and made available to researchers. The Library of Congress’ planned digital archive of all public tweets holds great promise for the research community, yet, over five years since its announcement, the archive remains unavailable. This paper explores the challenges faced by the Library that have prevented the timely realization of this valuable archive, divided into two categories: challenges involving practice, such as how to organize the tweets, how to provide useful means of retrieval, how to physically store them; and challenges involving policy, such as the creation of access controls to the archive, whether any information should be censored or restricted, and the broader ethical considerations of the very existence of such an archive, especially privacy and user control.
Challenges of the Library of Congress Twitter Archive
In April 2010, the U.S. Library of Congress and the popular micro-blogging company Twitter announced an agreement providing the Library a digital archive of all public tweets — short Web messages of up to 140 characters — from March 2006 (when Twitter first launched) through April 2010. Additionally, Twitter agreed to provide the Library all future public tweets on an ongoing basis (Raymond, 2010a). At the time of the announcement, Twitter was processing more than 50 million tweets per day from people around the world, and the historical archive consisted of approximately 170 billion tweets.
The Library of Congress’ commitment to archiving all public Twitter traffic is a clear recognition of the historical and cultural importance of this new information and communication medium. By providing a simple platform for users to explain “what’s happening” in 140 characters or less, Twitter has become the Internet’s de facto public forum to sharing “pretty much anything [users] wanted, be it information, relationships, entertainment, citizen journalism, and beyond” (Dybwad, 2009). While some have been quick to characterize Twitter’s content as “pointless babble” (CNBC, 2009), others point to the social value in even the most mundane tweets (boyd, 2009; Miller, 2008). Furthermore, Twitter has become a preferred communication and information-sharing platform for a variety of contexts, including reporting breaking news, organizing political protests, facilitating emergency communications, managing organizational communication and public relations, and the shared experiencing of live sporting and media events. Twitter also represents a robust social network of over 284 million active users engaging in information exchange, displaying complex arrangements of strong and weak social ties, rising and falling influence of particular nodes, and the trending patterns of particular topics over time. As a result, researchers have been quick to recognize the value in studying Twitter users and activities to gain a better understanding of its users, uses, and impacts on society and culture from a variety of perspectives (boyd and Ellison, 2008; boyd, 2013; Weller, et al., 2013; Zimmer and Proferes, 2014).
The Library of Congress’ planned digital archive of all public tweets holds great promise for the research community, providing long-term curation and access to this valuable information resource. Yet, over five years since its announcement, the archive remains unavailable. Reasons for the lengthy delay are varied, but some of the blame rests on unique challenges faced by the Library from the perspective of library and information science (LIS). These can be organized into two categories: challenges involving practice, such as how to organize the tweets, how to provide useful means of retrieval, how to physically store them; and challenges involving policy, such as the creation of access controls to the archive, whether any information should be censored or restricted, and the broader ethical considerations of the very existence of such an archive, especially privacy and user control. This paper explores these challenges from an LIS perspective, showing that while the Library of Congress has started to address many of the challenges of practice, the policy challenges remain largely unanswered.
Growth and challenges of Twitter-based research
Since its launch in 2006, Twitter has rapidly gained worldwide popularity, with over 284 million registered users as of 2014, generating over 500 million tweets each day (Twitter, 2014a). Twitter’s 140-character, plain text messages are relatively easily to process and store, and access to this stream of data (and related user account metadata) has been provided through Twitter’s own application programming interfaces (APIs) and related third-party services. With fewer than 10 percent of users taking steps to gain privacy through restricting access to their accounts (Meeder, et al., 2010; Moore, 2009), Twitter has emerged as a valuable resource for researchers hoping to tap into the zeitgeist of Internet and often beyond.
Researchers working with Twitter data at various levels of scale and complexity have already generated rich insights into the use, and users, of this social media platform. A recent analysis of published academic research utilizing Twitter data revealed over 380 publications from a wide range of disciplines, including computer and information science, communication, economics, social and behavioral sciences, and the humanities (Zimmer and Proferes, 2014). The focus of such studies ranged from content and sentiment analysis of particular tweets or community of users, mapping of social networks and the propagation of information, assessing the predictive value of Twitter content, or simply relying on Twitter data as a convenient corpus of text for linguistic, rhetorical, or statistical analysis. The majority of data was collected through Twitter’s application programming interface (API) or from the Web site directly, and the size of datasets analyzed ranged from only a handful of tweets to some numbered in the billions.
While researchers have been successful using existing tools to gain access to tweets and related Twitter data for analysis, limitations persist. Notably, Twitter made significant changes to its application programming interface (API) and terms of service in early 2011 (Melanson, 2011; Ramji, 2011) that limited researchers’ ability to access and share Twitter data, and effectively shut down popular services used by researchers to track and archive Twitter activity, such as TwapperKeeper and 140kit (Sample, 2011; Watters, 2011; Wisdom, 2013). With these changes, Twitter restricted how often researchers can request data through its APIs, and started to limit the amount the amount of Tweets — ranging from one percent to 10 percent — available through such automated services. The method Twitter uses to apply this filtering of tweets is kept secret, and thus presents a considerable limitation to many research studies (boyd and Crawford, 2012).
One way to overcome the limitations of the APIs is to use the Twitter Firehose — a realtime feed provided by Twitter that allows access to 100 percent of all public tweets. Only a select number of organizations have been granted access to the Firehose, and thus a substantial drawback for researchers is the cost of purchasing access from these partners certified by Twitter (Gannes, 2010; Luckerson, 2013). Further, the amount of computing resources required to receive, filter, and process the Firehose data can be daunting, if not out of reach, for many scholars (Ingram, 2014). Consequently, researchers have been forced to decide between two imperfect means of accessing Twitter data: the freely available but limited streaming API, or the comprehensive but expensive Firehose.
In early 2014, Twitter announced a pilot project called Twitter Data Grants, allowing researchers to submit proposals in order to obtain free access to Twitter datasets (Krikorian, 2014a). While promising to help connect researchers with the data they need, only six of the over 1,300 proposals — less than 0.5 percent — were awarded free access to Twitter datasets in order to move forward with their research (Krikorian, 2014b). In the face of such odds, the announcement that Twitter is donating its entire digital archive of public tweets to the Library of Congress came as a potential boon for researchers (Landgraf, 2010; Stross, 2010), promising to overcome many of the barriers of engaging in Twitter-based research.
The Twitter Archive at the Library of Congress
The Library of Congress, a research library that officially serves the United States Congress, is the nation’s the oldest federal cultural institution and, implicitly, the national library of the United States. It is also the largest library in the world, with more than 36 million books and printed materials, as well as more than 121 million maps, manuscripts, photographs, films, audio and video recordings, prints and drawings, and other special collections. Among the Library’s varied initiatives is the National Digital Information Infrastructure and Preservation Program, a national strategy to collect, preserve and make available significant digital content, especially information that is created in digital form only, for current and future generations. Since 2000, the Library of Congress has been creating collections of archived Web sites on such topics as the U.S. politics and national elections, the Iraq War, Supreme Court nominations, and the events of 11 September 2011. As of March 2014, the Library has collected about 525 terabytes of Web archive data, growing at a rate of about five terabytes per month (Library of Congress, 2014). It is in this tradition of digital preservation that the Library of Congress recognized the need to archive and maintain stewardship over the billions of public tweets have become part of the historical record of political, cultural, and social events and trends around the world.
On 14 April 2010, the Library of Congress and Twitter announced an agreement had been signed providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms (Raymond, 2010a; Stone, 2010). The two-page gift agreement between Twitter and the Library of Congress (Twitter — Library of Congress, 2010) provides conditions under which the archive is to be made available:
- It includes only public tweets;
- The Library may display and otherwise make available public tweets only after a six-month delay;
- The Library will not provide a “substantial portion” of the archive on its public Web site in a format that could easily be subject to bulk download;
- Access should only be provided to “bona fide” researchers in accordance with “the policies of the custodial division of the Library responsible for the administration and service of materials of this nature,” and only if the researcher signs a notification prohibiting commercial use and redistribution of “all or a substantial part” of the archive.
Additional details were provided Library in a blog post a few weeks after the announcement, noting that deleted tweets will not be included, and that “linked information such as pictures and Web sites is not part of the archive, and the Library has no plans to collect the linked sites” (Raymond, 2010b).
Since these initial announcements, the Library has provided few details regarding how the archive will be processed, how researchers will have access to actual Twitter data, or when the collection will be made available. In response to an information request from the author, the Library of Congress indicated in a 3 January 2012 letter that “the Library is still working on technical issues related to the implementation of the agreement, the material is still coming in, and the process of how to provide the material to researchers ... is still being worked out” (Nave, 2012a). Six months later, again in response to an information request from the author, the Library indicated it was still working on technical issues, and confirmed it had already received and stored over 80 terabytes of data containing over 120 billion tweets (Nave, 2012b).
In January 2013, the Library publicly provided a detailed update of the project, announcing it had received the full 2006–2010 archive of approximately 170 billion tweets totaling 133.2 terabytes, and had established a “secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day” (Library of Congress, 2013). The Library’s update provided detailed descriptions of its current process for receiving Twitter data — via an intermediate company named Gnip — and the challenges of processing and storing such a large volume of information, as the ongoing stream of public tweets to be processed had grown to nearly half a billion messages each day (Allen, 2013).
Researcher access had still not been provided, and the Library suggested public-private partnerships might be necessary to overcome the technical and infrastructural limitations that currently prevent the Library from providing researchers meaningful access to the data. As of the time of this writing (June 2015), over five years since the initial announcement, the Twitter Archive remains inaccessible.
Challenges of the Library of Congress Twitter Archive
As with any collection, the Twitter archive must be processed, organized, and catalogued in order to make it accessible and useful for researchers. While the Library of Congress has a long history working with digital content, the Twitter archive has posed a unique challenge:
The Twitter Archive represents a new type of collection. The Twitter collection is not only very large, it also is expanding daily, and at a rapidly increasing velocity. The variety of tweets is also high, considering distinctions between original tweets, re-tweets using the Twitter software, re-tweets that are manually designated as such, tweets with embedded links or pictures and other varieties. (Library of Congress, 2013)
It is not uncommon for a library to spends months processing large acquisitions; yet, the Twitter Archive remains unavailable more than five years after the initial announcement. Reasons for the long delay are numerous, and include the immense size of the initial archive, the growing size of the incremental updates to the collection, the complexities of the data itself (managing over 100 metadata fields associated with each tweet, processing embedded links and shortened URLs, and so on), the contractual agreement to delay access to tweets for six months over privacy concerns, and the need to develop appropriate access and usage policies. The Library has discussed many of these challenges openly, while some remain unaddressed, and others largely unrecognized.
The following sections outline these challenges in greater detail, organizing them into two categories: challenges involving practice, such as how to process and organize the tweets, how to physically store them, and how to provide useful means of access and retrieval; and challenges involving policy, such as the creation of appropriate access controls to the archive, whether any information should be censored or restricted, and the broader ethical considerations of the very existence of such an archive.
Challenges for practice
Size, complexity, and continuous growth
While the Library of Congress is quite adept with the preservation of large amounts of digital information — it has been archiving all congressional and presidential campaign Web sites since 2000, for example, and has collected over 525 terabytes of Web archive data — the Twitter Archive has posed unique technical challenges due to its size and complexity. The initial archive of all public tweets from 2006–2010 pledged to the Library consisted of 21 billion tweets, each with more than 50 accompanying metadata fields. The Library received this data in early 2012 — nearly two years after the original announcement — via Gnip, a social media data aggregation firm chosen as the delivery agent to migrate the data from Twitter to the Library in a usable format. This initial archive was delivered in three compressed files totaling 2.3 terabytes of data, which was uncompressed to a size of 20 terabytes. In December 2012, the Library received a second large batch of 150 billion additional tweets and corresponding metadata, increasing the size of the initial archive to 170 billion tweets totaling over 133 terabytes of data. Given that the Library held 167 terabytes of data in its digital collections at the time of the original announcement of the Twitter archive agreement (Raymond, 2010a), the receipt of the initial historical archive of Twitter data has nearly doubled the amount of data held within the Library’s infrastructure.
A contributing factor to the large size of the Twitter Archive is the amount of metadata that accompanies each tweet. More than just the 140-charater plain text that a user types into the Twitter interface, each tweet contains 150 pieces of metadata, such as a unique numerical ID, a timestamp, a location stamp, IDs for any replies, favorites and retweets that the tweet gets, the language, the date the account was created, the URL of the author if a Web site is referenced, the number of followers, and numerous other technical specifications (Dwoskin, 2014).
Adding to the challenge of providing stable and sustainable digital storage for such a large dataset is the fact that the Twitter collection is not a static archive, but a continuous stream of new data generated daily by the platform’s millions of users. At the time of the initial announcement with the Library, Twitter was processing 50 million tweets daily (Raymond, 2010a). Now, over five years later, the daily output has jumped to over 500 million (Twitter, 2014a), with particular global events, such as the World Cup, natural disasters, or even the airing of a popular television program, generating remarkable spikes in Twitter activity. On 3 August 2013, for example, Japanese Twitter users watching the television program Castle in the Sky set a record for Twitter activity of 143,199 tweets per second (the average is about 5,700 tweets a second) (Krikorian, 2013). To handle this growing volume of activity, Twitter has invested heavily in its technical infrastructure, continually re-architecting how it processes, archives, and displays activity for its users . The Library of Congress, lacking the resources and workforce at Twitter’s disposal, has had difficulties addressing the technical size and complexity of the archive (Nave, 2012a, 2012b), and has proposed partnering with outside technologists to try to find a workable solution.
In sum, while the Library of Congress has experience and expertise with digital archiving and managing large databases, the scale of the Twitter Archive seems well beyond their typical digital collection, a practical challenge that has contributed to the multi-year delay in making the archive accessible and useful for researchers.
Access and query processing
Once the practical challenges of receiving and processing the sheer volume of tweets in the Twitter Archive are addressed, the Library of Congress must also confront the uniqueness of providing access to such a unique archive. As detailed by Gaffney and Puschmann (2013), researchers have previously enjoyed (or, perhaps, been frustrated by) various means of access to Twitter data, such as the streaming API, the REST API, and the Search API. Each provides different levels of access, comprehensiveness, and means for filtering or targeting one’s query for Twitter data, and other independent tools have been launched to facilitate access using these APIs, such as 140kit or TwapperKeeper. While the vast majority of current research on Twitter have utilized one of these means of access (Gaffney and Puschmann, 2013; Zimmer and Proferes, 2014), it remains unclear whether the Library of Congress will provide similar direct access to the data elements within the Twitter Archive, or with any restrictions.
In its January 2013 update, the LOC indicated it had already received 400 inquiries from researchers but was not ready to provide access. It reported that a hypothetical query on the 2006–2010 history archive could take 24 hours — what it described as “an inadequate situation” — but also noted that the necessary investment into distributed and parallel computing resources to reduce this search time was “cost-prohibitive and impractical for a public institution” (Library of Congress, 2013). Noting a lack of sophisticated access tools available in the private section, the Library indicated it is working on a “basic level of access” for researchers. More recently, in response to an early 2014 information request from the author, the Library indicated that the 2006–2010 Twitter collection is being indexed and undergoing further processing by “reference librarians together with technology experts” in preparation for an access pilot a pilot program targeted for mid-2014 (Yake, 2014). As of the time of this writing (November 2014), no pilot program has been announced.
It remains unknown what type of indexing and processing the archive has undergone by any information professionals or technology experts assigned to such a major task. Decisions made regarding how the archive is indexed will directly impact not only the speed of query processing — hopefully becoming faster than the reported 24 hours for a simple text search — but also the types of queries possible. The flattest architecture would provide for simple text searched against the entire database of tweeted content, which could also be expanded to include searches against metadata records, or for specific user accounts, specific hashtags, limited by date ranges, or the estimated location of the IP address used to initiate the tweet. Providing a more robust query processing environment will increase the research value of the Archive, but simultaneously pose additional practical challenges for the Library, who is already struggling to simply create a publicly available archive with a basic level of accessibility.
Challenges for policy
The practical challenges of receiving such a large volume of Twitter data, archiving it, and making it accessible and useful are sizeable, and undoubtedly the Library of Congress is putting forward great effort to resolve them quickly. But complimenting these practical challenges are an equally imposing set of policy challenges, such as the creation of appropriate access policies, whether any information should be censored or restricted, and the broader ethical considerations of the very existence of such an archive, especially issues of privacy and user control.
The gift agreement between Twitter and the Library of Congress dictates various restrictions on access to the Twitter archive (Twitter — Library of Congress, 2010). First, tweets can only be made available six months after they were originally posted to Twitter. Second, once this time delay has been satisfied, tweets can only be made available to library staff and to “bona fide researchers” as determined by the Library, and who also must sign an agreement prohibiting “commercial use and redistribution” of the Archive. While not explicitly stated, the purpose of the six-month delay is most likely in response to privacy concerns of Twitter users (discussed below). And while restricting access to the archive to “bona fide” researchers appears to be a reasonable attempt to prevent the commercial use of the data, preventing open public access to materials can be a controversial archival practice (Cox, et al., 2009; Geselbracht, 1986; Greene, 1993). While the Twitter archive is meant to be an improvement over the existing means of access to tweets — through limited APIs or commercial resellers — any access restrictions put in place represent a policy dilemma for the Library in relation to how they will define “bona fide” research and determine who gets access to the collection of all public tweets.
The ethical codes and principles of librarians and related information professionals urge providing full access to information to satisfy the unique needs of all patrons and users (American Library Association, 2008; International Federation of Library Associations, 2011; Morgan, 2006). Such pledges of intellectual freedom maintain that materials should not be excluded because of the origin, background, or views of those contributing to their creation, and that any attempts for censorship should be challenged in the fulfillment of professional responsibilities to provide free and unfettered access to information. Numerous libraries have faced challenges to upholding this principle of intellectual freedom (Shaevel, et al., 2006), frequently fielding requests to remove sexually explicit or other controversial materials.
The Library of Congress has not been immune from controversies regarding restricting content. Most recently, the Library was criticized for blocking access to the Wikileaks Web site from its computer systems, including those used by patrons in the reading rooms (Lipton, 2010). And while Twitter itself engages in limited forms of content moderation to ensure users comply with its terms of service regarding appropriate content (Chen, 2014), it remains unknown whether the Library of Congress will similarly filter or restrict tweets based on the content. The gift agreement grants the Library the ability to “dispose” of material in the archive it considers “inappropriate for retention” (Twitter — Library of Congress, 2010), but it remains silent on how such a determination would be made, how is authorized to make it, and whether any public notification would be provided that such exclusions might occur. Since content posted to Twitter often includes pornographic, controversial, copyright-protected, confidential, and perhaps even illegal content, the Library might feel compelled to filter or remove certain tweets from the Archive. Such a move would conflict with the broader principles of intellectual freedom, and this constitutes a significant policy challenge as the archive continues to grow.
The initial announcement of the Twitter archive prompted immediate privacy concerns about creating a permanent archive of tweets, and whether such a proposal was properly aligned with users’ understanding of how the platform worked and their privacy expectations. For example, various comments on the Library of Congress’ Web version of the announcement of the archive contain surprised and frustrated sentiments about the seeming newfound permanence of tweets:
- So with no warning, every public tweet we’ve ever published is saved for all time? What the hell. That’s awful. (Shaun, in Raymond, 2010a)
- I can see a lot of political aspirations dashed by people pulling out old Tweets. I’ve always thought of the service as quite banal and narcissistic, but I’ve had a Twitter account to provide feedback to a college and a couple of vendors. I think I’ll close my account now. I don’t need to risk Tweeting something hurtful or stupid that will be around for all recorded time. (Joe Citizen, in Raymond, 2010a)
- Now future generations can bear witness to how utterly stupid and vain we were — 1. for creating this steaming mountain of pointless gibberings, and 2. for preserving it for posterity. LOC, you nimrods. (Zil Maddet, in Raymond, 2010a)
Even in broadcasting the news, the language Wired chose underscored the apparent transition from a fleeting existence for tweets to a newly instilled sense of permanence when it stated, “While the short form musings of a generation chronicled by Twitter might seem ephemeral, the Library of Congress wants to save them for posterity” (Singel, 2010).
In the wake of the Library of Congress announcement, increased debates over the appropriateness of archiving public Tweets for research purposes have arisen (see, for example, Vieweg, 2010; Zimmer, 2010b; Zimmer, 2010c), focusing largely on concerns over respecting the privacy expectations of Twitter users. Research has shown that between 40 percent and 50 percent of tweets included information about the author (Honeycutt and Herring, 2009; Naaman, et al., 2010), which might include contact data, other personally identifiable information, locational data, health information, and the like (see, for example, Mao, et al., 2011), posing potential privacy threats to users unaware of the fully public nature of their activity or its possible harvesting by researchers.
Similarly, the practice of retweeting represents a risk for the leakage of tweets that had been intended for a restricted audience, thereby generating a considerable privacy threat when archived by researchers. Users who have been granted access to restricted accounts can easily retweet private tweets by copying and pasting into their own, unprotected feed, violating the privacy protections enacted by the original author. In a study of over 80 million Twitter accounts, nearly 250,000 protected accounts had at least one restricted tweet retweeted by a public user (Meeder, et al., 2010). If such retweets of private tweets are included in research databases, the original author’s expectations of privacy might have been breeched.
When asked about whether the Twitter Archive could threaten the privacy of users, a Library of Congress spokesperson noted that the Twitter messages that would be archived are already publicly published on the Web: “It’s not as if we’re after anything that’s not out there already,” and that “people who sign up for Twitter agree to the terms of service” (quoted in Lohr, 2010). This is the classic “but the information is already public” argument used to justify the widespread harvesting of social media content (Zimmer, 2010a), which, while technically true, presumes a false dichotomy that information is either strictly public or private, ignoring any contextual norms (Nissenbaum, 2009) that might have guided the initial release of information through Twitter or how a person expects that tweet to flow.
Concerns over these privacy implications of creating a repository of all public tweets could be addressed, at least in part, through ensuring users have sufficient levels of control over their data and overall inclusion in the archive. No Twitter user was asked to provide explicit consent to be included in the Library of Congress archive, and, as noted above, the Library takes a position that since “people who sign up for Twitter agree to the terms of service” (quoted in Lohr, 2010), additional consent is not required. As a result, short of making their entire Twitter account private, users are denied the ability to control whether they wish to have their public tweets archived and made available through the Library of Congress archive.
Further, Twitter provides the ability for users to delete individual tweets from their timeline (Twitter, 2014b), which removes the tweet from the user’s account, the timeline of any accounts they follow, and also Twitter search results. Unaltered retweets of a tweet will also disappear from the platform when the original is deleted. Users’ ability to delete tweets provides them a considerable amount of control over their online activities and privacy (Almuhimedi, et al., 2013). However, deleting a tweet from the Twitter platform will not have a similar impact on the archive maintained at the Library of Congress, severely limiting users’ ability to control their information.
Overall, the Library of Congress does not appear ready to provide users any form of control or access to their own tweets archived within the large-scale repository. There will be no ability to opt-out of the repository, and no means of deleting individual tweets if a user later wishes to remove certain utterances from the archive.
In the five years since the Library of Congress announced its agreement to archive all public Twitter activity and make it available for researchers, the Library has tackled numerous technical challenges related to pursuing such an ambitious project. The most recent official update from January 2013 outlined the progress the Library is making addressing some of the practical challenges outlined above. Yet, despite this hopeful progress, the many policy challenges — of access, restrictions, privacy, and control — remain largely unresolved.
The library and information science (LIS) profession can provide some guidance to help the Library of Congress address these critical policy issues. The American Library Association’s core ethical documents, as well as those of the Society of American Archivists (SAA), suggest that the Library should enact policies that both encourage open access to the digital archive, while also finding ways to protect the privacy of those whose information is collected in the repository. Sufficiently addressing these policy concerns will, undoubtedly, result in further technical and practical challenges. The Library should, therefore, continue its path of pursuing public-private partnerships to overcome the technical and infrastructural limitations that currently prevent the Library from providing researchers meaningful access to the data. These partnerships, however, must include not only technical experts in the field of digital archives and information retrieval, but also those versed in information policy, research ethics, and privacy. With such an approach, hopefully, we will not need to wait another five years to make meaningful — and ethical — use of this important digital archive.
About the author
Michael Zimmer, Ph.D., is a privacy and Internet ethics scholar, most notable for his work in online privacy, the ethical dimensions of new media technologies, and Internet research ethics. Zimmer is an Associate Professor in the School of Information Studies at the University of Wisconsin-Milwaukee, where he also serves as Director of the Center for Information Policy Research.
E-mail: zimmerm [at] uwm [dot] edu
1. A summary of the evolution of Twitter’s technical architecture can be found at Krikorian (2013).
Erin Allen, 2013. “Update on the Twitter Archive at the Library of Congress,” Library of Congress Blog (4 January), at http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/, accessed 15 July 2013.
Hazim Almuhimedi, Shomir Wilson, Bin Liu, Norman Sadeh and Alessandro Acquisti, 2013. “Tweets are forever: A large-scale quantitative analysis of deleted tweets,” CSCW ’13: Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 897–908.
doi: http://dx.doi.org/10.1145/2441776.2441878 accessed 11 December 2014.
American Library Association, 2008. “Code of ethics of the American Library Association,” at http://www.ala.org/advocacy/proethics/codeofethics/codeethics accessed 20 June 2015.
danah boyd, 2013. “Bibliography of research on Twitter & microblogging,” at http://www.danah.org/researchBibs/twitter.php, accessed 15 July 2013.
danah boyd, 2009. “Twitter: ‘Pointless babble’ or peripheral awareness + social grooming?” apophenia (16 August), at http://www.zephoria.org/thoughts/archives/2009/08/16/twitter_pointle.html, accessed 15 July 2013.
danah boyd and Kate Crawford, 2012. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon,” Information, Communication & Society, volume 15, number 5, pp. 662–679.
doi: http://dx.doi.org/10.1080/1369118X.2012.678878 accessed 20 June 2015.
danah boyd and Nicole Ellison, 2008. “Social network sites: Definition, history, and scholarship,” Journal of Computer-Mediated Communication, volume 13, number 1, pp. 210–230.
doi: http://dx.doi.org/10.1111/j.1083-6101.2007.00393.x, accessed 20 June 2015.
Adrian Chen, 2014. “The laborers who keep dick pics and beheadings out of your Facebook feed,” Wired (23 October), at http://www.wired.com/2014/10/content-moderation/, accessed 18 November 2014.
CNBC, 2009. “Twitter is 40% ‘pointless babble’: Report,” CNBC.com (17 August), at http://www.cnbc.com/id/32446935, accessed 15 July 2013.
Richard J. Cox, Abigail Middleton, Rachel Grove Rohrbaugh and Daniel Scholzen, 2009. “A different kind of archival security: Three cases,” Library & Archival Security, volume 22, number 1, pp. 33–60.
doi: http://dx.doi.org/10.1080/01960070802562826, accessed 20 June 2015.
Elizabeth Dwoskin, 2014. “In a single tweet, as many pieces of metadata as there are characters,” WSJ Blogs — Digits (6 June), at http://blogs.wsj.com/digits/2014/06/06/in-a-single-tweet-as-many-pieces-of-metadata-as-there-are-characters/, accessed 13 November 2014.
Barb Dybwad, 2009. “Twitter drops ‘What are you doing?’ now asks ‘What’s happening?’,” Mashable.com (19 November), at http://mashable.com/2009/11/19/twitter-whats-happening/, accessed 13 November 2014.
Devin Gaffney and Cornelius Puschmann, 2013. “Data collection on Twitter,” In: Axel Bruns, Katrin Weller, Jean Burgess, Merja Mahrt and Cornelius Puschmann (editors). Twitter and society. New York: Peter Lang, pp. 55–68.
Liz Gannes, 2010. “What is taking a sip from the Twitter Firehose going to cost?” Gigaom (1 March), at https://gigaom.com/2010/03/01/what-is-taking-a-sip-from-the-twitter-firehose-going-to-cost-you/, accessed 11 June 2014.
Raymond H. Geselbracht, 1986. “The origins of restrictions on access to personal papers at the Library of Congress and the National Archives,” American Archivist, volume 49, number 2, pp. 142–162.
Mark A. Greene, 1993. “Moderation in everything, access in nothing? Opinions about access restrictions on private papers,” Archival Issues, volume 18, number 1, pp. 31–41.
Courtenay Honeycutt and Susan C. Herring, 2009. “Beyond microblogging: Conversation and collaboration via Twitter,” HICSS ’09: 42nd Hawaii International Conference on System Sciences, 2009, pp. 1–10.
doi: http://dx.doi.org/10.1109/HICSS.2009.89, accessed 20 June 2015.
Mathew Ingram, 2014. “Drinking from the Twitter firehose: I love the stream, but I need more filters and bridges,” Gigaom (9 January), at https://gigaom.com/2014/01/09/drinking-from-the-twitter-firehose-i-love-the-stream-but-i-need-more-filters-and-bridges/, accessed 11 June 2014.
International Federation of Library Associations, 2011. “IFLA statement on libraries and intellectual freedom,” at http://www.ifla.org/publications/ifla-statement-on-libraries-and-intellectual-freedom, accessed 20 June 2015.
Raffi Krikorian, 2013. “New tweets per second record, and how!” Twitter Engineering Blog (16 August), at https://blog.twitter.com/2013/new-tweets-per-second-record-and-how, accessed 13 November 2014.
Raffi Krikorian, 2014a. “Introducing Twitter data grants,” Twitter Engineering Blog (5 February), at https://blog.twitter.com/2014/introducing-twitter-data-grants, accessed 6 June 2014.
Raffi Krikorian, 2014b. “Twitter #DataGrants selections,” Twitter Engineering Blog (17 April), at https://blog.twitter.com/2014/twitter-datagrants-selections, accessed 6 June 2014.
Greg Landgraf, 2010. “Historians await access to the Library of Congress’s Twitter Archive,” American Libraries (17 May), at http://americanlibrariesmagazine.org/2010/05/17/historians-await-access-to-the-library-of-congresss-twitter-archive/, accessed 11 June 2014.
Library of Congress, 2014. “Web archiving FAQs,” at http://www.loc.gov/webarchiving/faq.html, accessed 5 June 2014.
Library of Congress, 2013. “Update on the Twitter Archive At the Library of Congress” (January), at http://www.loc.gov/today/pr/2013/files/twitter_report_2013jan.pdf, accessed 14 July 2013.
Eric Lipton, 2010. “Don’t look, don’t read: Government warns its workers away from WikiLeaks gocuments,” New York Times (4 December), at http://www.nytimes.com/2010/12/05/world/05restrict.html, accessed 18 November 2014.
Steve Lohr, 2010. “Library of Congress will save tweets,” New York Times (14 April), at http://www.nytimes.com/2010/04/15/technology/15twitter.html, accessed 11 December 2014.
Victor Luckerson, 2013. “Twitter Is selling access to your tweets for millions,” Time (8 October), at http://business.time.com/2013/10/08/twitter-is-selling-access-to-your-tweets-for-millions/, accessed 11 June 2014.
Huina Mao, Xin Shuai and Apu Kapadia, 2011. “Loose tweets: An analysis of privacy leaks on Twitter,” WPES ’11: Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society, pp. 1–12.
doi: http://dx.doi.org/10.1145/2046556.2046558, accessed 20 June 2015.
Brendan Meeder, Jennifer Tam, Patrick Gage Kelley and Lorrie Faith Cranor, 2010. “RT @IWantPrivacy: Widespread violation of privacy settings in the Twitter social network,” W2SP 2010: Web 2.0 Security & Privacy, pp. 28–48, and at http://www.cs.cmu.edu/~jdtam/Documents/Meeder-SNSP2010.pdf, accessed 20 June 2015.
Mike Melanson, 2011. “Twitter kills the API Whitelist: What it means for developers & innovation,” ReadWrite (11 February), at http://readwrite.com/2011/02/11/twitter_kills_the_api_whitelist_what_it_means_for, accessed 11 September 2013.
Vincent Miller, 2008. “New media, networking and phatic culture,” Convergence, volume 14, number 4, pp. 387–400.
doi: http://dx.doi.org/10.1177/1354856508094659, accessed 20 June 2015.
Robert Moore, 2009. “Twitter data analysis: An investor’s perspective,” TechCrunch (5 October), at http://techcrunch.com/2009/10/05/twitter-data-analysis-an-investors-perspective-2/, accessed 17 November 2012.
Candace D. Morgan, 2006. “Intellectual freedom: An enduring and all-embracing concept,” In: American Library Association. Office for Intellectual Freedom (compiler). Intellectual freedom manual. Seventh edition. Chicago: American Library Association, pp. 3–13.
Mor Naaman, Jeffrey Boase and Chih-Hui Lai, 2010. “Is it really about me? Message content in social awareness streams,” CSCW ’10: Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, pp. 189–192.
doi: http://dx.doi.org/10.1145/1718918.1718953, accessed 20 June 2015.
John Nave, 2012a. “Response to e-mail, dated 14 December 2011.”
John Nave, 2012b. “Response to e-mail, dated 30 May 2012.”
Helen Nissenbaum, 2009. Privacy in context: Technology, policy, and the integrity of social life. Stanford, Calif.: Stanford University Press.
Sam Ramji, 2011. “With APIs it’s caveat structor — developer beware,” Gigaom (22 March), at http://gigaom.com/2011/03/22/with-apis-its-caveat-structor-%e2%80%93-developer-beware/, accessed 11 September 2013.
Matt Raymond, 2010a. “How tweet it is! Library acquires entire Twitter archive,” Library of Congress Blog (14 April), at http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/, accessed 5 June 2014.
Matt Raymond, 2010b. “The Library and Twitter: An FAQ,” Library of Congress Blog (28 April), at http://blogs.loc.gov/loc/2010/04/the-library-and-twitter-an-faq/, accessed 6 June 2014.
Mark Sample, 2011. “The end of TwapperKeeper? (And what to do about it),” ProfHacker: Chronicle of Higher Education (8 March), at http://chronicle.com/blogs/profhacker/the-end-of-twapperkeeperand-what-to-do-about-it/31582, accessed 28 August 2013.
Evelyn Shaevel, Beverley Becker and Candace D. Morgan, 2006. “Challenges and issues today,” In: American Library Association. Office for Intellectual Freedom (compiler). Intellectual freedom manual. Seventh edition. Chicago: American Library Association, pp. 45–52.
Biz Stone, 2010. “Tweet preservation,” Twitter Blog (14 April), at https://blog.twitter.com/2010/tweet-preservation, accessed 6 June 2014.
Randall Stross, 2010. “A sea of history: Twitter at the Library of Congress,” New York Times (1 May), at http://www.nytimes.com/2010/05/02/business/02digi.html, accessed 11 June 2014.
Twitter, 2014a. “About Twitter, Inc. | About,” at https://about.twitter.com/company, accessed 30 October 2014.
Twitter, 2014b. “Deleting a tweet,” Twitter Help Center, at https://support.twitter.com/articles/18906-deleting-a-tweet, accessed 20 June 2015.
Twitter — Library of Congress, 2010. “Gift agreement,” at http://blogs.loc.gov/loc/files/2010/04/LOC-Twitter.pdf, accessed 5 June 2014.
Sarah Vieweg, 2010. “The ethics of Twitter research,” Revisiting Research Ethics in the Facebook Era: Challenges in Emerging CSCW Research.
Audrey Watters, 2011. “How recent changes to Twitter’s terms of service might hurt academic research,” ReadWrite (3 March), at http://readwrite.com/2011/03/03/how_recent_changes_to_twitters_terms_of_service_mi, accessed 28 August 2013.
Axel Bruns, Katrin Weller, Jean Burgess, Merja Mahrt and Cornelius Puschmann (editors). Twitter and society. New York: Peter Lang.
Dick Wisdom, 2013. “How Twitter gets in the way of knowledge,” BuzzFeed (4 January), at http://www.buzzfeed.com/nostrich/how-twitter-gets-in-the-way-of-research, accessed 11 June 2014.
Jeff Yake, 2014. “Response to e-mail, dated 27 March 2014.”
Michael Zimmer, 2010a. “‘But the data is already public’: On the ethics of research in Facebook,” Ethics and Information Technology, volume 12, number 4, pp. 313–325.
doi: http://dx.doi.org/10.1007/s10676-010-9227-5, accessed 20 June 2015.
Michael Zimmer, 2010b. “Is it ethical to harvest public Twitter accounts without consent?” MichaelZimmer.org (12 February), at http://www.michaelzimmer.org/2010/02/12/is-it-ethical-to-harvest-public-twitter-accounts-without-consent/ accessed 11 June 2014.
Michael Zimmer, 2010c. “Open questions about Library of Congress archiving Twitter streams,” MichaelZimmer.org (14 April), at http://www.michaelzimmer.org/2010/04/14/open-questions-about-library-of-congress-archiving-twitter-streams/ accessed 11 June 2014.
Michael Zimmer and Nicholas John Proferes, 2014. “A topology of Twitter research: Disciplines, methods, and ethics,” Aslib Journal of Information Management, volume 66, number 3, pp. 250–261.
doi: http://dx.doi.org/10.1108/AJIM-09-2013-0083, accessed 20 June 2015.
Received 14 December 2014; accepted 20 June 2015.
This paper is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The Twitter Archive at the Library of Congress: Challenges for information practice and information policy
by Michael Zimmer.
First Monday, Volume 20, Number 7 - 6 July 2015