First Monday

Stewardship in the Age of Algorithms by Clifford Lynch

This paper explores pragmatic approaches that might be employed to document the behavior of large, complex socio-technical systems (often today shorthanded as “algorithms”) that centrally involve some mixture of personalization, opaque rules, and machine learning components. Thinking rooted in traditional archival methodology — focusing on the preservation of physical and digital objects, and perhaps the accompanying preservation of their environments to permit subsequent interpretation or performance of the objects — has been a total failure for many reasons, and we must address this problem. The approaches presented here are clearly imperfect, unproven, labor-intensive, and sensitive to the often hidden factors that the target systems use for decision-making (including personalization of results, where relevant); but they are a place to begin, and their limitations are at least outlined. Numerous research questions must be explored before we can fully understand the strengths and limitations of what is proposed here. But it represents a way forward. This is essentially the first paper I am aware of which tries to effectively make progress on the stewardship challenges facing our society in the so-called “Age of Algorithms;” the paper concludes with some discussion of the failure to address these challenges to date, and the implications for the roles of archivists as opposed to other players in the broader enterprise of stewardship — that is, the capture of a record of the present and the transmission of this record, and the records bequeathed by the past, into the future. It may well be that we see the emergence of a new group of creators of documentation, perhaps predominantly social scientists and humanists, taking the front lines in dealing with the “Age of Algorithms,” with their materials then destined for our memory organizations to be cared for into the future.


The “Age of Algorithms”
Accountability, transparency, and documentation: Unease and discontents
Idealized documentation of systems in the Age of Algorithms
The impossibility of traditional comprehensive archiving
Documenting instead of archiving: Pragmatic approaches
A few initial thoughts on curation, population selection, and interpretation
Concluding comments




It has now been widely observed that we are living in the “Age of Algorithm” as the popular press would have it. Many articles and reports seek to characterize, analyze, celebrate and critique these developments (Rainie and Anderson, 2017; Henke, et al., 2016; U.S. Executive Office of the President, 2016; Royal Society, 2017; Shneiderman, 2016; ACM U.S. Public Policy Council and ACM Europe Policy Committee, 2017). There’s both excitement and concern about what such developments mean. Various parties are calling for increased “transparency” and “accountability” — two over-used and poorly defined terms that to many it seems somehow hard to argue against. Yet, there has been little consideration of how we actually document (contemporaneously or retrospectively) this transfigured new world for today and for the future; this challenge is related to, but distinctly different from, meeting the popular demands for accountability and transparency and I will try to clarify the distinctions here. They are different problems. By “document” here, I mean to capture a comprehensive record, or at least a good approximation, of the present reality that can be consulted today and brought forward into the future.

Further, while some relatively “high-stakes” algorithms certainly justify the legal concerns emerging from regulatory agencies, researchers and social justice advocates, there are many other algorithms that help shape our more mundane day-to-day interactions with the digital world that do not rise to such levels. Nonetheless, we certainly want to capture the experiences these systems are creating both for interested people today and for future generations seeking to understand our time and our cultures. This paper tries to address the imperatives for stewardship (at least as I view them, and I recognize I take an expansive view) of this growing segment of our culture, and how we might seek to meet these imperatives, as well as the compromises we will have to make. While a few details are specific to digital archivists and stewards of the cultural record more broadly, I hope that the core arguments are widely accessible and relevant to interested readers across all disciplines and backgrounds. Some of the material is specific to the U.S. marketplace and legal environment but the overall ideas should be generally applicable. Note also that I have mainly favored citations readily accessible to the interested reader rather than delving deeply into the (vast) technical literature in particular applications areas.

This paper is intended to be exploratory and pragmatic rather than comprehensive or definitive; we desperately need good ideas and good scholarship in this endeavor. By pragmatic, I mean that it focuses on actions that could actually be taken today to begin to address the challenge, as opposed to theoretical analyses, or approaches predicated on unlikely (in my view) regulatory, policy, legal or other prerequisites or technical miracles. It begins by trying briefly to characterize and illustrate what various people mean by the shorthand “Age of Algorithms.” I then discuss some of the specific concerns and challenges to documenting or preserving this environment, and how these relate to but are quite distinct from, the perspectives and agendas of the advocates of algorithmic transparency and accountability. After a short discussion of the impossibility of any sort of totally definitive or comprehensive archival solution in a very large number of cases, I next describe some specific high-level methods to pragmatically approach the documentation challenge. Genuine, practical, documentation implies very different approaches than those that are typical in what might be thought of as “artifact-based” archival practices. Making these new approaches work would require very novel collaborations between archivists and curators, and disciplines such as advertising, audience research, demographics, statistics, and sociology; success may also demand a broad-based engagement with what we may term “citizen archivists” (similar to the growing engagement of the public in “citizen science”). My hope is that this paper may lay some groundwork for a much broader discussion about how genuinely practical these ideas actually are, and what additional or alternative approaches should be considered.

In terms of scope, note that there is a closely related, additional, important, discussion emerging about the melding of algorithms and robotics (for example, autonomous vehicles of various types, or drones) and questions about accountability and transparency, as well as fundamental assumptions, in these systems, particularly when they are intended to harm humans, or are forced by events to choose which humans to harm. I will leave such questions as out of scope; the issues here are very complex and raise many new considerations. My focus here is on systems that simply process data and produce informational outputs as opposed to animating robotic devices; though they may have real-world consequences, these consequences are not instantaneous, and perhaps not as dire.



The “Age of Algorithms”

Algorithms have been, of course, familiar to mathematicians for centuries, indeed millennia. Euclid devised an algorithm for factoring prime numbers in classical times (though he did not use the term); my understanding is that the term derives from medieval Arabic [1]. It has gone in and out of usage over the centuries since. Algorithms were not grandiose things, but simply methods for solving specific problems.

In the twentieth century, predecessor disciplines growing out of mathematics towards computer science (such as the theory of computability) adopted the concept, both in very theoretical work (Turing, Gödel, et al., though to the best of my knowledge they did not use the term “algorithms,” speaking instead about questions like “halting problems” and “computability” that are in fact essential parts of what constitute algorithms) and in more applied bodies or work such as that of John von Neumann. The great computer scientist Donald Knuth was perhaps most central (though certainly not the first) in making both the concept and the actual term widely adopted in applied computer science in his magisterial work The art of computer programming (Knuth, 1997); the first edition of the first volume of this work, Fundamental algorithms, was published in 1968. The first chapter of this work contains a wonderful brief history and discussion of the definitions of algorithms and cannot be more highly recommended to interested readers. Again, in Knuth’s work, algorithms here were characterized as fairly modest things: well-specified computational procedures that could be proved to complete in a finite number of steps with a correct result [2].

During the second decade of the twenty-first century, legal scholars, public policy experts and regulators, and social scientists appropriated the term “algorithms” to mean something different, much broader and more amorphous [3]. The term became a shorthand for what previously social scientists sometimes had called “complex socio-technical systems” being implemented at a very broad level, most commonly by very large corporations or governments [4]. Perhaps confusingly, it also is frequently used to cover a huge number of decision-making algorithms that are part of various consumer-oriented services: for example, various kinds of personalization and recommendation engines are now well established and increasingly ubiquitous in consumer-oriented news, social media, and shopping sites. More recently, we are seeing a plethora of computational language translation services, speech transcription, facial recognition and photo classification and tagging systems, and similar developments. Usually, these systems actually embed and orchestrate a large number of algorithms (as computer scientists and programmers think of them) that are constantly and asynchronously changing and evolving.

The second part of this discussion is related to the now very common incorporation of machine learning (ML) systems, where a system is “trained” on a collection of data [5], and the system creates an algorithm, typically incomprehensible to humans and most commonly embodied in some kind of a neural network, that closely mimics the desired behavior on the training data, and which can be applied to new data. Such algorithms have had great successes in areas ranging from strategies in games like Go (Mozur, 2017), various kinds of advising or recommending that I’ll discuss shortly, translation, transcription, and image classification/recognition [6]. Sometimes they stand alone, but most commonly they are services or sub-components of larger and more complex systems [7]. Basically, these ML systems recognize patterns and use this ability to categorize inputs in various ways. These mechanisms are definitely part of (and often viewed as one of the more threatening aspects of) this “Age of Algorithms” and are no better than (and usually fairly accurately reflect) the biases and limitations that are implicit both in their training data and in the processes that lead to the outcomes that were used to label the training data. To explain this last point a little more, consider data that records the outcomes of an existing, well-established human-driven social system that discriminates (perhaps subtly) against some group based on gender, for example: perhaps decisions to extend credit, or on whom to hire from a group of applicants. A new computer-based system developed using ML on the outcomes of the extant system will be using training data reflecting the outcomes of human bias, and will likely correlate gender to success or failure implicitly. The machine learning system will learn this correlation, and reflect it in the classifier that it develops, hence essentially encoding the bias of past practice (Royal Society, 2017; Osoba, 2017; boyd, 2017). Such biases are not the fault of the ML system but of the data used to train it, or of the people who failed to understand the limitations of their training data.

An additional problem is the way algorithmic outputs are actually used (or misused): while the system may say that there is a higher probability of one outcome than another, this “prediction” is not and must not be equated to destiny, though it’s all too easy to do so. Making important decisions entirely by algorithm is a profound abdication of responsibility; there needs to be a way to also bring human judgment, intuition and empathy to the situation. And, further, correlation is not causation. Overtraining and/or false correlation detection are also problems (Stephens-Davidowitz, 2017; O’Neil, 2017). I’ll also note that there are now many adaptive ML systems where the classifiers evolve in operational use, though I won’t consider the additional complications that these introduce explicitly here.

A third issue here is the high level of path dependency (or dependency on history) in many of the systems that consumers interact with regularly. This dependency means that “algorithms” cannot stand alone: they operate in a very complex and extensive (and often proprietary, unrecordable or even un-reproducible and unknowable) context. If you think about how Facebook decides what to show a user, the choices depend on not only the history of the user’s interactions with Facebook in various dimensions (frequency, duration, what he or she did or didn’t like, what Facebook thinks it knows about that user [Nield, 2017]), but also the histories of many other users of Facebook, particularly those who are connected by “friendship” relationships to our user in question. If you change any of these inputs, the “feed” (including the advertising stream) offered to our user will be different. While the “feed” may be algorithmically generated [8], in a very real sense it’s not meaningfully reproducible. Recommender systems have somewhat similar processes; think about the way Amazon suggests books, considering your own previously expressed preferences (both through purchasing and browsing, and ratings if you’ve offered them), in addition to the ever-evolving histories of people who share common purchases with you. Indeed, construction of “feeds” can be thought of as an application and extension of these recommender systems (which pre-date the emergence of social media platforms) to new environments; in fact, early work on recommender systems used the terminology “collaborative filtering” prior to their commercial adoptions by corporations like Amazon, underscoring the reliance on community in addition to individual history. Then think about the extent to which travel sites that offer various flight and hotel options may incorporate similar algorithms both in the ranking of their suggestions and in the pricing [9].

Another way of thinking about how the world has changed is to consider the dichotomy between content and context. Had you asked me a decade ago about how we might preserve newspapers (a term that seems so old-fashioned now) in the digital age, I would have said that we need to have stewardship institutions ingest the content databases, including the revision histories of the content. Today, depending on “who” (individually and demographically) a news Web site thinks you are, and your history with the site, it will customize a set of featured articles; this is the end of a long, technologically enabled evolution that starts with early/late and regional editions. So context of the interaction is now critical. It often doesn’t matter much if some bit of content is in the database, but only a minuscule fraction of the site’s visitors see it, and perhaps only if they specifically search for it. It’s the highly visible materials that have the most social impact.

Here are a few popularly discussed (and far from exhaustive) examples of ways in which the pervasive “Age of Algorithms” is now upon us:

An additional area that’s slightly different in character from the examples above has moved into prominence in recent months: understanding of the role of information warfare in the 2016 election and in other elections around the world in recent years. Here, it’s clear that we need to better understand both the vulnerabilities of existing social media platforms (i.e., Facebook, Twitter, etc.) and the roles and functions of algorithmically driven aggressors (botnets) as well as networks of human actors in delivering messages to users of these platforms [11].

It’s important to recognize that many of these algorithmic outcomes do not rise to the level of regulatory or policy scrutiny (at least at the U.S. federal or, occasionally, state levels), or they do so only in very specific, blatant or extreme cases [12]. Some of the examples above are clearly very high stakes situations; others are not. Poor or skewed movie recommendations from Netflix, or a bad book suggestion from Amazon, or an annoying entry in a Facebook feed probably isn’t very important, though we have learned that huge numbers of individually low-stakes decisions can add up to very societally serious results, as the 2016 U.S. presidential election has made clear [13], when compared to systematic gender or racial bias in recommended trial sentences, loan approvals, or medical diagnoses. Various forms of price discrimination, frequently driven by ever-more-detailed information about individual consumers, are now regarded as effective and laudable business practice (Shapiro and Varian, 1999; Useem, 2017) if they are conducted appropriately [14]. Nevertheless, understanding and documenting such personalized and targeted system behaviors might be of real interest to consumers and consumer advocacy groups in the present, and certainly to scholars in the future (even if there’s nothing legally very questionable about them) for the insight that they give into behaviors of the population and the way opinion is shaped and choices are made.



Accountability, transparency, and documentation: Unease and discontents

Accountability speaks to understanding the effects of algorithmically defined recommendations or determinations, and the ways in which they relate to other societal issues, such as economic (class), gender, or racial discrimination. Government agencies such as the U.S. Federal Trade Commission (2016) have been very interested in such issues in various contexts: among the examples enumerated earlier this includes the widely used judicial sentencing recommenders, credit recommenders for mortgages or similar loans, and college admission screeners. At a very pragmatic level, for an important class of systems, you don’t need to know much about the algorithms themselves to examine outcomes and pursue accountability for these outcomes: it’s sufficient to simply capture the most basic inputs and the outputs of the system, and then re-correlate (perhaps first de-anonymizing) the cases by re-associating demographic information (gender, race, class, etc.) with the subjects of the algorithm if necessary, and then statistically analyze outcomes to test for biases of concern. The contribution of such accountability analysis towards actually documenting the operation of these algorithms is quite limited, however, as is the insight into why the algorithms are actually doing what they are doing. Such analytic work is very much within the domain of the regulator or social justice advocate, not the archivist. In particular, those concerned with accountability almost explicitly don’t care about the totality of the specific inputs to the “algorithm” (which may be anything but clear) or the sensitivity to various combinations of inputs, but only with the algorithmic outcomes and how these may harm specific groups of interest. The concern is not much with the why or even the how but with the actual operational result. Bad results damn themselves. In contrast, from the stewardship perspective things are what they are; even if we cannot explain them, it’s important to document them.

Note also that there are very large classes of systems and outcomes that do not lend themselves to this kind of accountability analysis but that may be of great interest, at least from the broader stewardship perspective; methods of ranking results in response to queries or creating news feeds would fit broadly in this category. A simple capture of the outputs and at least some of the inputs is really difficult in these cases, and it’s clear that path dependency means we cannot capture many of the key inputs in a sensible way.

Transparency speaks to understanding and documenting why the “algorithm” is doing what it is doing; this could be procedural or intentional at a higher level. There’s one line of argument that claims that simply opening up the code would be sufficient to address this question, though it’s hard to believe that most of the commercial algorithm-driven systems would do this on a continuous and ongoing basis (since their code is being constantly tweaked), and the sheer scale and complexity of these systems is also an issue. And the code is of course proprietary.

More and more large-scale, algorithmically driven systems incorporate sub-parts that are essentially the trained outcomes of machine learning algorithms. These are generally incomprehensible to humans. Though this problem is a target of considerable ongoing research interest [Gunning, n.d.; Knight, 2017; Lei, et al., 2016; Snow, 2017, to cite a few examples], the current reality is that the output of a machine learning process (an iterated deep learning neural network, for example) captures recognition and classification processes that are largely incomprehensible from the perspective of human analysis. The research work towards explicability is mainly focused on relatively simple cases, like image classifiers: What areas of the image, what pixels, are most important in classification decisions? This is a long way from understanding more abstract, higher-level pattern-recognizers and classifiers. There’s also interesting related work (Vincent, 2017) underway trying to understand how to fool such classifiers by understanding the limits of the training data, or to try to understand how brittle the classifier is based on the limitations of the scope of the training data, and indeed how to design inputs that will deceive a given classifier. Both of these areas of research provide additional insights into how these classifiers actually operate.

It’s interesting to note that the older (1980s), now out of fashion, rule-based “expert system” type artificial intelligence applications are much more hospitable to presenting comprehensible reasoning chains for inspection, though to date these have proven less capable, less flexible, more costly, and much slower to build than today’s machine learning breakthroughs (McCorduck, 2004).

For the present, the key materials to capture in the name of meaningful transparency probably would be the training data and, to a slightly lesser extent, the algorithms (or source code) that convert this training data to a recognizer, classifier, or similar tool. Note that if the classifier is embedded in a broader computational process that may modify or over-ride its outcomes under some circumstances, the situation becomes even more complicated and unclear [15]. As already discussed, bias in the training data (that is perhaps simply capturing and recording long-standing biases deeply embedded in social processes) will usually result in similarly biased algorithms, which will learn their bias from their training data, but understanding and analyzing these effects are often much clearer from the training data. Collections of algorithmic outputs contain and constitute evidence of their own failures and biases, but these won’t be obvious through purely algorithmic transparency.

Additionally, because of the path dependency properties discussed earlier, there’s an important class of systems where it’s almost impossible to really understand the impact of even the most transparent algorithms without also having the historical and contextual data upon which the algorithm operates. These systems aren’t yet autonomous, but they begin to feel more and more like they are, and incorporation of technologies like adaptive machine learning will take us farther along this path.

Finally, note that there are also non-technical dimensions to transparency and accountability: intent of the system designers is important (for example deliberate decisions to have the system discriminate in various ways); and disclosure is a significant part of transparency (for example, if you are selling highly targeted advertising to fronts for foreign intelligence services, maybe it should be revealed publically in some fashion). Again, this is in contrast to the concerns of stewardship.

Accountability and transparency issues are now well established as an academic area of inquiry. There are now many conferences and substantial research projects that are addressing accountability issues [16].



Idealized documentation of systems in the Age of Algorithms

Actually documenting the “Age of Algorithms” ideally involves capturing and preserving the answers to two questions. The first is to record, given each specific set of inputs (which may include identity, history and context) the actual outputs of the algorithm at a given point in time. The second, and even more difficult but much more comprehensive question is to be able to capture the answer to the subjunctive form of the first question: given a hypothetical set of inputs, what would the algorithm’s outputs have been at a given point in time? This must take into account the history of the system and all its users to that time, and also a snapshot of the computational algorithms at that point in time. I’ll expand on these objectives briefly before discussing why they are often fundamentally impossible in the next section.

The first challenge is observational at some level, though, as already discussed, the full identification of the inputs used may be extremely problematic; they may be secret or proprietary, and at least partially internal to the system rather than externally supplied. They may also include path dependent factors (history). For many systems, the outputs are far from clear and concise: consider a news feed that evolves based on continual interaction with the system’s user. This is not transactional; the “algorithm” (in the computer science sense) is embedded in something much more complex and executed iteratively or repetitively over time. When the system is sensitive to the previous history of interactions, and the interactions of other users, matters become very problematic.

The second challenge is extraordinarily difficult for all the reasons already discussed, and meeting it amounts to creating or obtaining a replica of the actual system and its database at a given moment, and being able to run that snapshotted copy under controlled circumstances. Because of the huge difficulty (often practical impossibility) of doing this, we see many accountability efforts simply capturing outputs at scale over time, and moving from the properties of those outputs to hypotheses about the predicted behavior of the system in other cases, as already described.

Recognize the fundamental intractability of the problem here in the general case: in essence, one would like to capture a nearly infinite number of unique, individual, personalized performances (or, for the second challenge, possible rather than actual performances) from an algorithmic system at any given point in time. For many real-world systems, the sheer scale of this task (and the incomprehensibility of the results at this scale) makes the objective hopeless and utterly unrealistic. Here, the real stewardship goal, which I will explore in more detail shortly, must be to capture some meaningful sense of the system’s behavior and results for the present and the future. Comprehensiveness is quixotic.

For some specific systems, perhaps used much less frequently and with very high stakes (many of the governmental rather than commercial examples, such as judicial sentence recommendations), capturing all the outcomes and at least partial inputs for these outcomes is actually quite feasible, and indeed in some cases probably is, or should be, legally required as a matter of fundamental governmental record-keeping and accountability. Such comprehensive record keeping is easiest with systems that are fairly transactional and have little or no path dependency; these systems represent important and likely tractable special cases. Note that even this last modest proposal is not without controversy, as many of the algorithmic systems are made available to governments by secretive commercial corporations, who use a variety of contractual means to aggressively avoid having such records created or made public, arguing that this is necessary to protect their intellectual property, trade secrets, and competitive positions [17].

There is an excellent paper by Sandvig and his colleagues on methodologies for auditing “algorithms” (Sandvig, et al., 2014); parts of that work parallel much of my thinking, though with more colorful and memorable terminology. This work also provides a helpful point of departure for thinking about approaches to genuinely documenting “algorithms,” as opposed to auditing them, and the way in which these two activities are importantly different. I will return to explore what Sandvig’s approaches can offer the enterprise of documentation later in this paper.



The impossibility of traditional comprehensive archiving

A thought experiment: Imagine that Facebook CEO Mark Zuckerberg suddenly recognizes and totally embraces the idea that stewardship of some version of at least the public portion of Facebook (whatever that may be) is profoundly important, and that the company needs to support and enable it. Facebook then offers a comprehensive record to one or more stewardship institutions — perhaps the Library of Congress, Harvard University, ... — and perhaps even some funding (one can always dream). How many petabytes, and how many square miles of data centers are necessary to support and provide any form of meaningful access to this data? The Library of Congress experience with the Twitter archive is an important cautionary tale here [18]. Even if the lawyers were to permit transfer of such a gift (improbable due to privacy and liability concerns, both on the contributing and receiving sides of the transaction), this arrangement will not work.

Even if an institution could accept this offer, it would not be good enough. The underlying database has less and less to do with what various sectors of the public actually see when they interact with the sites. A decade ago most people concerned with digital preservation would have argued that the database was sufficient. I no longer believe this is true.

First off, the computational infrastructure to provide meaningful access to the data for scholars (and indeed members of the broader public) who wish to use it is beyond the economic reach of most stewardship organizations; provisioning such resources has been a major barrier to the Library of Congress making the Twitter archive available, over and beyond the legal hurdles. They simply cannot fund this computational infrastructure, even if they can pay for the storage. Note that there are some very interesting conversations emerging about making data available in various public infrastructure frameworks (Google, Amazon Web Services, Microsoft Azure) whereby stewards can finance data availability but the computational resources needed to actually exploit the data need to be paid for by users and create a profit center. The data storage serves as an armature for the computation-consuming user community they hope to attract.

The traditional models of digital archiving are twofold: format migration and emulation. Both, of course, assume a substrate, which is now relatively well understood and implemented with a fairly high degree of confidence, assuming reasonably consistent and continuous funding, of bit-level preservation by migration from one storage technology to the next as necessary [19]. The first approach, format migration, is best suited to “document-like” objects: PDFs, Microsoft Word files, audio, video, XML, JPEG, TIFF, etc. Here the idea is that, as standards, or de facto standards, gradually evolve and the ecosystem to deal with those types of files shift, curators will migrate the file formats, but this strategy is not necessarily as simple as it seems. New file formats are often not isomorphic to older ones. Formats may be proprietary and/or undocumented, and even objects claiming to conform to well-known standards may not implement these standards correctly or may add proprietary extensions.

The second approach, emulation, goes back to the seminal work of Jeff Rothenberg in the 1980s (Rothenberg, 1995) but has only become genuinely practical, at least at a technical level, in the past few years (Rosenthal, 2015) with recent developments in emulation, virtualization, containerization and the like, though there are still very complex legal problems. The basic idea here is to emulate, in software, the hardware of the machine that a piece of code (which perhaps rendered content, or melded content with interaction in complex ways) on current hardware; if the bits representing the original software can be preserved and brought into the present, they can be re-executed today in this simulation environment. There is a myriad of technical details and complexities that I won’t delve into here (they are well covered in Rosenthal’s analysis); additionally, one needs to bring the entire software environment, including the original operating system, into the present, which is legally problematic. This approach is best suited to interactive objects, such as video games or specialized rendering or analysis software, and particularly those that run on a single machine with little or no interaction with the rest of the Internet; closely related techniques are gaining considerable uptake as a way of facilitating reproducibility of scientific data analyses. But the underlying idea is still to import software and data (along with the broader execution environment) into a stewardship organization and to subsequently save and curate this archived copy of the system, as a set of digital objects to be preserved; it’s focused on the ingestion and subsequent maintenance of the ingested materials themselves into the controlled, archival environment. The additional component is a set of hardware emulators.

Traditional archiving — via either of these approaches — of many large-scale social systems, or “algorithms” in the current popular parlance, won’t happen through either of these approaches for many reasons. The software is proprietary and the owners won’t give it to you. The data that accompanies the software is even more proprietary and the owners or system operators won’t share it. And you cannot obtain access to a computational platform or storage capacity of the scale necessary even if all the other conditions were satisfied.



Documenting instead of archiving: Pragmatic approaches

If we abandon the ideas of archiving in the traditional preservation of an artifact sense, it’s helpful to recall the stewardship goal here to guide us: to capture the multiplicity of ways in which a given system behaves over the range of actual or potential users. We face several serious problems here, which I will return to: Who are these “users” (and how many of them are there)? How do we characterize them, and how do we characterize system behavior?

I should stress that while these challenges are largely alien to the archivists and digital preservationists that have tended to dominate much of the stewardship discussion in the digital age, they are deeply rooted in historical methods of anthropology, sociology, political science, ethnography and related humanistic and social science disciplines that seek to document behaviors that are essentially not captured in artifacts, and indeed to create such documentary artifacts [20].

There are a few exploratory examples that should be mentioned which share some characteristics of both the approaches we will explore here. One is the Wall Street Journal “Red Facebook/Blue Facebook” (, another is “Burst Your Bubble” from the Guardian ( Other related tools have been released to allow individual filter bubble exits, e.g., MIT’s Twitter feed swap tool, FlipFeed (

Let me suggest that there are two basic approaches, though they both share a common theme of managed observation of interactions with the system being documented:

1. “Robotic witnesses”

This is what Sandvig, et al. colorfully term “sock puppets” in the auditing context. The basic idea is to create a (probably large) population of software robots (agents) which the system in question believes are actual human users, and hence assigns various attributes and demographics as a profile and then offer continuing interactions consistent with those attributes and demographics; the robots will capture the results of these streams of interaction.

Recognize how hard this is, and how curatorially intensive and subjective it must be in many situations. We are dealing with interaction streams rather than just executing independent pre-established queries in most cases. There’s a first-line question of bot detection by the target system (the system that is to be documented); a lot of work has been done on this [21]. There’s the challenge of gaining initial acceptance, and understanding the “profile” or attributes assigned to the robotic user by the system you’re trying to document: What if Facebook is trying to import user data from Acxiom about this new imaginary “person”? Can curators create an Acxiom profile for each software robot? This moves very rapidly to the research frontiers of establishing bot populations on social networks, for example. There is an extensive literature on “synthetic populations” in the context of social simulations, but I have never seen any analysis of how these can connect up to the surveillance economy. I fear there is substantial knowledge about this, but it’s not clear that it’s in the research, as opposed to the hacker/intelligence/information operations communities that regularly marshal bot armies, or even in the commercial sector, where corporations in some areas regularly probe their competition (Dastin, 2017). Once the robot is successfully accepted by the target system, what should be the trajectory of the interactions that continue to provide insight that is consistent with the initial profile? Can these be successfully algorithmically scripted? Taken to its logical conclusion, it calls for establishing sizeable populations of synthetic identities across the commercial surveillance ecosystem [22] and building robots to continually both represent and advance these fake identities, all the while gathering documentation on salient interactions.

There is an extensive literature on “synthetic populations” in the context of social simulations [23], but I have never seen any analysis of how these can connect up to the surveillance economy. I fear there is substantial knowledge about this, but it’s not clear that it’s in the research, as opposed to the hacker/intelligence/information operations communities that regularly marshal bot armies, or even in the commercial sector, where corporations in some areas regularly probe their competition (Dastin, 2017). Once the robot is successfully accepted by the target system, what should be the trajectory of the interactions that continue to provide insight that is consistent with the initial profile? Can these be successfully algorithmically scripted? Taken to its logical conclusion, it calls for establishing sizeable populations of synthetic identities across the commercial surveillance ecosystem (and perhaps requiring the active collaboration of some components of this ecosystem in order to deceive others, which is perhaps more easily done in the name of national security than stewardship, if it can be accomplished at all) and building robots to continually both represent and advance these fake identities, all the while gathering documentation on salient interactions [24].

It’s also important to note that the use of such software robots is strictly prohibited in the terms of use of many systems, and there is a growing focus in blocking them, partly in response to concerns about “fake news,” partially due to competitive pricing queries and the gathering of market intelligence across shopping services, etc. Sandvig, et al. also argue that it may be in violation of the federal Computer Fraud and Abuse Act (CFAA), which is extremely broad and has actually been used to mete out draconian sentences in some seemingly crazy situations (Wu, 2013; Williams, 2016; also recent and very relevant, Cushing, 2017). And I have no idea whether such an approach might be acceptable to a typical institutional review board (IRB) in a university research setting [25].

2. “New Nielson families”

Older readers will remember the Nielson Families who provided the raw data for establishing television show ratings starting in 1950 [26]: they were the anonymous mystery families across America who defined and documented the national experience in TV viewing, starting in an era of very limited audience segmentation [27]. So this is not a new idea — in fact it really dates back to the age of radio, before television. I will have more to say about what we might learn from the history of the Nielson families in the next section.

Sandvig, et al. call this a “crowd-sourced audit” in the accountability context, though the genuine difficulties are poorly explored in their paper [28]. The fundamental idea is that rather than creating synthetic robot users, you recruit actual people and then look over their shoulders, recording their interactions. “Recruit,” as opposed to simply “invite,” is an important distinction here. But which users and how many of them need to be recruited? Can you identify, and then recruit, the “right” witnesses — an appropriately diverse collection of users of whatever system you are trying to document? How long can you sustain their engagement in the effort, and how do you renew the cadre as necessary over time? How the range of attribute clusters that we wish to document and those population members willing to serve as witnesses correlate with each other is a complex question.

There are also real (though less difficult) challenges in making such a witnessing cadre work technically. Humans are less fragile than software-based robots. Such a cadre can adapt much better to seriously interactive systems, for example; but how much freedom should they be given in their continuing interactions with the system? Is it correct to assume that witnesses are authentically “representing” the profiles for which they were initially recruited into the program, and thus that they can be left to their own personal reactions? And curators will need suitable tools to follow and record the witnesses’ interactions with the system to be documented.

The problems with a system’s terms of service are very likely not a major issue here (at least until they are amended to prevent this kind of use, if indeed they can be) as the witnesses are legitimate human users and not software robots, but again Sandvig, et al. raise the specter of the CFAA.

While the general approaches here are clear, and represent a huge departure from traditional thinking in digital archiving, the details are very, very complex and uncertain. Beyond the fundamental cost and curatorial subjectivity problems already mentioned, the next section will delve into some of the hard, underpinning theoretical and technical problems.



A few initial thoughts on curation, population selection, and interpretation

There are some very deep and complex problems that are conserved whether we think about either the recruitment of a population of human witnesses or the construction of a population of robotic ones. How many of such “witnesses” do we need? How do we describe these witnesses and their characteristics, and how do we create or recruit the right ones? How do we detect and correct for mistakes?

A key first step is to move away from thinking about characterizing two opposing camps, the sort of low-hanging fruit that the Wall Street Journal explored with its work in 2016 on “Red Feed, Blue Feed” [29].

To gain insight into these questions, I think it’s helpful to review some of the history of audience segmentation and capabilities for targeting advertising into those segments, and how these lead to the ways in which social media and personalization-based platforms might segment and offer access to specific groups of people. This will help to suggest the number of groups we might need to be concerned with, and how we might identify them. In today’s world, there’s a clear tension between population clusters and individual attributes selected by advertisers, which might result in extremely tiny, targeted audiences. It’s important to note that, particularly in regard to social media, individual preferences and choices and social networks are important factors in what is shown to users (Bakshy, et al., 2015). The attributes that social media platforms assign to their users, and consequentially highly targeted advertising (e.g., Facebook’s custom audience and dark advertising), also play important roles. These attributes enhance the precision of highly targeted advertising (which is often then shared via friend-to-friend activity) and particularly so-called “dark” advertising, which is only shown to users meeting very specific demographic parameters [30]. Zeynep Tufekci has done superb work in illuminating the feedback loops between targeted advertising and social networks, particularly in the context of Facebook [31].

Efforts to measure audiences and to characterize demographics of various populations are highly relevant here. The prior art is limited and much of the key data is proprietary; some of the methodology hasn’t been widely scrutinized by the scholarly community [32] to demonstrate how deep the gap is between commercial practice and more scholarly explorations of audience characterization and demographics. It’s also too easy to focus on radio and television, just because these worlds are documented, however poorly. For example, as far as I know, the Nielson Corporation has almost never disclosed the number of Nielson families that they have operated at a given point in time, although they have indicated that this number has been steadily expanding as a consequence of ever-increasing audience fragmentation and segmentation, which isn’t surprising. There seems to be fairly widespread belief that in the early years the number of participants was very small (though I have had little success quantifying this), but of course, in those early years of TV there were very few choices (mainly three networks or not watching TV, and one TV set per household) for advertisers, and few seemed interested in trying to target anything but the crudest demographics via TV (magazines, for example, offered much more extensive audience segmentation). TV was about reaching large, highly aggregated audiences, nationally or perhaps regionally [33].

It’s clear that there’s a trajectory from the 1950s to the 1990s that maps ever-greater audience segmentation to needs and opportunities to support mass media advertising purchases. This is magnificently documented in studies such as The attention merchants (Wu, 2016) and Breaking up America (Turow, 1997). Starting in the mid-1970s, the increasing number of cable TV channels allowed advertisers to purchase ever more specialized demographics. In the 1950s, there were essentially three categories: families, children, and housewives (really viewers who were home during various parts of the day). By the early 1990s, cable television was beginning to cluster audiences (in new, very detailed and precise ways), to which advertisers presumably would purchase access. Cable channels continued to multiply in the 2000s with the subsequent move to digital cable transmission technologies [34]. Additionally, very specific location rather than demographic attribute targeting is much less discussed, but very real, for modern cable TV systems [35]. But in the late 1990s and early 2000s, progress along this trajectory began to become less significant, or perhaps it is more accurate to say that an alternative universe and set of opportunities emerged to compete for advertisers. Individual and arbitrary combinations of audience attributes of all sorts can now be purchased on the various Web-advertising platforms [36]. The entire framework changed. You don’t buy a mass-market audience segment anymore, not even a very narrow and demographically specialized one, unless that’s what you actually want [37]. Alternatively, you can buy the attention of increasingly arbitrary groups of individuals characterized by complex combinations of their actual or inferred attributes (age, income level, gender, pet ownership, etc.), but also including things like “people who have been searching for Buick automobiles lately.”

More traditional audience merchants (aggregators) often follow an approach employing “clustering,” an idea that is the logical endpoint of trends that have been playing out since the end of World War II. The idea is to break up America’s population into a set of clusters of people of similar characteristics that roughly cover the population, but represent a very small subset of the possible combinatorial arrangements of attributes that are known about individuals (income, interests, age, gender, family status, sexual orientation, political affiliation, etc.) [38]. Content offerings are then developed that aggregate audiences primarily consisting of one, or a few, related popular clusters. Hopefully, many of these clusters represent groups of interest to targeted advertisers of various sorts. The tension now, with the rise of the Internet, is between offering individual attribute set purchases (which the Internet-based advertising platforms do well) as opposed to aggregated channels such as specialized cable TV programming offerings that have high coverage of a specific cluster of interest — which group should an advertiser purchase access to? Clearly, this is a complex mix of cost, capability and precision tradeoffs.

The original efforts to cluster the U.S. population described in the Clustering of America (Weiss, 1988) led to about 40 buckets; Acxiom (a publically traded company that is perhaps the largest collector of personal data in the U.S.) today sorts people into about 70 buckets (Acxiom Corporation, n.d.; DEMA, n.d.) but there are closer to 1,500 distinct attributes in play in Acxiom’s databases (that some platforms may let you purchase in specific combinations). Most of these attributes aren’t binary, so the number of different combinations of attributes comprising the potential audience space is incomprehensively and intractably vast. There are, no doubt, numerous combinations of attributes that literally do not apply to a single person in the country, and many more combinations will be extremely rare. As another data point, Experian’s Mosaic USA clustering system (Experian, 2016) uses 71 categories.

Be mindful also that adaptive behavior is not only about segmenting the world to make money from advertisers of various sorts, though this has certainly been the main historical driver. And political campaigns, which have recently become much more important and sophisticated consumers of audience, are a growing but specialized subset of the advertising world, which segments the population in very different ways than most consumer-oriented advertisers. In the “Age of Algorithms,” different systems may have many and diverse goals in their segmentation, classification and other decision-making, and profound variation in their economic models. For some algorithms, the goal may be as simple as causing users (or some group of users) to spend more time per week engaging with the system. Recognize also that this will never capture the very fine details of proprietary algorithms that generate feeds or recommendations; at best it will be a very gross approximation. Most likely over time, the best personalization will be based primarily on history data (which is proprietary to the service that captures it) rather than demographic, though external witness generation will be largely driven by demographics rather than interaction history.

For example, as we try to understand Facebook’s role in disseminating “fake news” (propaganda and disinformation) to various targeted populations, and its vulnerability to targeted information operations [39] and the implications for recent elections around the world, we need much more than the traditional data collected in the name of “accountability.”

Scholars may well be interested in audiences other than those constructed and aggregated in the hopes of selling their attention to advertisers. Documenting the “Age of Algorithms” via witnesses will demand measures other than populating witness sets by reverse engineering the work of media and advertisers in segmenting attention in America into lucrative clusters.

Some very simple arithmetic suggests that even before we consider path dependencies, the space of possible values for the varied and numerous attributes that many social scale systems know are such that probing or documenting the entire attribute space is totally intractable (Rosenthal, 2017). So the documentary challenge is basically reverse engineering, or otherwise surveying, the attribute clusters strongly distinguished for the system in question, and then ensuring that sufficient witnesses monitor the major clusters differentiated by the platform. Perhaps if a system cannot effectively cluster the audience at some level, it cannot successfully sell enough attention: this is a shaky hypothesis but it needs to be explored. Hopefully, it can be done in a way that is both technically and economically feasible. Actually carrying out and validating this work is an incredibly complex, multi-disciplinary research problem that, as far as I know, has been little addressed, but it will become a central issue in the coming years. Even once a documenter gets the clustering right, it needs to be revisited periodically, as the solution will shift over time and system evolution; how often is this going to be necessary?

Such an effort also raises further questions about intent and illuminates the differences between auditability and documentation. In most of the work of auditability, one partitions the overall society (and hence the inputs to the algorithmic system) into a small number of clusters based on questions of interest (income level, or race, or gender, or zip code of residency, or whatever the case may be) and then examines the extent to which the system behaves differently on various clusters within this scheme. Why is the process conducted in this manner? The analysis begins by establishing user clusters independent of system behavior in most accountability work and then measures the system against those clusters. In documentation, the curator is trying to balance a number of objectives: to capture the largest clusters of “similar” responses (that is, defined by the system) and some information about what user attributes differentiate users from one cluster to another; to try to have good coverage across the overall user population (and be able to segment this in ways that may be useful to future researchers); to keep the number of witness clusters affordable; and, finally, perhaps, to also sample a few relatively unusual outlier behaviors. To a great extent, we need to let the system speak for itself, define the clusters, rather than to rely on our pre-conceived ideas about clustering [40].

On a more technical level, either approach depends to a great extent on understanding exactly how the initial user profiles for the system to be documented are provisioned and what data is used to provision them (e.g., external demographic or financial data), and how users are subsequently re-identified when they interact with the system (which may be as simple and clear as a login or a cookie stored in the user’s browser, or something much more subtle and complex).

There are also unexplored questions about how to record and store (and possibly reduce or abstract) the captured interactions (and witness characterizations), and much more complex challenges in devising good analytic methodologies and tools to explore various kinds of research questions utilizing the captured record, though these latter challenges are largely beyond the realm of stewardship. But, as they are well validated and accepted, they may inform archival practice going forward. For example, they might suggest that larger populations of witnesses are needed or smaller populations will suffice; they might also suggest that the range of population characteristics needs to be altered in various ways. Note that Sandvig, et al. do not discuss these issues in the context of auditing and accountability; I assume that this is because the auditing is being done to answer one or more reasonably well-defined questions as opposed to obtaining an open-ended documentary record suitable for multi-purpose reuse.

We need to be skeptical and humble about the value and relevance of the data that we collect; this has not been tested against the needs of future scholars. A continual and critical dialog about the relevance of the data being collected is going to be absolutely essential. Over time, empirical validation will be essential. Cathy Marshall (and I am deeply grateful for this point) has underscored the impossibility of avoiding assumptions about the nature of future use, which may or may not be valid, and about the potential contradictions between the needs of future researchers and what we perceive as the inherent characteristics of the “media” which need to be preserved [41].



Concluding comments

From a stewardship point of view (seeking to preserve a reasonably accurate sense of the present for the future, as I would define it), there’s a largely unaddressed crisis developing as the dominant archival paradigms that have, up to now, dominated stewardship in the digital world become increasingly inadequate. From the point of view of today’s society, there’s a related but somewhat different crisis about how we understand what the “Age of Algorithms” is doing to us, as consumers, as citizens, as a society, and how, when, and why we need to hold the systems in this world accountable. Those concerned primarily with the latter goals do not have perfect solutions by any means, but at least they are making pragmatic attempts (sometimes challenged or thwarted by legal or commercial mechanisms and constraints and perhaps regulatory failures) to address the situation. Until now, the stewards have been ineffective.

Honestly, the archival world, too often largely equated to the broader enterprise of stewardship in the digital world, has been mostly in denial; this new world is strange and inhospitable to most traditional archival practice. A few leading and unconventional thinkers, such as Brewster Kahle at the Internet Archive and David Rosenthal at Stanford University, have offered detailed and compelling insights into how the nature of the Web, for example, has been changing over the past decade and the growing limitations of relying on the consensus conceptual models of taking archival copies of more or less static (at a given time) Web pages. On a more operational level, those involved in crawling the Web at scale, including experts at a number of research universities and national libraries, are trying to cope with these shifts every day, without good solutions [42]. Further, the focus so far has been kept narrowly to the domain of Web page archiving but the problem is much broader and deeper. While admittedly four years old now, the 2013 report on Web archiving from the Digital Preservation Coalition (DPC) (Pennock, 2013) hardly mentions the issues involved in personalization and dynamic content; the more recent report on preserving social media (Thompson, 2016) is mostly an analysis of why this is impossible by stewardship organizations, even national libraries, and a discussion about how individuals might archive their interactions with social media like Twitter and Facebook [43]. In the literature broadly, Twitter seems to be the major case study, perhaps because it has been at least somewhat hospitable to researchers that want to study public Twitter streams in that it offers, through various mechanisms, access to archives and real-time samples of these streams [44]; I don’t know of any other social media platforms that offer similar access. This situation must change, and the existing models and conceptual frameworks of preserving some kind of “canonical” digital artifacts [45] are increasingly inapplicable in a world of pervasive, unique, personalized, non-repeatable performances. As stewards and stewardship organizations, we cannot continue to simply complain about the intractability of the problems or speak idealistically of fundamentally impossible “solutions.”

Recognize that the documentation of performances and events of various kinds — dance, ritual, theatre, musical performance, coronations and inaugurations, lectures, public addresses, riots and wars, etc. — is very old, actually long pre-dating the invention of writing. It starts with drawing and painting (arguably Neolithic cave paintings are a good example), embraces the oral traditions in poetry and history represented by Homer, and later incorporates writing, printing, sound recordings, photographic images, film and video, and other techniques. News and social media are relatively recent developments and a small though important part of a much larger picture. What’s new today is pervasive personalization (particularly at the extreme granularity that is now commonplace) and interaction. It’s also important to recognize that the objectives aren’t obvious: the rise of the telephone meant that there were a vast number of person-to-person calls that were never part of the record and that nobody expected to be. Are there older analogies to today’s personalized performances we can learn from? [46] Indeed, historically, personal letters had many similarities here. Are we expecting too much in terms of stewardship of today’s world?

If we are to successfully cope with the new “Age of Algorithms,” our thinking about a good deal of the digital world must shift from artifacts requiring mediation and curation, to experiences. Specifically, it must focus on making pragmatic sense of an incredibly vast number of unique, personalized performances (including interaction with the participant) that can potentially be recorded or otherwise documented, or at least do the best we can with this. This is indeed a strange and challenging new landscape, but it is not entirely novel. As already noted, documenting performance is ancient; the invention of new tools (writing, musical notations, photography, sound recording, film, etc.) have steadily broadened these capabilities. And perhaps there is a question of roles here: historically, the role of the archivist, the steward, has most often been to preserve documentation, recordings created by others: documentary photographers and videographers, ethnomusicologists, folklorists, audio engineers, and other ethnographers employing ethnographic methods. Reporters and journalists also play a role here. But it’s increasingly obvious that those creating the documentation are also essential and crucial participants in the enterprise of stewardship. If archivists will not create, capture, curate the “Age of Algorithms,” then we must quickly figure out who will undertake this task, and how to get the fruits of their work into the custody and safety of our memory organizations for long-term preservation. Traditional archivists seem most comfortable dealing with the outcomes of the work of various types of documenters, rather than creating the testimony: this is a professional constraint that needs to be explicitly recognized, considered, and if appropriate clarified and affirmed if it is to be the case going forward [47]. Can we fund and support a new discipline and profession of Internet documentation in the “Age of Algorithms,” if we must have it? The historic economics of enterprises like documentary videography and filmmaking are not promising; documentary photography is perhaps somewhat more encouraging.

I do not want to suggest that stewardship has ever been perfect, or even as good as we might hope for it to be. But the digital era overall, and now the “Age of Algorithms” in particular, introduces a wealth of new shortcomings and failure modes. Sadly, as best as I can determine, stewardship in this new world is going to be extremely imperfect, and as I hope I have suggested here, one of the great challenges is going to be understanding the extent of these imperfections, the subjectivities inherent in them, and the tradeoffs between the limitations and shortcomings of the record and the level of investment in capturing it. Another great challenge will be cultural, broadening the scope of what is considered not only legitimate but also essential curatorial and stewardship work by archivists and scholars at our memory institutions. One can only speculate about the perhaps essential roles that the analogs of private collectors (perhaps more accurately now freelance documenters and curators) might play in this new world. I would not dare to claim that the solutions proposed here are optimal; they are clearly flawed, compromised, limited. But I hope they will form a point of departure for a discussion on how we might do better. End of article


About the author

Clifford Lynch has been the Director of the Coalition for Networked Information (CNI, since July 1997. He also is Adjunct Professor at the School of Information at the University of California, Berkeley.
E-mail: cliff [at] cni [dot] org



As with so much of my thinking, the “Friday Afternoon” seminar at the UC Berkeley School of Information was essential in formulating these ideas. Jeanette Zerneke took extraordinary notes at the 24 March 2017, session when I first explored these ideas and was kind to share them with me. The seminar returned to a near-final version of my thinking on 25 August 2017. My thanks to Ginny Steel, Todd Grappone, and Christine Borgman for making it possible for me to explore these ideas further in a more public talk and discussion at UCLA in April 2017. Many people were kind enough to read and offer comments on various drafts of the paper, and I thank Cecilia Preston, Michael Buckland, Joan Lippincott, Elliott Shore, David Rosenthal, Bernard Riley, Cathy Marshall, and Roger Schonfeld for their help. Don Waters shared a particularly thorough and insightful set of comments on a draft as well. Diane Goldenberg-Hart was an essential contributor in the preparation of the paper.



1. See Knuth (1997).

2. Analysis of worst-case and average running times for these procedures has long been a hard central problem in computer science.

3. Nick Seaver has done some excellent but, unfortunately, as-yet unpublished work on this evolution; see, for example, his talk at the 2016 University of California, Berkeley “Algorithms & Culture” conference (, which was sadly not recorded due to technical problems.

4. I’ll offer examples of these later in this section.

5. More precisely, some of the training data is used to generate the algorithm, and the remainder is used to validate the output of the training process.

6. There are other cases where algorithms, including machine learning algorithms, have had mixed results, e.g., trading in financial markets. We seem to understand much less about what happens when independent algorithms compete with each other, or with a mix of humans and other algorithms, as distinct from applications that use algorithms to supplement or supplant human judgment. See Johnson, et al. (2013).

7. A good example of this: the iPhone recently started offering a textual transcription of messages left for incoming calls that the owner did not answer.

8. Perhaps with some human and hence clearly subjective intervention, to complicate matters even further.

9. Along with many other factors, which might include what hotels and airlines are paying them for placement.

10. Note that the terminology here is problematic. It can cover everything from fairly autonomous algorithmic agents to much more mundane algorithms for managing large buy and sell orders across exchanges (for example, Wikipedia distinguishes between “Algorithmic Trading” (the latter) to “algorithmic trading systems” (the former).

11. It’s important to note that, as we identify and purge various agents of influence in the social media networks, this also creates new issues. When outputs from these sources are indiscriminately eliminated from the public record, they may not be available to future scholars attempting to study their influence. This is similarly the case for the advertising campaigns that twisted the election on Facebook. Considerable attention needs to be devoted to how to handle the archiving and documentation of propaganda and “fake news.” See, for example, Zelenkauskaite and Niezgoda, 2017; Bessi and Ferrara, 2016; Ferrara, 2017a, 2017b.

12. The issue here is usually systematic bias: for example, some years ago the Sabre airlines reservation system, which was built and owned by American Airlines, was originally deliberately designed to show American Airlines flights very early in the list of flights returned in response to a query from a travel agent. This gave rise to legal action (Friedman and Nussenbaum, 1996).

13. See, for example, Vaidhyanathan (2017) for an excellent example of how complex, ephemeral, difficult to detect, and insidious such activities have been, and, implicitly, how really hard they are going to be to document without specific advance targeting.

14. The specifics of “acceptable” price discrimination are tricky, both from a legal and a customer relations basis, and demand considerable care.

15. Also note that exposing training data is really dangerous to the extent that it can help an opponent to understand how to deceive or manipulate your classifier.

16. Notable recent conferences would include the Algorithms & Accountability Conference at New York University (, Algorithms in Culture Conference at UC Berkeley (, and many others. Projects would include work underway at the Berkman Klein Center for Internet & Society at Harvard ( and Big Data and Society, founded by danah boyd (

17. While detractors and critics may equally argue that the real agenda is to conceal shortcomings and discriminatory biases in the algorithms (Angwin, 2016; Wexler, 2017).

18. Since 2010 the U.S. Library of Congress has been maintaining a database of all the public tweets (though I am not clear whether they have other things, like some record of the “follower” structure for Twitter users). This is a very large database, and to date, the Library of Congress has not made it accessible, either to vetted scholars on a case-by-case basis or to the broader public. I am told that the barriers are twofold: concerns about legal liability, and the inability to afford the storage and computational resources that would be needed to support meaningful access to this collection.

19. I will leave aside the occasional searcher for the “holy grail” of very long-lived storage media (a materials science undertaking) combined with very low-tech, easy to produce reading technology. These seem unlikely to have much genuine impact on digital preservation, as they are not currently anywhere near consistent with the kinds of data volumes and transfer rates that are needed to be useful; they are more appropriate for challenges like identifying long-lived high-level nuclear waste sites for millennia across a dark age following the fall of civilization, which doesn’t require much bandwidth. See Benford (1999).

20. I am deeply indebted to Don Waters for stressing this key observation.

21. Cf., for example Varol, et al. (2017), though this is much more focused on “emitters” (e.g., tweeters) rather than relatively passive and private recorders.

22. And perhaps requiring the active collaboration of some components of this ecosystem in order to deceive others, which is perhaps more easily done in the name of national security than stewardship, if it can be accomplished at all.

23. See, for example, the excellent blog post “Synthetic populations for ABMs,” at Note that there are some really tricky problems using this kind of synthetic population generation to generate cadres of simulated users, because the population attributes that the target system conditions on and cares about are often tightly held secrets.

24. Other than the problem of getting various data providers like Axciom to attribute suitable characteristics to all the software agents, it is provocative to think about very smart witnessing software that does most of the work of creating and managing a population of robot witnesses, and even trying to figure out the demography of this population both initially and on an ongoing basis. This seems far out of reach today, but consider the recent work by Google where they took various single-user video games from the 1980s and connected them to a machine learning system that essentially could read the screen, manipulate a mouse, and recognize the score and when its game was over. Astoundingly, the system fairly rapidly discovered optimal strategies for how to play these games. See Mnih, et al. (2015) and Bellemere, et al. (2016).

25. This is actually a very complex and nuanced issue. It’s not clear what the scope of IRB jurisdiction is: it may well be that simply archiving material is not within the purview of the typical IRB, particularly if this isn’t being funded by grant support. At the same time, researchers trying to use archived materials later may find themselves subject to IRB oversight and constraints.

26. Regional market coverage started in 1954, though it’s not clear how many regional markets were covered.

27. And when the question was how many people were watching and how many weren’t, rather than how various audience segments and demographics were making choices from among a vast number of alternatives.

28. And their language is misleading — crowdsourcing will likely produce a cadre of witnesses that don’t correctly represent the population needed, perhaps for auditing but virtually certainly for stewardship-oriented documentation, unless you can somehow enroll a rather large randomly chosen subset of the entire population, which seems improbable.

29. See John Keegan, at and related pages; see also Wong, et al. (2016). This is going to be a very complex, multipolar population.

30. Facebook has just announced recently that they will eliminate this type of secret highly targeted advertising, and will make all such advertisement available for public inspection. It remains to be seen if this makes any meaningful difference in practice.

31. I have not yet found a good published version of these insights; perhaps the best summaries are Tufekci’s comments in “Fireside panel: The truth about fake news” from the 2017 MIT Conference on Digital Experimentation (CODE), and her recent TED talk “We’re building a dystopia just to make people click on ads” — see

32. This is well covered in Balnaves, et al. (2011).

33. Trying to find data on the number of Nielson families over the years is difficult and full of complications, quickly running into various, sometimes contradictory, statements by different authors, often made without any citation to a source. In the early days, Nielson used a purely diary-based method, where viewers reported on their viewing by making hand entries into these diaries; this system was later supplemented by a device called the “set meter,” which was combined with the diaries. In 1987 Nielson began to shift from diaries to an at least partially automated data collection tool, and in the next 30 years the technology evolved, but in the early days the automated system was apparently expensive and hence deployed in very limited numbers. As a further complication, they also had both national data as well as local market data in some regions. There are two excellent books on the overall evolution and issues of audience measurement: Audience ratings: Radio, television, cable (Beville, 1988) and the much more recent Rating the audience: The business of media (Balnaves, et al., 2011). The former suggests that Nielson was using some 450 families in 1951, and 700 by 1953; by 1965 there were 1,150 households and 1,250 by 1982 (pp. 71–72). The details here are mind-bogglingly complex, and the interested reader is referred to these sources and the references there.

34. For readers unfamiliar with the evolution of cable television, here’s a very brief history for the United States. CATV (Community Access Television Antenna) systems first appear in the very late 1940s. The idea was to share and redistribute signal from an antenna (perhaps situated on a mountaintop) to a community that couldn’t get TV reception. In the early days they were limited to carrying broadcast channel signals and had very limited deployment, mostly to rural communities. In the mid-1970s things began to change quickly, as some CATV systems shifted from community antennas to satellite downlinks, and new content appeared like Ted Turner’s Atlanta-based Super Station; premium offerings like HBO also began to be marketed. However, despite various incremental technology improvements during the 1980s and early 1990s that took most cable systems to perhaps 50-channel capabilities, these analog CATV systems were limited in the overall number of channels they could support. In the mid-1990s cable systems began to transition to digital delivery, which allowed two key developments: they could support a mind-numbing number of channels (many hundreds or more), and they could also move into the delivery of broadband data services; of course, there are tradeoffs between the number of channels and the amount of broadband data that a cable system can carry. For a superb, richly detailed history, see Parsons (2008).

35. Very little seems to be publically available about how often this capability is exploited and how valuable it is. Its use in political campaigns seems reasonably well known, but oddly, doesn’t seem to be well documented in the literature.

36. And, indeed, it’s become controversial which ones are offered for sale by any given platform: teenagers with serious depression might in theory be targetable, but perhaps not available for targeting by advertisers (Machkovech, 2017).

37. If you do you can buy some interesting, only-recently-available new ones, defined in ways ranging from people interested in Texas home renovation-flippers to a specific city block or political precinct in Manhattan.

38. Curiously, it turns out that for a variety of reasons, people falling into one cluster, or perhaps two or three similar ones, tend to flock together geographically, so there are some very specific geographic areas that are very densely populated by people belonging to specific clusters. Or at least they did historically. See Weiss (1998).

39. Perhaps intended to just sow confusion, as distinct from traditional advertising or propaganda activities (Starbird, 2017; Weedon, et al., 2017; Gallagher, 2017).

40. Though trying to understand how system-defined clusters relate to various commonplace external clustering ideas is probably of considerable interest to future scholars, and to the extent that that information can be captured, quite valuable as well.

41. I trust I am not doing an injustice to her thinking here.

42. One good example here is the pervasive use of JavaScript, which leads to hugely subtle questions about what “stage” of the performance of a Web page should be archived. The series of blog posts by David Rosenthal (2017) on “the Amnesiac Society” is a great survey here.

43. I want to be clear here that my critique is of the scope of the DPC publications, which are generally of high quality and represent much of the best consensus thinking of the digital preservation community.

44. Albeit in a somewhat compromised fashion, in that tweeters can “recall” tweets and these are removed from the archives, and those receiving and archiving real-time feeds are obligated to reflect these deletions. This is actually really complex. The terms of service have been amended repeatedly, as has practice about their enforcement. A key area where Facebook’s position has reversed repeatedly involves the preservation of subsequently recalled tweets by politicians and other major public figures, where there is at least arguably some reasonable expectation of accountability.

45. We are coming to grips more effectively at last with the challenge many of these executable digital artifacts create for digital preservation, be they scientific analysis tools or digital humanities projects or video games, simply because they make demands on and incorporate dependencies on complex digital environments that must accompany them to permit their re-execution. We are learning how to package and usefully document these through virtual machines, containerization, and similar techniques. Though as Rosenthal points out, there are still many limitations to this work.

46. This situation with telephone calls is much more complex than it seems. Certain “official” phone calls have been regularly recorded, transcribed, or otherwise documented. “Official” correspondence is generally part of the public record, at least at the federal level. And, at some point in the future, as a society we are going to have to decide what to do with the legacy of truly massive surveillance of telephony; there are interesting echoes here of what Germany, for example, has had to work through in dealing with the legacy of ubiquitous Stasi surveillance.

47. Note that I make no value judgment or criticism here, merely a plea for clarity and realism so that the key broader challenge can be met.



Acxiom Corporation, n.d. “Consumer data products catalog: The power of insight,“ at, accessed 16 November 2017.

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner, 2016. “Machine bias,” ProPublica (23 May), at, accessed 16 November 2017.

ACM U.S. Public Policy Council and ACM Europe Policy Committee, 2017. “Statement on algorithmic transparency and accountability” (25 May), at, accessed 16 November 2017.

Mark Balnaves, Tom O’Regan, and Ben Goldsmith, 2011. Rating the audience: The business of media. New York: Bloomsbury Academic.

Eytan Bakshy, Solomon Messing, and Lada A. Adamic, 2015. “Exposure to ideologically diverse news and opinion on Facebook,” Science, volume 348, number 6239 (5 June), pp. 1,130–1,132.
doi:, accessed 16 November 2017.

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos, 2016. “Unifying count-based exploration and intrinsic motivation,” Advances in Neural Information Processing Systems 29, at, accessed 16 November 2017.

Gregory Benford, 1999. Deep time: How humanity communicates across millennia. New York: Avon.

Alessandro Bessi and Emilio Ferrara, 2016. “Social bots distort the 2016 U.S. Presidential election online discussion,” First Monday, volume 21, number 11, at, accessed 16 November 2017.
doi:, accessed 16 November 2017.

Hough Malcolm Beville, Jr., 1988. Audience ratings: Radio, television, cable. Revised edition. Hillsdale, N.J.: L. Erlbaum Associates.

danah boyd, 2017. “Towards accountability: Data, fairness, algorithms, consequences,” Points (12 April), at, accessed 16 November 2017.

Catherine Caruso, 2016. “Can a social-media algorithm predict a terrorism attack,” MIT Technology Review (16 June), at, accessed 16 November 2017.

Tim Cushing, 2017. “Court says CFAA isn’t meant to prevent access to public data, orders LinkedIn to drop anti-scraper efforts,” Techdirt (15 August), at, accessed 16 November 2017.

Jeffrey Dastin, 2017. “Amazon trounces rivals in battle of the shopping ‘bots’,” Reuters (10 May), at, accessed 16 November 2017., n.d. “Life stage clustering system: ‘PersonicX.’,” at, accessed 16 November 2017.

Experian, 2016. “Mosaic® USA: Your consumer classification solution for consistent cross-channel marketing,” at, accessed 16 November 2017.

Emilio Ferrara, 2017a. “Disinformation and social bot operations in the run up to the 2017 French presidential election,&edquo; arXiv (1 July), at, accessed 16 November 2017.

Emilio Ferrara, 2017b. “Disinformation and social bot operations in the run up to the 2017 French presidential election,” First Monday, volume 22, number 8, at, accessed 16 November 2017.
doi:, accessed 16 November 2017.

Batya Friedman and Helen Nissenbaum, 1996. “Bias in computer systems,” ACM Transactions on Information Systems, volume 14, number 3, pp. 330–347.
doi:, accessed 16 November 2017.

Sean Gallagher, 2017. “Facebook enters war against ‘information operations,’ acknowledges election hijinx,” Ars Technica (3 May), at, accessed 16 November 2017.

Susan Grajek and the 2016–2017 EDUCAUSE IT Issues Panel, 2017. “Top 10 IT Issues, 2017: Foundations for student success,” EDUCAUSE Review, volume 52, number 1, at, accessed 16 November 2017.

David Gunning, n.d. “Explainable artificial intelligence (XAI),” Defense Advanced Research Projects Agency (DARPA), at, accessed 16 November 2017.

Nicolaus Henke, Jacques Bughin, Michael Chui, James Manyika, Tamim Saleh, Bill Wiseman, and Guru Sethupathy, 2016. “The age of analytics: Competing in a data-driven world,” McKinsey Global Institute report, at, accessed 16 November 2017.

Neil Johnson, Guannan Zhao, Eric Hunsader, Hong Qi, Nicholas Johnson, Jing Meng, and Brian Tivnan, 2013. “Abrupt rise of new machine ecology beyond human response time,” Scientific Reports, volume 3, number 2627 (11 September), at, accessed 16 November 2017.

Will Knight, 2017. “The U.S. military wants its algorithmic machines to explain themselves,” MIT Technology Review (14 March), at, accessed 16 November 2017.

Donald Knuth, 1997. The art of computer programming. Volume I: Fundamental algorithms. Third edition. Reading, Mass.: Addison-Wesley.

Donald Knuth, 1968. The art of computer programming. Volume I: Fundamental algorithms. Reading, Mass.: Addison-Wesley.

Tao Lei, Regina Barzilay, and Tommi Jaakkola, 2016. “Rationalizing neural predictions,” arXiv (2 November), at, accessed 16 November 2017.

Adam Liptak, 2017. “Sent to prison by a software program’s secret algorithms,” New York Times (1 May), at, accessed 16 November 2017.

Sam Machkovech, 2017. “Report: Facebook helped advertisers target teens who feel ‘worthless’ [Updated].” Ars Technica (1 May), at, accessed 16 November 2017.

Matt McFarland, 2017. “A rare look inside LAPD’s use of data,” CNNtech (11 September), at, accessed 16 November 2017.

Pamela McCorduck, 2004. Machines who think: A personal inquiry into the history and prospects of artificial intelligence. Twenty-fifth anniversary update. Natick, Mass.: A. K. Peters.

Cade Metz, 2016. “The rise of the artificially intelligent hedge fund,” Wired (25 January), at, accessed 16 November 2017.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, 2015. “Human level control through deep reinforcement learning,” Nature, volume 518, number 7540 (26 February), pp. 529–533.
doi:, accessed 16 November 2017.

Paul Mozur, 2017. “Google’s AlphaGo defeats Chinese Go master in win for A.I.,” New York Times (23 May), at, accessed 16 November 2017.

Siddartha Mukherjee, 2017. “A.I. versus M.D.,” New Yorker (3 April), at, accessed 16 November 2017.

David Nield, 2017. “You probably don’t know all the ways Facebook tracks you,” Gizmodo (8 June), at, accessed 16 November 2017.

Cathy O’Neil, 2017. “How can we stop algorithms telling lies,” Guardian (16 July), at, accessed 16 November 2017.

Osonde A. Osoba and William Welser IV, 2017. “An intelligence in our image: The risks of bias and errors in artificial intelligence,” RAND Corporation, Research Report, RR-1744-RC, at, accessed 16 November 2017.
doi:, accessed 16 November 2017.

Eli Pariser, 2011. The filter bubble: How the new personalized Web is changing what we read and how we think. London: Penguin Books.

Simon Parkin, 2016. “The artificially intelligent doctor will hear you now,” MIT Technology Review (9 March), at, accessed 16 November 2017.

Patrick R. Parsons, 2008. Blue skies: A history of cable television. Philadelphia: Temple University Press.

Maureen Pennock, 2013. “Web-archiving,” Digital Preservation Coalition (DPC) Technology Watch Report, 13–01 (6 March), at, accessed 16 November 2017.

Lee Rainie and Janna Anderson, 2017. “Code-dependent: Pros and cons of the algorithm age,” Pew Research Center (8 February), at, accessed 16 November 2017.

John Reidl and Joseph Konstan with Eric Vrooman, 2002. Word of mouse: The marketing power of collaborative filtering. New York: Warner Books.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, 2016. “‘Why should I trust you?’ Explaining the predictions of any classifier,” KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1,135–1,144.
doi:, accessed 16 November 2017.

David S. H. Rosenthal, 2017. “The amnesiac civilization, parts 1–5,” DSHR’s Blog, at, accessed 16 November 2017.

David S. H. Rosenthal, 2015. “Emulation & virtualization as preservation strategies,” Andrew W. Mellon Foundation, at, accessed 16 November 2017.

Jeff Rothenberg, 1995. “Ensuring the longevity of digital documents,” Scientific American, volume 272, number 1, pp. 42–47; An expanded version, dated 22 February 22, 1999, at, accessed 16 November 2017.

Royal Society, 2017. “Machine learning: The power and promise of computers that learn by example,” at, accessed 16 November 2017; See also, Ipsos MORI, 2017. “Public views of machine learning,” Royal Society, at, accessed 16 November 2017.

Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort, 2014. “Auditing algorithms: Research methods for detecting discrimination on Internet platforms,” paper presented at “Data and discrimination: Converting critical concerns into productive inquiry,” a preconference at the 64th annual meeting of the International Communications Association (22 May, Seattle, Wash.), at, accessed 16 November 2017.

Noam Schactman, 2012. “Death by algorithm: West Point code shows which terrorists should disappear first,” Wired (6 December), at, accessed 16 November 2017.

Carl Shapiro and Hal R. Varian, 1999. Information rules: A strategic guide to the network economy. Boston, Mass.: Harvard Business School Press.

Ben Shneiderman, 2016. “Opinion: The dangers of faulty, biased or malicious algorithms requires independent oversight,” Proceedings of the National Academy of Sciences, volume 113, number 48 (29 November), pp. 13,538–13,540.
doi:, accessed 16 November 2017.

Brent Smith and Greg Linden, 2017. “Two decades of recommender systems at,” IEEE Internet Computing, volume 21, number 3, pp. 12–18.
doi:, accessed 16 November 2017.

Jackie Snow, 2017. “Brainlike computers are a black box. Scientists are finally peering inside,” Science (7 March), at, accessed 16 November 2017.
doi:, accessed 16 November 2017.

Kate Starbird, 2017. “Information wars: A window into the alternative media ecosystem,” Medium (15 March), at, accessed 16 November 2017; See also, Karen Starbird, 2017. “Examining the alternative media ecosystem through the production of alternative narratives of mass shooting events on Twitter,” at, accessed 16 November 2017.

Seth Stephens-Davidowitz, 2017. Everybody lies: Big data, new data and what the Internet can tell us about who we really are. New York: HarperCollins.

Joseph Turow, 1997. Breaking up America: Advertisers and the new media world. Chicago: University of Chicago Press.

U.S. Executive Office of the President, 2016. “Big data: A report on algorithmic systems, opportunity, and civil rights,” at, accessed 16 November 2017.

U.S. Federal Trade Commission, 2016. “Big data: A tool for inclusion or exclusion? Understanding the issues,” at, accessed 16 November 2017.

Jerry Useem, 2017. “How online shopping makes suckers of us all,” Atlantic Monthly, at, accessed 16 November 2017.

Siva Vaidhyanathan, 2017. “Facebook wins, democracy loses,” New York Times (8 September), at, accessed 16 November 2017.

Onur Varol, Emilio Ferrara, Clayton A. Davis, Filippo Menczer, and Alessandro Flammini, 2017. “Online human-bot interactions: Detection, estimation, and characterization,” arXiv (27 March), at, accessed 16 November 2017.

James Vincent, 2017. “Magic AI: These are the optical illusions that trick, fool, and flummox computers,” Verge (12 April), at, accessed 16 November 2017.

Jen Weedon, William Nuland, and Alex Stamos, 2017. “Information operations and Facebook,” version 1.0 (27 April), at, accessed 16 November 2017.

Michael J. Weiss, 1988. The clustering of America. New York: Harper & Row.

Rebecca Wexler, 2017. “Code of silence: How private companies hide flaws in the software that governments use to decide who goes to prison and who gets out,” Washington Monthly, at, accessed 16 November 2017.

Jamie Williams, 2016. “Our fight to rein in the CFAA: 2016 in review,” Electronic Frontier Foundation (28 December), at, accessed 16 November 2017.

Julia Carrie Wong, Sam Levin, and Olivia Solon, 2016. “Bursting the Facebook bubble: We asked voters on the left and right to swap feeds,” Guardian (16 November), at, accessed 16 November 2017.

Tim Wu, 2016. The attention merchants: The epic scramble to get inside our heads. New York: Knopf.

Tim Wu, 2013. “Fixing the worst law in technology,” New Yorker (18 March), at, accessed 16 November 2017.

Asta Zelenkauskaite and Brandon Niezgoda, 2017. “‘Stop Kremlin trolls:’ Ideological trolling as calling out, rebuttal, and reactions on online news portal commenting,” First Monday, volume 22, number 5, at, accessed 16 November 2017.
doi:, accessed 16 November 2017.


Editorial history

Received 23 September 2017; revised 13 November 2017; accepted 14 November 2017.

Copyright © 2017, Clifford Lynch.

Stewardship in the “Age of Algorithms”
by Clifford Lynch.
First Monday, Volume 22, Number 12 - 4 December 2017