The big head and the long tail: An illustration of explanatory strategies for big data Internet studies
First Monday

The big head and the long tail: An illustration of explanatory strategies for big data Internet studies by Rasmus Helles

This paper discusses how the advent of big data challenges established theories in Internet studies to redevelop existing explanatory strategies in order to incorporate the possibilities offered by this new empirical resource. The article suggests that established analytical procedures and theoretical frameworks used in Internet studies can be fruitfully employed to explain high–level structural phenomena that are only observable through the use of big data. The present article exemplifies this by offering a detailed analysis of how genre analysis of Web sites may be used to shed light on the generative mechanism behind the long–tail distribution of Web site use. The analysis shows that the long tail should be seen as a tiered version of popular top sites, and argues that downsizing of large–scale datasets in combination with qualitative and/or small–scale quantitative procedures may provide qualitatively better understandings of macro phenomena than purely automated, quantitative approaches.


1. Introduction
2. Data and empirical materials
3. The long tail of Internet use
4. Analyzing audience engagement
5. Sampling, data cleaning and limitations
6. Genre analysis
7. The anatomy of the long tail
8. Conclusion



1. Introduction

The notion of ‘big data’ takes on a range of different meanings that — at least nominally — unites people from vastly different scientific disciplines and businesses ranging from small to global in scale (Mayer–Schönberger, 2013). The fact that the unity is perhaps more rhetorical than substantial becomes clear when the different meanings ascribed to the term ‘big data’ is investigated and unpacked (boyd and Crawford, 2012). It is associated with a particular business paradigm based on the exploitation of automatically harvested transactional data from digital systems for business purposes (Davenport and Harris, 2007), e.g., for purposes of innovation, prediction of customer behaviour or optimization of business processes. The term also signifies a range of approaches concerned with the handling and utilization of large–scale transactional datasets for scientific study, not least concerning online behaviour such as, e.g., the use of social media like Facebook and Twitter (Bruns, this issue).

All these definitions have their uses and justifications, but for the purposes of the present article, big data is taken to refer to repositories of transactional data from the Internet: big data are the (sometimes gigantic) banks of data that are created when user interactions in online systems are logged, from online games, via social media to logs of customers browsing online shops. Defined in this way, big data can include both meta–data (Jensen, this issue) about communication and the content of communication as well.

The ability to record and store digital transactions between people and between people and various online systems offers new possibilities for researchers from a variety of fields and disciplines, and may serve to extend existing lines of scientific inquiry into the Internet in a variety of different ways. How central big data will become to different strands of Internet research depends in no small part on how compatible this particular way of registering communication is with the fields’ established modes of theorizing for the purpose of constructing explanations and developing forms of analysis. The present article departs from a specific subset of Internet studies, that which deals with understanding how the Internet relates to wider social and cultural formations, and looks at the challenges that big data poses to established modes of relating theory and data in scientific explanations. The article makes use of a specific case study of Internet browsing patterns in Denmark to exemplify the dilemmas that big data poses to a field that relies on general social and cultural theories to explain the relationship between online phenomena and wider social processes and structures. These theories are not immediately amenable to the analysis of big banks of transactional data, and the article begins by considering how different kinds of big data can be situated in relation to established categorizations of data types.



2. Data and empirical materials

Despite the fact that collections of big data (in the sense used here) can seemingly be found online, e.g., in the form of sample log files from Web servers, it is important to remember that in scientific study, data is in principle always made. The Web server log file that we find online does not become data before we begin to conceptualize it within the context of a research project. Our found ‘data’ can only come to constitute data in the sense of the empirical basis of a scientific project once it has actively been selected by a researcher according to a specific set of research interests or a particular perspective (cf., Boellstorff, this issue; Jensen, 2012). This process of selection is always guided by assumptions about the constitution of the phenomena of interest, which in scientific inquiry takes the form of theoretical concepts. These may be more or less explicitly stated in actual examples of research, but theoretical notions are always used to delimit what can meaningfully constitute data for a particular research project — to state otherwise would amount to saying that anything and everything could be used as data for a given research project, which is clearly not the case.

The fact that big data may in principle be harnessed from a range of different perspectives does not in itself clarify how theory can become involved in the construction of data, since this depends on the way data are collected, e.g., whether they are collected by the researcher or provided by others etc. For the purposes of the present discussion, it is useful to consider this problem through the distinction between different forms of data used in social inquiry [1]: Primary data, which is data generated by the researcher, secondary data, which is generated by other researchers, and tertiary data, which is both generated and interpreted by others. The three different levels of the typology map how far removed the researcher is from the central processes in the creation and evaluation of materials. With primary data, the researcher has optimal control over the creation of data, and is directly implicated in the process of defining which aspects of the object under study should be included. For example, if a researcher wishes to study user comments to articles at the public portion of a given newspaper’s Web site, a Web crawler can be programmed to traverse the pages at some specified interval, which involves decisions about what portions of the site to record, which metadata relating to the individual comments to include (e.g., profile information about the author of the comment), etc. Which choices are made will depend on the horizon defined in part by the historical and theoretical legacy of the researcher’s field, and will likely differ substantially between for example researchers from journalism studies and interpersonal communication. In addition to crawlers and other automated ways of harvesting public content on the Web, typical ways of collecting primary data for big data analysis is through server side access to digital platforms that will allow researchers to set up custom made logging scripts that record desired transactions in a specific format. Similarly, some application programming interfaces (APIs) will allow researchers unfiltered access to a range of system variables (such as, e.g., Twitter Firehose), which could also be considered as primary data since the researcher has extensive control over the selection and formatting of data.

Secondary data are data collected and prepared by somebody else. In traditional social research, secondary data often consists of survey responses stored in spreadsheets in data repositories, or, with qualitative data, transcripts of interviews etc. (Hammersley, 2010). With secondary data analysis of any kind, several of the choices that are available to researchers creating primary data are no longer open: important decisions regarding what to record, how and when, have been made by others and the specific circumstances under which data were collected may not be documented. This means that the ability to shape inquiry by instrumentalizing theoretical concepts into categories for data collection is restricted to making choices within the universe defined by data that is already made. This clearly does not mean that secondary analysis is necessarily flawed or unable to achieve the same degree of scientific rigour as primary analysis, but it does pose important restrictions on what analysis can be made based on these data alone. Secondary data in big data analysis includes standard system log files (e.g., from a Web server), and some types of APIs. Even if data can be harvested by the researcher directly through the API, it does not necessarily give direct access to original data, but rather to data that are filtered and/or aggregated prior to being made available through the API (Bruns, this issue; Vis, this issue). In this case the researcher has no influence on the central process of deciding how data should be aggregated and may also be unable to ascertain exactly by which process aggregation has taken place. The output of recommendation systems (such as Amazon’s ‘people who bought this book also bought...’) typically provide aggregated data of this kind, often for privacy reasons or to protect business interests.

Tertiary data are data where both the collection and interpretation of data are done by somebody else. In traditional social research tertiary analysis typically consists in researchers using tabulations of results found in published research and comparing them to contemporary results or to other, tertiary data. It is clear that tertiary data restricts the researcher’s influence on the very central processes of collection, analysis and interpretation of data. Of course, as with secondary data, tertiary data and analysis may be fully feasible, provided that the data concerns the right issues, and is presented in appropriate categories, and provided that the researcher has sufficient guarantees about the validity of the original process of collection and analysis. In big data analysis, tertiary data may consist in tabulations of traffic volumes, user activity and other high–level statistics found in reports from governments, companies and consultancy firms.

It should be noted that the lines between the categories is not always clear–cut. It might be argued, for example, that the stats that are made available through Amazon’s recommendation system (as indicated above) are processed to a level that in itself constitutes a basic form of interpretation, since it is likely to be based on assumptions about user preferences that play a substantial role in the processing of the results, so that it should be counted as tertiary rather than secondary data.

In terms of comparison with the original notion of tertiary data, which is restricted to the re–use of scientific results, it is important to note that the automated collection and processing of data that takes place in many of the institutions and companies that generate and own many of the big data repositories, often involves judgements about how to construe data that plays a role similar to that of theory in scientific inquiry, since it involves making assumptions about the structure and dynamic that shape the slice of reality that data are supposed to represent. The data material used in the case study below is a clear example of this: the data consists of a high–level aggregate of the Web browsing behaviour of an N=5,080 representative panel of Danish Web users during one month. The data consisted of a list of the approximately 12,000 sites visited by members of the panel, how many members visited each site, and the total time they spent there. The original data were collected and processed by the trade organization of major Danish online media [2], FDIM (Organization of Danish Interactive Media), though a combination of cookie based tracking of Web users and data from a panel of users that agreed to have their browsing tracked through a script installed on their primary browsing device [3]. The granularity of the original data was at the level of individual URLs displaying the precise Web page visited, for how long, but were processed through a series of steps to produce a ranked list of top–level URLs (e.g.,, with aggregate user count and time spent per site. The assumptions guiding this aggregation concern the intended use of the data, which is not scientific, but commercial in nature: FDIM tracks Web use in order to create a baseline measurement that will allow for a neutral metric of traffic that can be used in pricing online advertising. The reduction (‘stemming’) of individual URLs of pages visited to the top–level URLs of each site is relevant to obtain a metric for overall site popularity and is also sufficient for analysis that seeks to break down traffic according to user demographics.

As a tertiary data source, the list supports analysis from several different perspective. It is, presumably, immediately useful for comparative analysis of information seeking behaviour of the kind conducted in information science (Kostagiolas, et al., 2012), since it supports numerical analysis of the statistical distribution of time spent on sites at different levels of popularity. In terms of understanding how Web use relates to wider aspects of social life, however, the list has very little direct value, and as such may be seen as an example of the kinds of dilemma that confront researchers seeking to utilize the information offered by big data about the problems they are interested in.

Yet the list clearly holds unique information about macroscopic patterns of the browsing behaviour of an entire nation. Perhaps the list does not lend itself immediately to interpretation about the relevance of these patterns to wider issues of social and cultural life. But because the list was generated through a relatively unobtrusive method (Halavais, this issue) and based on a comparatively large, representative sample, the list most likely reflects real dynamics and structures in the overall Internet behaviour of the population. Making the list speak more directly to research interests from the field of Internet studies as defined here, thus becomes an exemplary case of the challenges and potential benefits of big data to the sociology of Internet use.



3. The long tail of Internet use

An observable phenomenon that characterizes the list of sites in the study is the so–called long–tail distribution (see Figure 1, below), which is a statistical distribution that has been found to characterize many different forms of online activity (Brynjolfsson, et al., 2010). The distribution is revealed when a site’s share of the total time spent online by the panel is plotted against its rank (sites are ranked so that the most used site is number 1, and lesser used sites gets a numerically higher rank). On the list in question, the ranks span from number 1 (held by Facebook) to number 12,324 (held by the defunct Web site of a small local newspaper). A special property of long–tail distributions, which is also seen in Figure 1, is that the sites at the top account for a disproportionately large share of the total time spent on all sites: The top ten sites (just 0.08 percent of the 12,324 sites visited by the panel) account for 48.7 percent of the total amount of time spent online by the panellists.


The long-tail distribution of Internet use
Figure 1: The long–tail distribution of Internet use.


It is important to note that this long–tail distribution of Internet use only becomes visible due to the automated measurement technique used; the detailed ranking of sites would be completely beyond the means of any conventional method of measurement, as would indeed the ability to compile such an extensive and detailed list of the browsing behaviour of a panel of this size. While the long–tail distribution may not imply anything specific in and of itself, there is clearly something structuring the activity, since the highly skewed nature of the distribution suggests that a large number of panel members share a preference for whatever is offered by the sites at the top, while sites in the tail do not have the same appeal. In this sense, the long tail may be seen as terra incognita, since a location in the tail implies that a given site exclusively appeals to a limited number of users, while those in the top were used regularly by a large majority of users.

When interpreted in this way, the observed long–tail distribution resonates with a research interest in how Web use reflects the interests and priorities of users: It indicates that while the sites in the top may be taken to represent an online mainstream, which caters to interests and needs shared by a large proportion of users, the tail sites serve a more specialized range of interests. Understanding what separates the communicative and interactional possibilities offered at sites at the top from those in the tail will give insight into factors shaping Web use at a scale not available to research without the use of big data.

These considerations motivated a comparative study of sites located in the top and tail portions of the sample. The study reported below approaches the problem through a quantitative content analysis. The study was designed to compare sites in terms of the content they presented, the interactional features they offered and a number of aesthetic or formal means of expression (such as images or sound).

Categorizing content meant confronting an issue that follows directly from the macroscopic character of the data obtained from the panel, and which clashes with fundamental assumptions that normally underlies content analytical approaches: the problem of devising a coding scheme that would allow for the huge, anticipated variation in content across sites. Given that every conceivable topic is likely to be discussed somewhere online, and that a share of those topics are likely to be discussed at sites visited by panel members, constructing an appropriate set of content categories is clearly problematic. In any content analysis (Krippendorff, 2004; Krippendorff and Bock, 2009), the construction of coding schemes is closely related to the phenomena under study (e.g., a specific news topic), since it is the ability of the coding scheme to register discursive forms in sufficient details that will subsequently allow the researcher to draw conclusions about for example changes in news coverage. In the present case it was obviously not possible to develop a coding scheme that would cover the different topics that could be expected to be found on sites visited by the panel — the expected variation would simply be too great to be empirically manageable and the resulting list of categories too long.

The problem was eventually handled by deriving a set of categories from a macro–sociological perspective, taking Habermas’ (1992) broad categories of social institutions as point of departure. Rather than going for a long, exhaustive list of topics, the analysis was instead based on a scheme for coding site content according to the general distinctions between institutionalized spheres of social interaction and communication that are central to Habermas’ work on the public sphere. The details of the development of the content categories can be found below — at this point it is central to underline that the development of the coding scheme along these lines was necessary to accurately reflect the anticipated character of the online content it had to accommodate. It is also central to the process of bringing a theoretical approach that is germane to research interests within the sociology of Web use to bear on a big data source: Making the tertiary data source represented by the list speak to broader issues of socio–cultural structures in this case involved developing a supplementary set of primary data on the basis of the original data set.

The specific interest of the study in linking the differences between sites in the top and the tail to a broader view of the user interests they accommodate was served by embedding the content categories in a theoretical understanding of Web sites as elements in a genre system that caters to different aspects of users’ interests and allow for different types of social action and interaction.



4. Analyzing audience engagement

Since the publication of their foundational article on genre and organizational communication, the work of Yates and Orlikowski (1992) has been a natural point of reference for studies on genre and new media. Based on sociological ideas about the mutually constitutive relationship between structure and (communicative) agency (Giddens, 1984; Miller, 1984), their emphasis on the nexus between the generic forms of communicative practices, and the contexts in which they are routinely employed, has paved the way for a more nuanced understanding of the relationship between media technologies, communication and organisational structures.

In Yates and Orlikowski’s definition, a genre is “[...] a typified communicative action invoked in response to a recurrent situation [...] or socially defined need [...]” [4], stipulating a two–way relationship between the wider structural conditioning of social action, and the specific communicative shapes it takes. Genres reflect the practices they are a part of, in no small part because they play a key role in making communication simple or at least manageable: Moulding particular communicative functions into recognizable formats allows people to determine what the communication is about, without having to absorb the entire message at first sight.

A frequent motivation in empirical studies of genres of organisational communication (Yates and Orlikowski, 2002) is that the character of the genre repertoire of any given organisation can reveal deeper aspects of the social dynamics of that particular organisation: The combinations of content, form and mode of address in our routine interactions tell tales about the underlying social organisation of our communities.

The present study, on the one hand, shares this fundamental view of the relationship between genres and social structures, which holds important implications for the study of genres also outside organisational communication. On the other hand, the empirical study proposes to scale up the analysis by departing, not from the workings of a single organisation or community, but from the full range of genres on the Web. What are the genres that are used the most and the least?

Following this revised approach, Web sites can still be seen as a key element in the structuring of social relationships and interaction among a wide variety of social institutions and actors. At the same time, the emphasis is shifted toward a comparison of the concrete sites out there, and the kinds of interaction they enable — or disable. Following the notion of genre employed in Jensen and Helles [5], which builds on Raymond Williams (1977), a genre is characterised by three dimensions:

  • Characteristic subject matter. Genres are defined in part by the nature of the content they structure, e.g., private communication on dating sites or public communication on the sites of political candidates.
  • Formal composition. Different kinds of content are typically communicated using different formal or expressive means. Music may be appropriate for personal expressions on, e.g., Myspace profiles, but not for the ‘Proceed to payment’ page on Amazon.
  • Mode of address. The configuration of communicative possibilities on a given site defines the expectations and possibilities for interaction: Newspaper sites often enable user–user communication, reflecting a view of visitors as active and inviting interaction, while banking sites offer a more restricted set of communicative possibilities.

As noted above, a key challenge when building a coding scheme for an analysis of highly diverse genres is to develop a typology that will allow for sufficiently nuanced distinctions between content types.

The present analysis seeks to accomplish this by mapping combinations of discursive types that are drawn from the sociological theory of Jürgen Habermas (1992) and the distinctions between different spheres of social interaction and communication implicit in his work on the public sphere. The central rationale for choosing this model, in addition to its proven usefulness as a tool in sociology and media studies, is that it essentially aims to do the same thing: create a typology that allows for a high–level ordering of the most fundamental kinds of social interaction.

The model is summarised in Table 1, after Jensen [6]:


Central communication forms


The model points out five spheres of social communication, characterised by different topics and, historically, conducted by different institutions: Private enterprises emerged as the site for economic exchange, and the intimate sphere was prototypically located in the family home, where topics not fit for public discussion could be debated. Not least the development of media, from television (Meyrowitz, 1985), to digital media and social network sites (Baym, 2010; boyd and Ellison, 2007), have loosened the association between the institutions and the discursive forms in the five spheres, allowing forms of communication and exchange to take place across institutional spaces previously reserved for only some kinds of communication. Despite this historical trend towards increased dissociation between institutional place and communicative form, and despite an increased blurring of previously clear–cut meanings of ‘public’ and ‘private’ communication and the possible limits in applying the categories unequivocally in online contexts (boyd and Ellison, 2007; Papacharissi, 2002, 2009a; Svenningson Elm, 2009), the typology still captures recognizable communicative forms. In their operationalized form, the different types of communication were defined as follows in the empirical content analysis:

  • Communication in the intimate sphere was understood as communication regarding purely private affairs, even if they were stated in a publicly available site. In this sense, someone writing “I’ve been cleaning the kitchen today” on his or her blog would be making a statement of this type.
  • Communication in the cultural public sphere was defined as communication that concerned aspects of private life, but in a non–private way, e.g., in a fictional form, or in a way that tries to generalize a personal experience to a more general level, for example by framing it as an existential problem common to mankind.
  • Communication in the social sphere was defined as communication that deals with commercial exchanges or communication about other business–related topics. In this sense sites explicitly inviting commercial exchange (e.g., house listings at real estate sites) or the publication of a set of employee guidelines on a company Web site would be engaged in social sphere communication.
  • Communication in the political public sphere was understood as political debate, e.g., discursive manifestations with a general perspective on issues relating to the regulation of relationships between people or institutions: An individual assuming the discursive role of citizen when complaining about tax returns or gender roles on a private blog, or a company publishing a press release stating their view about environmental regulations, would both be coded as political communication.
  • State communication, finally, was understood to be communication of legal rules and regulations. In this sense, a public information site informing visitors about the possibilities of getting public help to quit smoking would be state communication, as would information about tax rules and regulation on a bank site.

The discursive forms were coded separately, so that a given site might be coded, for example, for political, private as well as cultural communication.

Form and mode of address

The variables for coding form, first, were designed to capture the different formal elements a site may have [7]. The formal categories describe what types of media are present at a site.

In addition to the list of formal categories pertaining to sensory modalities, a second set of categories regarding the configuration of information and interactivity with respect to the relationship between user and site were coded [8]. This block of categories captures the ways a site presents information about itself or the social entity behind it to users, and also captures what kinds of potential communication are made available. The third and final set of variables concerns the affordances for user–user information present at a given site, and includes facilities for communicating directly within the site (e.g., a forum), and features encouraging communication about the site in other contexts, for example through tagging, ready–made e–cards or link–forwarding facilities [9]. Taken together, the coding categories under the label of mode of address combine to capture the kinds of activities a site stipulates for users — from the availability of features allowing users to communicate between them, to the presence or absence of contact details allowing visitors to communicate back to the person or organisation behind a site.

All variables were coded as binary, and were coded as one if the site feature they describe was found present anywhere on the front page or two link steps down from the front page. Coding the front page and two steps down ensures that a broad variety of Web pages of the site were reached [10].



5. Sampling, data cleaning and limitations

A final issue that arose from the large–scale nature of the panel data was that of reducing the data to a level that can be handled with manual coding. Making choices about features of content, form and communicative affordances of different sites at the level required for genre analysis remains beyond the scope of automated analysis (Lewis, et al., 2013). In this case, the challenge was to create sub–samples that retained the phenomena of interest (the structural features of the long tail), while reducing it to a manageable number of sites. The problem was handled by using a combination of random sampling (to create a representative sample of tail sites) and a sample comprising all the top sites.

By using the proposed analytical strategy, and by operating with limited sample sizes, it becomes possible to use quantitative methods for a precise identification of groups of similar sites, and at the same time allow for subsequent comparisons based on qualitative observations and characterisations of sites within the individual genres.

The analytical model resulted in the construction of a combination of purposive and random sampling:

  • A top sample, consisting of a total sample of the first 150 sites on the list.
  • A tail sample, consisting of N=200 drawn at random from the tail.

The decision to define top 150 as the mainstream has an inherently arbitrary component, as there is no natural cut–off point in long–tail distributions where the big head stops and the long tail begins. The choice was based on conversations with Web industry insiders, who agreed that sites beyond the first 150 were not likely to be produced specifically for the Danish market, and therefore would not have marketing budgets that would promote them to Danish users, pushing them up the ranks.

The method by which the original sample was generated is in many ways helpful, since the panel approach used to collect the online browsing behaviour that underlies the list ensures that all traffic data stems from real people (and not bots and crawlers), and because it can be considered representative of the general activity of the Danish Internet population. However, the method also imposes limitations on the study that need to be respected.

The absolute top site in the entire sample is Facebook, which accounts for 18.68 percent of the total time spent. But since everybody logging on to Facebook are transported to their individual page, which allocates content dynamically, and because many profile pages are private (Bucher, 2012; Papacharissi, 2009b), Facebook cannot be coded according to the definition of a Web site used in this study. For this reason, Facebook was removed from the top sample.

A similar problem relates to blogs, since the URLs of individual blogs are also aggregated under the top URL (e.g.,, rather than the individual pages. For the same reason as above, the major blogging services also had to be removed, since individual blogs could not be discerned and coded. Compared to the amount of traffic, the (quantitative) loss is smaller, as blogs (judging by the traffic on the four major blogging sites) accounted for less than one percent of the total time spent. Twitter and LinkedIn were also removed from the sample prior to the analysis for the same reasons.

The process resulted in a top sample of N=138. Two identical sites with different URLs in the tail samples were consolidated resulting in a tail sample of N=198. The removal of sites means that the total sample (all remaining 12,310 sites) covered 79 percent of all Web traffic by the panel, and that the top sample alone covered about 60 percent of all Web traffic.



6. Genre analysis

The top sample was analysed using latent class analysis [11], and resulted in the identification of three well–defined classes of sites [12]. A subsequent analysis of the relative importance of each coding variable to the classes allowed for the identification of traits that could be considered important, by being either significantly [13] over– or under–represented in a given class. The resulting three classes can be characterised as follows:

6.1. Content is king

Sites in this category (N=49) are much more likely to offer video content and to display ads for third–party sites. They also have various features for on–site user–user communication, and frequently allow for some form of written exchange between users, e.g., a forum or facilities for commenting. They also often enable users to share views and experiences in more restricted formats such as voting and quizzes, where users can compare their own input to that of previous users. Users are also able to alert other people to site content in the form of tagging and ‘ready–mades’, the latter typically by sending an e–mail message with a link to the site.


Availability of characteristic forms of content and communication on sites in the content is king genre compared to sites not in the genre


An inspection of the sites in this cluster reveals that it is comprised of three distinct subgroups, defined by specific combinations of content variables.

  1. Private and cultural communication. This group of sites (N=18) mainly consists of commercial dating sites, which characteristically combine private communication with cultural and commercial communication: Users mix the private content of their self–presentations with general statements or discussions, e.g., about the definition of true love, or the appropriate balance between the genders in a relationship. The commercial content is typically offers made by the organisation behind the site, aimed at selling various packages (gold memberships etc.).
  2. Political, cultural and commercial communication: All the major online news sites and newspaper sites (N=19) form a distinct subgroup within the category. These sites share a unique combination of content variables, as they are the only sites in the top sample to simultaneously have cultural, political and business communication. While many other sites include business communication, and many also have the combination of business and cultural communication, news sites are the only ones to combine all three dimensions. This resonates with the role of the modern newspaper as the carrier of cultural and political debates, combined with the commercial basis for their operation, just as the many features for user–user interaction may reflect the classic, publicist role of newspapers, embodied in the transformation of the letter–to–the–editor to facilities for user commentary below articles (Chung, 2008).
  3. The final group of sites in this category mixes culture and commerce (N=12), and mainly consists of sites with video pornography (so–called tube sites) and other sites with user–upload of video content. They are similar to the other two categories because they allow users to communicate on site, e.g., in the form of commentary sections to the videos. They represent some of the few pornographic sites that attract enough sustained use to make it into the top sample, most likely because people are required to stay at the site to watch the videos.

As a summary of the features of the three subgroups, they are all sites that are heavy on content, either in the form of re–mediated (Bolter and Grusin, 1999) content formats (e.g., news and more or less tasteful forms of entertainment), or Web–specific formats such as asynchronous discussion forums. The combination of content features, formal features and mode of address all reflect this, in the sense that these sites are designed to keep users there, and to allow them to contribute to the attraction of the content already on offer by producing extra content (Bruns, 2008).

6.2. Citizen and consumer sites

Sites in this genre (N=48) are characterised mainly by their frequent attempt at selling things to visitors. This may be in the form of merchandise or services. At the same time, sites are very unlikely to display advertisements for other organisations, and they are also unlikely to invite users to communicate with each other on–site. In terms of content, they rarely display cultural or private communication.


Availability of characteristic forms of content and communication on sites in the Citizens and consumers genre compared to sites not in the genre


An inspection of the sites in this cluster reveals three subgroups (two big and one small) based on content variables.

  1. State and legal communication. This subgroup (N=16) represents sites owned by public authorities, and are characterised by combining state communication with business and (sometimes) cultural communication. These are the sites of tax authorities, public libraries and the municipalities of big cities. Beside the difference in subject matter available at the sites, they have many things in common with the sites of big corporations, not least in the way they approach the user: Like big companies, they are unlikely to allow users to register each other’s presence, just as they are unlikely to allow users to communicate asynchronously, e.g., via a forum. Users are invited to contribute and request information via formulas, and once they log in with their identification number, they are able to access information held by the organisation about them (e.g., see the status of their tax report, or what library loans are registered in their name). These sites often collect fees for particular services through an e–shop.
  2. Business communication. This subgroup (N=25) consists of big retailers and business–to–consumer companies. These are sites of companies selling goods with broad appeal, such as IKEA, large travel agencies, banks, national real–estate chains and telecommunication companies. The absence of on–site user–to–user communication presumably reflects a desire to keep a clear and uncluttered communication, and a desire to protect the online manifestation of the brand from comments by frustrated users. Instead, these sites position users as individuals who are offered meticulously crafted avenues of interaction with the organisation behind the site: Users can buy things or ask questions via preformatted formulas, provided they identify themselves by first constructing a user profile.
  3. Business and culture. These sites (N=7) combine cultural and commercial communication, for example auction sites specialising in used quality items. Although these sites do allow some user contributions in the form of pictures and descriptions of items to be sold (e.g., antiques), they also employ strict moderation, disallowing users from engaging in off–topic exchanges.

All sites in this group share a selective approach to their users by pre–selecting a specific set of user needs or interests as relevant, and formatting the possibilities of users’ on–site activities accordingly: On these sites, you are either a customer of a particular kind of good or service, or a citizen with a specific set of needs to be taken care of.

6.3. Specialised services

The last group (N=29) contains sites that are very unlikely to have communication about anything besides business; for example they very rarely offer communication about cultural topics, and they are also significantly less likely to have more advanced aesthetic features such as e.g., video. They are also less likely to offer retention features aimed at bringing users back at a later point, e.g., by suggesting they sign up for e–mail newsletters.


Availability of characteristic forms of content and communication on sites in the Specialised services genre compared to sites not in the genre


An inspection of the sites in this group reveals the prototype to be sites of online–only businesses: They are information aggregator sites, in the form of telephone directories, price aggregators and currency exchange rate information sites. They specialize in no–nonsense dissemination of a single type of information, and do not attempt to tie users into more elaborate or evolved forms of relationship by having them register a profile or subscribing to a newsletter.

The sites in this genre differ from others by their sole reliance on broadcast communication: They do not have anything to sell, nor do they have facilities that require you to log in in order to access them. All content is immediately available for consultation, and the only interactive features relate to search or input in formulas.

6.4. Non–classified sites

A small group of sites (N=12) do not fit in any of the three main or sub genres [14].

They can be seen in the left panel of Figure 2 (below), which displays the fit between the classes found by in the latent class analysis and the sites in the top sample. An interesting example is the site of Danmarks Radio, the main national public service broadcaster: It is similar to the other news sites in Class 1 (Table 5, below) in some respects, but differs among other things because the site does not feature advertisements, making it more similar to the sites in Class 2 (Table 5, below).


Fit between top and tail sample sites to the genre model derived from the top sample
Figure 2: Fit between top and tail sample sites to the genre model derived from the top sample.




7. The anatomy of the long tail

Subsequent to the classification of the top sites, tail sites were classified according to the latent class typology that was based on the top sites, by calculating their predicted class membership. The result is plotted in the right hand panel of Figure 2, which clearly shows that while the majority of tail sites do belong to one of the three classes (genres), a number of sites do not fit any of the categories particularly well [15].

An important observation concerns the split between classified and unclassified sites in the tail sample. Since the tail sample is a simple random sample, the findings suggest that 35.4 percent (±7 percent) of all the tail sites fall outside the classification of the top sites. Stated differently, something like 60–70 percent of the tail sites fit the classification of the top sites, while 30–40 percent of the tail sites fall outside the classification. In terms of understanding of the anatomy of the long tail, the terra incognita discussed above turns out to look quite familiar, at least in the sense that the sites found here offers broadly the same combinations of interactive possibilities and content as those found in the top.

Before turning to the unclassified tail sites, we shall look at the well–classified ones. An inspection of the clusters of the tail sites similar to the one presented for the top sample reveals some interesting variations [16].

In Table 5 the top and tail sites are compared by entering the tail sites in a table with the main groups and subgroups that emerged from the analysis above.


Genres of the top and the tail samples


It is clear that the tail sites in many cases represent specialised versions of their top sample counterparts: The tail portion of the news category (Class 1b) comprises sites of niche newspapers and non–Danish language news sites (e.g., New York Times, Guardian). In Class 1c, the fictional content commented on by fans shifts from pornography to other kinds of fiction (e.g., a site for a fantasy trilogy), and sites relating to various hobbies. The difference between the top and the tail sites in Class 1 appears to follow a specialisation in content.

In the ‘Citizen and consumer’ Class 2, the underlying discriminatory mechanism is also quite clear, as large municipalities in Class 2a are replaced by smaller ones, and several more specialised state–owned sites (e.g., university sites) enter the picture. In the business segment (Class 2b), retailers with broad appeal are replaced by sites for individual brands, offering a less frequently needed set of services (e.g., downloads of manuals for software drivers). In general, the tail sites in Class 2 are differentiated from their top counterparts by virtue of a decline in potential audience (e.g., by servicing fewer customers or fewer citizens).

The largest differences are found in Class 3, where a new sub–category emerges. The category already familiar from top sites (online–only businesses) follows the same pattern of content specialisation that was seen in Class 2b and 2c, since the aggregators with broad appeal (e.g., currency exchange rate sites) are replaced by specialised ones (e.g., a pet trading site). The new category consists of sites for local businesses. They fall in the same category as the online–only businesses primarily because they combine business communication with a very limited set of formal and interactional affordances: The sites of local business are online ‘business cards’ containing little more than a few pictures and a description of services offered and contact information.

Taken together, the sub-genres in Class 1–3 primarily manifest a trend towards smaller audiences based on two discernible factors, namely, either the geographically specific character of the services provided, or a thematic specialisation.

The unclassified long tail sites

The uncategorised sites in the long tail (N=70) do not support further classification using latent class analysis. Judging by their content composition, they are either business–only sites (N=44), or they mix business communication with some other type of communication (N=26).

Looking at the last group first, nine are pornographic sites, which, judging by the names implied by their URLs, cater to more specialized interests than their generically–named counterparts from the top sample (e.g., youporn). By the same evidence (URLs) a further six sites can be identified as music sites (the collaborative music database is among these sites), and contain various sorts of music related information (lyrics etc.).

Of the business–only sites, 32 can be separated into two groups, suggesting two opposite ways businesses can fail online: poorly made sites for local businesses that simply fail to do the most basic things that business sites are required to do, such as supplying contact details for the company behind the site, and failed attempts at creating aggregator sites. The first group are sites for existing businesses that have never really caught on to the logic of the Web; the failed aggregator sites are presumably developed by web enthusiasts that have never really caught on to the logic of business.

The remaining sites are a mix of NGO sites, specialised business sites and two discussion forum sites, and fall outside of the established genre system primarily because they mix content categories: For example, the NGO sites combine communication about legal matters (e.g., formal funding criteria) with explicit partisan political communication. Several of the specialised business sites in the remainder group are privately run, semi–commercial activities, with a combined focus on personal communication with potential clients and on attempts at selling (e.g., a woman selling knitting recipes while inviting off–line socialising). Together with the discussion forums, these chatty home businesses are among the only ones in the tail inviting communication beyond broadcast messages of prices and links to content.



8. Conclusion

The present study has produced findings along two dimensions. First, it has produced a baseline of information about the kinds of sites that attract the attention and time of Danish Web users, and demonstrated that the long tail can usefully be seen as a tiered version, with content specialization and geographical proximity as key factors characterizing the difference between top and tail. In addition, the analysis has shown the utility of departing from a corpus of well–known sites in order to map the terra incognita of the long tail. Second, it has demonstrated how theoretical perspectives specific to studies looking at the relationship between Internet use and wider socio–cultural dynamics may productively be brought to bear on big data. In this case by applying genre theory and content analysis to tertiary data on Web use.

Along the first dimension, the study has shown that the generic forms of the mainstream sites at the top of the long tail closely follow the characteristics of the underlying organisations: Sites with news and entertainment fall in the same category. Bureaucratic state organisations and large corporations with mass appeal approach their users in similar ways. Importantly, the analysis of activity also revealed that a substantial portion of sites in the long tail consists of specialised versions of the offers of top sites. Tail sites are, in a sense, top sites that offer something that fewer people find relevant or useful. The tail sites that did not match the mainstream genres were found to be high on business and low on conversation, which may well reflect that conversations which once took place on the Web has now moved on to Facebook. Finally, it was indicated that the unclassified portion of the tail sites, to a large extent, consist of either very specialised sites or failed sites.

Along the second dimension, the analysis serves as a demonstration of the necessity of introducing theory in order to make big data useful as an empirical resource. Big data will obviously require researchers across many disciplines to adjust to a new set of practical demands being posed by the scale and character of the big data sets that can be found or made. But solving those problems does not entail solving the epistemological problems that are entailed in transforming big data sets to empirical materials. This requires the application of theory that maps the relationship between analytical categories and results — the features of the world about which we make data speak. End of article


About the author

Rasmus Helles is Associate Professor of Communication and IT at the Department of Media, Cognition and Communication at the University of Copenhagen, Denmark. His research focuses on the use of digital media, especially the relationship between media use and everyday life. Another research focus is the influence of digital media on media regulation and the existing media landscape. He is currently working on a project on the impact of the e–book on the market for commercial publishing in Denmark. He received his Ph.D. from the University of Copenhagen in 2009.
E–mail: rashel [at] hum [dot] ku [dot] dk



1. Blaikie, 2009, p.161ff.

2. FDIM:, last accessed 1 August 2013. The sample was for September 2009.

3. Although the measurement method has inherent limitations and problems, among other things because it does not trace parallel use of the web on different terminals (Helles, 2013; see also Baym, this issue) by the individual users, it nonetheless provides a robust indication of the relative distribution of activity across sites.

4. Yates and Orlikowski, 1992, p. 301.

5. Jensen and Helles, 2005, p. 98.

6. Jensen, 2012, p. 17.

7. A total of 11 formal categories were coded: Sound, music, speech, photographs, graphics, moving graphics, animation, video and background image. The final category was image size, which was coded for if more than two–thirds of the front page of the site was covered by images.

8. This included a total of 12 categories: Links to other sites, Login–area, archive, site information, contact details, other language version, downloads, pushing of later pull, membership, (a)synchronous user–site communication and user uploads, aggregation.

9. The list includes: Site–internal user–user features, external features, readymades, voting, user–user selling, voting, quizes, shop, user–user uploads.

10. Intercoder reliability between the two coders was checked using Krippendorff’s alpha (Krippendorff, 2004) by letting both coders code a randomly chosen reliability sample of 10 percent of the combined top and tail samples. Alpha values ranged between .80 and .90 for 13 measures. Of the remaining 29 measures all but five had alpha values at or above .90. The remaining five variables, which had alpha values between .56 and .64 measured the type and size of advertisements on the front page and a form of link aggregation. These variables were subsequently rejected from the analysis. The remaining variables were kept in the analysis based on the exploratory purpose of the study. Because sites were coded some weeks after the sample had been drawn, steps were taken to ensure that sites from the reliability sample were coded on the same day.

11. Agresti, 2002, p. 538; Linzer and Lewis, 2011.

12. Latent class analysis requires the analyst to stipulate the number of categories before analysis. The analysis was run consequtively with N=1 to N=8 categories, and the three–class model was chosen on the basis of minimizing the Bayesian Information Criterion (BIC), in accordance with (Nylund, et al., 2007).

13. The significance of a given variable to a given class was determined by chi2–tests of variables–per–class against the other two classes, compared to expected values based on the even distribution of a given trait.

14. Sites were categorised as unclassified if their highest probability of belonging to any of the three clusters was below p=.95.

15. Tail sites were designated as unclassified if they did not achieve a minimum probability of p=.95 class membership for one of the classes.

16. It is important to note that due to the small size of the tail sample, the proportions of the different genres in the tail site cannot be considered statistically meaningful.



Alan Agresti, 2002. Categorical data analysis. Second edition. New York: Wiley–Interscience.

Nancy K. Baym, 2010. “Interpersonal life online,” In: Leah A. Lievrouw and Sonia Livingstone (editors). Handbook of new media: Social shaping and social consequences of ICTs. Updated student edition. London: Sage, pp. 35–54.

Norman W.H. Blaikie, 2009. Designing social research: The logic of anticipation. Second edition. Cambridge: Polity Press.

Jay David Bolter and Richard Grusin, 1999. Remediation: Understanding new media. Cambridge, Mass.: MIT Press.

danah boyd and Kate Crawford, 2012. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon,” Information, Communication & Society, volume 15, number 5, pp. 662–679.
doi:, accessed 18 September 2013.

danah boyd and Nicole B. Ellison, 2007. “Social network sites: Definitions, history, and scholarship,” Journal of Computer–Mediated Communication, volume 13, number 1, pp. 210–230.
doi:, accessed 18 September 2013.

Axel Bruns, 2008. Blogs, Wikipedia, Second life, and beyond: From production to produsage. New York: Peter Lang.

Erik Brynjolfsson, Yu Hu, and Michael D. Smith, 2010. “Research commentary — Long tails vs. superstars: The effect of information technology on product variety and sales concentration patterns,” Information Systems Research, volume 21, number 4, pp. 736–747.
doi:, accessed 18 September 2013.

Taina Bucher, 2012. “Want to be on the top? Algorithmic power and the threat of invisibility on Facebook,” New Media & Society, volume 14, number 7, pp. 1,164–1,180.
doi:, accessed 18 September 2013.

Deborah S. Chung, 2008. “Interactive features of online newspapers: Identifying patterns and predicting use of engaged readers,” Journal of Computer–Mediated Communication, volume 13, number 3, pp. 658–679.
doi:, accessed 18 September 2013.

Thomas H. Davenport and Jeanne G. Harris, 2007. Competing on analytics: The new science of winning. Boston: Harvard Business School Press.

Anthony Giddens, 1984. The constitution of society: Outline of the theory of structuration. Cambridge: Polity.

Jürgen Habermas, 1992. The structural transformation of the public sphere. An inquiry into a category of bourgeois society. Translated by Thomas Burger with the assistance of Frederick Lawrence. Cambridge: Polity Press.

Martyn Hammersley, 2010. “Can we re–use qualitative data via secondary analysis? Notes on some terminological and substantive issues,” Sociological Research Online, volume 15 number 1, at accessed 5 September 2013.
doi:, accessed 18 September 2013.

Rasmus Helles, 2013. “Mobile communication and intermediality,” Mobile Media & Communication, volume 1, number 1, pp. 14–19.
doi:, accessed 18 September 2013.

Klaus Bruhn Jensen, 2012. “Introduction: The state of convergence in media and communication research,” In: Klaus Bruhn Jensen (editor). A handbook of media and communication research: Qualitative and quantitative methodologies. Second edition. London: Routledge, pp. 1–19.

Klaus Bruhn Jensen and Rasmus Helles, 2005. “‘Who do you think we are?’ A content analysis of websites as resource for politics, business and civil society,” In: Klaus Bruhn Jensen (editor). Interface://culture: The World Wide Web as political resource and aesthetic form. Frederiksberg: Samfundslitteratur, pp. 93–122.

Petros A. Kostagiolas, Nikolaos Korfiatis, and Marios Poulos, 2012. “A long–tail inspired measure to assess resource use in information services,” Library & Information Science Research, volume 34, number 4, pp. 317–323.
doi:, accessed 18 September 2013.

Klaus Krippendorff, 2004. Content analysis: An introduction to its methodology. Second edition. Thousand Oaks, Calif.: Sage.

Klaus Krippendorff and Mary Angela Bock, 2009. The content analysis reader. Thousand Oaks, Calif.: Sage.

Seth C. Lewis, Rodrigo Zamith and Alfred Hermida, 2013. “Content analysis in an era of big data: A hybrid approach to computational and manual methods,” Journal of Broadcasting & Electronic Media, volume 57, number 1, pp. 34–52.
doi:, accessed 18 September 2013.

Drew A. Linzer and Jeffrey Lewis, 2011. “poLCA: Polytomous variable latent class analysis” (Version R package version 1.3.1), at, accessed 18 September 2013.

Joshua Meyrowitz, 1985. No sense of place: The impact of electronic media on social behavior. New York: Oxford University Press.

Carolyn R. Miller, 1984. “Genre as social action,” Quarterly Journal of Speech, volume 70, number 2, pp. 151–167.

Karen L. Nylund, Tihomir Asparouhov and Bengt O. Muthén, 2007. “Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study,” Structural Equation Modeling, volume 14, number 4, pp. 535–569.
doi:, accessed 18 September 2013.

Zizi Papacharissi, 2009b. “The virtual geographies of social networks: a comparative analysis of Facebook, LinkedIn and ASmallWorld,” New Media & Society, volume 11, numbers 1–2, pp. 199–220.
doi:, accessed 18 September 2013.

Zizi Papacharissi, 2009a. “The virtual public sphere 2.0: The Internet, the public sphere, and beyond,” In: Andrew Chadwick and Philip N. Howard (editors). Routledge handbook of Internet politics. London: Routledge, pp. 230-245.

Zizi Papacharissi, 2002. “The virtual sphere: The Internet as a public sphere,” New Media & Society, volume 4, number 1, pp. 9–27.
doi:, accessed 18 September 2013.

Malin Svenningson Elm, 2009. “How do various notions of privacy influence decisions in qualitative internet research?” In: Annette N. Markham and Nancy K. Baym (editors). Internet inquiry: Conversations about method. Los Angeles: Sage, pp. pp. 69–87.

Raymond Williams, 1977. Marxism and literature. Oxford: Oxford University Press.

JoAnne Yates and Wanda Orlikowski, 2002. “Genre systems: Structuring interaction through communicative norms,” Journal of Business Communication, volume 39, number 1, pp. 13–35.
doi:, accessed 18 September 2013.

JoAnne Yates and Wanda J. Orlikowski, 1992. “Genres of organizational communication: A structurational approach to studying communication and media,” Academy of Management Review, volume 17, number 2, pp. 299–326.


Editorial history

Received 16 September 2013; accepted 17 September 2013.

© Rasmus Helles 2013, All Rights Reserved.

The big head and the long tail: An illustration of explanatory strategies for big data Internet studies
by Rasmus Helles.
First Monday, Volume 18, Number 10 - 7 October 2013

A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2019. ISSN 1396-0466.