Big data experiments with the archived Web: Methodological reflections on studying the development of a nation's Web

This article explores how archived Web sources can be used for historical studies of an entire national Web domain and its development over time. It presents the methodological challenges of large-scale studies using Web archive content and discusses the limitations and potential of this new type of study of Web history. It uses the entire Danish Web domain .dk from 2006 to 2015, as it has been preserved in the Danish national Web archive, as a case to exemplify how ‘a nation’ can be delimited on the Web and how an analytical design for this type of big data analysis using archived Web can be developed. This includes considering the characteristics of the archived Web as a historical source for academic studies as well as the specific characteristics of the data sources used. Our findings reveal some of the ways in which a nation’s digital landscape can be mapped by examining Web site sizes and hyperlinks, and we focus on discussing how these results shed light on the methodological challenges, reflections and choices that are an integral part of large-scale Web archive studies. The study demonstrates that hardware and software as well as human competences from various disciplines make it possible to perform large-scale historical studies of one of the biggest media sources of today, the World Wide Web.

Contents

Introduction
General methodological reflections
Analytical design
Examples of Danish Web development 2006–2015: Web site size and hyperlinks
Discussion
Concluding remarks and next steps

Introduction

This article takes some early steps into a new area of study in which the archived World Wide Web (or just: the Web) is used for big data studies of historical developments of the Web. The aim of the article is to conduct the first explorations into how large-scale Web history studies based on the archived Web can be performed, in this case exemplified with a national Web domain. We therefore ask: What are the methodological challenges of using Web archive content to study an entire national Web domain and its development, and what are the limitations and potential of adopting this approach?

Studying the history of national Web domains

Investigating national Web domains dates back to the early 2000s [1]; but these early studies focus primarily on technical issues and only rarely adopt a historical approach. In the mid–2010s, a few studies of national Web domains were published within the social sciences and humanities; but they were mainly based on the online Web, in contrast to the archived Web, which is why these studies are also only rarely historical (e.g., Rogers, et al., 2013; Ben-David, 2014). Parallel to these studies of online national Web domains, a number of historical Web archive based research projects were launched, and, since then, this emerging field of study has grown, mainly as a function of the establishing or opening-up in the same period of national Web archives like the U.K., French or Danish Web archives (for information about existing Web archives, see Webster, 2019). One of the first research projects using the archived Web to study the history of a national Web domain was ‘Big data: Demonstrating the value of the UK Web domain dataset for social science research’, which explored trends in the development of the U.K. Web domain (Hale, et al., 2014). This was followed by a number of research projects about national Web domains, including in the U.K. (Cowls, 2017), Canada (Milligan and Smith, 2019), the Netherlands (WebART; see National Library of the Netherlands, n.d.), France (Schafer and Thierry, 2016; Merzeau, n.d.), and Denmark (Brügger and Laursen, 2019b). Many of these projects are discussed in Brügger and Laursen (2019b), the first-ever edited volume about the Web in different national settings. In addition, in the same period, a few studies of national Web domains were also conducted based on the holdings of the world’s largest Web archive, the transnational Internet Archive (established in 1996) (e.g., the history of the Web domain of former Yugoslavia; Ben-David, 2016) [2].

The study on which the present article is based moves beyond the existing literature in several respects. First, the study has had access to much larger and more diverse amounts of Web archive content than previous studies (an entire national Web domain from 2005 onwards). Second, the study has had specific datasets extracted to allow for different types of analyses (of files, of content on Web pages, of hyperlinks, etc.). Third, the extraction of large-scale datasets has made it possible to develop and test different novel methodological approaches and to inform discussions of major methodological challenges. Finally, on a more practical yet important level, from the outset, the study benefited from software and hardware support by specialist IT developers at the Web archive and was granted access to a high-performance computer cluster that allowed it to run big data analyses.

The present study is relevant for other types of Web studies for two reasons. First, although the scale of the study makes it an extreme case, the methods developed (e.g., delimiting a corpus and removing duplicates/versions) can be scaled either up or down and thus used to study more national Web domains or smaller Web entities, such as Web sites related to a specific event (e.g., political elections, sports events, or terrorist attacks), another geophysical entity such as a state, region or city, or a specific period of time. In addition, the study of an entire national Web constitutes the backcloth against which other Web entities can be seen and contextualized and therefore better understood. For instance, if one wishes to study the development of 1,000 Web sites related to parliamentary elections, it might be relevant to know what a typical Web site in the nation in question has looked like throughout history (in terms of, for example, size, number of images or videos and internal/external hyperlink structure), or what characterizes the overall link structure in which the websites in question are embedded (e.g., whether they are central or marginal in the national hyperlink network). It can be challenging to establish such baseline knowledge in other ways than through large-scale Web archive studies. Consequently, histories of national Webs constitute studies of Web history in their own right, as well as providing background information for future studies (in this latter case, their value remains to be seen).

National Web domains

In a technical sense, the World Wide Web is a global medium, with information published anywhere around the globe being available from anywhere around the globe. However, just because one Web page is only one click away from another, this does not mean that all Web pages are linked to each other, thus making the Web truly global. As this study indicates, link patterns tend to stay mainly within national borders (cf., section Connecting to entities outside of the Danish Web). In the Danish case, Web pages on the .dk Web domain link mostly to other Web pages on .dk.

If one looks at how users use the Web, at least three studies indicate that, to a large extent, the Web is often used as a ‘national Web’. Finnemann, et al. (2009) examined which media were used by Danes and concluded that the Web was generally used as a national (or even local) medium [3], Schroeder (2018) questioned whether or not the Web was global and concluded that there is no one single Web but rather a “series of clusters: linguistic plus those that develop due to the policies of states and sites promoting shared interests such as economic development strategies” [4]; and Curran, et al. (2013) conducted a comparative study of news Web sites in nine different countries to show that online news is strongly nation-centered [5].

Finally, the vast majority of Web archives in the world operate on a national basis, primarily because in most cases they emerge out of national libraries and therefore strive to collect national Web domains. So they too are founded on the notion of a national Web.

*

Since studying archived national Web domains and their development on a large scale is a novel approach within internet studies, a number of the methodological issues related to this new type of study need to be investigated and discussed in detail. A number of the general methodological themes related to this type of study have been discussed in the literature (Ben-David, 2016, 2014; Brügger, 2018, 2017; Brügger and Laursen, 2019a, 2019b; Halavais, 2000; Hale, et al., 2014; Mussu and Merletti, 2016; Rogers, et al., 2013); and therefore the main focus of this article will be placed on how these general themes can be translated into an analytical design, and on giving a few examples of the historical mapping of a nation’s Web domain. First, we will debate a) the extent to which one can delimit ‘a nation’ on the Web; b) what characterizes the archived Web as a historical source for academic studies, irrespective of the size of the archived Web to be studied; and c) the general characteristics of our data source, the archived Web in the national Danish Web archive Netarkivet. Once this is in place, the article introduces an analytical design for how this type of big data analysis of an entire national Web can be performed, including a more detailed presentation of the data sources that were used and how data was processed to enable the study. This is followed by two examples of how the methods in question can be used: first, we look at whether the national Web is composed of small or large Web sites; and second, we focus on a few ways of studying hyperlinks. Finally, the methodological challenges and the limitations are discussed, followed by a few concluding remarks. Hopefully, this investigation can spur an interest among Internet scholars in making big data analyses of the archived Web, including raising awareness about the technical and conceptual complexities of such a venture [6].

General methodological reflections

In the following we briefly discuss how a national Web can be delimited, why the archived Web must be the main source for studies of national Webs, and what characterizes the archived Web in general [7].

Delimiting a national Web domain

Although there may be good reasons to talk about national Webs from a Web researcher’s perspective, it is sometimes difficult to determine exactly how ‘a national Web’ can be delimited as a subset of the World Wide Web. There are different ways of doing this.

The most obvious way of delimiting a nation on the Web is to follow the institutionalized nation building that comes with the Internet, namely the system of top-level domain names (TLDs), of which some are attributed to countries, the so-called country code top-level domains (ccTLDs) like .uk, .fr, and .dk for the United Kingdom, France and Denmark, respectively; whereas others like .com, and .org are assigned to more generic entities like ‘commercial’ and ‘organization’ (these are called gTLDs, generic top-level domains). The advantage of identifying a nation on the Web by ccTLD is that this can be automated and performed on a large scale. However, identifying a nation on the Web exclusively by its ccTLD has a number of drawbacks [8]. For instance, a lot of Web material of relevance to a nation is located at gTLDs, and some national ccTLDs are not used much, like .us and .ca (cf. Milligan and Smyth, 2019, regarding Canada’s .ca). In such cases, the Web material to be included in the study has to be identified manually.

The need for the archived Web

Entire national Web domains need to be collected and preserved in a stable form if their history is to be studied. There are two interrelated reasons for this. First, the online Web changes constantly, and therefore the Web of the past is gone if it has not been preserved. Second, the size of the Web makes it challenging to study the online Web on a national scale while it is still online, because it is very likely to change during any such study [9]. Large-scale Web archiving is usually done by the use of Web crawlers, consisting of software that collects material on the Web. Web crawlers work by following hyperlinks based on an initial seed list containing Web addresses of all the Web pages to be collected; and once the Web crawler has archived these Web pages, it follows the hyperlinks from them, archives the material to which the links lead, and then continues this iterative process as far away from the seed list as specified in the crawl settings.

The Danish national Web archive Netarkivet was established in 2005, based on a revision of the existing legal deposit law stipulating that one copy of all publicly available material must be preserved by the Royal Danish Library. The revised law includes material in computer networks which was not included in the law previously (for more information about Netarkivet and its history, see Laursen and Møldrup-Dalum, 2017).

Since its inception Netarkivet has used three strategies to collect as much of the Danish Web as possible: 1) The broad crawl strategy, archiving all Web material on the Danish ccTLD .dk as well as material outside .dk that can be considered Danish. The broad crawl is usually performed three–four times per year, it takes one–three months to perform, and the material on .dk is identified based on the authoritative domain name list provided by the Danish domain name registrar DK Hostmaster; 2) The selective strategy, preserving a limited number of Web sites on a daily basis (approximately 100 Web sites are selected); and 3) The event strategy, crawling Web activities in relation to three–four events of national interest a year for as long as the event takes.

As the existence of these three strategies indicates, it is not possible to archive an entire national Web domain in all its dimensions as a snapshot. In other words, the Web archive contains different instances of the national Web, depending on which of the three strategies is used. As a consequence, the Danish Web material in Netarkivet is not necessarily the online Danish Web as it looked in the past — it is a certain representation of this Web, instead. Thus, every time the term ‘the Danish Web’ is used in the following, this means ‘the Danish Web as it was archived by and is represented in Netarkivet’. It is important to bear this in mind, because each of the strategies comes with a set of limitations and possible biases (these will be addressed in relation to the analysis where relevant).

General characteristics of the archived Web

Since studies of the development of national Web domains must be based on the archived Web as it is found in national or transnational Web archives, this type of study inherits the challenges stemming from the nature of the archived Web in general (for a detailed account of these, see Brügger, 2018). First, in contrast to digitized materials such as documents, newspapers, radio or television, it is impossible to return to the original because the Web of the past is very likely to have disappeared. Second, due to a combination of curatorial and technical challenges related to the process of archiving the Web, what was collected is probably incomplete, in the sense that everything that was online may not have been archived. Third, the archived Web is likely to be inconsistent in terms of time and space, which is the case with large amounts of Web data in particular. The archived Web is temporally inconsistent because everything was not archived at the same time — as mentioned above, the archiving of the entire Danish Web domain .dk may take one–three months. This means that if what was archived at the beginning of the process links to something that was archived several weeks later, the link source and link target will be from different points in time. The material may also be spatially inconsistent if all the Web sites were not archived at the same depth, for instance if only the front page of a given Web site was archived while other Web sites were archived several levels below the front page. Inconsistencies also occur if something is archived more than once; this happens when not only what was initially intended for archiving was collected, but also previous or later versions of the same Web entity were collected because hyperlinks from other websites point to this material, and it was therefore archived several times at different points in time (as will be seen in a later section, this constitutes a great challenge for studies like the one presented here). Finally, it has to be highlighted that with regard to Web crawling in particular, what goes into the Web archive is not the Web as seen in a Web browser, but all the bits and pieces — HTML files, images, video, streaming, feeds, etc. — that are knitted together on the online Web to be presented as a single Web page in the user’s browser.

Analytical design

The present study is based on the material in the Danish Netarkivet, and we delimit ‘the Danish Web’ to what was present on the ccTLD .dk as well as the material on other TLDs (ccTLDs as well as gTLDs) that Netarkivet has identified and collected as relevant for the Danish Web. The number of registered domain names on the Danish .dk domain was 629,344 in 2005, 973,456 in 2009, 1,163,250 in 2012, and 1,277,035 in 2015 [10]. The use of Netarkivet is also the main reason why the period investigated starts in 2006, since the first full crawl in Netarkivet is from 2006. We also focus on the general characteristics of the archived Web where relevant for the different phases of the analysis. In the following, the concrete ways of transforming these overall choices into an analytical design are presented.

Working with this amount and complexity of data demands an analytical design that is rigorously and thoroughly thought out to make the analysis manageable. And because of the many methodological choices imposed on the analysis by the nature of the object of study, it is pivotal to present the methodological reflections in a detailed manner. This will also make it easier to replicate the study with material from another national Web archive. We distinguish between three main phases: 1) Extracting, transforming and loading (ETL); 2) Selecting the corpus; and 3) Translating research questions into code.

Extracting, transforming and loading data

Since the data source of the study is the Danish Web as it was archived by Netarkivet, the first step is to make the data available for the study. This step includes three sub-steps: Extracting the material from Netarkivet; transforming it into a format that can be computed on the high-performance computer that was used to make the analyses, the DeIC National Cultural Heritage Cluster at the Royal Danish Library; and, finally, loading the data onto the Cultural Heritage Cluster [11].

However, before the technical ETL process was initiated, the data which was to be extracted was identified by the researchers based on a thorough investigation of curatorial and Web crawl information from Netarkivet. It was decided to extract the first annual broad crawl from each year [12], and to supplement this with some of Netarkivet’s selective crawls if such crawls exist from the same period (these selective crawls include special crawls of ‘Very big sites’, and of ‘Government bodies’). The identification and ETL phases were closely interrelated, since negotiations had to take place on an ongoing basis to determine what could actually be extracted to enable the planned analyses.

Summarizing Møldrup-Dalum (2018), the steps in the ETL process are as follows: a) the data identified by the researchers is described in terms used by Netarkivet, such as harvest names, numbers and types; b) this description is mapped from Netarkivet terms to UNIX file paths to identify the Web archive files containing the actual data; c) a series of Hadoop Streaming jobs are submitted to the Cultural Heritage Cluster for extracting and transforming the data and subsequently storing it on the distributed file system of the Cultural Heritage Cluster. The present study is based on only two types of extracted data: 1) metadata, in this case the so-called crawl.log and seeds.txt files; and 2) hyperlinks from all Web pages. With regard to the former, the two files are produced during Web crawling, and the seeds.txt file contains the Web addresses that constitute the crawl origins, whereas the crawl.log file records all the movements of the Web crawler, what it encounters and tries to collect and when, and with what result [13]. With regard to the latter, hyperlinks were extracted as described below.

Selecting and building the final corpus

As mentioned above, the ETL process extracts, transforms and loads the first annual broad crawl from each year from Netarkivet along with selected special crawls. However, because of the way Web crawling works, namely by following hyperlinks, the corpus extracted from each year can only be considered a gross corpus that has to be refined before it can serve as the basis of the analysis. As described briefly above, the challenge with Web crawling is that too many versions of ‘the same’ material may be stored in the archive. The following small example can illustrate this: at the time t¹ the Web crawler sets out to crawl the Web sites ‘website1.dk’, ‘website2.dk’ and ‘website3.dk’, based on a seed list of the three Web sites. However, on ‘website3.dk’ there is a link to ‘website4.dk’, which was not in the seed list that was launched at the time t¹, but it now goes into the archive because there is a link to it. A week later at the time t² another crawl job of the broad crawl is initiated, aiming at archiving ‘website4.dk’, ‘website5.dk’, and more. It completes its job, but now ‘website4.dk’ is in the archive twice, but (probably) not as identical copies, because time has passed from t¹ to t². And there may even be a link on ‘website5.dk’ pointing to ‘website1.dk’, which will then be archived once again. As this very small example illustrates, hyperlinks are very likely to cause the same Web material to be archived several times; and as the amount of data grows and the time of archiving stretches over several weeks, it is not hard to imagine the scope of this challenge for studies using large amounts of archived Web material as their data source. In the present project, our analyses clearly showed the size of this problem: approximately 50 percent of the files in a broad crawl had been archived more than once.

This means that if the broad crawls were studied as they were crawled, the analysis would be skewed because some Web entities may be there more than once, and we cannot know how many times each entity is present. To counter this problem, an automated procedure was established that can distinguish between all the material that was on a seed list to be archived in a specific crawl job on the one hand; and all the material that was also archived because a hyperlink pointing to it (without being on the seed list for this job) on the other. We refer to the former as ‘main harvests’ and the latter as ‘by-harvests’, and an algorithm used to identify and select the main harvests was developed.

Although establishing such an algorithm may sound like an easy task, it is quite challenging because a number of choices have to be made, each of which may have an impact on the analysis. For instance, the data that was extracted had to be cleaned (e.g., the data field’s ‘timestamp’ and ‘fetch_time’ were validated) and transformed to so-called parquet files, a column-based data format that includes a schema for the data (one parquet file for each year). Then a decision has to be made as to which part of the data to include. We decided to filter the data by status code — the response sent from the Web server where the Web material was located when online — including only objects with a status code between 200 and 599 (e.g., ‘200’ is ‘ok’). And priorities also had to be made regarding which selective harvests to include (while excluding the same Web domains from the broad crawl). It also became clear that the seed lists were not always consistent within a broad crawl, because sometimes a domain name was entered in more than one job, although we would expect it to be there only once (we term these ‘false main harvests’). Finally, handling embeds was a challenge. Embeds are Web elements such as an image or a video that is located on a Web server (e.g., youtube.com) from where it can be embedded on Web pages on other Web domains. Strictly speaking, an embedded video is not part of the Web address on the Web crawler’s seed list, and should thus be considered a by-harvest. But we decided to regard embeds as part of the Web page on which they are embedded because from the perspective of Web site owners this is probably the intention, and from a user’s point of view this is probably how it is experienced. However, for technical reasons it was only possible to include embeds that were no more than one step away from the Web page on which they were embedded, which implies that there may be more embedded elements in the material that is extracted than analyzed here.

The result of running the selection algorithm on the gross corpus extracted is a refined corpus with only one version of each Web entity. This final corpus will be used as the data source for the analyses in the next step. The consequence of the selection step is that although there is only one version of each Web entity, it is one among others, and there is no guarantee that it is the best one or the one you want to include. However, given the amount of data it is not feasible to go through all versions manually (although it may be possible to develop an algorithm that evaluates versions and suggests which one to choose). The focus is not placed on each individual element when studying big data, but rather on the larger trends. Or as Mayer-Schönberger and Cukier put it in their book Big data: A revolution that will transform how we live, work, and think: “With big data, we’ll often be satisfied with a sense of general direction rather than knowing a phenomenon down to the inch, the penny, the atom” [14]. Nevertheless, this process of selecting and building the corpus may entail biases as to which Web elements were selected to go into the final corpus, thus questioning how accurate the corpus is compared to what was actually on the online Web in the past. And each of the possible biases may affect the results, which is why we will include them in the following where relevant. Consequently, every time the term ‘the Danish Web’ is used in the following, this phrase should be read as ‘the Danish Web as it was archived by and is represented in Netarkivet’ (cf., above) with the following addition: ‘and as it was selected as an annual corpus’.

Regarding the corpus used for the mapping of hyperlinks, the delimitation of the refined corpus described above was used as a ‘master key’ to extract all the hyperlinks that were present on all the Web pages in the refined corpus (the hyperlinks were extracted from Netarkivet’s full-text index of the entire collection, a Solr index).

Translating research questions into code

With the final annual corpora in place, the next step was to translate the research questions into something that can be analyzed by applying computational methods. Large-scale studies of an entire national Web domain are big data studies and cannot be conducted manually but require the use of automated methods (at least in the first steps of the data analysis); so we need to find methods to analyze huge amounts of archived Web material. A Web page is not collected and preserved as one artifact, but rather as a collection of fragments in the form of HTML files (with the source code of the Web page), a variety of other file types (e.g., images, audio, PDFs), metadata etc. All these fragments can be analyzed — individually or in combination — and the methods applied depend on what types are included.

To perform the computational part of the analysis, we used the above-mentioned DeIC National Cultural Heritage Cluster at the Royal Danish Library. The DeIC National Cultural Heritage Cluster is equipped with Web-based tools to easily access the cluster, including the software packages RStudio and Jupyter Notebooks. To perform the data analyses, we used RStudio, so the things we wanted to study on the Danish Web had to be translated into R-code to compute the data that was extracted and selected. To manage and perform the computational analysis, R Markdown files (.Rmd) were made with the individual scripts to run each step of the analysis each year.

Examples of Danish Web development 2006–2015: Web site size and hyperlinks

In the research project on which this article is based, a large number of metrics were generated to map the historical development of the Danish Web (materialized in the above mentioned R-scripts), such as the overall size of the entire Danish Web (in TB and number of files), or content types on the Danish Web, in particular the amount of written text and images, respectively. To illustrate what a study of a national Web domain might look like, and to fuel the methodological discussion, we have selected two examples. First we look at the size of Web sites and in particular how Web sites of different size are distributed on the Danish Web; and second we look at the extent to which the Danish Web is connected to the Web outside its own Web domain, exemplified through links to other TLDs and social media.

Size of Web sites on the Danish Web

Web sites can be considered one of the main structuring entities of Web content, partly because they mirror the Web’s domain name system [15]. So in order to obtain valuable background knowledge about the characteristics of a national Web domain, it is relevant to have a closer look at what a typical Web site looks like, and how this may have changed over time. This first example looks at distinct websites, and we focus on their size to investigate whether Web sites have become smaller or bigger, or whether they have remained the same size. With a view to characterizing typical Web sites year by year, the analysis can be extended by looking at the ratio between written text, images and video, the length of pages and the number of internal/external hyperlinks, just to mention a few parameters.

A first approach to this question would be to calculate the average size of Web sites in terms of number and bytes, which shows a development from 375 files per Web site (2006) to approximately 700 (2009–2010) and down to 600 in 2015; and an average size per Web site varying from 10 MB (2006) to 24 MB in 2010 and 33 MB in 2015.

However, an average does not reveal much about the distribution of files and size on distinct Web sites. The question is if the number of files and size of Web sites are evenly distributed on the entire Danish Web.

Figure 1: Distribution of number of files per Web site.

Figure 2: Distribution of size of Web sites.

Figures 1 and 2 show that they are definitely not evenly distributed. In Figure 1 the distribution of files on Web sites is illustrated by sorting the Web sites by number of files and then dividing the Web sites into percentage groups, followed by a calculation of the average number of files for each percentage group (a similar method is used in Figure 2 regarding size). Both figures clearly show that the Danish Web has the shape of a very long tail: very few distinct Web sites have a very high number of files, and they are also very big in terms of bytes, for example in 2006 the top one percent contained 52 percent of all files that year; whereas the vast majority of Web sites on the Danish Web are very small with regard to both number of files and bytes. If we then focus on the largest Web sites only (Figure 3), the same picture emerges: even within the top one percent, only approximately five percent of the Web sites are very big. In the top of the long tail, another long tail is nested.

Figure 3: Distribution of number of files per Web site on the top one percent largest Web sites (please note that the scale on the y-axis is different than the scale in Figure 1).

So we should not expect to find many average Web sites on the Danish Web — they are either very big or very small. And what is also striking is that, by and large, this has not changed from 2006 to 2015 (e.g., in 2006–2015 between 52 percent and 66 percent of the files were located in the top one percent of Web sites). This sort of information is valuable for scholars interested in finding out if the Web sites they are studying (in relation to parliamentary elections, for instance) are placed in the spike or somewhere in the long tail, because this may indicate their weight on the national Web, which can then be the starting point for further investigations.

Connecting to entities outside of the Danish Web

The example above was based on one dataset, namely metadata. But as described above, hyperlinks were also extracted from each of the annual corpora. Hyperlinks are one of the defining features of the Web [16]; and for analytical purposes one of the strengths of hyperlinks is that they can show the relations between elements such as Web sites, making it possible to perform network analyses based on hyperlinks to map which Web sites are the most central in the Danish Web (the most central Web sites are usually identified as the ones with the most links pointing to them). But hyperlinks can also be used to map how the Danish Web domain relates to entities outside of the Danish domain; and when doing this the hyperlinks are understood as the possible routes to take on the Danish Web to get out of the Danish national Web domain, which we will illustrate by two examples: links to other national Web domains, and links to social media.

Before presenting these examples, it is worth noting that the total number of hyperlinks on the Danish Web has increased from approximately 2.5 billion in 2006 to 10.5 billion in 2010, since when the total number of links has remained at the same level (10–11 billion). These numbers include all links, but in the following we only focus on Web site-external links in order to study links between different TLDs. When investigating how linked the Danish Web is to other national Web domains, it is relevant to start by asking how national the Danish Web is, which in a study of hyperlinks can be translated into how often websites on the .dk Web domain link to other websites on the .dk domain.

Figure 4: Number of ingoing links to the top four TLDs from ccTLD .dk.

Figure 4 shows that between one-third and one-half of the hyperlinks pointing out of the ccTLD .dk point to the .dk domain itself, around one-third point to the gTLD .com, and the rest point to other TLDs. This might indicate that the Danish Web domain is linked to the Web outside of .dk, and thus is not only national (i.e., linking to itself). However, a closer look at the Web sites on .com to which the many hyperlinks on .dk point indicates two major trends: first, the vast majority link to Web sites that could be termed ‘Web infrastructure Web sites’, i.e., Web sites that help run Web sites on the Danish Web such as mysql.com, phpbb.com, google.com, blogspot.com, blogger.com, adobe.com or addthis.com. Second, Web sites on the Danish Web domain link to social media Web sites like facebook.com or twitter.com, where it can be difficult to determine whether the links point to Danish pages on these social media sites or not. These trends develop differently over time: the first trend is most prevalent in the early years from 2006 onwards, whereas from about 2010 the second trend becomes more common. However, the overall picture is that to a very large extent the Danish Web domain links to itself, which may indicate that the Danish Web is indeed a nationally oriented Web.

Figure 5: Number of hyperlinks from the Danish Web to the top five other ccTLDs.

Figure 5 illustrates the ccTLDs other than .dk to which the Danish Web domain is connected by its hyperlinks, and how this has developed. As can be seen, there is a clear trend over the ten-year period: the Danish Web is closely linked to the four nearest neighboring countries — Germany (.de), Norway (.no), Sweden (.se), and the UK (.uk). The number of links fluctuates over the years, but all four countries are present all the time. Another, more surprising, observation is that in 2008–2010 the number of hyperlinks from the Danish Web to the Polish Web skyrocketed, being the ccTLD with the highest number of links in each of these years. We have looked more closely into which Web sites link to Poland; but we have not found any explanation for this increase so far, so more investigation is needed here.

Figure 6: Number of hyperlinks from the Danish Web to social media.

The second example investigates the links of the Danish Web to social media; and as can be seen in Figure 6, hyperlinks pointed mainly to Flickr and MySpace from 2006 to 2009, after which Facebook, Twitter and YouTube took over and continued to increase until 2015, when a slight decrease can be seen. If we look at image-based social media, Pinterest was present before Instagram; but in 2013–14 Instagram became bigger than Pinterest. Thus, Figure 6 shows the development of the number of links to different social media Web sites from the Danish Web. This analysis can be extended by mapping the Web sites doing the linking, in other words: which Web sites on the Danish Web link to each of the social media, and how has this evolved? A small test of this type of investigation shows that links pointing to MySpace and Facebook come from very different websites: among the top ten linking Web sites there is not one that links to both MySpace and Facebook, which indicates that if you link to MySpace you come from a different part of the Danish Web than if you link to Facebook.

Discussion

The six figures above can all be considered part of a mapping of what the Danish Web has looked like and how it has evolved. This type of study comes with a number of challenges, some of which have already been briefly touched upon, but it also opens up a range of new analytical perspectives. In the following we will discuss the limitations and the potential of this type of study of Web history.

As with any other historical study, one major challenge is dependency on the available sources. To use a phrase coined by the American historian Roy Rosenzweig, historians who study national Web territories face both scarcity and abundance at the same time (Rosenzweig, 2003). In some countries the national ccTLD has not been collected and preserved, and there may simply not be enough material to map the national Web. Even if a national ccTLD has been archived by a national Web archive as in the Danish case, one can never be sure that what is in the archive is what one would have liked to include, since there may be some ‘Danish Web’ on other Web domains, in particular gTLDs. As mentioned above, this Web material is tracked down manually and it is very difficult to evaluate the extent to which it covers the national Danish Web domain. In particular, social media Web sites are a challenge, since it can be difficult to identify what could be referred to as the Danish part of Facebook, for instance. In addition, even when the collection of a national Web is based on the authoritative domain name list from the official registrar of domain names, what goes into the Web archive is not an accurate picture, simply because archiving an entire national Web takes time, two to three months, during which period Web domain names are very likely to come and go [17]. Thus, compared to the online Danish Web, the archived Web is a somewhat scarce source.

On the other hand, the archived Web may also be characterized as abundant. As outlined above, each Web page is frequently in the archive more than once as a consequence of the archiving process — as mentioned above, approximately 50 percent of the material in a broad crawl in Netarkivet is present more than once — and therefore only one copy of each has to be selected. But owing to the scale of this study, it is difficult to know, or even evaluate, the extent to which the best Web entity has been selected. Nevertheless, selections have to be made to avoid the potential and unsystematic bias of having too many versions of a Web entity. Consequently, a large-scale study of a Web domain based on material in a Web archive is always a two-step mapping process: first, a map of the national Web domain is drawn by the Web archive when the online Web is represented in the archive during the archiving process; then another map is drawn on the basis of the first map, namely the subset of the archived Web that is selected and used as the basis of the detailed mappings of specific parts of the national Web (size, content type, hyperlinks, etc.), which we have termed the refined corpus above.

Thus, one of the major challenges when using the archived Web as an object of study is to navigate between the Scylla of scarcity and the Charybdis of abundance. No matter how this is done, the mapping of the national Web domain will largely mirror the national Web archive’s way of delimiting the national Web, which is why a nation’s Web tends to become an effect of the national boundaries of the national Web archive on which it is based. Naturally, dependency on the available sources is a general historiographical challenge; but it comes in new forms with the archived Web, and we need new methods to understand and handle it.

On a more detailed level, the choices made in our analytical design have an effect on the results. In studies based on metadata in which the number of files is calculated — as is the case with the first example above about the size of Web sites — it is imperative to remove duplicates to avoid skewing the results. But as mentioned above, it is difficult to evaluate whether the version that is retained as part of the refined corpus is the best version or the version that you want to include. In these cases, methods have to be developed to inform the choices of versions. In the study of hyperlinks we choose the method of studying hyperlinks that is least affected by the temporal inconsistency mentioned above: the hyperlinks archived at the beginning of the archiving process may link to something that was archived several weeks later. Our example of hyperlinks focused purely on how many links point to a specific top-level domain or a specific Web site (in our case social media). But if we had chosen to make a link graph that included both ingoing and outgoing hyperlinks of all websites on the Danish Web, our results would have been affected by this temporal inconsistency between link source and link target. Such a study should include rigorous reflections on how to tackle this challenge — either by shortening the time span to limit the number of materials to be included, or by keeping the time span but thereby increasing the potential temporal inconsistency.

Finally, since the aim of this study has been to investigate some of the methods of facilitating studies of a nation’s digital domain, we have not tried to explain the findings by embedding them in a wider historical context. But it is obvious that this would be a relevant next step (cf., Schafer and Thierry, 2019) to answer questions such as: Why has the Danish Web developed as it has? What could be the driving forces or constraints behind these developments? For instance, we expect that the developments in Web design and technical potential (both when creating and archiving Web sites) can help explain some of the changes — but what are the specific consequences, for what and when? What role does the spread of blogs, comment fields, and social media plugins (for instance) play? Do changes in Web performance like page load speed influence the use of specific MIME types? Questions like these can help us understand how to interpret the fluctuations over time.

Despite the challenges, studying an entire national Web domain also contains a lot of promise. Just as geography in the geophysical world is the environment for social, economic, cultural and individual actions, the national online Web is the backcloth on which the nation’s Web activities take place, whether this involves politics, sociality, culture or other issues. To fully understand these activities, it is imperative to relate them to the developments of the wider picture constituted by the entire national Web.

In addition, if one wants to understand how the online Web co-exists with the off-line world, studies of the online Web are key. Despite the challenge of substantiating claims about how the online and the off-line are entwined, such studies must rely on a broad understanding of the development of the online Web. So knowledge about the nation online can be used as an element in (historical) studies of the offline world — for instance, studies of hyperlinks going out from a national Web domain could be included in studies of immigration or trade; or a politician’s position (e.g., numbers of inlinks to that person’s Web site or profile on social media) might be correlated with success in an election, just to mention a couple of examples.

Finally, mappings of a national Web domain can constitute a generative element in formulating new research questions. Looking at one Web site at a time does not reveal the trends and developments of a Web domain. This is because these trends only become visible in a large-scale analysis; and once they have been identified they may be a cause for wonder, and thereby for new research questions. For instance, as indicated above, it is somewhat surprising that the Danish Web was so closely connected to the Polish Web in 2008–2010. Why was this?

It is also worth debating whether the archived Web is such a challenging object of study that it cannot serve as a reliable source. However, in this connection let us not forget that in many (if not most) historical studies the status of the available sources and of their potential use is always questioned. If we want to study a national Web domain, we may not have the material we really wanted, and it may be more complex and complicated than we would have liked it to be, but nevertheless this is probably still the best source we can imagine for such a study. No doubt historians and archaeologists would have liked to listen to Plato’s speeches on tape and actually see an entire Greek household as it stood 2,500 years ago. But they have managed to study these topics through written copies and potsherds all the same. What is important is to accompany the use of such sources with a high degree of methodological reflection, detailing each step and the potential consequences of all choices with a view to making choices which are as informed as possible. When it comes to large-scale Web archive studies, we hope that this article has contributed to such reflections.

Concluding remarks and next steps

The aim of this article has been to launch a methodological discussion about how to perform large-scale historical studies of one of the biggest media forms of today, the archived World Wide Web.

The project on which the article is based has shown that with the advent of digital sources such as the archived Web, Internet scholars have to develop new skills and competences that are in line with the digital source environment, with well-known tasks such as source criticism occurring in new forms (cf., Brügger, 2018). This is not to say that digital sources and computational methods will replace more familiar sources and methods, but they definitely open up an array of new research questions that can now be asked and in most cases also answered. One such question is how an entire national Web domain has developed.

New avenues for Internet studies?

Studying an entire national Web domain and its development is not a trivial task. Internet scholars have to enter uncharted waters with regard to the sources and the methods as well as their collaborators. Based on the experience gained during this project, what is needed to perform analyses of an entire national Web domain is a digital research infrastructure, including hardware and software as well as human competences. To unlock this new type of source material, it is pivotal that Internet historians have access to high-performance computing facilities, and that they collaborate with specialist IT developers and curators with the relevant domain knowledge about the archived Web and specific Web collections. In brief, this study was only possible thanks to interdisciplinary collaborations of this nature.

When the right research infrastructure is at hand, it is possible to make large-scale analyses of the digital cultural heritage that may be of great value for anyone studying our culture and its development since the mid-1990s. Many scholars within the humanities or the social sciences can benefit from knowledge about the Web, and they can add a new dimension to their studies by embedding these studies in a larger national digital environment, the national Web.

Future studies

As mentioned above, the results of this study are based on only a couple of fragments of the archived Web in the Danish Web archive Netarkivet: metadata and extracted hyperlinks from all Web pages. Possible next steps could involve studying a large number of the other entities that can be extracted from Netarkivet.

Based on the algorithm of selecting only a single copy of each file, one can extract all the written words of the body text, all images, video or audio, all Web trackers and information about shopping baskets. Preliminary analyses have already been performed of the written words of the body text of the entire Danish Web, and a simple calculation of word frequencies without using stop words showed that the words used most frequently were ‘for’, ‘med’ and ‘til’ (‘for’, ‘with’ and ‘to’), which — as in English — are among the commonest of words. Future analysis will delve deeper into textual analysis, for instance using topic modelling, as well as studying (for instance) the different languages in use on the Danish Web.

Despite the differences between Web archives and the challenges of studying just one Web archive, a new field of study would be transnational studies of Web domains (as outlined in Brügger, 2019). This would make it possible to compare trends of Web development across nations.

In conclusion, this article has shown that it is possible to conduct a type of historical study that can add new dimensions to the array of familiar source types and methods, and hopefully this approach will prove useful in taking Internet studies one step further.

About the authors

Niels Brügger is Professor in Media Studies, Head of NetLab, part of the Danish Digital Humanities Lab, and head of the Centre for Internet Studies at Aarhus University in Denmark.
E-mail: nb [at] cc [dot] au [dot] dk

Janne Nielsen is Assistant Professor at NetLab and board member of the Centre for Internet Studies at Aarhus University.
E-mail: janne [at] cc [dot] au [dot] dk

Ditte Laursen is responsible for acquisition of digitally born cultural heritage materials, long-term preservation of digital heritage collections and for access to digital cultural heritage collections at the Royal Danish Library.
E-mail: dila [at] kb [dot] dk

Acknowledgements

The authors thank Per Møldrup-Dalum, Ulrich Karstoft Have and Thomas Egense for their invaluable help with the project on which this article is based. We also thank the DeIC National Cultural Heritage Cluster at the Royal Danish Library, as well as NetLab/DIGHUMLAB for their support.

Notes

1. Cf., the overviews in Baeza-Yates, et al., 2007, pp. 3–5; Brügger, 2017, p. 62.

2. Recently, Web archive-based studies of national Web domains have found a home in the international research network WARCnet (Web ARChive studies network researching Web domains and events), funded by the Independent Research Fund Denmark (see warcnet.eu).

3. Finnemann, et al., 2009, pp. 31–35.

4. Schroeder, 2018, pp. 121–122.

5. Curran, et al., 2013, pp. 887–891.

6. The article is based on the research project ‘Probing a nation’s Web domain: The development of the Danish Web 2005–2015’. The first phases of this project have been highly explorative, and one of the aims was to become familiar with the source material, including developing the necessary methods to unlock the material and to make the first investigations of large amounts of this new type of digital cultural heritage, the archived Web (for a brief research history of studies of national Web domains, see Brügger and Laursen, 2019a, pp. 415–416).

7. For more unfolded reflections on all three points, see Brügger and Laursen, 2019a, pp. 417–421; Brügger, 2018.

8. Cf., Brügger and Laursen, 2019a, pp. 417–419.

9. Brügger, 2018, pp. 75–77.

10. Brügger, et al., 2017, p. 69.

11. The DeIC Cultural Heritage Cluster is a physical and virtual hybrid HPC infrastructure with nine Dell PowerEdge R730 compute and data nodes with a total of 324 cores, 2.3TB RAM and 288TB storage.

12. Except for 2011 and 2012, where we have chosen the second broad crawl because previous studies showed anomalies here due to changes in harvesting settings.

13. It is worth noting that during the archiving process the Web crawler sets up a so-called ‘frontier queue’ to maintain the internal state of the crawl, e.g., to check if a URL has already been visited. Depending on the settings of the frontier, everything that happens in the ‘frontier queue’ may not be recorded in the crawl.log file. Consequently, a given Web entity on the online Web may not have left a trace during the archiving if it was discarded in the ‘frontier queue’ and was thus not recorded in the crawl.log file. The consequence of this is that the online Web may be composed of far more files than the crawl.log file reports, which may bias some of the analysis because numbers of files are calculated on the basis of the metadata in the crawl.log file.

14. Mayer-Schönberger and Cukier, 2013, p. 13.

15. For a definition of Web site, see Brügger, 2018, p. 34.

16. Cf., Brügger, 2018, pp. 28–30.

17. The fact that this is actually the case has been substantiated in Brügger, et al., 2017, pp. 72–76.

References

A. Ben-David, 2016. “What does the Web remember of its deleted past? An archival reconstruction of the former Yugoslav top-level domain,” New Media & Society, volume 18, number 7, pp. 1,103–1,119.
doi: https://doi.org/10.1177/1461444816643790, accessed 10 February 2020.

A. Ben-David, 2014. “Mapping minority webspaces: The case of the Arabic webspace in Israel,” In: D. Caspi and N. Elias (editors). Ethnic minorities and media in the Holy Land. London: Vallentine Mitchell, pp. 137–157.

N. Brügger, 2019. “A national Web Trend Index,” In: N. Brügger and D. Laursen (editors). The historical Web and digital humanities: The case of national Web domains. London: Routledge, pp. 178–187.
doi: https://doi.org/10.4324/9781315231662, accessed 10 February 2020.

N. Brügger, 2018. The archived Web: Doing history in the digital age. Cambridge, Mass.: MIT Press.

N. Brügger, 2017. “Probing a nation’s Web domain: A new approach to Web history and a new kind of historical source,” In: G. Goggin and M. McLelland (editors). Routledge companion to global Internet histories. New York: Routledge, pp. 61–73.
doi: https://doi.org/10.4324/9781315748962, accessed 10 February 2020.

N. Brügger and D. Laursen, 2019a. “Historical studies of national Web domains,” In: N. Brügger and I. Milligan (editors). Sage handbook of Web history. London: Sage, pp. 413–427.
doi: http://dx.doi.org/10.4135/9781526470546, accessed 10 February 2020.

N. Brügger and D. Laursen (editors), 2019b. The historical Web and digital humanities: The case of national Web domains. London: Routledge.

N. Brügger, D. Laursen and J. Nielsen, 2017. “Exploring the domain names of the Danish Web,” In: N. Brügger and R. Schroeder (editors). The Web as history: Using Web archives to understand the past and the present. London: UCL Press, pp. 62–80, and at https://www.ucldigitalpress.co.uk/Book/Article/45/70/3430/, accessed 10 February 2020.
doi: https://doi.org/10.14324/111.9781911307563, accessed 10 February 2020.

J. Cowls, 2017. “Cultures of the UK Web,” In: N. Brügger and R. Schroeder (editors). The Web as history: Using Web archives to understand the past and the present. London: UCL Press, pp. 220–237.
doi: https://doi.org/10.14324/111.9781911307563, accessed 10 February 2020.

J. Curran, S. Coen, T. Aalberg, K. Hayashi, P.K. Jones, S. Splendore, S. Papathanassopoulos, D. Rowe and R. Tiffen, 2013. “Internet revolution revisited: a comparative study of online news,” Media, Culture & Society, volume 35, number 7, pp. 880–897.
doi: https://doi.org/10.1177/0163443713499393, accessed 10 February 2020.

N.O. Finnemann, P. Jauert, J.L. Jensen, K.K. Povlsen and A.S. Sørensen, 2012. “The media menus of Danish internet users 2009,” at https://comm.ku.dk/phd/?pure=en%2Fpublications%2Fthe-media-menus-of-danish-internet-users-2009(7ac0e756-6479-41c3-a80c-f911b565f230).html, accessed 23 December 2019.

A. Halavais, 2000. “National borders on the World Wide Web,” New Media & Society, volume 2, number 1, pp. 7–28.
doi: https://doi.org/10.1177/14614440022225689, accessed 10 February 2020.

S.A. Hale, T. Yasseri, J. Cowls, E.T. Meyer, R. Schroeder and H. Margetts, 2014. “Mapping the UK Webspace: Fifteen years of British universities on the Web,” WebSci ’14: Proceedings of the 2014 ACM Conference on Web Science, pp. 62–70.
doi: https://doi.org/10.1145/2615569.2615691, accessed 10 February 2020.

D. Laursen and P. Møldrup-Dalum, 2017. “Looking back, looking forward: 10 years of development to collect, preserve, and access the Danish Web,” In: N. Brügger (editor). Web 25: Histories from the first 25 Years of the World Wide Web. New York: Peter Lang, pp. 207–227.
doi: https://doi.org/10.3726/b11492, accessed 10 February 2020.

V. Mayer-Schönberger and K. Cukier, 2013. Big data: A revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt.

L. Merzeau, B. Thierry and V. Schafer, n.d. “ASAP — Archives sauvegarde attentats Paris,” at https://asap.hypotheses.org, accessed 23 December 2019.

I. Milligan and T.J. Smyth, 2019. “Studying the Web in the shadow of Uncle Sam: The case of the .ca domain,” In: N. Brügger and D. Laursen (editors). The historical Web and digital humanities: The case of national Web domains. London: Routledge, pp. 45–63.
doi: https://doi.org/10.4324/9781315231662, accessed 10 February 2020.

M. Mussu and F. Merletti, 2016. “This is the future: A reconstruction of the UK business Web space (1996–2001),” New Media & Society, volume 18, number 7, pp. 1,120–1,142.
doi: https://doi.org/10.1177/1461444816643791, accessed 10 February 2020.

National Library of the Netherlands, n.d. “WebART: Enabling scholarly research in the KB Web Archive,” at https://www.kb.nl/en/organisation/research-expertise/research-on-digitisation-and-digital-preservation/webart-enabling-scholarly-research-in-the-kb-web-archive, accessed 23 December 2019.

R. Rogers, E. Weltevrede, E. Borra and S. Niederer, 2013. “National Web studies: The case of Iran online,” In: J. Hartley, J. Burgess, and A. Bruns (editors). A companion to new media dynamics. Oxford: Blackwell, pp. 142–166.
doi: https://doi.org/10.1002/9781118321607.ch8, accessed 10 February 2020.

R. Rosenzweig, 2003. “Scarcity or abundance? Preserving the past in a digital era,” American Historical Review, volume 108, number 3, pp. 735–762.
doi: https://doi.org/10.1086/ahr/108.3.735, accessed 10 February 2020.

V. Schafer and B.G. Thierry, 2019. “Web history in context,” In: N. Brügger and I. Milligan (editors). Sage handbook of Web history. London: Sage, pp. 59–72.
doi: http://dx.doi.org/10.4135/9781526470546, accessed 10 February 2020.

V. Schafer and B. Thierry, 2016. “The ‘Web of pros’ in the 1990s: The professional acclimation of the World Wide Web in France,” New Media & Society, volume 18, number 7, pp. 1,143–1,158.
doi: https://doi.org/10.1177/1461444816643792, accessed 10 February 2020.

R. Schroeder, 2018. Social theory after the InternetL Media, technology, and globalization. London: UCL Press.
doi: https://doi.org/10.14324/111.9781787351226, accessed 10 February 2020.

P. Webster, 2019. “Existing Web archives,” In: N. Brügger and I. Milligan (editors). Sage handbook of Web history. London: Sage, pp. 30–41.
doi: http://dx.doi.org/10.4135/9781526470546, accessed 10 February 2020.

Editorial history

Received 11 December 2019; revised 8 January 2020; accepted 17 January 2020.

To the extent possible under law, Niels Brügger, Janne Nielsen, and Ditte Laursen has waived all copyright and related or neighboring rights to this paper.

Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web
by Niels Brügger, Janne Nielsen, and Ditte Laursen.
First Monday, Volume 25, Number 3 - 2 March 2020
https://firstmonday.org/ojs/index.php/fm/article/download/10384/9396
doi: http://dx.doi.org/10.5210/fm.v25i3.10384