Preserving Government and Political Information: The Web-at-Risk Project
First Monday

Preserving Government and Political Information: The Web-at-Risk Project by Valerie D. Glenn



Contents

Web harvesting — what and why?
Web–at–Risk
Sample collections
Web harvesting — issues involved
Harvesting tools
Harvesting services

 


 

Today I will be discussing the Web–at–Risk Project to harvest & preserve born–digital government and political information. During this presentation I’m hoping to provide an overview of Web harvesting — specifically, what it is, and why you or your institution would want to do it. I’m also going to explain the Web–at–Risk project and its activities, the issues involved in harvesting born–digital materials, give examples of some harvesting tools and services, and give you a few resources to visit for more information.

 

++++++++++

Web harvesting — what and why?

First, what is Web harvesting? The automated capture of Web–published material, sometimes referred to as born–digital information. These files can be pdf, html, Windows media, gifs, jpgs, etc. Now, why would you want to harvest this material? There are several reasons for harvesting: to capture material in danger of disappearing, to capture a particular event, or moment in time, and/or to build a collection of similar or related materials.

  1. To capture materials in danger of disappearing. We know that the Web is not the most stable environment – information appears & disappears, is modified, etc., all of the time. In terms of government and political information, a change in administration or an election win or loss can mean a Web site scrubbed of certain publications or taken down completely. The CyberCemetery at North Texas (http://govinfo.library.unt.edu) is a good example of this – they capture the Web sites of federal government commissions or agencies that are going out of business, and house the files on their servers to ensure access to them.

  2. To capture a particular event, or moment in time. At the end of the President Bush’s first term in office, the National Archives conducted a harvest of all .gov and .mil Web sites (http://webharvest.gov/collections/peth04/). They also conducted an end of Congressional session harvest at the end of the 109th Congress (http://webharvest.gov/collections/congress109th/). In the aftermath of Hurricane Katrina, numerous blogs, Web sites, etc., sprung up. The Web–at–Risk folks used this opportunity to test their harvest tools, and to ensure that the sites were captured for posterity.

  3. To build a collection of similar or related materials. Researchers may be interested in water management policies or flood control in particular areas, and how they may change over time. Or a library could be interested in capturing the Web sites of political parties in a particular country or region.

 

++++++++++

Web–at–Risk

The Web–at–Risk project is funded by a grant received from the National Digital Information Infrastructure and Preservation Program (NDIIPP), and project partners are the California Digital Library, University of North Texas, and New York University. The purpose of the project is to build tools that will allow librarians to ‘capture, curate and preserve Web–based government and political information.’ While the project consists of programmers, developers, and curators, among others, today I’m going to focus on the role of curators.

The Web–at–Risk curators are librarians who are familiar with the content. Our role is to develop collection plans and test the capture tools built by the Web–at–Risk team. The collection plans are important in determining the scope and content of the harvest – and where decisions are made that can affect the entire project. More on those in a moment.

All of the Web–at–Risk collection plans are available from the Web–at–Risk Wiki, at http://wiki.cdlib.org/WebAtRisk/tiki-index.php?page=WebCollectionPlans.

 

++++++++++

Sample collections

Three examples of collections identified by Web–at–Risk curators are:

The CyberCemetery at the University of North Texas (http://govinfo.library.unt.edu): this collection is comprised of the Web sites of federal agencies no longer in existence. Here, entire Web sites are captured once, just before agencies or commissions shut their doors.

The UCLA Online Campaign Literature Archive (http://digital.library.ucla.edu/campaign/): this site provides access to captured Web sites from California “local, state, and federal offices, and ballot measures affecting the Los Angeles area.”

The Islamic and Middle Eastern Political Web, from Stanford University: while this collection is not yet available online, the plan shows how and why a particular group of sites with a similar focus, that of political parties inside and outside the borders of Islamic and Middle Eastern countries, would be of interest to researchers & kept for research purposes.

 

++++++++++

Web harvesting — issues involved

When creating the collection plan, there are several decisions that curators must make. First, content must be identified. What do you want to capture? Will the collection be based on content, publisher, subject? A combination of those?

Another consideration is the depth of the capture. How much information should be captured? An entire Web site, or only certain levels? If it is determined that an entire Web site will be harvested, should any external links be included in the capture?

Number and frequency of captures
How often should the information be captured? Is this a one–time snapshot, or should it be captured periodically in order to harvest new or changed information?

Permissions
Simply because a file or files are made available on the Web does not mean that there are no copyright concerns. Does the site owner or publisher need to be contacted for permission before harvesting the content you’ve identified?

Limitations
While harvesters and harvesting techniques are becoming increasingly sophisticated, so too are Web sites. It is difficult to harvest materials in databases, for instance, or generated dynamically. Certain harvesters handled JavaScript better than others. Be sure to review the content thoroughly in order to determine if it can be harvested at all.

 

++++++++++

Harvesting tools

Once the scope of the harvest has been determined, you’ll also need to decide how to harvest – i.e., use open source tools that are available & do–it–yourself, or employ a Web harvesting service.

There are several harvesters available for use – many of them can be found via the Web–at–Risk wiki. Two that I’m familiar with are Heritrix and Httrack – both open source products. A third, the Web Curator Tool, has recently been developed and tested by the Web–at–Risk project team.

Heritrix (http://crawler.archive.org/) is the harvester used by the Internet Archive, and is the harvester being used in the Web–at–Risk project. It stores harvested content in Arc files, and requires more massaging of harvested content before it can be reproduced on a different Web server.

Httrack (http://www.httrack.com/) is a more user–friendly capture tool and bills itself as a “website copier.” While it does handle JavaScript pretty well, it cannot handle more dynamic content and there is a greater possibility that files will be modified when captured.

The Web Curator Tool (http://webcurator.sourceforge.net/) was developed by the National Library of New Zealand and the British Library. This tool is used for managing the entire process of harvesting Web content – including scope of content, permissions, and the actual harvest itself. The Web–at–Risk curators were asked to test this product, but I am not sure that an analysis of the test has been concluded.

 

++++++++++

Harvesting services

As the practice of Web harvesting has evolved, harvesting services have developed. These services, such as the model developed in the Web–at–Risk project, allow the client to determine the scope, etc., and pay someone else to harvest & host the Web–published material they are interested in.

ArchiveIt! (http://www.archive-it.org/) is a subscription service of the Internet Archive that allows institutions to create their own Web archive without having to host it. Some state governments are using this service, as are other institutions looking to preserve Web content without devoting too many in–house resources to the effort.

OCLC’s Digital Archive (http://www.oclc.org/digitalarchive/default.htm) is another harvesting service. In addition to harvesting entire sites, this product also allows for individual publications to be harvested. Also, as a part of the Web–at–Risk project, the Web Archiving Service is being developed. More information about this is on the Web–at–Risk wiki (http://wiki.cdlib.org/WebAtRisk/tiki-index.php).

I’ve gone through quite a bit of information in a short time, but there is quite a bit of information out there for further research. The Web–at–Risk Wiki has a wealth of information about the project, including the collection plans developed by curators and presentations by the project team. Web harvesting can be a complex project, and I hope that I’ve given you some things to think about before embarking on such an initiative. Thank you. End of article

 

About the author

Valerie D. Glenn is the Government Documents Librarian at the University of Alabama Libraries.

 


 

Contents Index

Copyright ©2007, First Monday.

Copyright ©2007, Valerie D. Glenn.

Preserving Government and Political Information: The Web–at–Risk Project by Valerie D. Glenn
First Monday, volume 12, number 7 (July 2007),
URL: http://firstmonday.org/issues/issue12_7/glenn/index.html





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2014.