The Wellcome Library is part of Wellcome Collection | Wellcome Trust websites
 
 





Web archiving: Management summary

Why collect and preserve the web?
Why should the Wellcome Library be interested in this?
Why should the JISC be interested in this?
Collaboration
Challenges
Approaches

Why collect and preserve the web?

  • The web is a vital means of facilitating global communication and an important medium for scientific communication, publishing, e-commerce and much more. The 'fluid' nature of the web, however, means that pages or entire sites frequently change or disappear, often without leaving any trace.
  • In order to help counter this change and decay, web archiving initiatives are required to help preserve the informational, cultural and evidential value of the world wide web (or particular subsets of it).
top

Why should the Wellcome Library be interested in this?

  • The Wellcome Library has a particular focus on the history and progress of medicine. The web has had a huge impact on the availability of medical information and has also facilitated new types of communication between patients and practitioners as well as between these and other types of organisations. The medical web, therefore, has potential long-term documentary value for historians of medicine.
  • To date, however, there has been no specific focus on collecting and preserving medical websites. While the Internet Archive has already collected much that would be of interest to future historians of medicine, a preliminary analysis of its current holdings suggests that significant content or functionality may be missing.
  • There is, therefore, an urgent need for a web archiving initiative that would have a specific focus on preserving the medical web. The Wellcome Library is well placed to facilitate this and such an initiative would nicely complement its existing strategy with regard to preserving the record of medicine past and present.
top

Why should JISC be interested in this?

The Joint Information Systems Committee of the Higher and Further Education Funding Councils (JISC) has a number of areas where web archiving initiatives would directly support its mission.

  • JISC funds a number of development programmes. It therefore has an interest in ensuring that the web-based outputs of these programmes (e.g. project records and publications) persist and remain available to the community and to JISC. Many of the websites of projects funded by previous JISC programmes have already disappeared.
  • JISC also supports national development of digital collections for HE/FE and the Resource Discovery Network (RDN) services that select and describe high-quality web resources judged to be of relevance to UK further and higher education. A web archiving initiative could underpin this effort by preserving copies of some of these sites, e.g. in case the original sites change or disappear. The expertise and subject knowledge of the RDN could in turn assist development of national and special collections by bodies such as the national libraries or Wellcome Trust. These collections would be of long-term value to HE/FE institutions.
  • JISC also funds the JANET network used by most UK further and higher education institutions and, as its operator, UKERNA has overall responsibility for the ac.uk domain.
top

Collaboration

  • Collaboration will be the key to any successful attempt to collect and preserve the web.
  • The web is a global phenomenon. Many attempts are being made to collect and preserve it on a national or domain level, e.g. by national libraries and archives. This means that no one single initiative (with the exception of the Internet Archive) can hope for total coverage of the web. Close collaboration between different web archiving initiatives, therefore, will be extremely important, e.g. to avoid unnecessary duplication in coverage or to share in the development of tools, guidelines etc.
  • More specifically, there is a need for all organisations involved in web archiving initiatives in the UK to work together. In particular there is the opportunity to work closely with the British Library as it develops its proposals for web archiving as part of the national archive of publications. Potentially, many different types of organisation have an interest in collecting and preserving aspects of the UK web, while the British Library (BL), the Public Record Office (PRO) and the British Broadcasting Corporation (BBC) have already begun to experiment with web archiving. The Digital Preservation Coalition (DPC) is well placed to provide the general focus of this collaboration, although there may be a need for specific communications channels.
top

Challenges

The web poses preservation challenges for a number of reasons:

  • The web's fast growth rate and 'fluid' characteristics mean that it is difficult to keep up-to-date with its content sufficiently for humans to decide what is worth preserving.
  • Web technologies are immature and evolving all the time. Increasingly, web content is delivered from dynamic databases that are extremely difficult to collect and preserve. Some sites use specific software (e.g. browser plug-ins) that may not be widely available or use non-standard features that may not work in all browsers. Other websites may belong to the part of the web that is characterised by the term 'deep web' and will be hard to find using most web search services and maybe even harder to preserve.
  • Unclear responsibilities for preservation - the diverse nature of the web means that a variety of different organisation types are interested in its preservation. Archives are interested in websites when they may contain records, libraries when they contain publications or other resources of interest to their target communities. The global nature of the web also means that responsibility for its preservation does not fall neatly into the traditional national categories.
  • Legal issues relating to copyright, the lack of legal deposit mechanisms (at least in the UK), liability issues related to data protection, content liability and defamation. These represent serious problems and are dealt with in a separate report [PDF 344KB] that has been prepared by Andrew Charlesworth of the University of Bristol.
top

Approaches

Since the late 1990s, a small number of organisations have begun to develop approaches to the preservation of the web, or more precisely, well-defined subsets of it. Those organisations include national libraries and archives, scholarly societies and universities. Perhaps the most ambitious of these initiatives is the Internet Archive. This US-based non-profit organisation has been collecting broad snapshots of the web since 1996. In 2001, it began to give public access to its collections through the 'Wayback Machine'.

Current web archiving initiatives normally take one of three main approaches:

  • deposit, whereby web-based documents or 'snapshots' of websites are transferred into the custody of a repository body, e.g. national archives or libraries
  • automatic harvesting, whereby crawler programs attempt to download parts of the surface web - this is the approach of the Internet Archive (who have a broad collection strategy) and some national libraries, e.g. Sweden and Finland
  • selection, negotiation and capture, whereby repositories select web resources for preservation, negotiate their inclusion in cooperation with website owners and then capture them using software (e.g. for site replication or mirroring, harvesting, etc.) - this is the approach of the National Library of Australia and the British Library's recent pilot project.

These are not mutually exclusive. Several web archiving initiatives (e.g. the Bibliothèque nationale de France and the National Library of New Zealand) plan to use combinations of both the selective and harvesting based approaches. The selective approach can deal with some level of technical complexity in websites, as the capture of each can be individually planned and associated with migration paths. This may be a more successful approach with some parts of the so-called 'deep web.' However, hardware issues aside, collection would appear to be more expensive (per gigabyte archived) than the harvesting approach. Estimates of the relative costs vary, but the selective approach would normally be considerably more expensive in terms of staff time and expertise. This simple assessment, however, ignores factors related to the cost of preservation over time (whole of life costs), the potential for automation, and quality issues (i.e. fitness for purpose).

top
 Wellcome Library, 183 Euston Road, London NW1 2BE, UK  tel:+44 (0)20 7611 8722  email: library@wellcome.ac.uk Sitemap|Privacy statement|Disclaimer