The Documentalist

UT-Austin Library Web Clipper: Follow Up Questions

Posted in Archiving Solutions, technology by Sarah on November 10, 2009

newspaper-clippings-101

Image courtesy of http://www.deepspace.com

A couple of weeks ago, Kevin Wood (University of Texas Libraries at Austin) and I posted an article with the title “Archiving Web Pages: UT-Austin Library’s Web Clipper,” where we described an innovative solution to capturing and preserving fragile human rights material from the World Wide Web.  The post generated a number of interesting questions, so we have decided to post this follow up in a Q&A style to provide additional information on how the Web Clipper works.  Special thanks again to Kevin for taking the time to craft answers to these questions.  Please do not hesitate to contact me with more questions if you have them.  We will be writing updates on the Web Clipper progress as Kevin and his team continue to develop it and will do our best to answer your questions here as we do so.  –Sarah

The UT Libraries’ Web Clipper

As part of a Bridgeway Funded initiative, the University of Texas Libraries at Austin is engaged in a project developing a means for harvesting and preserving fragile or endangered Web materials related to human rights violations and genocide.  Having tried a number of available technologies for harvesting Web material and finding them to be unsatisfactory for their needs, a team of developers created an in-house Web Clipper program designed to meet the libraries’ specific needs for preserving Web material.  A full description of the Web Clipper is available here.  What follows is a series of responses to questions generated from the first post about the Web Clipper.

Q1: When the clipper clips, does it save the file in the original formats (e.g., html, with all the associated files)?

A: Yes, to the extent possible.  There are challenges with javascript and streaming media that we are still working on with the new clipper.  In those cases we rely on attachments (see the answer to question 2 below).  Before designing the new Web Clipper, We’d gone through a few different clipping strategies and were not pleased with any.  Zotero does a good job of capturing what you see, but makes modifications to the files, thus complicating preservation.  Placing Firefox behind a proxy captures a lot, but misses content that relies on user interactions if those interactions don’t occur.  Heritrix does the best job, but we’ve seen it struggle with more than 10% of the pages that have been clipped.

Q2: Are there limitations on what the Web Clipper can and cannot capture?

A: There are limitations to what our new Web Clipper can automatically capture, but it has the ability to accept attachments.  Extensions like DownloadHelper (a free Firefox extension for downloading and converting videos from many sites with minimum effort) can turn a streaming video into a file that can then be attached to a clipping.  The final format of the attachment depends on the tool used to create it, but generally matches the original.

Q3: Are the graduate research assistants who are testing the Clipper capturing multiple instances of the same site over time, or are these one-off?

A: Each capture is a one-off.  The Web Clipper allows users to dive deeper into sites and capture individual pages rather than whole sites (sometimes a site that wouldn’t normally carry relevant human rights information has an article or blog post that we want to preserve).  Where one might use tools such as Archive-It, WAS, WAX or Web Curator Tool to capture an entire blog, one uses the Web Clipper to capture and describe a single blog post or article, for example.

Q4: When the clipped files are submitted to The University of Texas Libraries’ DSpace (the local repository), is the submission  process simple? That is, is there an automated process created?

A: Yes, this process is automated.  We use the SWORD (Simple Web-service Offering Repository Deposit) to facilitate interface between the Web Clipper and DSpace for ingestion.  A script runs periodically, identifies new clippings and pushes them into the repository.

Q5: Regarding the use of a local Wayback machine for preserving the clipped materials: Are you capturing clipped material via Wayback in addition to DSpace, or is this all the same process with just one instance of the preserved site? If the latter, how does one set up a local Wayback version?

A: There is only one instance of the preserved site.  The repository contains a link out to the Wayback machine, not the preserved clipping itself.  The link allows a user to open the original record in the DSpace repository.  Although we could store ARC files (a lossless data compression and archiving format) in the repository, they wouldn’t be of much use to our users as such, so we’re only exposing the content through a local Wayback instance.  We use the open source version of the Wayback Machine.

Q6: Is access to the clipped documents restricted, or are they open to everyone via UT Libraries’ digital repository? Are there any privacy or confidentiality issues associated with the clipped material?

A: The clippings will be open to everyone, but while we’re in development they’re restricted.  We haven’t seen any privacy or confidentiality issues with our clipped material.  All of the clippings come from the public web.

One Response

Subscribe to comments with RSS.

  1. HRDI Updates » HRDI Web Clipper Q & A said, on November 12, 2009 at 9:01 pm

    […] A few weeks ago, The Documentalist, featured an article on the HRDI Web Clipper tool that archives individual web pages. The posting generated several key questions which Kevin Wood, the HRDI Web Clipper developer, has answered in a new posting, “UT Austin Library Web Clipper: Follow Up Questions.” […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: