UT-Austin Library Web Clipper: Follow Up Questions
A couple of weeks ago, Kevin Wood (University of Texas Libraries at Austin) and I posted an article with the title “Archiving Web Pages: UT-Austin Library’s Web Clipper,” where we described an innovative solution to capturing and preserving fragile human rights material from the World Wide Web. The post generated a number of interesting questions, so we have decided to post this follow up in a Q&A style to provide additional information on how the Web Clipper works. Special thanks again to Kevin for taking the time to craft answers to these questions. Please do not hesitate to contact me with more questions if you have them. We will be writing updates on the Web Clipper progress as Kevin and his team continue to develop it and will do our best to answer your questions here as we do so. –Sarah
The UT Libraries’ Web Clipper
As part of a Bridgeway Funded initiative, the University of Texas Libraries at Austin is engaged in a project developing a means for harvesting and preserving fragile or endangered Web materials related to human rights violations and genocide. Having tried a number of available technologies for harvesting Web material and finding them to be unsatisfactory for their needs, a team of developers created an in-house Web Clipper program designed to meet the libraries’ specific needs for preserving Web material. A full description of the Web Clipper is available here. What follows is a series of responses to questions generated from the first post about the Web Clipper.
Q1: When the clipper clips, does it save the file in the original formats (e.g., html, with all the associated files)?
Q2: Are there limitations on what the Web Clipper can and cannot capture?
A: There are limitations to what our new Web Clipper can automatically capture, but it has the ability to accept attachments. Extensions like DownloadHelper (a free Firefox extension for downloading and converting videos from many sites with minimum effort) can turn a streaming video into a file that can then be attached to a clipping. The final format of the attachment depends on the tool used to create it, but generally matches the original.
Q3: Are the graduate research assistants who are testing the Clipper capturing multiple instances of the same site over time, or are these one-off?
A: Each capture is a one-off. The Web Clipper allows users to dive deeper into sites and capture individual pages rather than whole sites (sometimes a site that wouldn’t normally carry relevant human rights information has an article or blog post that we want to preserve). Where one might use tools such as Archive-It, WAS, WAX or Web Curator Tool to capture an entire blog, one uses the Web Clipper to capture and describe a single blog post or article, for example.
Q4: When the clipped files are submitted to The University of Texas Libraries’ DSpace (the local repository), is the submission process simple? That is, is there an automated process created?
A: Yes, this process is automated. We use the SWORD (Simple Web-service Offering Repository Deposit) to facilitate interface between the Web Clipper and DSpace for ingestion. A script runs periodically, identifies new clippings and pushes them into the repository.
Q5: Regarding the use of a local Wayback machine for preserving the clipped materials: Are you capturing clipped material via Wayback in addition to DSpace, or is this all the same process with just one instance of the preserved site? If the latter, how does one set up a local Wayback version?
A: There is only one instance of the preserved site. The repository contains a link out to the Wayback machine, not the preserved clipping itself. The link allows a user to open the original record in the DSpace repository. Although we could store ARC files (a lossless data compression and archiving format) in the repository, they wouldn’t be of much use to our users as such, so we’re only exposing the content through a local Wayback instance. We use the open source version of the Wayback Machine.
Q6: Is access to the clipped documents restricted, or are they open to everyone via UT Libraries’ digital repository? Are there any privacy or confidentiality issues associated with the clipped material?
A: The clippings will be open to everyone, but while we’re in development they’re restricted. We haven’t seen any privacy or confidentiality issues with our clipped material. All of the clippings come from the public web.