Archiving Solutions | The Documentalist

Satellite Image Archives

Posted in Archiving Solutions, Reviews by Sarah on January 21, 2010

From satellites to archived files. Image courtesy of Integral Systems

The last couple of posts have dealt with the technology of satellite imagery and how this imagery can serve human rights. However, of more interest to some might be the archiving of satellite images. After all, the benefit of satellite imagery for human rights work is predicated on access to “before” and “after” images that illustrate physical destruction of villages or farms in the wake of human rights atrocities–the before images perforce come from past images that organizations acquire from archives of stored and cataloged materials collected by various geo-spatial imaging companies’ satellites.

Because satellite imaging companies are for-profit, information about their archiving practices is fairly limited, but some information is available on-line and is summarized below for three large imaging firms: GeoEye, ImageSat International, and Digital Globe. These companies have each provided images to human rights organizations or to researchers investigating human rights events, either as donations or through purchase arrangements.

GeoEye

GeoEye maintains an archive of satellite images and a suite of services for accessing them. These services are available through their GeoFUSE program, described as follows:

GeoEye’s Imagery Sources collect vast amounts of high-resolution satellite and aerial imagery from around the globe each day. This imagery is processed and used in a multitude of applications such as mapping, disaster response, infrastructure management, and environmental monitoring. Now, with GeoEye’s new suite of Search & Discovery tools, our customers can browse the GeoEye image catalog archives, quickly and easily locating and previewing imagery for their specific needs. Using the information obtained through use of these tools, our customers can easily communicate the information necessary to place orders for imagery products that meet their project requirements.

Access services include: Online Maps, Google Earth Tools, Online Resource Center, Advanced Search Options, Toolbar for ArcMaps (a desk top GIS application), Help & Documentaiton, and Image search (Resourcesat-1 catalogs). Some preliminary searching can be done through these tools at the website and selected preview images can be stored in a personal file at the GeoEye Webpage for reference and purchase. GeoEye also offers imagery for free to academics, human rights organizations, and other non-profits through the GeoEye Foundation.

ImageSat International.

ImageSat has an archive, but little information about it is available online. The Website states:

ImageSat maintains an imagery archive, which contains all imaged EROS A data, including that which is down-linked by the ground control stations in ImageSat’s Global Network. Customers may purchase this imagery at preferred prices. To enquire about purchasing imagery from the ImageSat Imagery Archive, contact our Order Desk or call us at +972-3-7960627.

There are sample images available through the gallery, but there does not appear to be a means of searching preview images as there is at GeoEye’s website. The page requests that you call for information.

Digital Globe

Digital Globe also maintains an archive of their satellite imagery, which you can learn more about by contacting them directly at the following:

Please Contact Customer Service for information on searching The DigitalGlobe^™ Archive

E-mail: info@digitalglobe.com
Toll Free: 800.496.1225 or
Phone: 303.684.4561
Fax: 303.684.4562

Currently, Digital Globe is offering free imagery for coverage of the Haiti crisis, which is available on their gallery page. Click on the “Free Access to Haiti Imagery” button and you will be taken to an order form for imagery requests. They also offer an on-line image search feature that allows site visitors to sample the imagery in the archive according to region of the world. An interactive map leads visitors through the preview process. Standard imagery is available upon request:

Standard Imagery can be acquired directly from the DigitalGlobe archive or you can submit a new collection request. Standard Imagery is ordered by area, with a minimum purchase of 25 km² (~10 mi²) for archive orders. For tasking, the minimum area for ordering is 25 km² (~10 mi²), but minimum pricing rules apply, depending on the tasking level selected. If your order crosses more than one strip, one standard imagery product per scene is delivered.

Products are delivered on your choice of standard digital media with Image Support Data files including image metadata.

leave a comment

GLIFOS-Media: Rich Media Archiving

Posted in Archiving Solutions, Reviews, technology by Sarah on November 19, 2009

Rich-media preservation

As posted on November 4, 2009, The University of Texas Libraries Human Rights Documentation Initiative (HRDI) has been working with the Kigali Genocide Memorial Centre in Rwanda on a pilot digital archiving program that takes advantage of a rich media platform called GLIFOS media. GLIFOS provides a social media tool kit that was originally created to meet the needs of a distance learning program at the Universidad de Francisco Marroqín in Guatemala, but it proves to also be promising as a tool for human rights archiving (see the article “Non-custodial archiving: U Texas and Kigali Memorial Center” at WITNESS Media Archive). As a rich-media wiki, GLIFOS is designed to integrate digital video, audio, text, and image documents through a process that “automates the production, cataloguing, digital preservation, access, and delivery of rich-media over diverse data transport platforms and presentation devices.” GLIFOS media accomplishes this by presenting related documents–for example, video of a lecture, a transcript of the same, and associated PowerPoint slides–in a synchronized fashion such that when a user highlights a particular segment of a transcript, for example, the program locates and plays the corresponding segment of the video and also locates the related Power Point slide. This ability to seamlessly synchronize and present related digital media translates well to the human rights context by allowing for the cataloging and integration of video material, documents containing testimonies, photographs, and transcripts. Materials that all relate to a single event can be pulled together and presented in a holistic fashion, which is useful for activism and scholarship.

GML: The Key to Preservation

In order to support the presentation of this integrated information for users, GLIFOS needed to ensure that materials can be read and accessed across existing digital presentation platforms (e.g., web browsers, DVDs, CD) and readers (e.g., PCs or PDAs), as well as on platforms yet-to-be-created (see “XML Saves the Day,” an article written by the developers in 2005 for more detail). This was accomplished by indexing and annotating all digital documents stored in the GLIFOS repository with an XML-based language called the “GLIFOS Markup Language,” or GML. The claim is that “GML is technology, platform, and format independent” (Ibid), thus allowing for preservation of established relationships between materials. Basically, the GML language allows users of GLIFOS to create a metafile that determines the relationships between related multi-media records held in a repository in such a way that the relationships between files are maintained across a variety of media reading platforms. This is possible because GML is a significantly stripped-down markup language that requires little or no translation from one reader to the next, thus content is preserved as technology changes and evolves.

GLIFOS and Human Rights Documentation

Given that GLIFOS is designed to catalog, index, and synchronize a wide variety of digital media types, it proves to be a promising tool for aiding in digital archiving. The GLIFOS GML protocol allows the program to access and present cataloged materials through the meta-relationships it establishes for records; and because GML is a streamlined markup language that allows multiple platforms to present and read digital documents, these relationships have been successfully maintained when migrated to entirely new data reading and presentation platforms. As long as the repository of documents that GLIFOS accesses remains intact, both in terms of the materials stored there and their associated metadata, and as long as new media platforms continue to read older video and image media files, use of the GLIFOS Markup Language aids in preservation by providing a means of cataloging and indexing documents using GML, as well as preserving the synchronized links and interactions that GLIFOS establishes between related documents over time.

[1] See http://www.glifos.com/wiki/images/f/f5/Arias_reichenbach_pasch_mLearn2005.pdf

1 comment

UT-Austin Library Web Clipper: Follow Up Questions

Posted in Archiving Solutions, technology by Sarah on November 10, 2009

Image courtesy of http://www.deepspace.com

A couple of weeks ago, Kevin Wood (University of Texas Libraries at Austin) and I posted an article with the title “Archiving Web Pages: UT-Austin Library’s Web Clipper,” where we described an innovative solution to capturing and preserving fragile human rights material from the World Wide Web. The post generated a number of interesting questions, so we have decided to post this follow up in a Q&A style to provide additional information on how the Web Clipper works. Special thanks again to Kevin for taking the time to craft answers to these questions. Please do not hesitate to contact me with more questions if you have them. We will be writing updates on the Web Clipper progress as Kevin and his team continue to develop it and will do our best to answer your questions here as we do so. –Sarah

The UT Libraries’ Web Clipper

As part of a Bridgeway Funded initiative, the University of Texas Libraries at Austin is engaged in a project developing a means for harvesting and preserving fragile or endangered Web materials related to human rights violations and genocide. Having tried a number of available technologies for harvesting Web material and finding them to be unsatisfactory for their needs, a team of developers created an in-house Web Clipper program designed to meet the libraries’ specific needs for preserving Web material. A full description of the Web Clipper is available here. What follows is a series of responses to questions generated from the first post about the Web Clipper.

Q1: When the clipper clips, does it save the file in the original formats (e.g., html, with all the associated files)?

A: Yes, to the extent possible. There are challenges with javascript and streaming media that we are still working on with the new clipper. In those cases we rely on attachments (see the answer to question 2 below). Before designing the new Web Clipper, We’d gone through a few different clipping strategies and were not pleased with any. Zotero does a good job of capturing what you see, but makes modifications to the files, thus complicating preservation. Placing Firefox behind a proxy captures a lot, but misses content that relies on user interactions if those interactions don’t occur. Heritrix does the best job, but we’ve seen it struggle with more than 10% of the pages that have been clipped.

Q2: Are there limitations on what the Web Clipper can and cannot capture?

A: There are limitations to what our new Web Clipper can automatically capture, but it has the ability to accept attachments. Extensions like DownloadHelper (a free Firefox extension for downloading and converting videos from many sites with minimum effort) can turn a streaming video into a file that can then be attached to a clipping. The final format of the attachment depends on the tool used to create it, but generally matches the original.

Q3: Are the graduate research assistants who are testing the Clipper capturing multiple instances of the same site over time, or are these one-off?

A: Each capture is a one-off. The Web Clipper allows users to dive deeper into sites and capture individual pages rather than whole sites (sometimes a site that wouldn’t normally carry relevant human rights information has an article or blog post that we want to preserve). Where one might use tools such as Archive-It, WAS, WAX or Web Curator Tool to capture an entire blog, one uses the Web Clipper to capture and describe a single blog post or article, for example.

Q4: When the clipped files are submitted to The University of Texas Libraries’ DSpace (the local repository), is the submission process simple? That is, is there an automated process created?

A: Yes, this process is automated. We use the SWORD (Simple Web-service Offering Repository Deposit) to facilitate interface between the Web Clipper and DSpace for ingestion. A script runs periodically, identifies new clippings and pushes them into the repository.

Q5: Regarding the use of a local Wayback machine for preserving the clipped materials: Are you capturing clipped material via Wayback in addition to DSpace, or is this all the same process with just one instance of the preserved site? If the latter, how does one set up a local Wayback version?

A: There is only one instance of the preserved site. The repository contains a link out to the Wayback machine, not the preserved clipping itself. The link allows a user to open the original record in the DSpace repository. Although we could store ARC files (a lossless data compression and archiving format) in the repository, they wouldn’t be of much use to our users as such, so we’re only exposing the content through a local Wayback instance. We use the open source version of the Wayback Machine.

Q6: Is access to the clipped documents restricted, or are they open to everyone via UT Libraries’ digital repository? Are there any privacy or confidentiality issues associated with the clipped material?

A: The clippings will be open to everyone, but while we’re in development they’re restricted. We haven’t seen any privacy or confidentiality issues with our clipped material. All of the clippings come from the public web.

1 comment

UT Human Rights Archiving and GLIFOS

Posted in Archiving Solutions, Reports, technology by Sarah on November 4, 2009

T-Kay Sangwand, the human rights archivist at he University of Texas Libraries in Austin has contributed a guest post to the WITNESS Media Archive blog to close out Grace Lile’s series for Archives Month last month. The post discusses a non-custodial archiving arrangement that the University of Texas Libraries has established with the Kigali Memorial Centre (KMC) in Rwanda. Funded by the Bridgeway Foundation and the University of Texas Libraries, the project–called the Human Rights Documentation Initiative (HRDI)–consists of a collaborative effort to digitize, preserve, and catalogue a variety of documentation from the Rwandan genocide. In order to accomplish this, HRDI project team members traveled to Rwanda this summer to help KMC set up an archiving system that utilizes the GLIFOS media toolkit–a rich-media storage program and reader developed in Guatemala:

In order to facilitate access to KMC materials, the HRDI has been working with the Guatemala-based company, Glifos, that provides powerful software that allows for cataloging, indexing, and syncing audiovisual materials with transcripts and other materials for enhanced access. Using Glifos, the HRDI built a prototype for a digital archive for KMC and in July 2008, three members of the HRDI project team (Christian Kelleher, T-Kay Sangwand, and Amy Hamilton) traveled to Rwanda to demo the prototype.

A unique piece of this project is the supportive role that the University of Texas Libraries is playing as KMC establishes and maintains their archive. Specifically, the library is serving as a repository of the digitized materials created at Kigali, while Kigali maintains the original collection of physical paper documents, film footage, or audio recordings. GLIFOS will allow users in Rwanda to directly access the digital materials held in the Texas repository. See the entire article at the WITNESS Media Archive for the complete discussion of this project.

As illustrated by the HRDI project at Texas, the GLIFOS program proves to be a good means of cataloguing, indexing, and preserving rich-media content (that is, video, text, audio, and even materials in multiple languages) in a way that allows for ease of archiving and ease of access and use. A future post on this blog will discuss the technical specifications of GLIFOS in terms of its utility for digital archiving.

2 comments

Capturing & Archiving Web Pages: UT-Austin Library’s Web Clipper

Posted in Archiving Solutions, Reports, technology by Sarah on October 23, 2009

The following was co-authored with Kevin Wood at the University of Texas Libraries at Austin. The post describes a promising experimental archiving strategy that the UT Libraries is developing for harvesting and preserving primary resources from the Web. Special thanks to Kevin for contributing his expertise and time by co-authoring this post.

–Sarah

University of Texas Libraries-Austin’s Web Clipper Project for Human Rights

Developer: Kevin Wood

Example of a Web page clipped from the web for achiving as a primary resource. Image courtesy of Kevin Wood, University of Texas Libraries-Austin

Example of a Web page clipped from the web for archiving as a primary resource. Image: Kevin Wood, University of Texas Libraries-Austin

Background

In July of 2008, the University of Texas Libraries received a grant from the Bridgeway Foundation to support efforts to collect and preserve fragile records (records that are at risk of destruction either from environmental conditions or human activity) of human rights conflicts and genocide. These funds are helping the library to develop new means for collecting and cataloguing “fragile or transient Web sites of human rights advocacy and genocide watch;” sites that are important because the internet has become a primary means for distributing both information and misinformation about human rights abuses and for documenting human rights events. Thus these fragile Web sites become valuable primary resources for survivors, scholars, and activists as they pursue their work in human rights (see the library’s grant announcement for a press release on the grant).

Harvesting Web Sites for Archiving

In their first attempt to establish a reliable means for harvesting Web sites for preservation, archivists at the University of Texas Libraries used Zotero, a free Firefox extension that allows users to collect, manage and cite online resources for research. The program allows users to capture copies of webpages and catalog them in a bibliographic program that functions much like End Note or Book Ends. Archivists at the University of Texas planned to use the program to pull specific documentation of human rights events off of the internet and then submit the collected pages to their institutional repository for cataloging and preservation. However, Zotero wound up not meeting their needs. Zotero is geared toward individual work from a desktop, therefore, when it harvests a page, it changes links to be relative to the individual’s desktop rather than saving the original links as they are built into the webpage of interest—in terms of archiving and preservation, this is problematic because it calls into question the authenticity of the captured pages. Zotero can be made to keep the original links, but it was not originally designed to do so, so this becomes a cumbersome process and as Zotero continues to evolve in the direction of meeting the needs of individual users, this work-around process becomes that much more difficult to maintain.

The solution for this problem is the in-house creation of a custom web clipper program that harvests pages without modifying them. It functions as a Firefox plug-in and was built from the bottom up borrowing heavily from open source programs that already have some of the right functionality for the libraries’ human rights archiving needs. The designer wants to keep the coding footprint of the web clipper as small as possible to minimize the deployment and maintenance burden. Therefore, the main logic of the clipper will be hosted on a server and accessed on individual machines or terminals through web services. Eventually, this will allow patrons to use the clipper from anywhere in the library system as a harvesting tool. The goal is to centralize the clipping process as much as possible without the need of customizing individual machines, thus streamlining collection, cataloging, and preservation processes.

The prototype clipper is currently housed on two computers at the library in Austin and graduate research assistants are actively clipping web pages for archiving. As they clip a page (see the image above for an example of a clipped page) , users enter metadata in predetermined fields and then assign descriptive terms as tags for subject and content cataloging. Users can either select from a thesaurus of human rights terms (in this case, they are beginning with the thesaurus from WITNESS and extending it with terms as appropriate) or assign arbitrary keywords. Though users have complete control over clipping, documenting, and tagging a Web page, a moderator or manager determines if new terms should be added to the thesaurus.

Regardless of whether a new term makes it into the thesaurus, the pages clipped by users get stored in the archive. Once items are clipped and tagged with descriptive terms, they are ingested into the UT Libraries’ institutional repository, based on DSpace. Metadata are stored in the repository with a link to a local instance of Internet Archive’s Wayback Machine. These copies appear exactly as the pages appeared when the material was first clipped and submitted for preservation, thus maintaining their value as primary resources.

1 comment

Digital Archiving Tool: Amnesty International’s ADAM

Posted in Archiving Solutions by Sarah on October 15, 2009

Amnesty International’s International Secretariat recently released an in-house digital archiving program called ADAM–Amnesty Digital Asset Management. The program, designed in conjunction with Bright Interactive, allows Amnesty field workers to upload digitally created photos, videos, and audio recordings into a central repository that all Amnesty members can access from within the organization. ADAM is a customized application of Bright Interactive’s “Asset Bank” tool which:

is a digital asset management system, enabling your organisation to create a fully searchable, categorised library of digital images, videos and other documents. It is a high-performance, cost-effective server application to enable you to manage digital assets – all that is needed to access it is a web browser (from Asset Bank).

The description for the product goes on to specify that the Asset Bank program that ADAM is built from is customizable, scalable, and multi-lingual.

Because the program is accessible through a web browser, field workers can submit their field materials from anywhere in the world, as long as they have an internet link (sometimes a challenge in the further reaches of the world). As users upload their digital materials, they fill in required fields for metadata and context information. Use and access restrictions are also recorded in the record for each uploaded item. At this point, uploading material into ADAM is voluntary, but according to AI’s digital archivist, response has been enthusiastic. The hope is that uploading material into ADAM will become standard practice for all field workers, thus streamlining archiving processes and making material readily available for AI reports and campaigns. This material could also potentially be available for scholarly and legal work by outside parties–always dependent, of course, on the access agreements that AI holds with the creators of the material and the individuals represented in images, videos, or audio recordings.

Currently, ADAM holds approximately 36,000 records, 159 of which are available for public viewing at the ADAM Web site. Though Web site visitors from outside of AI can’t access the full holdings, the public holdings allow you to see the types of information that ADAM users submit when they upload their digital documentation items. Information ADAM currently collects is as follows:

Descriptive

Title of the video, image, or audio file
Description of the content
Keywords, or terms for searching and cataloging
Campaigns that the item contributes to or was created for
Tags
Copyright type
Copyright credit

Agreement Type:

Agreement specifies the level of use that the creator of the piece and individuals represented within the piece permit within Amnesty International. Some items are publicly available and others are highly restricted.
Agreement Notes specify additional use restrictions not covered in the standard agreements preset in ADAM
Shotlist/Transcript information for video and/or audio material
Date Created
Creation Date Accuracy is a space for stating level of confidence for when the item was created.
Place Created

Technical

Size of the digital image, video or audio recording in terms of image density and/or memory space required for the file
Orientation of images (landscape or portrait)

Admin

ID, a catalog number assigned to the item by ADAM
Date Last Modified
Embedded Data
Collections
Categories

1 comment

JISC-PoWR: A Resource for Digital Preservation

Posted in Archiving Solutions by Sarah on October 9, 2009

image courtesy of http://jiscpowr.jiscinvolve.org/handbook/

image courtesy of JISC-PoWR

The Joint Information Systems Committee, or JISC, is an organization in the UK dedicated to exploring the ways in which ICT (information and communication technologies) can support higher education and research. According to their Web site, they sponsor or fund more than 200 projects related to innovative uses of ICT, one of which is a focus on archiving the internet, social media, Web 2.0 and digital materials in general. To this end, they have written a 104 page handbook on web archiving and support a blog, called JISC-PoWR, dedicated to all things digital archiving. There are a number of informative posts and resources available at the site, including discussions of how to harvest and archive Twitter tweets, blogs, fragile Web pages, and the like.

leave a comment

Best Practices for Human Rights Archiving & a Push by WITNESS

Posted in Archiving Solutions, Reports by Sarah on October 2, 2009

In a post titled “Archives Month: Focus on Human Rights”, Grace Lile of WITNESS calls attention to a recent UN report, Right to Truth from the United Nations High Commissioner for Human Rights that contains best practices related to human rights archiving. This annual report created by the Human Rights Council highlights the importance of good documentation collection and preservation practices for upholding the mandate of the Universal Declaration of Human Rights and supporting justice: In short, documentation is essential for human rights action and careful documentation practices should be a part of the operational mandate for human rights workers.

Also, in honor of Archives Month, Grace notes that at WITNESS:

we’d like to amplify the topic of Archives and Human Rights, through this blog, and through a series of videos to be highlighted later this month on the Hub. What is a human rights archive? How do archives support human rights? What are the most pressing issues in the field today? What can be done to strengthen the ability of archives to promote and support human rights?

Despite the increasing recognition of value and need noted by the UNHCHR, the challenges for documentation centers and archives are daunting, and range from poor documentation on the ground to the long-term preservation of increasingly ephemeral media. What can be done? What is being done? We’d love to hear from anyone who can contribute to this topic.

Please visit the the blog to follow these posts and for more information about other initiatives related to archiving in human rights and contribute your input on the questions WITNESS is asking.

1 comment

New Twitter Terms Potentially Impact Archiving

Posted in Archiving Solutions, Twitter by Sarah on September 15, 2009

The confusing world of copyright. Image courtesy of vtualerts.com.

Twitter Announces New Copy Right Terms

On September 10, 2009, Twitter announced that they have updated their copyright terms for user posts (see their blog at http://blog.twitter.com/ for an overview). Previously, Twitter simply assured that users’ posts were their own, but encouraged people to consider their material as part of the public domain. The new terms specify that, whereas tweets are the property of users, they automatically enter into the public domain. Furthermore, users–by virtue of agreeing to Twitter’s terms and conditions–grant Twitter the right to world-wide distribution of tweets, as well as the right to distribute Tweets to outside organizations for purposes of media coverage or research.

I’ve included Twitter’s original terms of use and their new terms below so that you can compare and contrast. It appears that the new terms may allow for straightforward harvesting and archiving of Twitter tweets.

Twitter’s original copyright statement:

Copyright (What’s Yours is Yours)

1. We claim no intellectual property rights over the material you provide to the Twitter service. Your profile and materials uploaded remain yours. You can remove your profile at any time by deleting your account. This will also remove any text and images you have stored in the system.

2. We encourage users to contribute their creations to the public domain or consider progressive licensing terms

3. Twitter undertakes to obey all relevant copyright laws. We will review all claims of copyright infringement received and remove content deemed to have been posted or distributed in violation of any such laws.

(Source: http://twitter.com/tos accessed 8/28/2009 at 1:00 pm. N.B.–clicking on the link to the left will take you to the new terms and conditions of use)

Twitter’s new copyright statement:

Your rights

You retain your rights to any Content you submit, post or display on or through the Services. By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed).

TIP: This license is you authorizing us to make your Tweets available to the rest of the world and to let others do the same. But what’s yours is yours – you own your content.

You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use.

TIP: Twitter has an evolving set of rules for how API developers can interact with your content. These rules exist to enable an open ecosystem with your rights in mind.

Such additional uses by Twitter, or other companies, organizations or individuals who partner with Twitter, may be made with no compensation paid to you with respect to the Content that you submit, post, transmit or otherwise make available through the Services.

We may modify or adapt your Content in order to transmit, display or distribute it over computer networks and in various media and/or make changes to your Content as are necessary to conform and adapt that Content to any requirements or limitations of any networks, devices, services or media.

You are responsible for your use of the Services, for any Content you provide, and for any consequences thereof, including the use of your Content by other users and our third party partners. You understand that your Content may be rebroadcasted by our partners and if you do not have the right to submit Content for such use, it may subject you to liability. Twitter will not be responsible or liable for any use of your Content by Twitter in accordance with these Terms. You represent and warrant that you have all the rights, power and authority necessary to grant the rights granted herein to any Content that you submit.

Twitter gives you a personal, worldwide, royalty-free, non-assignable and non-exclusive license to use the software that is provided to you by Twitter as part of the Services. This license is for the sole purpose of enabling you to use and enjoy the benefit of the Services as provided by Twitter, in the manner permitted by these Terms.

Twitter Rights

All right, title, and interest in and to the Services (excluding Content provided by users) are and will remain the exclusive property of Twitter and its licensors. The Services are protected by copyright, trademark, and other laws of both the United States and foreign countries. Nothing in the Terms gives you a right to use the Twitter name or any of the Twitter trademarks, logos, domain names, and other distinctive brand features. Any feedback, comments, or suggestions you may provide regarding Twitter, or the Services is entirely voluntary and we will be free to use such feedback, comments or suggestions as we see fit and without any obligation to you.

(Source: http://twitter.com/tos accessed 9/15/2009 at 10:45 am)

leave a comment

Harvesting and Preserving Twitter Tweets: A Model from the Web Ecology Project

Posted in Archiving Solutions, Twitter by Sarah on September 4, 2009

How do you capture <i>that<i>? <br /><i>Image courtesy of tweetwheel<i>

How do you capture that? Image courtesy of tweetwheel

Every day, users of the social media platform Twitter send out streams of “tweets” (short text messages of 140 characters or less) to communicate about events, share photos, and link readers, or “followers,” to other on-line sources of information. Thus, when users tweet about human rights events or issues, Twitter becomes a powerful tool for human rights work, both for mobilizing action and documenting events. In the case of human rights, a portion of tweets become first-person records of key events and therefore constitute a valuable potential resource for human rights scholarship, activism, and legal action. However, collecting and archiving those tweets for such work can be challenging due to the volume of tweets produced and their fleeting nature. Fortunately, the Web Ecology Project (WEP—an overview of the organization can be found at the end of this report) has devised a workable solution for harvesting Twitter tweets.[1] By using readily available server technologies, working with Twitter’s established access and data sharing policies, and drawing on the skills of trained programmers, the research team at the WEP collects, stores, and archives massive numbers of Twitter tweets.[2] Their tweet-harvesting set-up is straight forward and can potentially be implemented by any organization wishing to gather similar materials from Twitter, as long as they have access to a programmer who can help manage the process.

The first step to collecting and archiving Twitter tweets is gaining access to Twitter’s Application Programming Interface (API), which WEP accomplished by following a standard application process established by Twitter for permitting access to their data.[3] An API serves as a common access point that allows various programs and platforms to “talk” to each other through shared variables, even if they do not share the same programming language. Basically, the API allows programmers to build applications that share information between platforms (for example, the ability to post Twitter tweets via Facebook or Facebook updates via Twitter).

With API access secured, the next step is to capture and download data from Twitter’s database. The WEP’s programmers accomplish this by writing code that requests data from Twitter’s servers via the API. The code instructs Twitter’s server to harvest data that meet specific search criteria contained in the code request—typically key words or phrases that appear in tweets about the event or topic of interest. For example, if a researcher wished to collect tweets related to the 2009 Iranian presidential election, she would submit search terms such as: #iranelection, Neda, Ahmadinejad , et cetera. When Twitter’s data server receives the code command, it pulls all tweets containing any of the requested terms, bundles them as a data packet, and sends the packet back to the WEP’s server.

Once the data arrive in the WEP’s server, the tweets dump into a massive database program as individual text files accompanied by relevant metadata (time and date tweet was created, Twitter user name, and location (if available)). The database is essentially a meta-form of an excel spreadsheet organized in rows and columns; it is the sort of thing that any trained server programmer can create when establishing a server’s architecture. Once the tweets are grouped and stored in this database, they are searchable and sortable, so that both qualitative and quantitative analyses can be run on them. And, most importantly, the database is easily archived and shared because a database of this sort is a fundamental type of programming that does not change much over time, meaning that the content will be readable down the line.

Though the request and delivery process that the Web Ecology Project has established is rapid and efficient, a couple of important limitations impact this process. First, once a code request is sent, harvest and delivery of data is automatic, however, the request process itself is not. Code must be hand written and manually sent, which can complicate archiving tweets for the duration of an important event. Typically, Twitter users responding to events send out tweets for a few days, which means that data need to download for the duration of the event in order to capture as much relevant material as possible. Since the WEP programmers have not yet devised a means of sending automated requests to Twitter, they have to manually resend requests for a particular set of terms at regular intervals over the course of several days as they follow a trending topic on Twitter. Second, although Twitter shares their data freely, they have one stipulated limitation on harvesting: only data up to five days old may be collected in response to a code request (though Twitter does maintain a database of all of the tweets ever posted since it came on line in 2006). However, these limitations should not hinder harvesting if a researcher or archivist is diligent and begins requesting data shortly after an event begins to trend on Twitter and then regularly re-sends the request until the event dies down.

These exceptions aside, the process described above provides a model for one means of establishing and maintaining archives of fleeting, first-person, digital documentation of key events produced through social media platforms. Though the Web Ecology Project team established the process explained above for collecting and archiving Twitter data, other social media platforms, such as Facebook, MySpace, or LinkedIn, also use APIs to integrate their functions with other social networking platforms in order to meet users’ desires to work seamlessly between their various social presences on the web. Therefore, the process described for the Web Ecology Project’s Twitter research would also apply to collecting and archiving digital documentation from a variety of social media sources.

Notes on the Web Ecology Project

The Web Ecology Project is an unaffiliated research group made up of independent researchers in the Boston area interested in the social processes of the internet and social media. Members pool their expertise and resources to conduct relevant research into social media trends. To date, the research group receives no outside funds (public or private), therefore, members of the WEP have pooled their resources to purchase infrastructure such as servers, and all of the work they do is voluntary. That said, their business model is shifting and evolving as interest in their work grows and they are contemplating a means of doing contractual research. With regards to the data they collect and archive, the WEP researchers make them available to interested parties when and where appropriate, within the limitations of legal restrictions with Twitter and Twitter users. If you are interested in learning more about data availability, contact the group at www.webecologyproject.org. Dataset availability is dependent upon WEP research; they can only make data available that they originally collected for their own research interests. At the time that this report was written, researchers at the WEP state that they plan to store all of the databases and archives they create indefinitely as a resource to future investigators. For more information on the goals and objectives of the Web Ecology Project, see their mission statement at: http://www.webecologyproject.org/2009/08/reimagining-internet-studies/.

[1] Special thanks goes to Dharmishta Rood of the Web Ecology Project for explaining the data harvesting and archiving process described in this report.

[2] Copyright on all tweets belongs to Twitter users; however, Twitter encourages users to contribute their tweets to the public domain (see http://twitter.com/tos for details on terms of service and copyright). Tweets submitted as such fall under fair use rules for copyright.

[3] See http://apiwiki.twitter.com/ for Twitter’s API application process.

leave a comment

The Documentalist

Satellite Image Archives

GeoEye

ImageSat International.

Digital Globe

GLIFOS-Media: Rich Media Archiving

Rich-media preservation

GML: The Key to Preservation

GLIFOS and Human Rights Documentation

UT-Austin Library Web Clipper: Follow Up Questions

The UT Libraries’ Web Clipper

UT Human Rights Archiving and GLIFOS

Capturing & Archiving Web Pages: UT-Austin Library’s Web Clipper

University of Texas Libraries-Austin’s Web Clipper Project for Human Rights

Developer: Kevin Wood

Digital Archiving Tool: Amnesty International’s ADAM

JISC-PoWR: A Resource for Digital Preservation

Best Practices for Human Rights Archiving & a Push by WITNESS

New Twitter Terms Potentially Impact Archiving

Twitter Rights

Harvesting and Preserving Twitter Tweets: A Model from the Web Ecology Project

The Author

The Pages

The Search

The Associates

The Archives

The Categories

The Meta