The Documentalist

Web Ecology Announces Free Translation Tool

Posted in technology, Twitter by Sarah on September 20, 2009

The Rosseta Stone.

The Rosseta Stone. Image courtesy of

The Web Ecology project (covered in the post from September 4, 2009) announced the release of their first open source resource tool on Friday September 18, 2009.  The tool works with Google’s language tools to detect, translate, and transliterate print language on the Web.  In the words of Jon Beilin, the author of the announcement:

One of the tenets of Web Ecology is accessibility to the field through open tools and open data. At the Web Ecology Project, we’re working to get more of our code in a clean, commented, and releasable state. The first tool that we have queued up for release is a Python module allowing easy use of Google Language Tools, involving language detection and translation, with transliteration in an experimental state (Google has not yet released the API spec for the transliteration portion so that was reverse-engineered).

Please visit the full post describing the Google Language Python Module to see an example of how the code will work for translating print material on the Web, and to download the the program, which is an MIT/X11-licensed release.  Web Ecology plans to continue developing and making Web research tools available, so keep an eye on the site to learn more as developments emerge.


New Twitter Terms Potentially Impact Archiving

Posted in Archiving Solutions, Twitter by Sarah on September 15, 2009
The confusing world of copyright.  Image courtesy of vtualerts.

The confusing world of copyright. Image courtesy of

Twitter Announces New Copy Right Terms

On September 10, 2009,  Twitter announced that they have updated their copyright terms for user posts (see their blog at for an overview).  Previously, Twitter simply assured that users’ posts were their own, but encouraged people to consider their material as part of the public domain.  The new terms specify that,  whereas tweets are the property of users, they automatically enter into the public domain. Furthermore, users–by virtue of agreeing to Twitter’s terms and conditions–grant Twitter the right to world-wide distribution of tweets, as well as the right to distribute Tweets to outside organizations for purposes of media coverage or research.

I’ve included Twitter’s original terms of use and their new terms below so that you can compare and contrast.  It appears that the new terms may allow for straightforward harvesting and archiving of Twitter tweets.

Twitter’s original copyright statement:

Copyright (What’s Yours is Yours)

1. We claim no intellectual property rights over the material you provide to the Twitter service. Your profile and materials uploaded remain yours.  You can remove your profile at any time by deleting your account.  This will also remove any text and images you have stored in the system.

2. We encourage users to contribute their creations to the public domain or consider progressive licensing terms

3. Twitter undertakes to obey all relevant copyright laws.  We will review all claims of copyright infringement received and remove content deemed to have been posted or distributed in violation of any such laws.

(Source: accessed 8/28/2009 at 1:00 pm.  N.B.–clicking on the link to the left will take you to the new terms and conditions of use)

Twitter’s new copyright statement:

Your rights

You retain your rights to any Content you submit, post or display on or through the Services. By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed).

TIP: This license is you authorizing us to make your Tweets available to the rest of the world and to let others do the same. But what’s yours is yours – you own your content.

You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use.

TIP: Twitter has an evolving set of rules for how API developers can interact with your content. These rules exist to enable an open ecosystem with your rights in mind.

Such additional uses by Twitter, or other companies, organizations or individuals who partner with Twitter, may be made with no compensation paid to you with respect to the Content that you submit, post, transmit or otherwise make available through the Services.

We may modify or adapt your Content in order to transmit, display or distribute it over computer networks and in various media and/or make changes to your Content as are necessary to conform and adapt that Content to any requirements or limitations of any networks, devices, services or media.

You are responsible for your use of the Services, for any Content you provide, and for any consequences thereof, including the use of your Content by other users and our third party partners. You understand that your Content may be rebroadcasted by our partners and if you do not have the right to submit Content for such use, it may subject you to liability. Twitter will not be responsible or liable for any use of your Content by Twitter in accordance with these Terms. You represent and warrant that you have all the rights, power and authority necessary to grant the rights granted herein to any Content that you submit.

Twitter gives you a personal, worldwide, royalty-free, non-assignable and non-exclusive license to use the software that is provided to you by Twitter as part of the Services. This license is for the sole purpose of enabling you to use and enjoy the benefit of the Services as provided by Twitter, in the manner permitted by these Terms.

Twitter Rights

All right, title, and interest in and to the Services (excluding Content provided by users) are and will remain the exclusive property of Twitter and its licensors. The Services are protected by copyright, trademark, and other laws of both the United States and foreign countries. Nothing in the Terms gives you a right to use the Twitter name or any of the Twitter trademarks, logos, domain names, and other distinctive brand features. Any feedback, comments, or suggestions you may provide regarding Twitter, or the Services is entirely voluntary and we will be free to use such feedback, comments or suggestions as we see fit and without any obligation to you.

(Source: accessed 9/15/2009 at 10:45 am)

Breaking Tweets: Twitter Tweets Informing Human Rights News

Posted in Human Rights news, Twitter by Sarah on September 10, 2009
"A little bird told me" A weekly op-ed column at Breaking Tweets.  Image courtesy of

"A little bird told me." Image courtesy of

Here’s a fun site that illustrates how twitter tweets can make informative news pieces related to human rights.  Breaking Tweets, in its own words produces “world news, Twitter-style” by creating journalistic news articles based on first-person information posted on Twitter.  Tweets are pulled together into coherent stories; these stories, along with links to the tweets that inform them, are archived at the website.  Many of the news items produced by Breaking Tweets focus on issues directly related to human rights, social justice, or environmental justice.  The site also allows you to view stories by region, grouping them by major geographic areas: Africa, Americas, Asia, Europe, Mideast, and Oceania.

Harvesting and Preserving Twitter Tweets: A Model from the Web Ecology Project

Posted in Archiving Solutions, Twitter by Sarah on September 4, 2009
How do you capture <i>that<i>? <br /><i>Image courtesy of tweetwheel<i>

How do you capture that? Image courtesy of tweetwheel

Every day, users of the social media platform Twitter send out streams of “tweets” (short text messages of 140 characters or less) to communicate about events, share photos, and link readers, or “followers,” to other on-line sources of information.  Thus, when users tweet about human rights events or issues, Twitter becomes a powerful tool for human rights work, both for mobilizing action and documenting events.  In the case of human rights, a portion of tweets become first-person records of key events and therefore constitute a valuable potential resource for human rights scholarship, activism, and legal action.  However, collecting and archiving those tweets for such work can be challenging due to the volume of tweets produced and their fleeting nature.  Fortunately, the Web Ecology Project (WEP—an overview of the organization can be found at the end of this report) has devised a workable solution for harvesting Twitter tweets.[1] By using readily available server technologies, working with Twitter’s established access and data sharing policies, and drawing on the skills of trained programmers, the research team at the WEP collects, stores, and archives massive numbers of Twitter tweets.[2] Their tweet-harvesting set-up is straight forward and can potentially be implemented by any organization wishing to gather similar materials from Twitter, as long as they have access to a programmer who can help manage the process.

The first step to collecting and archiving Twitter tweets is gaining access to Twitter’s Application Programming Interface (API), which WEP accomplished by following a standard application process established by Twitter for permitting access to their data.[3] An API serves as a common access point that allows various programs and platforms to “talk” to each other through shared variables, even if they do not share the same programming language.  Basically, the API allows programmers to build applications that share information between platforms (for example, the ability to post Twitter tweets via Facebook or Facebook updates via Twitter).

With API access secured, the next step is to capture and download data from Twitter’s database.  The WEP’s programmers accomplish this by writing code that requests data from Twitter’s servers via the API.  The code instructs Twitter’s server to harvest data that meet specific search criteria contained in the code request—typically key words or phrases that appear in tweets about the event or topic of interest.  For example, if a researcher wished to collect tweets related to the 2009 Iranian presidential election, she would submit search terms such as: #iranelection, Neda, Ahmadinejad , et cetera.  When Twitter’s data server receives the code command, it pulls all tweets containing any of the requested terms, bundles them as a data packet, and sends the packet back to the WEP’s server.

Once the data arrive in the WEP’s server, the tweets dump into a massive database program as individual text files accompanied by relevant metadata (time and date tweet was created, Twitter user name, and location (if available)). The database is essentially a meta-form of an excel spreadsheet organized in rows and columns; it is the sort of thing that any trained server programmer can create when establishing a server’s architecture.  Once the tweets are grouped and stored in this database, they are searchable and sortable, so that both qualitative and quantitative analyses can be run on them.  And, most importantly, the database is easily archived and shared because a database of this sort is a fundamental type of programming that does not change much over time, meaning that the content will be readable down the line.

Though the request and delivery process that the Web Ecology Project has established is rapid and efficient, a couple of important limitations impact this process. First, once a code request is sent, harvest and delivery of data is automatic, however, the request process itself is not.  Code must be hand written and manually sent, which can complicate archiving tweets for the duration of an important event.  Typically, Twitter users responding to events send out tweets for a few days, which means that data need to download for the duration of the event in order to capture as much relevant material as possible.  Since the WEP programmers have not yet devised a means of sending automated requests to Twitter, they have to manually resend requests for a particular set of terms at regular intervals over the course of several days as they follow a trending topic on Twitter. Second, although Twitter shares their data freely, they have one stipulated limitation on harvesting: only data up to five days old may be collected in response to a code request (though Twitter does maintain a database of all of the tweets ever posted since it came on line in 2006).  However, these limitations should not hinder harvesting if a researcher or archivist is diligent and begins requesting data shortly after an event begins to trend on Twitter and then regularly re-sends the request until the event dies down.

These exceptions aside, the process described above provides a model for one means of establishing and maintaining archives of fleeting, first-person, digital documentation of key events produced through social media platforms.  Though the Web Ecology Project team established the process explained above for collecting and archiving Twitter data, other social media platforms, such as Facebook, MySpace, or LinkedIn, also use APIs to integrate their functions with other social networking platforms in order to meet users’ desires to work seamlessly between their various social presences on the web.  Therefore, the process described for the Web Ecology Project’s Twitter research would also apply to collecting and archiving digital documentation from a variety of social media sources.

Notes on the Web Ecology Project

The Web Ecology Project is an unaffiliated research group made up of independent researchers in the Boston area interested in the social processes of the internet and social media.  Members pool their expertise and resources to conduct relevant research into social media trends.  To date, the research group receives no outside funds (public or private), therefore, members of the WEP have pooled their resources to purchase infrastructure such as servers, and all of the work they do is voluntary.  That said, their business model is shifting and evolving as interest in their work grows and they are contemplating a means of doing contractual research.  With regards to the data they collect and archive, the WEP researchers make  them available to interested parties when and where appropriate, within the limitations of legal restrictions with Twitter and Twitter users.  If you are interested in learning more about data availability, contact the group at Dataset availability is dependent upon WEP research; they can only make data available that they originally collected for their own research interests.  At the time that this report was written, researchers at the WEP state that they plan to store all of the databases and archives they create indefinitely as a resource to future investigators.  For more information on the goals and objectives of the Web Ecology Project, see their mission statement at:

[1] Special thanks goes to Dharmishta Rood of the Web Ecology Project for explaining the data harvesting and archiving process described in this report.

[2] Copyright on all tweets belongs to Twitter users; however, Twitter encourages users to contribute their tweets to the public domain (see for details on terms of service and copyright).  Tweets submitted as such fall under fair use rules for copyright.

[3] See for Twitter’s API application process.