Harvesting and Preserving Twitter Tweets: A Model from the Web Ecology Project

How do you capture that? Image courtesy of tweetwheel
Every day, users of the social media platform Twitter send out streams of “tweets” (short text messages of 140 characters or less) to communicate about events, share photos, and link readers, or “followers,” to other on-line sources of information. Thus, when users tweet about human rights events or issues, Twitter becomes a powerful tool for human rights work, both for mobilizing action and documenting events. In the case of human rights, a portion of tweets become first-person records of key events and therefore constitute a valuable potential resource for human rights scholarship, activism, and legal action. However, collecting and archiving those tweets for such work can be challenging due to the volume of tweets produced and their fleeting nature. Fortunately, the Web Ecology Project (WEP—an overview of the organization can be found at the end of this report) has devised a workable solution for harvesting Twitter tweets.[1] By using readily available server technologies, working with Twitter’s established access and data sharing policies, and drawing on the skills of trained programmers, the research team at the WEP collects, stores, and archives massive numbers of Twitter tweets.[2] Their tweet-harvesting set-up is straight forward and can potentially be implemented by any organization wishing to gather similar materials from Twitter, as long as they have access to a programmer who can help manage the process.
The first step to collecting and archiving Twitter tweets is gaining access to Twitter’s Application Programming Interface (API), which WEP accomplished by following a standard application process established by Twitter for permitting access to their data.[3] An API serves as a common access point that allows various programs and platforms to “talk” to each other through shared variables, even if they do not share the same programming language. Basically, the API allows programmers to build applications that share information between platforms (for example, the ability to post Twitter tweets via Facebook or Facebook updates via Twitter).
With API access secured, the next step is to capture and download data from Twitter’s database. The WEP’s programmers accomplish this by writing code that requests data from Twitter’s servers via the API. The code instructs Twitter’s server to harvest data that meet specific search criteria contained in the code request—typically key words or phrases that appear in tweets about the event or topic of interest. For example, if a researcher wished to collect tweets related to the 2009 Iranian presidential election, she would submit search terms such as: #iranelection, Neda, Ahmadinejad , et cetera. When Twitter’s data server receives the code command, it pulls all tweets containing any of the requested terms, bundles them as a data packet, and sends the packet back to the WEP’s server.
Once the data arrive in the WEP’s server, the tweets dump into a massive database program as individual text files accompanied by relevant metadata (time and date tweet was created, Twitter user name, and location (if available)). The database is essentially a meta-form of an excel spreadsheet organized in rows and columns; it is the sort of thing that any trained server programmer can create when establishing a server’s architecture. Once the tweets are grouped and stored in this database, they are searchable and sortable, so that both qualitative and quantitative analyses can be run on them. And, most importantly, the database is easily archived and shared because a database of this sort is a fundamental type of programming that does not change much over time, meaning that the content will be readable down the line.
Though the request and delivery process that the Web Ecology Project has established is rapid and efficient, a couple of important limitations impact this process. First, once a code request is sent, harvest and delivery of data is automatic, however, the request process itself is not. Code must be hand written and manually sent, which can complicate archiving tweets for the duration of an important event. Typically, Twitter users responding to events send out tweets for a few days, which means that data need to download for the duration of the event in order to capture as much relevant material as possible. Since the WEP programmers have not yet devised a means of sending automated requests to Twitter, they have to manually resend requests for a particular set of terms at regular intervals over the course of several days as they follow a trending topic on Twitter. Second, although Twitter shares their data freely, they have one stipulated limitation on harvesting: only data up to five days old may be collected in response to a code request (though Twitter does maintain a database of all of the tweets ever posted since it came on line in 2006). However, these limitations should not hinder harvesting if a researcher or archivist is diligent and begins requesting data shortly after an event begins to trend on Twitter and then regularly re-sends the request until the event dies down.
These exceptions aside, the process described above provides a model for one means of establishing and maintaining archives of fleeting, first-person, digital documentation of key events produced through social media platforms. Though the Web Ecology Project team established the process explained above for collecting and archiving Twitter data, other social media platforms, such as Facebook, MySpace, or LinkedIn, also use APIs to integrate their functions with other social networking platforms in order to meet users’ desires to work seamlessly between their various social presences on the web. Therefore, the process described for the Web Ecology Project’s Twitter research would also apply to collecting and archiving digital documentation from a variety of social media sources.
Notes on the Web Ecology Project
The Web Ecology Project is an unaffiliated research group made up of independent researchers in the Boston area interested in the social processes of the internet and social media. Members pool their expertise and resources to conduct relevant research into social media trends. To date, the research group receives no outside funds (public or private), therefore, members of the WEP have pooled their resources to purchase infrastructure such as servers, and all of the work they do is voluntary. That said, their business model is shifting and evolving as interest in their work grows and they are contemplating a means of doing contractual research. With regards to the data they collect and archive, the WEP researchers make them available to interested parties when and where appropriate, within the limitations of legal restrictions with Twitter and Twitter users. If you are interested in learning more about data availability, contact the group at www.webecologyproject.org. Dataset availability is dependent upon WEP research; they can only make data available that they originally collected for their own research interests. At the time that this report was written, researchers at the WEP state that they plan to store all of the databases and archives they create indefinitely as a resource to future investigators. For more information on the goals and objectives of the Web Ecology Project, see their mission statement at: http://www.webecologyproject.org/2009/08/reimagining-internet-studies/.
[1] Special thanks goes to Dharmishta Rood of the Web Ecology Project for explaining the data harvesting and archiving process described in this report.
[2] Copyright on all tweets belongs to Twitter users; however, Twitter encourages users to contribute their tweets to the public domain (see http://twitter.com/tos for details on terms of service and copyright). Tweets submitted as such fall under fair use rules for copyright.
[3] See http://apiwiki.twitter.com/ for Twitter’s API application process.
leave a comment