If you'd forgotten your drunken or embarassing tweets from 2010, bad news: the Library of Congress is reportedly weeks away from finishing their project to archive the roughly 170 billion tweets sent between Twitter's founding in 2006 and April 2010, when the initiative was announced. Why are they archiving your tweets? All in the name of science and research.
Twitter is a new kind of collection for the Library of Congress, but an important one to its mission of serving both Congress and the public. As society turns to social media as a primary method of communication and creative expression, social media is supplementing and in some cases supplanting letters, journals, serial publications and other sources routinely collected by research libraries.
Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today's cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes.
Only public tweets that were published six months ago or longer will be included in the library, which has so far received over 400 requests for information from researchers around the world.
Sounds like an interesting and useful enough project, right? Well, don't get too excited yet; it reportedly takes a full 24 hours to perform just one search, which seems mindboggling and impossible (or, as the Library puts it, "an inadequate situation") until you consider the nightmarish labyrinthine manner in which the Library is organizing the information.
Gnip, the designated delivery agent for Twitter, receives tweets in a single real-time stream from #Twitter. Gnip organizes the stream of tweets into hour-long segments and uploads these files to a secure server throughout the day for retrieval by the Library. When a new file is available, the Library downloads the file to a temporary server space, checks the materials for completeness and transfer corruption, captures statistics about the number of tweets in each file, copies the file to tape, and deletes the file from the temporary server space.
The Library has assessed existing software and hardware solutions that divide and
simultaneously search large data sets to reduce search time, so-called "distributed and parallel computing". To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is costprohibitive and impractical for a public institution.
Yikes. If you're one of those 400 curious researchers, who are interested in such noble pursuits as "patterns in the rise of citizen journalism and interest in elected officials' communications to tracking vaccination rates and predicting stock market activity," you'll probably need to wait a while longer to get actual, usable information. Or, you don't have the patience, just hire a small army of interns.