Library of Congress Archives Tweets

To quote the Library of Congress, or at least its first tweet and second tweet announcement (follow it at @librarycongress):

Library to acquire ENTIRE Twitter archive — ALL public tweets, ever, since March 2006! Details to follow.
Library acquires ENTIRE Twitter archive. ALL tweets. More info here http://go.usa.gov/ik4

This is old news to most of you, and it’s only barely over 2 days old. Heck, I heard about it on Twitter within a half hour thanks to a re-tweet from someone I follow (@zeldman). Then I saw it picked up on a some sites that I follow within a couple hours (“Library of Congress to Preserve Tweets for Eternity” at Mashable). A few hours after that, The New York Times posted it on their site (“Library of Congress Will Save Tweets“). And of course there’s been plenty of chatter about it since (and I wasn’t even at Chirp to see everyone react to the announcement, but you can still read the Twitter announcement on its blog).

The Library of Congress has posted an explanation on its blog (“How Tweet It Is!: Library Acquires Entire Twitter Archive“), explaining that public tweets are getting archived and acknowledging that Twitter process 50 million tweets a day. This doesn’t fall outside of the domain of the Library. They currently archive legal blogs, web sites of candidates for national office, and sites of members of Congress, all dating back to 2000. They already have 167 terabytes of just that data alone. In addition, the Library operates the National Digital Information Infrastructure and Preservation Program at digitalpreservation.gov.

I would be the first to say that the bulk of tweets don’t really contain much value. If anything they show our fascination with the inane and our inability to spell as a culture. However, it also means that the average person can get his/her wacky (140 character) idea into the archives, something generally reserved for those who have the ability to get published (often considered the elite). It’s also a great log of the day-to-day history of the world (as told by the small percentage that tweets) and will provide insight into our current culture for future generations (or space aliens with lots of time to kill).

If you read the Library blog, however, you’ll see comments from people who clearly are missing the point. Someone asks who owns the tweets, but only asks between Twitter and the author, failing to ask about the Library or recognize that libraries don’t hold copyright to works they archive. Another asks why the government thinks it has a right to archive his “PRIVATE” tweets, clearly missing the part where the Library says it is only grabbing public tweets. Someone else thinks this will come back to hurt him/her by saving hurtful/stupid tweets for all time, while failing to recognize that it’s already happening via site like Google and that perhaps they shouldn’t say just anything in what is truly a public forum. The only comments that have any merit or lamenting the cost associated with doing this, but then nobody knows the actual cost — for all we know they are just getting handed a giant database file to post to a back-up machine, all of a few minutes’ worth of work. In short, there is some entertaining reading in there.

If you are not at the office and willing to engage in some humor that is in poor taste, you can see the comic that summarizes some of the banality of Twitter. Read the Penny Arcade comic that started a new term.

Penny Arcade comic.

Update: July 20, 2015

Here we are, five years later, and it’s proving to be a slow and difficult process. As New Scientist reports, there a few big issues:

One problem, of course, is the sheer number of tweets. When the library started the project, there was already a four-year backlog of of about 21 billion tweets. Now, half a billion tweets are sent daily.

Another problem is that a tweet isn’t just a string of 140 characters. Each comes with a packet of metadata, including when and where the tweet was sent, who sent it, and how many people marked it as a favourite, shared it or responded. There are over 100 information fields in all.

It turns out running a search against just the data through 2010 can take 24 hours.

It also turns out that the EU “right to be forgotten” law can be a bit of a pickle should people demand their tweets be removed (even if I think those requests should be ignored). New Scientist also mentions that.

No comments? Be the first!

Leave a Comment or Response

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>