Coding, is? Fun!

Sunday, January 04, 2009

Internet Archiving - who owns my data?

(Update below)

Something has been bothering me about the internet (or more precisely, the collection of websites I use in the internet).
As an user, I have a broad range of options available now for publishing my content (such as blogs, images, video). But I do feel paranoid about certain qualities in the current internet.

Let me describe my situation - I write comments in several online discussion forums (such as I write blogs in and I upload videos to I also write comments in other people's blogs, whichever platform they may be. I save my bookmarks in
Now, I value all this content I put in the web. For example, in several debates you come up with a new way of looking at something, an effective reply to a "talking point", a key piece of data that shuts up everybody. I am not talking about other people's content - I am specifically talking about content I myself put on the web in different websites. This is the age of User Generated Content and my content is distributed across different websites.
There are a couple of problems that I face with this distributed content:
1. How can I aggregate all my content and get updates when someone replies to me or links to me? This is a problem that RSS solves. I will not elaborate on this here.
2. How can I collect and provide a kind of catalog of my opinions in all these different websites? Let us say that in the near future I seek admission to Harvard. Is there someway that I can provide a collection of all my valuable content to Harvard so that I can be credentialized? Looking into the future, we can expect a new generation to start creating their identity online by teenage and thus leave a trail of their work and impressions (in whatever format) across the interner as they grow older. How can someone maintain this digital trail and leverage it?
The point is this problem has always existed even before the age of the internet. For example, if you wanted to collect the complete works of Einstein you went to every university he ever went to and searched libraries and archives. Some of these archives are digitized now, but nobody came up with a solution for an easy mechanism to package up your life's work. It was close to impossible in the pre-internet world to have a centralized collection of all of one's life's work.
But, this problem is solvable in the digital age - the facebook, orkut, myspace generation is going to have access to internet most of the time. Fifty years from now, it is possible to expect that a person's life's work can be determined by a biographer or an anthologist merely from the digital world.

So, what prevents me from getting a digital collection of my own work distributed across the internet, right now?
I can, of course, prepare a set of links with my comments, my dailykos blogs, my blogspot blogs and my youtube videos. But that is all I can do - the websites reserve the right to invalidate these links at any point of time. In fact, twenty years from now, many of these may have switched off their servers and gone home.
As an internet contributor, the core problem I face is this - the rich data that is part of "my" internet, is not owned by me. It is owned by at least fifteen different websites. The same problem goes for everyone using the internet.
When I write an article in DailyKos, I want my article, along with the comments (which provide context) to be available for posterity. But I have no control over when they may "retire" the article or when Markos closes it down.
In theory, this is no different from the problems faced by preceding generations - if you are a newspaper columnist, you took paper cuttings of your column; probably photocopied it and kept it at home. That is all you could do.
I think we, in the internet age should demand more though - because more is possible now. For example, taking a printout of a webpage with my article is not good enough - because someone could be commenting on that article this minute; and I don't want to lose that context.
With their myriad ways of annotating, commenting and extending our content, the websites of the internet have made my content richer, more contextual, more centralized than in the pre-internet world. Youtube, blogspot, rediff have all made my contributions richer, but because of that it is more important that I be able to catalog and archive that content.

The tension here is between two poles - the websites have enabled me to contribute and reach a broad audience. For the survival of their business, they want me to keep coming back to their pages. So the data stays in different forms in different websites. I, on the other hand, would like to extract my data (in some format) and keep it in a set of archives so that the data is available for posterity. I am worried that all my valuable contributions will be gone some twenty years later.

I may sound paranoid, but I do care about the longevity of my thoughts - I think everyone who contributes in the internet does.

We should not allow what happened in the past centuries - there was no centralized publishing, so much work of enormous value was gone in a short time because the medium (like parchment or paper) perished. In this age almost everyone with access to the internet can publish their opinions and share their knowledge. It is all searchable. We should leverage the advantage of the digital medium and come up with a solution for extracting our data (even for a fee) and create some standards for extracting User Generated Content.
We also need a standard archiving solution that is not tied to any particular website. For a fee I should be able to store my data in different servers that are part of the meta-internet.
By the way, check out this browser add-in - It provides a way for extracting any webpage for your personal storage. But the storage is still owned by - I think we need an internet archiving project that is more community-driven.

Update I
Refer to what happened to Soapblox in this article.
The content of several blogs running on the Soapblox platform was almost wiped out by hackers. Several years work could have been lost. This is why we need an open source extraction and archiving system.

Labels: ,


  • Thats true, point well made. All our datas are at a stake for sure.. Btw its a good hint to a new business idea. You should explore more to come up with a business solution, a product in the making... People will love if we get them a solution.

    By Blogger Vasanth Kumar Gopalakrishnan, at 7:49 AM  

  • There is a lot of talk going on about this topic - recently I read an article in The Hindu newspaper

    By Blogger Mathi Rajan, at 9:32 PM  

Post a Comment

<< Home