useR! 2008: Gary King and the DataVerse Network


Data gets lost or unusable after a remarkably short time. What I try to achieve in Read.isi was done on a massive scale by Gary King. He recognizes the speed at which data gets inaccessible, and how heterogeneous data-formats are, and decided to initiate the Dataverse Network. The Dataverse Network is a server-based approach on storing, managing, and providing to others the data resulting from the countless surveys and experiments performed in science.

Intended to store virtually all data ever collected, the system has its own, unique, algorithm to store data in a single format. Based on this, storage and analysis can be exceptionally reliable and easy to use.

All very promising, but, before everybody will be willing to donate their data to archives, we will need to find a solution to the political problem of receiving credit when people use your data. Basically, this requires a form of citation that applies to data as well, so that it can be referenced to at the end of articles. The most unique part of this is formed by UNF codes, that represent the content of the data in a uniqe, short string. This unique string will help to identify data-sets, without conveying any information about the content of the data. Confidentiality is thus retained.

All the data are stored on a central archive, the DataVerse network, but people can have the network being directly accessed from their own website. So, in that way, you can present from your own website the data you collected yourself, or you can present the data you use in your papers, or that you recommend to your students, or whatever selection you’d like.

What is especially impressive though, is that it is possible to perform advanced statistical analyses from within the DataVerse archive. They achieved doing so by writing the Zelig-package, which should become ‘everyone’s statistical software. It is basically a wrapper for many functions, whose authors need to write a small bridge-function between the function and the Zelig package. In that way, a universal syntax is achieved.

I think that this is an excellent initiative. Especially the attempt to unify data and to protect it from the waring of the ages. What makes this a potential large success, is that the developers clearly thought about the (political) structure present in science. Instead of trying to change that (and fail miserably), Gary King and his colleagues accepted the situation as it is and build upon that as best as they could. So for now, I’ll go gather some data and store it as soon as possible on my own part of the DataVerse Network.

Leave a Reply