Rense Nieuwenhuis » data

WorldBank on iPhone: Great initiative, but not quite there …

Rense Nieuwenhuis — Thu, 27 May 2010 10:00:17 +0000

Last Tuesday, I wrote about the Open Data initiative of the World Bank, and mentioned the iPhone App that provides access to a great amount data. Isn’t it lovely to be able to access information on a large number of indicators, covering many countries and years? Always wondered how the fertility rates in Samoa developed over time, or are you finding yourself discussing country-differences in the government dept as a percentage of GDP? Now you can have the information in your pocket, and access it everywhere, every time. For free.

I have been playing with this app, and found it very easy to use. First, you have to select one of the many available indicators, one or several countries, and the time-span you wish to plot. The first picture below shows a (very small) part of the list of indicators. Unfortunately, on my iPhone I was only able to access the indicators starting with A to C. I suppose the others will be added soon.

The second image shows the screen in which the plot is set up: an indicator is selected, and some countries are added to the list. After saving these settings, a screen pops up with more plots you defined (third image). It is very nice to be able to save some of the data you want to be able to access quickly. Selecting one of these shows the graphic as line-plots (fourth image). You can even save the image of the graph to your photo-library, or send the image or it’s data to an e-mail address. Unfortunately, on my iPhone, the graph doesn’t show properly: the lines do not align to the x-axis. In addition: I do not think these particular lines representing fertility rates are all that accurate.

The idea of a WorldBank DataFinder app for the iPhone is very nice. I think the app is well designed: it is easy to set up a graphic, and you can store several graphics for easy reference later on. In its current version, some serious bugs are present, but these will be resolved easily, I reckon. In the end, this is going to be a great app, but some bugs need to be ironed out first.

I reviewed version 1.1 of the World Bank DataFinder app, on my iPhone 3g (OS version 3.1.3).

World Bank Open Data Initiative

Rense Nieuwenhuis — Mon, 24 May 2010 10:00:41 +0000

Following upon my declaration of the Decade of Data, I think it is very impressive that the World Bank decided to share its data. As part of their ‘open data initiative’, data from their large number of databases is made available through the internet. Together, these databases encompass over 2,000 indicators of countries all over the world, many of them covering a time-series of 50 years. Topics include:

If the mission of the World Bank is to “fight poverty with passion and professionalism for lasting results and to help people help themselves and their environment by providing resources, sharing knowledge, building capacity and forging partnerships in the public and private sectors”, I believe this initiative can but help. More data, more dissemination, and more people using the data will all contribute to more knowledge.

The website allows for making graphics, maps and tables using the data. It is also possible to download the data: be sure to do so, for it immediately gives access to the complete time-series. Moreover, the World Bank created an interface allowing other applications to access the data directly. One cool application using this is already created by the people of the World Bank themselves, for an iPhone application is available. Imagine having all this data in your pocket!

Thanks to

Have a great Decade of Data!

Rense Nieuwenhuis — Wed, 12 May 2010 10:00:41 +0000

Now that I am trying to get to a regular blogging schedule, I realized that I have not wished my readers a happy new year. Although I am traditionally late with these kind of things, I suppose now is too late to wish you all a very happy 2010. But, perhaps it is not too late to wish you all to have a great new decade?

I think that 2010 could be the beginning of a beautiful decade. The Decade of Data perhaps? There have been so many data-related developments the last couple of years, that I tend to believe that a lovely stage has been set.

During the last decade, several books that are closely related to data availability have become immensely popular. Freakonomics may be the most prominent example of this new kind of books. It combines a popular way of writing about advanced statistical techniques with applications on interesting sets of data. Ian Ayres’ Super Crunchers perhaps takes this approach even further, by describing more about both the nature of the applied statistical techniques (experiments and regression analysis) and making the most of the increasing availability of data.

The improvements of data analysis (including the abovementioned, but also including more academic innovations) perhaps are only left behind by the improvements in data availability. For instance, Hans Rosling promotes the public availability and use of large amounts of data, and does so by providing the public with means of creating mesmarizing graphics. See more on the website Gapminder.org.

Data collection is one thing, but data maintenance is something completely different. Gary King recognizes the speed at which data gets inaccessible, and how heterogeneous data-formats are, and decided to initiate the Dataverse Network. The Dataverse Network is a server-based approach on storing, managing, and providing to others the data resulting from the countless surveys and experiments performed in science. I think it is an impressive attempt in facilitating (academic) researchers in finding and sharing their data.

Also, governments are trying to upon up the collections of data their decisisons are based upon. Think about the possibilities of using these data, either for checking your government, or for (other) academic purposes! For instance, in the USA, government databases are made public on data.gov. From their website:

As a priority Open Government Initiative for President Obama’s administration, Data.gov increases the ability of the public to easily find, download, and use datasets that are generated and held by the Federal Government. Data.gov provides descriptions of the Federal datasets (metadata), information about how to access the datasets, and tools that leverage government datasets. The data catalogs will continue to grow as datasets are added. Federal, Executive Branch ata are included in the first version of Data.gov.

The above merely serves as a few examples of the exiting developments regarding public availability of data. I will continue to write about this, both detailing the examples given above, as well as about more lovely examples. An overview of the data I find interesting, is collected here.

useR! 2008: Gary King and the DataVerse Network

Rense Nieuwenhuis — Thu, 14 Aug 2008 08:08:50 +0000

Data gets lost or unusable after a remarkably short time. What I try to achieve in Read.isi was done on a massive scale by Gary King. He recognizes the speed at which data gets inaccessible, and how heterogeneous data-formats are, and decided to initiate the Dataverse Network. The Dataverse Network is a server-based approach on storing, managing, and providing to others the data resulting from the countless surveys and experiments performed in science.

Intended to store virtually all data ever collected, the system has its own, unique, algorithm to store data in a single format. Based on this, storage and analysis can be exceptionally reliable and easy to use.

All very promising, but, before everybody will be willing to donate their data to archives, we will need to find a solution to the political problem of receiving credit when people use your data. Basically, this requires a form of citation that applies to data as well, so that it can be referenced to at the end of articles. The most unique part of this is formed by UNF codes, that represent the content of the data in a uniqe, short string. This unique string will help to identify data-sets, without conveying any information about the content of the data. Confidentiality is thus retained.

All the data are stored on a central archive, the DataVerse network, but people can have the network being directly accessed from their own website. So, in that way, you can present from your own website the data you collected yourself, or you can present the data you use in your papers, or that you recommend to your students, or whatever selection you’d like.

What is especially impressive though, is that it is possible to perform advanced statistical analyses from within the DataVerse archive. They achieved doing so by writing the Zelig-package, which should become ‘everyone’s statistical software. It is basically a wrapper for many functions, whose authors need to write a small bridge-function between the function and the Zelig package. In that way, a universal syntax is achieved.

I think that this is an excellent initiative. Especially the attempt to unify data and to protect it from the waring of the ages. What makes this a potential large success, is that the developers clearly thought about the (political) structure present in science. Instead of trying to change that (and fail miserably), Gary King and his colleagues accepted the situation as it is and build upon that as best as they could. So for now, I’ll go gather some data and store it as soon as possible on my own part of the DataVerse Network.

R-Sessions 09: Data Manipulation

Rense Nieuwenhuis — Mon, 11 Aug 2008 10:00:39 +0000

Today’s edition of R-Sessions deals with the manipulation of data that is stored R-Project. Building upon the previous R-Session, attention is paid to recoding of data, ordering, and finally the merging of several sets of data.

Recoding

The most direct way to recode data in R-Project is using a combination of both indexing and conditionals as described elsewhere. To exemplify this, a simply data.frame will be created below, containing variables indicating gender and monthly income in thousands of euros.

gender <- c(“male”, “female”, “female”, “male”, “male”, “male”, “female”)
income <- c(54, 34, 556, 57, 88, 856, 23)
data <- data.frame(gender, income)
data

> gender <- c("male", "female", "female", "male", "male", "male", "female")
> income <- c(54, 34, 556, 57, 88, 856, 23)
> data <- data.frame(gender, income)
> data
  gender income
1   male     54
2 female     34
3 female    556
4   male     57
5   male     88
6   male    856
7 female     23

Some of the values on the income variable seem exceptionally high. Let’s say we want to remove the two values on income higher than 500. In order to do so, we use the which() command, that reveals which of the values is greater than 500. Next, the result of this is used for indexing the data$income variable. Finally, the indicator for missing values, ‘NA’ is assigned to the that selected values of the ‘income’ variables. Obviously, we would normally only use the third line. The first two are shown here, to make clear exactly what is happening.

which(data$income > 500)
data$income[data$income > 500]
data$income[data$income > 500] <- NA
data

> which(data$income > 500)
[1] 3 6
> data$income[data$income > 500]
[1] 556 856
> data$income[data$income > 500] <- NA
> data
  gender income
1   male     54
2 female     34
3 female     NA
4   male     57
5   male     88
6   male     NA
7 female     23

Sometimes, it is desirable to replace missing values by the mean on the respective variables. That is what we are going to do here. Note, that in general practice it is not very sensible to impute two missing values using only five valid values. Nevertheless, we will proceed here.
The first row of the example below shows that it is not automatically possible to calculate the mean of a variable that contains missing values. Since R-Project cannot compute a valid value, NA is returned. This is not what we want. Therefore, we instruct R-Project to remove missing values by adding na.rm=TRUE to the mean() command. Now, the right value is returned. When the same selection-techniques as above are used, an error will occur. Therefore, we need the is.na() command, that returns a vector of logicals (‘TRUE’ and ‘FALSE’ ). Using is.na(), we can use the which() command to select the desired values on the income variable. To these, the calculated mean is assigned.

mean(data$income)
mean(data$income, na.rm=TRUE)
data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
data

> mean(data$income)
[1] NA
> mean(data$income, na.rm=TRUE)
[1] 51.2
> data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
> data
  gender income
1   male   54.0
2 female   34.0
3 female   51.2
4   male   57.0
5   male   88.0
6   male   51.2
7 female   23.0

ORDER

It is easy to sort a data-frame using the command order. Combined with indexing functions, it works as follows:

x <- c(1,3,5,4,2)
y <- c('a','b','c','d','e')
df <- data.frame(x,y)

df
  x y
1 1 a
2 3 b
3 5 c
4 4 d
5 2 e

df[order(df$x),]
  x y
1 1 a
5 2 e
2 3 b
4 4 d
3 5 c

MERGE

Merge puts multiple data.frames together, based on an identifier-variable which is unique or a combination of variables.

x <- c(1,2,5,4,3)
y <- c(1,2,3,4,5)
z <- c('a','b','c','d','e')

df1 <- data.frame(x,y)
df2 <- data.frame(x,z)
df3 <- merge(df1,df2,by=c("x"))

 df3
  x y z
1 1 1 a
2 2 2 b
3 3 5 e
4 4 4 d
5 5 3 c

useR! 2008: Retrieving old data using ‘read.isi’

Rense Nieuwenhuis — Sun, 18 May 2008 19:09:19 +0000

Today I was notified that my proposal for a presentation on userR! 2008, the The R User Conference 2008,Â is approved. Actually, I applied for a poster-presentation, but apparently the organization upgraded it to a full presentation. The presentation will be on a macro I programmed, enabling me to retrieve old statistical data, which was incompatible with commonly used statistical programs.

From the proposal:

Due to technological and software development, it sometimes is no longer possible to automatically read older data-files into statistical software. Especially data-files that originate from the times magnetic tapes were used to store data are often distributed as raw (ASCII) data, without proper means to read those data into statistical packages.
However, for those interested in using data to perform longitudinal analyses, these older sets of data are very valuable.
In the Netherlands, the national archive for data storage (DANS) is currently organizing conferences on a unified and time-proof manner of storing data-files. But what to do with those data that already have become difficult to access?

The solution I came up with consists of a software macro, that read and interprets the code-book and converts this to syntax allowing the original data to be read into a statistical package. It is programmed for R-Project, the open-source software package for statistical analysis that I work with and write about. A first public release is scheduled shortly before the conference.

The conference will be held in Dortmund, August 12-14. It will be the ideal opportunity of sharing my approach with experts in the field and perhaps find some people who are interested in using it.