Rense Nieuwenhuis » useR! 2008

useR! 20008: Read.isi reception

Rense Nieuwenhuis — Sun, 17 Aug 2008 13:55:35 +0000

Just a few days since the useR! 2008 conference, it is time for some initial evaluation of the reception of my presentation of Read.isi.

The presentation itself went well. I was presenting in one of the larger rooms of the conference, to an audience of about 50-60 people. I was rather content with that number, especially because I received some nice and supportive feedback. This feedback made rather clear, however, that I’m not a professional programmer (which I obviously already knew). Fortunately, the suggestions were focused on performance, rather than preventing errors.

I is always a nice topic to talk about with fellow conference attendants: are you presenting something? Quite some of the people who asked me that, appeared to have read my abstract. After my presentation, one person came to speak to me. He used to work with the Word Fertility Survey and said he would have loved my package to be available back then.

But most of all, the chair of the sessions I presented Read.isi in, also one of the members of the R core group, suggested to refer to my package from within the data manual of R-Project itself. To me, this is a level of endorsement I hadn’t expected and is a strong stimulus to keep improving my package.

All in all, I’m very happy with the way Read.isi was received.

useR! 2008: Goodbye

Rense Nieuwenhuis — Thu, 14 Aug 2008 13:30:53 +0000

Unfortunately, the useR! 2008 conference has come to an end. At the conference closing Uwe Liges thanked us all for our presence and summarized the conference:

more than 400 participants from all over the world
more than 170 presentations
many CRAN package authors and R code team members
7 invited lectures and 13 pre-conference tutorials

As a last sign of geekiness (of all people present, not of him personally), he showed that over the years a strong correlation has been developed between the size of the R-Project base software (in megabytes) and the number of participants of the R-conferences. Funnily Scary!

As some real closing remarks, the announcement was made that useR! 2009 will be held in Rennes, France.

useR! 2008: The last session

Rense Nieuwenhuis — Thu, 14 Aug 2008 13:04:37 +0000

useR! 2008 is almost done already. The last session I attended was in the Audimax, the largest lecture room of the conference site. Focus of the majority of the presentations was on teaching R and of motivating people to use it. However, the most interesting presentation was on R Analytic Flow, a rather simple application aimed at providing a hierarchical overview of your complex analyses.

R Analytic Flow has something very simple, but very interesting to offer for especially the more complex projects. Earlier, I wrote a project called MORET to organize your analysis, and R Analytic Flow has something similar to offer. Basically, R Analytic Flow offers a hierarchical view of your analysis. Each node of the hierarchical tree contains a basic function with corresponding parameters, which are executable directly. Also, future versions will allow for caching, so that the entire workspace is saved in the corresponding node. In that way, complex and time-consuming calculations do not need to be repeated. This is something I will surely try (see: http://www.ef-prime.com) and will help me to keep the overview on complex projects.

Perhaps, R Analytic Flow might even help people to start using R-Project, which was more explicitly the purpose of the other presentations. I’ve seen Nicholas Lewin-Koh presenting how he implemented to R-Project the analyses that are quite standard and often performed on his medical department. To give students a better feel for statistics, Adrian Bowman wrote simulation modules for often used distributions and analyses. Richard Pugh from Mango Solutions gave a nice presentation on the experiences of his company with teaching R-Project to people. I totally agree with him that at first a firm basis should be created, before the more interesting capabilities should be taught.

useR! 2008: Gary King and the DataVerse Network

Rense Nieuwenhuis — Thu, 14 Aug 2008 08:08:50 +0000

Data gets lost or unusable after a remarkably short time. What I try to achieve in Read.isi was done on a massive scale by Gary King. He recognizes the speed at which data gets inaccessible, and how heterogeneous data-formats are, and decided to initiate the Dataverse Network. The Dataverse Network is a server-based approach on storing, managing, and providing to others the data resulting from the countless surveys and experiments performed in science.

Intended to store virtually all data ever collected, the system has its own, unique, algorithm to store data in a single format. Based on this, storage and analysis can be exceptionally reliable and easy to use.

All very promising, but, before everybody will be willing to donate their data to archives, we will need to find a solution to the political problem of receiving credit when people use your data. Basically, this requires a form of citation that applies to data as well, so that it can be referenced to at the end of articles. The most unique part of this is formed by UNF codes, that represent the content of the data in a uniqe, short string. This unique string will help to identify data-sets, without conveying any information about the content of the data. Confidentiality is thus retained.

All the data are stored on a central archive, the DataVerse network, but people can have the network being directly accessed from their own website. So, in that way, you can present from your own website the data you collected yourself, or you can present the data you use in your papers, or that you recommend to your students, or whatever selection you’d like.

What is especially impressive though, is that it is possible to perform advanced statistical analyses from within the DataVerse archive. They achieved doing so by writing the Zelig-package, which should become ‘everyone’s statistical software. It is basically a wrapper for many functions, whose authors need to write a small bridge-function between the function and the Zelig package. In that way, a universal syntax is achieved.

I think that this is an excellent initiative. Especially the attempt to unify data and to protect it from the waring of the ages. What makes this a potential large success, is that the developers clearly thought about the (political) structure present in science. Instead of trying to change that (and fail miserably), Gary King and his colleagues accepted the situation as it is and build upon that as best as they could. So for now, I’ll go gather some data and store it as soon as possible on my own part of the DataVerse Network.

useR! 2008: Model Management

Rense Nieuwenhuis — Wed, 13 Aug 2008 14:57:01 +0000

I don’t fully understood what is meant with the term ‘model management’ when I entered this session, but it appeared to be quite an interesting session, although apparently, there are some widely different interpretations on what it actually means.

With increased computer power, it has become very easy to estimate models. It has even become so easy, that we easily estimate loads of models, resulting in the piling up of lots of data. The managing of large sets of models by hand can be cumbersome work, as was stated by Ralf Seger. He presented MORET – A software for model management. MORET collects all input from R, and stores the data, corresponding input and models in a database. The software then allows the comparison of global model characteristics. It is even possible to manually define what elements of information needs to be extracted from what types of models!

After models have been stored in MORET, they can be accessed from within R-Project, or they can be ‘dragged’ from within the MORET interface. So, basically, you can retrieve the full history of all the analysis you’ve done in a long period, or even career! Check it out on: www.rosuda.org

In a very different meaning of model management, Werner Stahel presented an augmented version of a regression function (”Yet another Regression Function”). It primarily has a different way of doing residual analysis and improves the way anova tables are calculated from regression objects. At times, he moves away form what is customary in regression analysis, so I wonder how many people will use it, or especially report the new measures in their publications.

useR! 2008: Count data and Model comparison

Rense Nieuwenhuis — Wed, 13 Aug 2008 14:01:00 +0000

Today’s first focus sessions was planned around modeling. Two presentations stood out for me, were the ones by Christian Kleiber on generalized regression on count data, and Gianmarco AltoÃ¨ on bootstrapped model comparison.

Christan Kleiber presented a very interesting package regarding regression models for count data. Classical count data models are for instance poisson regression, which is offered by several packages already in R-Project. Using many of the code already available in R, Kleiber wrote several functions, for instance for efficiently estimating zero-inflated models or so-called Hurdle models. Although apparently developed for use in econometrics, I can easily see the use for this package, especially regarding the zero-inflated models.

I think that the presentation given by Gianmarco AltoÃ¨, and especially the package DeltaR he developed, can be very valuable to some types of research. As a statistician, he was asked for the possibility to compare the proportion of variance explained by different regression models, estimated using different samples. I don’t see myself using this, since as a sociologist I try to get samples that cover the whole population as best as possible anyway. However, especially in disciplines such as psychology, management studies, or perhaps even development studies, I can see the use of model comparisons.

I do wonder though, that if we are comparing the models based on different samples, if we are not implicitly assuming that the two samples are subsets of a single sample. If that should be the case, we don’t need to apply this type of comparison and we could better merge the data and perform a single analysis, focused on the comparison between the two groups.

useR! 2008: Duncan Murdoch and package development

Rense Nieuwenhuis — Wed, 13 Aug 2008 13:52:55 +0000

Duncan Murdoch gave an invited lecture on why we need packges in R-Project. He explained that R-Project is normally distributed with only 12 base packages, and 14 recommended packages. Nevertheless, the total number of packages available amounts to more than 1500 (and counting). So, “R is basically its packages”.

He first described other methods used to distribute work that is done in R-Project, mainly distributing complete workspaces or using script files.

Sure you can save the workspace, but it is easy to gorget how some objects were created and you often save more than you would like. It is also possible to save code to script files, and then to run those when needed. This does work, but when the number of functions in the script files increases it rapidly becomes a rather cluttered working process.

So, we need packages. These are small, compact, and complete ways to easily distribute your work. A lot can be stored in packages, such as the functions we wrote in separate script-files and data-sets that are stored in the native R format. But, the package also offers the opportunity to add manual files and vignettes (a specific type of manual file, that also contains R code). But, most of all, R Packages can contain program code in different programming languages, such as for instance C, C++, Fortran, and Objective C. The benefit to be gained by these lower level programming languages are enormous gains in the speed of functions. Murdoch estimated that when loops are transformed from R-code to a variant of C, a speed increase of a factor 100 can be achieved! Also, a package can contain explicit tests, that allow the developer to be warned almost automatically, when an update in the R-Project base code breaks the package.

The rest of the presentation was focused on creating packages in Windows. I will not be detailing that process, so be sure to find his presentation on the conference site when it becomes available.

useR! 2008: Collaboration and visualization

Rense Nieuwenhuis — Wed, 13 Aug 2008 10:43:33 +0000

The first session of presentations this morning is a kaleidoscope session, so as was to be expected the presentations were highly diverse. Three presentations really stood out to me, ranging from online collaborating on R-packages using R-forge, to visualizing categorical data and dynamic representation of the results of principal components analysis.

46% of the R-packages are developed and maintained by more than one author, which at times leads to difficulties regarding the cooperation. Building upon statements made earlier by Kurt Hornik, a challenge for future development of R-Project might just be in this area. How do we keep people motivated to keep working on complex packages? Well, Stefan Theussl and the other the people working on R-Forge must have thought: “by keeping them facilitated”. R-Forge is an open source, online collaboration facility, specifically bound to R-Project. If you’re the developer of an R-package, you might find it to be the right way of sharing you’re code with others while it is still under development.

The presentation by David Meyer was something completely different: visualizing categorical data and the corresponding VCD package. What I loved about this presentation, is that it contained some ideas on how one should properly visualize data. Still, we run into some weird instances of graphics, as was illustrated during the presentation (for instance the use of 3d bar-plots, in which we cannot compare the bar-height). One of the possibly interesting strategies, was to use color shaded mosaic plots, in which differences that were statistically significant to a higher extent were indicated by stronger colors. The direction of the relationship determined the nature of the color, the strength of the relationship the intensity of the color.

At the end of the session, a very nice piece of software was demonstrated, aimed at the dynamic interpretation and representation of principal components analyses. What the package basically does, is to take an object containing the results of a PCA, and then to represent that in a java based application. By that, we can easily select the variables or factors to represent. It is even possible to only show those factors, that have an impact above a specified threshold. Very nice visualization, and in the case of enormous numbers of variables, a real life saver.

useR! 2008: Social Sciences

Rense Nieuwenhuis — Tue, 12 Aug 2008 13:28:58 +0000

The next session I attended, was a Focus session on Social Sciences. Two presenters were present, and unfortunately for my beloved discipline, the session was held in one of the smaller rooms.

The first presentation addressed a problem initially thought to be simple: what is the sample size we need to answer our research question(s), and how do we keep costs low? All basic introductory books on statistics learn you how to do this, given any expected sample variance and correlation sizes. However, it appears that these methods can be largely improved by a stratified sampling design. This, however, makes the calculation of the required sample sizes remarkably more complex, for it needs to be done for each stratum and simultaneously the required sample sizes of each stratum influences each other.

In a sample study on Italian farms in 103 provinces, they were able to achieve a reduction in required total sample size of approximately 45%. Evidently, this will lead to an enormous reduction in the associated costs. While only in their starting phase, I think that these authors have made an interesting package, that I will surely be investigating.

The second presentation focused on the problem of small groups and questionnaires. Coming from educational evaluation research, the problem had risen that standard evaluation methods, using questionnaires with Likert-scale items, were not applicable to faculties with (very) small numbers of students. Having developed a probability model, the new package should allow for more exact measures of student satisfaction. I found this presentation a little too much focused on sampling theory and probability functions, because the more interesting part was too short. It was on the automation of scanning all the questionnaires using OCR software, and analyzing the data resulting from that process. I think a presentation with a stronger focus on that part of the project would have been much more interesting to a broad useR! conference.

useR! 2008: Harrell already wrote it …

Rense Nieuwenhuis — Mon, 11 Aug 2008 15:33:27 +0000

Unfortunately, Frank E. Harrell Jr. already wrote the book that I would have loved to (be able to) write, probably somewhere at the end of my career. If at all. Fortunately, I can learn a lot very much faster now. I’m talking about a book on statistics that also contains a perspective and opinion on the application statistics. Harrell called his book “Regression Modeling Strategies”. Oh, and he also demonstrates his main arguments in R-Project. And now he is telling me that his philosophy on applied statistics is also condensed in an R-package (the design package).

An eye-opener to me was his description of non-statisticians being afraid of continuous variables. Indeed, when I doubt the linearity of a continuous variable and I can’t find a way to fix it, I tend to categorize or even dichotomize it. I feel that this is not very uncommon to do so, but now that I have heard Harrell’s criticism on this strategy, I hope to never do that again and will give some serious thought on his suggestion of using ‘spline functions’. He argued that we tend to dichotomize a not-completely-linear variable, because we do not believe that it is linear in reality. But, given the finding of near linearity in our data, do we believe reality to be dichotomized? Probably our actions will bring our model farther away form correspondence with reality, than would our inaction. “Nature is not that kind. There is no reason to expect linearity.”

This does, again, raise the question on how theory and statistical model relate to each other. Is there a need to analyze data with models that are richer in detail than the actual theory that we are testing? If so, does this subsequently mean that our theories are not fit (enough) to be tested with the models that we are forced to estimate?

It is too easy to find interpretations of parameters that basically don’t have any meaning at all. Harrell told a story of how he had found a hugely significant interaction parameter. He went to the cardiologist he worked with, who soon thought of an interpretation. Then, Harrell found that he had made a mistake, and had to correct the sign of the parameter. Shockingly, the cardiologist immediately had a new, and completely different, interpretation.

Several other topics were addressed, such as the treatment of missing values. It is all to easy to add a category ‘missing’ to our analyses, but this does mess up our number of degrees of freedom. He analyzed how a horrid technique as variable / model selection came about, by arguing that computer were able to perform the technique, before they could run the simulations to properly test the technique. Harrell explained that the purpose of data imputation is not to recover missing data, but to retain data that was not not missing to begin with on the other variables.

The last half hour of the presentation was spend on analyzing some magnificent data with survival rates of passengers of the Titanic. “What did ‘women and children first’ really mean?” Applying much of what he already discussed, he was not only able to show how age, sex, and social social class affected chances of survival, but moreover how they interacted. For instance, younger people indeed had relatively high chances of survival, but lower class young males (> 20 years) had almost no chance of survival.

A lot of other topics were covered, only some of which I hinted at here. All were stated with confidence and from a clear perspective. I don’t really know what to think of the implications for the relationship between theory and empirical analysis, but Harrell sure has given some input to my thought on that. However, what I learned on the most fundamental level, I think, is that I can’t wait for the conference bookstore to open tomorrow.