Rense Nieuwenhuis » R-Sessions

Applied R: Manual for the quantitative social scientist

Rense Nieuwenhuis — Wed, 23 Mar 2011 10:50:31 +0000

R-Project is an advanced software package for statistical analysis. Several years ago, already, I wrote an introductory manual for several analyses that can be performed with R. Although several parts of this are available from my blog as the R-Sessions, I never publicly published the full document. Now, this changes: for those looking for an applied guide to R-Project, here it is!

This manual was written specifically as an introduction for the quantitative social scientist. To my opinion, R-Project is a magnificent statistical program, ready to be accepted and implemented in the social sciences. The flexibility of this program and the way data are handled gives the user a sense of closeness to and control over the data. I think this inspires users to analyze their data more creatively and sometimes in a more advanced way. At present, this manual has a strong focus on multilevel regression techniques. Reason for this is that in R-Project it is very easy to estimate these types of models, even the more complex variants. The more basic and fundamental aspects of R-Project are introduced as well. All this is done with the needs of the quantitative social scientist in mind.

Of course, this manual it provided without any warranty. Please realize that I wrote it almost four years ago.

I’d love to hear any feedback for (future) improvements!

Download:

Applied R for the quantitative social scientist

Index of the R-Sessions

Rense Nieuwenhuis — Mon, 17 May 2010 10:00:19 +0000

The R-Sessions are a series of blog entries on using R. A large part consists of an R-manual I once wrote. Other posts include some tricks I found out, as well as entries detailing functions and packages I wrote for R. The series already entails over forty posts, so I decided to create an index. It is found below. On a fixed page on this website (www.rensenieuwenhuis.nl/r-project/r-sessions-index/) I will continue to update this index with new editions of the R-Sessions.

A .PDF manual containing many of the R-Sessions material is available here.

Introducing R

Data Manipultion

Graphics

Mixed Models

Influence.ME: Tools for detecting influential cases in mixed models

Some small functions I wrote

Books

Various

R Sessions 33: Select (nested) observations with equal number of occurences

Rense Nieuwenhuis — Wed, 23 Sep 2009 10:00:05 +0000

Recently, I was contacted with an question about R code. A befriended researcher was working with nested data, which was unbalanced. He was working with data in a ‘long’ format: all observations nested within the same group had the same identification number. But, the number of observations in each of the groups differed (hence: unbalanced data).

He asked me for a piece of code that creates a subset of the data that is balanced, i.e. all observations that are nested within equally sized groups. Or, as an alternative, all observations nested within groups with at least a minimum number of observations.

I solved it the quick and dirty way, and the solution involves creating additional variables, a new data.frame, and merging. It sure can be done much prettier, but it works.

So, I share it below:

id <- c("a", "b","b", "c","c","c", "d","d","d","d", "e","e","e") y <- c(3,4,3,2,4,5,6,5,6,7,5,4,3) df <- data.frame(id, y) # setting up original data.frame


tab <- data.frame(id=names(table(df$id)), fre=as.vector(table(df$id))) # table of frequencies
df.new <- merge(df, tab, by="id") # merging frequencies-variable

subset(df.new, fre==3) # subsetting subset(df.new, fre>3)

R-Sessions 31: Combining lmer output in a single table (UPDATED)

Rense Nieuwenhuis — Thu, 05 Feb 2009 11:00:38 +0000

There are various ways of getting your output from R to your publication draft. Most of them are highly efficient, but unfortunately I couldn’t find a function that combines the output from several (lmer) models and presents it in a single table. lmer is the mixed effects model function from the lme4 package. So, I wrote a simple function that does exactly that.

Using it for a specific purpose, it is not a general function or something, but it can easily be adapted for use in other settings. Here it goes:

require(lme4) require(mlmRev) require(lme4) require(mlmRev)


model.1 <- lmer(normexam ~ 1 + (1 | school), data=Exam)

model.2 <- lmer(normexam ~ standLRT + (1 | school), data=Exam)

model.3 <- lmer(normexam ~ standLRT + sex + (1 | school), data=Exam)

model.4 <- lmer(normexam ~ standLRT + sex + schavg + (1 | school), data=Exam)
model.a <- lmer(use ~ 1 + (1 | district), family=binomial, data=Contraception)

model.b <- lmer(use ~ livch + (1 | district), family=binomial, data=Contraception)

model.c <- lmer(use ~ age + (1 | district), family=binomial, data=Contraception)

model.d <- lmer(use ~ livch + age + (1 | district), family=binomial, data=Contraception)
m1 <- c(model.1, model.2, model.3, model.4)

m2 <- c(model.a, model.b, model.c, model.d)
combine.output.lmer <- function(models, labels=FALSE)

	{
	fix.coef <- lapply(models, function(x) summary(x)@coefs)

	var.coef <- lapply(models, function(x) summary(x)@REmat)

	n.par <- dim(summary(models[[1]])@coefs)[2]
	ifelse(labels==FALSE,

		fix.labels <- colnames(summary(models[[1]])@coefs),

		fix.labels <- labels)
	var.labels <- colnames(var.coef[[1]])
	# Creating table with fixed parameters

	output.coefs <- data.frame(Row.names=row.names(fix.coef[[1]]))

	for (i in 1:length(models))

		{
		a <- fix.coef[[i]]

		colnames(a) <- paste("Model", i, fix.labels)

		output.coefs <- merge(output.coefs, a, by.x=1, by.y=0, all=T, sort=FALSE)
		}

	output.coefs[,1] <- as.character(output.coefs[,1])

	output.coefs[dim(output.coefs)[1]+2, 1] <- "Loglikelihood"

	LL <- unlist(lapply(models, function(x) as.numeric(logLik(x))))

	output.coefs[dim(output.coefs)[1], 1:length(models)*n.par-n.par+2] <- LL
	# Creating table with random parameters

	output.vars <- data.frame(var.coef[[1]])[,1:2]

	for (i in 1:length(models))

		{
		a <- var.coef[[i]]

		colnames(a) <- paste("Model", i, var.labels)

		output.vars <- merge(output.vars, a, by.x=1:2, by.y=1:2, all=T, sort=FALSE)
		}
	# Combining output.coefs and output.vars

	n.cols <- dim(output.coefs)[2]

	n.coefs <- dim(output.coefs)[1]

	n.vars <- dim(output.vars)[1]
	output <- matrix(ncol=n.cols +1 , nrow=n.vars+n.coefs+2)
	output[1:n.coefs, -2] <- as.matrix(output.coefs)

	output[n.coefs+2, 1] <- "Variance Components"

	output[(n.coefs+3) : (n.coefs+n.vars+2), 1:2] <- as.matrix(output.vars[,1:2])

	output[

		(n.coefs+3) : (n.coefs+n.vars+2),

		which(rep(c(1,1,rep(0, n.par-2)),length(models))!=0)+2] <- as.matrix(output.vars[,c(-1,-2)])
	colnames(output) <- c("Parameter", "Random", colnames(output.coefs)[-1])
	return(output)

	}
combined <- combine.output.lmer(m1)

combined <- combine.output.lmer(m2)
combined <- combine.output.lmer(m1, labels=c("appel", "banaan", "grapefruit"))

combined <- combine.output.lmer(m2, labels=c("appel", "peer", "banaan", "grapefruit"))

write.csv(combined, "combined.csv", na=" ")

In this example I estimate four mixed effects models, which are concatenated in a single object 'm'. The function itself is called 'combine.output.lmer', and is used on the object 'm'. The output is a data.frame with the variable names in the first column. Not-estimated parameters in models are indicated by 'NA' in their respective columns. By writing the 'combined'-object to an external file, the NA's are lost and the file can be read into other software, such as Open Office Spreadsheet or Excell. Use the xtable-package to get it in your latex document.

UPDATE
I updated and improved the code somewhat, for I wasn't satisfied with the results. Now the code adapts to the number of parameters derived form the models' summary, allows to add your own names to the columns, and, most importantly, also reports the random slopes.

Please note: due to the internal matching procedure, errors may occur when the same variable is random 'within' more than one other variable. This is only the case when other variables are random within each nesting factor as well.

R-Sessions 30: Visualizing missing values

Rense Nieuwenhuis — Thu, 08 Jan 2009 10:00:39 +0000

It always takes some time to get a grip on a new dataset, especially large ones. The code-books are often as indispensable as they are massive, and not always as clear as one would want. Routings, and resulting and strange patterns of missing values are at times difficult to find.

I found a nice way to plot missing values, using R. Basically, I thought it would be nice to calculate the percentage of missings on each variable, and do so for each year represented in the data. These numbers could be visualized using a levelplot(), which resulted in the graph below.

In this example I used a small subset of variables from the cumulative file of the General Social Survey, which is freely available from the web. I used this syntax:

testing.NA <- matrix(ncol=26, nrow=21) for (i in 1:dim(GSS)[2]) { testing.NA[i,] <- tapply(GSS[[i]], GSS$year, function(x) sum(is.na(x)) / length(x)) }


dimnames(testing.NA) <- list(

	names(GSS),

	sort(unique(GSS$year)))
library(lattice)

levelplot(testing.NA, scales=list(x=list(rot=90)), main="Percentage missing values on variables in GSS", xlab="Variable", ylab="Year")

First, I defined the testing.NA matrix, using the number of years and variables. Then, in a loop, I calculate the percentage missing values, basically using is.na() and length(). I assign dimnames to the matrix and use the levelplot() function from the lattice-library to plot the matrix. That's it, easy does it.

But: does it help? I think it does. Of course, all this information can be gained from the code-book, and needs to be verified. However, it does give us some immediate notes on the availability of these variables. For instance, we see that in the first few years, the abany variable is missing, whereas other variables on abortion don't. When creating scales this needs to be taken into account, not to lose the complete data on the first few years. The speduc-variable (spouse's educational level) has a high number of missings, as does the denom variable. This, however, makes sense: not everybody has a spouse and the denom-variable only applies to protestants. Finally, this graph gives some pointers on a change in survey-strategy from 1988 onwards regarding the items on induced abortion. The percentage missing values increased sharply at that point, and does so for all abortion-related variables.

This graph does not tell what exactly happened, but does provides nice pointers on what to look for when reading the code-book.

R-Sessions 29: Running R-Project twice on Apple Mac OS X

Rense Nieuwenhuis — Mon, 24 Nov 2008 10:00:40 +0000

Working with statistics can be quite time consuming. As anyone working with relatively advanced models and large amounts of data knows, especially the waiting can be excruciating. Your statistical software is locked up while crunching those numbers, while you’d actually prefer to run some minor procedures, such as post-estimations, testing some loops, or simply displaying the output of a previously estimated model. With Apple’s Mac OS X you now can run R-Project twice, making the most of your dual core processor.

The procedure is very easy, and it works like a charm. Mind though that, obviously, it drains your computers’ resources heavily, so performance of each instance of R-Project decreases slightly at least. For that to change, we would need dual-hard disk laptops, and dual-RAM laptops and the such. Dual laptop-laptops basically.

Back to running R-Project twice. Just start R-Project as usual. Then go to your applications folder and secondary-click on the R-Project app. Select ‘duplicate’, and there you are: an app named R copy emerges. Start this as usual and start working.

in the image below you see two instances of R-Project running. The first is working on a heavy-weight function that results in some output every hour or so and runs 96 times. In other words: it takes ages. However, it stores the output in an external file, and since each little bit of output needs some post-estimation before being interpreted, I can use the second instance to load that data and examine it (not shown).

Although you don’t need to re-install packages, the only thing I did not (yet) find out how to do is to share resources between these two instances of R-Project. Being able to share variables, models, and such would be great. Ideas anyone?

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

R-Sessions 28: Impressive R Speeds

Rense Nieuwenhuis — Thu, 30 Oct 2008 10:00:22 +0000

Yesterday, I received my new Apple MacBook. It’s running a Core 2 Duo at 2.4 Ghz and it’s fast. Really fast!

Apparently, it’s very cool to show of the speed of R-Project on your system. Optimized .DLL files help to speed up your R on Windows systems (and possibly other systems as well) with respect to matrix transformations, which has led to enormous speed increases. So, let’s perform a speed-test of our own.

First of all, in the syntax below, the Matrix package is activated, using the require() command. Since we will be creating random data, we set the seed in order to receive the exact same data every time the test is run. This is done with set.seed(). The next line creates a matrix X, which in the last three lines is manipulated in different ways.

To test how long this takes, we enclose that matrix operations in the system.time() function, which clocks the operation.

require(Matrix) set.seed(123) X <- Matrix(rnorm(1e6), 1000) system.time(for(i in 1:25) X%*%X) system.time(for(i in 1:25) solve(X)) system.time(for(i in 1:10) svd(X))

This results in the following output:

> X <- Matrix(rnorm(1e6), 1000) > system.time(for(i in 1:25) X%*%X) user system elapsed 8.306 0.591 5.031 > system.time(for(i in 1:25) solve(X)) user system elapsed 8.933 1.331 6.684 > system.time(for(i in 1:10) svd(X)) user system elapsed 36.989 3.665 33.384

WOW! This is the fastest I've seen in real life, even faster than some of the desktops that I know people currently work with (i.e. my own). I'm however very sure that it is not the fastest possible, not to say compared with how fast future calculations will be.

Additionally, in the near future my MacBook will be configured with 4 Gb RAM, so I'm curious to find out whether or not this will result in an additional speed increase. I expect, however, most benefit from the additional RAM when doing binomial mixed effects models, so of course expect a comparative benchmark on that one as well as soon as the new RAM arrives.

So, in the meantime, you can use this code to do some benchmarks yourself, on various computers. Please post the results here, or discuss them in the R-Sessions Forum.

UPDATE:
I also tested my old Powerbook G4 (1.5 Ghz, 1.25 Gb RAM):
> set.seed(123) > X <- Matrix(rnorm(1e6), 1000) > system.time(for(i in 1:25) X%*%X) user system elapsed 34.661 1.590 47.528 > system.time(for(i in 1:25) solve(X)) user system elapsed 37.184 1.656 51.516 > system.time(for(i in 1:10) svd(X)) user system elapsed 247.694 11.258 331.979

- - -- --- ----- --------

Discuss this article and pose additional questions in the R-Sessions Forum

- - -- --- ----- --------
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
-------- ----- --- -- - -

R Sessions 26: Text editors for R: Internal editor on OS X

Rense Nieuwenhuis — Mon, 06 Oct 2008 10:00:38 +0000

Since R-Project is essentially syntax based, one needs a good text editor to write some code before it is executed in R. And, since we are all writing high quality code, we need a high quality text editor. This is the first in a series on text editors for using with R-Project on MacOSX.

The first editor to look at, is the internal one. The Mac OS X version of R-Project comes with quite a strong, although basic, text editor. It is shown in the picture below, where it is being used to edit a fragment of code of my own. We readily see some syntax coloring, which is a great help regarding the readability of the syntax (syntax colouring is only available in the Mac OS X version of R-Project). Also, on the top of the window, a drop-down list is shown, which now shows `dp.HI.cook’, which happens to be the name of one of the functions that is defined in the syntax file. By clicking one of the items in this drop-down list, the cursor automatically jumps to that section of the file, allowing for fast and easy navigation. As to be expected, the code can easily be send from the editor to the R prompt, where it is executed.

To my liking, this editor is just a little too light-weight. It does a nice job colouring the syntax and such, but it lacks on other features such as advanced find & replace, or management of multiple files. If you’re editing more than a single file, you will be doing it in more than a single window, which might be a little inconvenient. Also, when working on large files for a longer period, I found that at times the text wasn’t rendered any-longer. I then had to save and re-open the file. Also, working on long files can be a tad slow, for it seems that the text-colouring tends to run behind easily.

Nevertheless, having your text editor integrated in R-Project does have one strong advantage: the syntax help provided by the Mac OS X version of R-Project is also shown in the editor. So, if you type the name of a function and the opening bracket ‘(‘, you immediately see all the pre-defined parameters to that function. This greatly reduces the necessity of using the help-pages and is provided by only a very few of the external editors.

All in all, the internal editor of R-Project is not bad at all, particularly on Mac OS X. For simple analyses and writing or editing of some small helper functions, it suffices. However, for more serious projects it will soon prove to be too ill equipped. Fortunately, we have some excellent external text editors for this, some of which will be discussed in the upcoming R-Sessions.

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

R-Sessions 25: Book – Mixed Effects Models in S and S-PLUS (Pinheiro & Bates, 2000)

Rense Nieuwenhuis — Wed, 01 Oct 2008 10:00:51 +0000

Despite the reference to S and S-PLUS in the title of this book, it offers an excellent guide for the nlme-package in R-Project. Reason for this is the close resemblance between R and S. The nlme-package, available in R-Project for estimation of both linear and non-linear multilevel models, is written and maintained by the authors of this book.

The book is not an introduction to R. Basic knowledge of R-Project (or S / S-PLUS) is required to get the most out of it, as well as some knowledge on multilevel theory. Although the book forms a thorough introduction to multilevel modeling, addressing both some theory, the mathematics and of course the estimation and specification in R-Project (or S / S-PLUS), the learning curve it offers is quite steep. The authors are not shunned to apply matrix algebra and specify exactly the used estimation procedures.

Not only the specification of basic models is described, but many other subjects are brought up. A specific grouped-data object is considered, as well as ways to visualize hierarchical data and multilevel models. Heteroscedasticity, often a violation of assumptions, can be caught in the models easily, as is described clearly in one of the chapters. Finally, not only linear models are tackled, but non-linear models as well.

All in all, this book is an excellent addition for those who have prior knowledge of both R-Project and multilevel analysis. Using real-data examples and by providing tons of output, the authors accomplish to make clear the necessity of the more complex models and thereby invite the reader to invest time for the more fundamental aspects of multilevel analysis.

– – — — —– ——–

R-Sessions 23: Book: Data Analysis Using Regression and Multilevel/Hierarchical Models — Gelman & Hill (2007)

Rense Nieuwenhuis — Tue, 23 Sep 2008 10:00:05 +0000

Data Analysis Using Regression and Multilevel/Hierarchical Models

Andrew Gelman is known for his expertise on Bayesian statistics. Based on that knowledge he wrote a book in multilevel regression using R and WINbugs. This book aims to be a thorough description of (multilevel) regression techniques, implementation of these techniques in R and bugs, and a guide on interpreting the results of your analyses. Shortly put, the books excels on all three subjects.

Admittedly, this review has been written based on first impressions on the book. But, a sunny day in the park reading this book (literally) left me to believe that I have some understanding on what this book is trying to achieve. I bought this book in order to have an overview on fitting multilevel regression models using R. Starting to read the book, I soon found out that that is indeed what it has to offer me, but it offers me a lot more. After some introductory chapters, the book starts off with an introduction to both linear regression as well as introducing the reader to R software, by showing how to fit linear regression models in R. This is readily expanded to logistic regression and generalized regression models. All is illustrated lushly with many examples and illustrations.

Before these ‘basic’ regression models are extended to multilevel models, Bayesian statistics are introduced. Based on simulation techniques, causal inferences, based on regression models, are made. The multilevel section of the book is set up similarly. First, ‘basic’ multilevel regression models are introduced. Throughout the book, the lmer function is used. This function is not only able to fit simple multilevel models, but logistic and generalized models as well. It can even estimate non-nested models. All in all, this forms a thorough introduction to multilevel regression analysis in itself, but the book continues here as well to introduce the reader to Bayesian statistics.

All above-mentioned models, as well as more complicated models, are fitted using WINbugs as well. This very flexible method allows the reader to estimate a greater variety of (multilevel) models. Causal inference on multilevel models, using Bayesian statistics, is described as well. The third main part of the book elaborates on the skills the reader uses to ‘just’ fitting models. It learns the reader to really think about what it going on. Topics such as ‘understanding and summarizing the fitted models’, ‘sample size and power calculations’, and most of all ‘model checking and comparison’ each receive their own chapter of the book. In this we can see that the authors of this book aimed higher than just writing instructions on how to let R fit (multilevel) regression models. The aim of this book, is to teach the reader how to analyze data the proper way. Much attention is paid to assumptions, testing theory, and interpretation of what you’re doing. To quote the authors: “If you show something, be prepared to explain it”.

This philosophy seemed to be a guideline for the authors while writing this book, as well as flexibility. The book starts off with some examples of the authors’ own research. These examples return throughout the book, resulting in some degree of familiarity with the data by the reader. Due to this, the concepts, models and/or analyses described are certainly more easy to be understood. As a reader, you start to think along with the author, when a new problem is described. The relative worth of the techniques, as well as their drawbacks, are made perfectly clear. The use of R software, as well as WINbugs, pays of well in the sense that it requires some more effort to master these programs, but in that process the reader learns to think deeply about what he really want to do and how it is done properly.

I found it not an easy book, but thanks to the many examples throughout the book it can be fully understood by people with some prior knowledge in regression techniques. All of the examples in the book can be tried yourself, since the data and syntax are available on the author’s website on the book. This helps the reader to get some feel for the more difficult subjects of the book. All in all, this seems to me as a great book for every applied researcher that has basic prior understanding of regression analysis. Due to its focus on one set of techniques, a great depth of understanding can be derived from this book.