Rense Nieuwenhuis » R-Project

A novel method for modelling interaction between categorical variables

Rense Nieuwenhuis — Tue, 18 Apr 2017 08:00:43 +0000

We have been developing weighted effect coding in an ongoing series of publications (hint: a publication in the R Journal will follow). To include nominal and ordinal variables as predictors in regression models, their categories first have to be transformed into so-called ‘dummy variables’. There are many transformations available, and popular is ‘dummy coding’ in which the estimates represent deviations from a preselected ‘reference category’.

To avoid choosing a reference category, weighted effect coding provides estimates representing deviations from the sample mean. This is particularly useful when the data are unbalanced (i.e., categories holding different numbers of observation). The basics of this technique, with applications in R, were detailed here.

In a new publication, available open access,, we show that weighted effect coding can also be applied to regression models with interaction effects (also commonly referred to as moderation). The weighted effect coded interactions represent the additional effects over and above the main effects obtained from the model without these interactions.

To apply the procedures introduced in these papers, called weighted effect coding, procedures are made available for R, SPSS, and Stata. For R, we created the ‘wec’ package which can be installed by typing:

install.packages(“wec”)

References (Open Access!)

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). A novel method for modelling interaction between categorical variables. International Journal of Public Health, 62(3), 427–431. http://link.springer.com/article/10.1007/s00038-016-0902-0

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health, 62(1), 163–167. http://doi.org/10.1007/s00038-016-0901-1

When Size Matters: Weighted Effect Coding

Rense Nieuwenhuis — Fri, 24 Feb 2017 07:50:02 +0000

Categorical variables in regression models are often included by dummy variables. In R, this is done with factor variables with treatment coding. Typically, the difference and significance of each category are tested against a preselected reference category. We present a useful alternative.

If all categories have (roughly) the same number of observations, you can also test all categories against the grand mean using effect (ANOVA) coding. In observational studies, however, the number of observations per category typically varies. Our new paper shows how categories of a factor variable can be tested against the sample mean. Although the paper has been online for some time now (and this post is an update to an earlier post some time age), we are happy to announce that our paper has now officially been published a the International Journal of Public Health.

install.packages(“wec”)

References

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health, (62), 163–167. http://doi.org/10.1007/s00038-016-0901-1

Sweeney R, Ulveling EF (1972) A transformation for simplifying the interpretation of coefficients of binary variables in regression analysis. Am Stat 26:30–32

New version of WEC: focus on interactions

Rense Nieuwenhuis — Tue, 17 Jan 2017 11:00:52 +0000

We have uploaded a new version of WEC, an R package to apply ‘weighted effect coding’ to your dummy variables. With weighted effect coding, your dummy variables represent the deviation of their respective category from the sample mean, rather than the deviation from a reference category. Particularly with observational data, which are often unbalanced, this can have attractive interpretations. We recently published two articles in which we discuss some of the advantages:

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2016b). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health, 1–5. http://doi.org/10.1007/s00038-016-0901-1

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2016a). A novel method for modelling interaction between categorical variables. International Journal of Public Health, 1–5. http://doi.org/10.1007/s00038-016-0902-0

As some of the real advantages of weighted effect coding come into play when using interactions, that was what we focused in the current update to our ‘wec’ package (version 0.4). The package now supports interactions between a weighted effect coded factor variable and an interval variable, and the calculation of interactions between two weighted effect coded factor variables was much improved. An example is given below (with more to follow, hopefully soon).

library(wec) data(PUMS) PUMS$race.wec <- factor(PUMS$race) contrasts(PUMS$race.wec) <- contr.wec(PUMS$race.wec, "White") PUMS$race.educint <- wec.interact(PUMS$race.wec, PUMS$education.int) m.wec.educ <- lm(wage ~ race.wec + education.int + race.educint, data=PUMS) summary(m.wec.educ)$coefficients

The code above results in a regression model (shown below) in which the main effect for education (9048) remains the same, whether the interaction terms are included or not (you can try this yourself). Thus, the interaction terms represent how much the average education effect varies by race.

                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                     52320        559    93.5  0.0e+00
race.wecHispanic                -4955       1736    -2.9  4.3e-03
race.wecBlack                  -11276       1817    -6.2  5.7e-10
race.wecAsian                    5151       2381     2.2  3.1e-02
education.int                    9048        287    31.6 2.3e-208
race.educintinteractHispanic    -3266        977    -3.3  8.3e-04
race.educintinteractBlack       -3293        990    -3.3  8.8e-04
race.educintinteractAsian        3575       1217     2.9  3.3e-03

Weighted Effect Coding: Dummy coding when size matters

Rense Nieuwenhuis — Mon, 31 Oct 2016 11:00:32 +0000

If your regression model contains a categorical predictor variable, you commonly test the significance of its categories against a preselected reference category. If all categories have (roughly) the same number of observations, you can also test all categories against the grand mean using effect (ANOVA) coding. In observational studies, however, the number of observations per category typically varies. We published a paper in the International Journal of Public Health, showing how all categories can be tested against the sample mean.

In a second paper in the same journal, the procedure is expanded to regression models that test interaction effects. Within this framework, the weighted effect coded interaction displays the extra effect on top of the main effect found in a model without the interaction effect. This offers a promising new route to estimate interaction effects in observational data, where different category sizes often prevail.

install.packages(“wec”)

Introducing Influence.ME: Tools for detecting influential data in mixed models

Rense Nieuwenhuis — Wed, 29 Apr 2009 09:03:25 +0000

I’m highly excited to announce that influence.ME is now available. Influence.ME is a new software package for R, providing statistical tools for detecting influential data in mixed models. It has been developed by Rense Nieuwenhuis, Ben Pelzer, and Manfred te Grotenhuis. The basic rationale behind identifying influential data is that when iteratively single units are omitted from the data, models based on these data should not produce substantially different estimates. To standardize the assessment of how influential data is, several measures of influence are commonly used, such as DFBETAS and Cook’s Distance.

Mixed effects regression models tend to become common practice in the field of Social Sciences. However, diagnostic tools to evaluate these models lag behind. For instance there is no general applicable tool to check whether all units (or cases) roughly have the same influence on the regression parameters. It is however commonly accepted that tests for influential cases should be performed, especially when estimates are based on a relatively small number of cases. Testing for influence with mixed effects models is especially important in Social Science applications, for two reasons. First, models in the Social Sciences are frequently based on large numbers of individuals while the number of higher level units is often relatively small. Secondly, often the higher level units are remarkably similar, for instance in the case of neighboring countries. Influence.ME is a new package for R which provides two innovations for evaluating influential cases: it extends existing procedures for use with mixed effects models, and it allows to not only search for single influential cases, but for combinations of cases that as a combination exert too much influence.

I plan to use my blog to provide more information about influence.ME. For instance, you can expect some example analyses soon. Other developments, new features, or exciting applications in research papers will be discussed here as well in due time. A static page on influence.ME is available as well, where all important information is collected.

Questions, comments, thoughts, experiences, notes on bugs (and other vermin), feature requests, and what more: it is all highly appreciated. They can be sent by e-mail, or placed in the comments-section on this blog.

useR! 2009 acceptance: presenting influence.ME

Rense Nieuwenhuis — Thu, 23 Apr 2009 10:23:56 +0000

The organizing committee of the useR! 2009 conference just informed me, that my submission for presenting my extension package influence.ME, has been accepted! Influence.ME is a new R package that I’m currently developing, with the indispensable help of Ben Pelzer and Manfred te Grotenhuis. Although I did not yet introduce influence.ME on this blog, rest assured that I will do so within just a few weeks. Now is time for celebration!

Influence.ME is an R package that provides a collection of tools for detecting influential data in mixed effects models. Testing for inï¬‚uence with mixed effects models is especially important in Social Science applications, for two reasons. First, models in the Social Sciences are frequently based on large numbers of individuals while the number of higher level units is often relatively small. Secondly, often the higher level units are remarkably similar, for instance in the case of neighboring countries.

useR! is a yearly user conference on exciting applications in R. The useR! 2009 edition will be held in Rennes, France. A great variety of packages, applications, and other developments relating to R will be discussed. I’ve visited the useR! 2008 conference last year (in Dortmund, Germany), and found it a highly stimulating environment for those interested in exciting, practical applications in statistics using R.

Influence.ME is a project I’ve been working on for the last months, together with Ben Pelzer and Manfred te Grotenhuis. I’m still working – quite hard!- to iron out the last quirks, and we have tons of ideas for extending its functionality. I’m very happy to be able to present the result of this work to an R-minded audience this summer.

R-Sessions 32: Forward.lmer: Basic stepwise function for mixed effects in R

Rense Nieuwenhuis — Fri, 13 Feb 2009 10:59:03 +0000

Intended to be a customized solution, it may have grown to be a little more. forward.lmer is an early installment of a full stepwise function for mixed effects regression models in R-Project. I may put in some work to extend it, or I may not. Nevertheless, in a ‘forward sense of stepwise’, I think it can be pretty useful as it is. Also, it has an interesting take on the stepwise concept, I think.

Most stepwise functions (as far as I know) take a base model and a bunch of variables, and then iteratively adds and/or subtracts some variables, according to various criteria, to come to the best fitting regression model. All very interesting, but how to deal with interaction variables? And moreover: most existing functions do not work with mixed effects models ((I use the term ‘mixed effects model’ to describe this stepwise function to refer to what is often referred to as hierarchical or multilevel regression models, as well)).

Built around the lme4 package in R, forward.lmer provides a forward stepwise procedure to mixed effects models. Also, it allows the user not only to enter single variables to models, but also to do the same with blocks of variables. This opens up many options: users can add the complete interactions at once (i.e. both the original and the multiplicative terms), or add these consequetively. Future development will focus on additional selection criteria for interactions, such as the criterium that at least the multiplicative term needs to be statistically significant.

The user provides a starting model and a set of variables to evaluate. The procedure then updates the starting model with the addition of every single variable (or block of variables). The models are ordered based on their LogLikelihood (other criteria, i.e. BIC and AIC following soon), after which the best fitting model is evaluated against one of two criteria. The first criterium is that at least one of the added parameters is statistically significant. The other criterium is that the addition of the parameters together is statistically significant.

There are several parameters to be specified:

start.model: The starting model the procedure starts with. This can be a null-model, or a model already containing several variables. All lmer-models (i.e. logistic, poisson, linear) are supported.
blocks: a vector of variable names (as character strings) to be added to a model. Several variables can a concatenated within the same character string, so that these are added as a block of variables, instead of a single variables at once.
max.iter: The maximum number of variables that are evaluated. If max.iter is reached, the procedure stops without adding more variables.
sig.level: This is the p-value against which it is tested whether the new model fits better than a base model. Either sig.level or zt needs to be specified, but not both at once.
zt: This is either the T or Z value that is used to test whether (at least) one of the added variables is statistically significant. T values are used for linear regression, Z values for binary response models.
print.log: Should a log be printed? The log contains information on which variables (and on which criteria) were added in each step.

The forward.lmer function returns the best fitting model (according to the given criteria). Of course, one can use this resulting model as a starting model for a new stepwise procedure.

forward.lmer <- function( start.model, blocks, max.iter=1, sig.level=FALSE, zt=FALSE, print.log=TRUE) {


	# forward.lmer: a function for stepwise regression using lmer mixed effects models

	# Author: Rense Nieuwenhuis
	# Initialysing internal variables

	log.step <- 0

	log.LL <- log.p <- log.block <- zt.temp <- log.zt <- NA

	model.basis <- start.model
	# Maximum number of iterations cannot exceed number of blocks

	if (max.iter > length(blocks)) max.iter <- length(blocks)
	# Setting up the outer loop

	for(i in 1:max.iter)

		{
		models <- list()
		# Iteratively updating the model with addition of one block of variable(s)

		# Also: extracting the loglikelihood of each estimated model

		for(j in 1:length(blocks))

			{

			models[[j]] <- update(model.basis, as.formula(paste(". ~ . + ", blocks[j])))

			}
		LL <- unlist(lapply(models, logLik))
		# Ordering the models based on their loglikelihood.

		# Additional selection criteria apply

		for (j in order(LL, decreasing=TRUE))

			{
			##############

			############## Selection based on ANOVA-test

			##############
			if(sig.level != FALSE)

				{

				if(anova(model.basis, models[[j]])[2,7] < sig.level)

					{
					model.basis <- models[[j]]
					# Writing the logs

					log.step <- log.step + 1

					log.block[log.step] <- blocks[j]

					log.LL[log.step] <- as.numeric(logLik(model.basis))

					log.p[log.step] <- anova(model.basis, models[[j]])[2,7]
					blocks <- blocks[-j]
					break

					}

				}
			##############

			############## Selection based significance of added variable-block

			##############	
			if(zt != FALSE)

				{

				b.model <- summary(models[[j]])@coefs

				diff.par <- setdiff(rownames(b.model), rownames(summary(model.basis)@coefs))

				if (length(diff.par)==0) break

				sig.par <- FALSE
				for (k in 1:length(diff.par))

					{

					if(abs(b.model[which(rownames(b.model)==diff.par[k]),3]) > zt)

						{

						sig.par <- TRUE

						zt.temp <- b.model[which(rownames(b.model)==diff.par[k]),3]

						break

						}

					}					
				if(sig.par==TRUE)

					{

					model.basis <- models[[j]]
					# Writing the logs

					log.step <- log.step + 1

					log.block[log.step] <- blocks[j]

					log.LL[log.step] <- as.numeric(logLik(model.basis))

					log.zt[log.step] <- zt.temp

					blocks <- blocks[-j]
					break

					}

				}

			}

		}
	## Create and print log

	log.df <- data.frame(log.step=1:log.step, log.block, log.LL, log.p, log.zt)

	if(print.log == TRUE) print(log.df, digits=4)
	## Return the 'best' fitting model

	return(model.basis)

	}

As always, you're invited to use this function, or to adapt it and use that. However, it is required to make mention of this function and its author. Additionally, since I intend to continue working on this function (perhaps even evolve it to a 'package' on CRAN), I would love to hear about any experiences in using it.

R-Sessions 31: Combining lmer output in a single table (UPDATED)

Rense Nieuwenhuis — Thu, 05 Feb 2009 11:00:38 +0000

There are various ways of getting your output from R to your publication draft. Most of them are highly efficient, but unfortunately I couldn’t find a function that combines the output from several (lmer) models and presents it in a single table. lmer is the mixed effects model function from the lme4 package. So, I wrote a simple function that does exactly that.

Using it for a specific purpose, it is not a general function or something, but it can easily be adapted for use in other settings. Here it goes:

require(lme4) require(mlmRev) require(lme4) require(mlmRev)


model.1 <- lmer(normexam ~ 1 + (1 | school), data=Exam)

model.2 <- lmer(normexam ~ standLRT + (1 | school), data=Exam)

model.3 <- lmer(normexam ~ standLRT + sex + (1 | school), data=Exam)

model.4 <- lmer(normexam ~ standLRT + sex + schavg + (1 | school), data=Exam)
model.a <- lmer(use ~ 1 + (1 | district), family=binomial, data=Contraception)

model.b <- lmer(use ~ livch + (1 | district), family=binomial, data=Contraception)

model.c <- lmer(use ~ age + (1 | district), family=binomial, data=Contraception)

model.d <- lmer(use ~ livch + age + (1 | district), family=binomial, data=Contraception)
m1 <- c(model.1, model.2, model.3, model.4)

m2 <- c(model.a, model.b, model.c, model.d)
combine.output.lmer <- function(models, labels=FALSE)

	{
	fix.coef <- lapply(models, function(x) summary(x)@coefs)

	var.coef <- lapply(models, function(x) summary(x)@REmat)

	n.par <- dim(summary(models[[1]])@coefs)[2]
	ifelse(labels==FALSE,

		fix.labels <- colnames(summary(models[[1]])@coefs),

		fix.labels <- labels)
	var.labels <- colnames(var.coef[[1]])
	# Creating table with fixed parameters

	output.coefs <- data.frame(Row.names=row.names(fix.coef[[1]]))

	for (i in 1:length(models))

		{
		a <- fix.coef[[i]]

		colnames(a) <- paste("Model", i, fix.labels)

		output.coefs <- merge(output.coefs, a, by.x=1, by.y=0, all=T, sort=FALSE)
		}

	output.coefs[,1] <- as.character(output.coefs[,1])

	output.coefs[dim(output.coefs)[1]+2, 1] <- "Loglikelihood"

	LL <- unlist(lapply(models, function(x) as.numeric(logLik(x))))

	output.coefs[dim(output.coefs)[1], 1:length(models)*n.par-n.par+2] <- LL
	# Creating table with random parameters

	output.vars <- data.frame(var.coef[[1]])[,1:2]

	for (i in 1:length(models))

		{
		a <- var.coef[[i]]

		colnames(a) <- paste("Model", i, var.labels)

		output.vars <- merge(output.vars, a, by.x=1:2, by.y=1:2, all=T, sort=FALSE)
		}
	# Combining output.coefs and output.vars

	n.cols <- dim(output.coefs)[2]

	n.coefs <- dim(output.coefs)[1]

	n.vars <- dim(output.vars)[1]
	output <- matrix(ncol=n.cols +1 , nrow=n.vars+n.coefs+2)
	output[1:n.coefs, -2] <- as.matrix(output.coefs)

	output[n.coefs+2, 1] <- "Variance Components"

	output[(n.coefs+3) : (n.coefs+n.vars+2), 1:2] <- as.matrix(output.vars[,1:2])

	output[

		(n.coefs+3) : (n.coefs+n.vars+2),

		which(rep(c(1,1,rep(0, n.par-2)),length(models))!=0)+2] <- as.matrix(output.vars[,c(-1,-2)])
	colnames(output) <- c("Parameter", "Random", colnames(output.coefs)[-1])
	return(output)

	}
combined <- combine.output.lmer(m1)

combined <- combine.output.lmer(m2)
combined <- combine.output.lmer(m1, labels=c("appel", "banaan", "grapefruit"))

combined <- combine.output.lmer(m2, labels=c("appel", "peer", "banaan", "grapefruit"))

write.csv(combined, "combined.csv", na=" ")

In this example I estimate four mixed effects models, which are concatenated in a single object 'm'. The function itself is called 'combine.output.lmer', and is used on the object 'm'. The output is a data.frame with the variable names in the first column. Not-estimated parameters in models are indicated by 'NA' in their respective columns. By writing the 'combined'-object to an external file, the NA's are lost and the file can be read into other software, such as Open Office Spreadsheet or Excell. Use the xtable-package to get it in your latex document.

UPDATE
I updated and improved the code somewhat, for I wasn't satisfied with the results. Now the code adapts to the number of parameters derived form the models' summary, allows to add your own names to the columns, and, most importantly, also reports the random slopes.

Please note: due to the internal matching procedure, errors may occur when the same variable is random 'within' more than one other variable. This is only the case when other variables are random within each nesting factor as well.

R-Sessions 30: Visualizing missing values

Rense Nieuwenhuis — Thu, 08 Jan 2009 10:00:39 +0000

It always takes some time to get a grip on a new dataset, especially large ones. The code-books are often as indispensable as they are massive, and not always as clear as one would want. Routings, and resulting and strange patterns of missing values are at times difficult to find.

I found a nice way to plot missing values, using R. Basically, I thought it would be nice to calculate the percentage of missings on each variable, and do so for each year represented in the data. These numbers could be visualized using a levelplot(), which resulted in the graph below.

In this example I used a small subset of variables from the cumulative file of the General Social Survey, which is freely available from the web. I used this syntax:

testing.NA <- matrix(ncol=26, nrow=21) for (i in 1:dim(GSS)[2]) { testing.NA[i,] <- tapply(GSS[[i]], GSS$year, function(x) sum(is.na(x)) / length(x)) }


dimnames(testing.NA) <- list(

	names(GSS),

	sort(unique(GSS$year)))
library(lattice)

levelplot(testing.NA, scales=list(x=list(rot=90)), main="Percentage missing values on variables in GSS", xlab="Variable", ylab="Year")

First, I defined the testing.NA matrix, using the number of years and variables. Then, in a loop, I calculate the percentage missing values, basically using is.na() and length(). I assign dimnames to the matrix and use the levelplot() function from the lattice-library to plot the matrix. That's it, easy does it.

But: does it help? I think it does. Of course, all this information can be gained from the code-book, and needs to be verified. However, it does give us some immediate notes on the availability of these variables. For instance, we see that in the first few years, the abany variable is missing, whereas other variables on abortion don't. When creating scales this needs to be taken into account, not to lose the complete data on the first few years. The speduc-variable (spouse's educational level) has a high number of missings, as does the denom variable. This, however, makes sense: not everybody has a spouse and the denom-variable only applies to protestants. Finally, this graph gives some pointers on a change in survey-strategy from 1988 onwards regarding the items on induced abortion. The percentage missing values increased sharply at that point, and does so for all abortion-related variables.

This graph does not tell what exactly happened, but does provides nice pointers on what to look for when reading the code-book.

R-Sessions 29: Running R-Project twice on Apple Mac OS X

Rense Nieuwenhuis — Mon, 24 Nov 2008 10:00:40 +0000

Working with statistics can be quite time consuming. As anyone working with relatively advanced models and large amounts of data knows, especially the waiting can be excruciating. Your statistical software is locked up while crunching those numbers, while you’d actually prefer to run some minor procedures, such as post-estimations, testing some loops, or simply displaying the output of a previously estimated model. With Apple’s Mac OS X you now can run R-Project twice, making the most of your dual core processor.

The procedure is very easy, and it works like a charm. Mind though that, obviously, it drains your computers’ resources heavily, so performance of each instance of R-Project decreases slightly at least. For that to change, we would need dual-hard disk laptops, and dual-RAM laptops and the such. Dual laptop-laptops basically.

Back to running R-Project twice. Just start R-Project as usual. Then go to your applications folder and secondary-click on the R-Project app. Select ‘duplicate’, and there you are: an app named R copy emerges. Start this as usual and start working.

in the image below you see two instances of R-Project running. The first is working on a heavy-weight function that results in some output every hour or so and runs 96 times. In other words: it takes ages. However, it stores the output in an external file, and since each little bit of output needs some post-estimation before being interpreted, I can use the second instance to load that data and examine it (not shown).

Although you don’t need to re-install packages, the only thing I did not (yet) find out how to do is to share resources between these two instances of R-Project. Being able to share variables, models, and such would be great. Ideas anyone?

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –