Rense Nieuwenhuis » multilevel

Influence.ME: Simple Analysis

Rense Nieuwenhuis — Thu, 16 Jul 2009 11:00:19 +0000

With the introduction of our new package for influential data influence.ME, I’m currently writing a manual for the package. This manual will address topics for both the experienced, and the inexperienced users.

I will also present much of the content of this manual on my blog. Of course, feel free to comment on it, and readers are encouraged to discuss the content of the manual here. All information will be accessible from the influence.ME website as well. Note that updates to the manual will be made available on that website”, instead of updating this blog post. So, please refer to the influence.ME website for the most up-to-date information.

This is the first section on influence.ME, which deals with a very simply analysis of students nested within 23 schools. Only the effect of a single variable measured at the school level is estimated.

A basic example analysis

The school23 data contains information on a math test performance of 519 students, who are nested within 23 schools. For this example, we will be interested in the relationship between class structure (in this data measured at the school level) and students’ performance on a math test. The research question is: To what extend does the classroom structure determine the students’ math test outcomes?

Initially, we will estimate the effect of class structure on the result of the math performance test, without any further covariates. We do take into account the nesting structure of the data, however, and allow the intercept to be random over schools. This model is estimated using the following syntax, and is assigned to an object we call ‘model’.

model <- lmer(math ~  structure + (1 | school.ID), data=school23)
summary(model)

The call for a summary of the model results in the output shown below. In this summary, the original model formula is shown, as well as the data on which this model was estimated. Both random and fixed effects are summarized. The amount of intercept variance associated with the nesting structure of students within schools is considerably large (23.8 compared with 81.2 + 23.8 = 104 in total). The effect of interest is that of the structure variable, which is -2.343 and statistically insignificant by most reasonable standards (t=-1.609).

Linear mixed model fit by REML 
Formula: math ~ structure + (1 | school.ID) 
   Data: school23 
  AIC  BIC logLik deviance REMLdev
 3802 3819  -1897     3798    3794
Random effects:
 Groups    Name        Variance Std.Dev.
 school.ID (Intercept) 23.884   4.8871  
 Residual              81.270   9.0150  
Number of obs: 519, groups: school.ID, 23
Fixed effects:
            Estimate Std. Error t value
(Intercept)   60.002      5.853  10.252
structure     -2.343      1.456  -1.609
Correlation of Fixed Effects:
          (Intr)
structure -0.982

Iteratively re-estimate model

Building upon the example model estimated in section 2.1, the first step in the procedure of the influence.ME package is to iteratively exclude the influence of the observations nested within each school separately. This is done using the estex() function. The name estex refers to the ESTimates that are returned while EXcluding the influence of each of the grouping levels separately. Thus, in the case of the math test example, in which students are nested in 23 schools, the estex procedure re-estimates the original model 23 times, excluding the influence of a higher level unit (ie school). The function returns the relevant estimates of these 23 re-estimations, which in Figure [fig:Three-steps] is referred to with 'altered estimates'.

The estex() function requires the specification of two parameters: a mixed effects model is to be specified, and the grouping factor of which the influence of the nested observations are to be evaluated. In the syntax example below, the original object 'model' is specified, and 'school.ID' is the relevant grouping factor. school.ID is the name of the variable used to indicate the grouping factor when the original model was specified. The estex() function works perfectly when more than a single grouping is present in the model, but only one grouping factor can be addressed at once.

In the example below, the estimates excluding the influence of the respective grouping levels, as returned by the estex() function, are assigned to an object, which in this case is called este.model (the name of this object, however, is to be chosen arbitrarily by the user).

estex.model <- estex(model, "school.ID")

Note that in the case of complex mixed models (i.e. models with large numbers of observations, complex nesting structures, and/or many nesting groups) the execution of estex() may consume considerable amounts of time. The examples offered by the school23 data, should offer no such problems, however.

Calculate measures of influence

The object estex.model containing the altered estimates can be used to calculate several measures of influential data. To determine the Cook's distance, the ME.cook() function is to be used. In its most basic specification, the ME.cook() function only requires an object to which the altered estimates as returned by the estex() function were assigned:

ME.cook(estex.model)

This basic specification returns a matrix with the rows representing the groups in which the observations are nested, and the single column represents the associated value of Cook's distance. Clearly, these can also be assigned to an object for later modification. The output below shows the result of the syntax above, representing the Cook's distance associated with each school in the school23 data.

              [,1]
6053  2.927552e-02
6327  2.557810e-02
6467  1.402948e-02
7194  3.443392e-05
7472  1.115626e+00
7474  8.142758e-02
7801  3.007558e-04
7829  1.005329e-01
7930  5.525680e-03
24371 4.334659e-03
24725 4.387907e-02
25456 5.644399e-04
25642 1.470130e-02
26537 2.369898e-02
46417 2.204840e-02
47583 1.891108e-02
54344 1.445087e-01
62821 3.593314e-01
68448 2.427028e-02
68493 1.538479e-02
72080 3.471805e-04
72292 6.387956e-03
72991 1.316049e-02

Based on the output shown above, the Cook's distance of school number 7472 is the largest. This corresponds very well to what was concluded based on Figure [fig:Bivariate-influence-plots]. For those who prefer to evaluate the Cook's distance based on a visual representation, the ME.cook() function can also plot its output. To do so, an additional parameter is required, stating plot=TRUE. Additional parameters are allowed as well, which are passed on to the internal dotplot() function (Deepayan Sarkar, 2008) and are used to format the resulting plot. In this case, the example syntax below also specifies the xlab= and ylab= parameters, labelling the two axes. The resulting plot is shown in the figure below. These kinds of plots can be used to more easily assert the influence a grouped set of observations exert on the outcomes of analyses, relative to the influence excerted by other groups of observations.

In this case, it (again) is clear that the observation of the level of class structure of school number 7472 excerts the highest influence. This is based on the calculated value for Cook's distance, as well as that this influence clearly exceeds that of other schools.

ME.cook(estex.model, plot=TRUE,
xlab="Cook's Distance, Class structure",
ylab="School")

Exclude influence, and Repeat

Based on the analyses and graphs shown in the previous sections, there are strong indications that the observations in school number 7472 excert too much influence on the outcomes of the analysis, and thereby unjustifiably determine the outcomes of these analyses. To definitively decide whether or not the influence of these observations indeed is too large, the value of Cook’s distance of this school can be compared with a cut-off value given. Regarding Cook’s distance, it has been argued that observations exceeding a Cook’s distance of are too influential Belsley et al. (1980), and need to be dealt with. In this formula, ‘p’ refers to the number of predictors on which Cook’s distance was calculated. In the case of mixed effects models, this refers to the number of groups in which the observations are nested.

The Cook’s distance of school number 7472 was determined to be 1.31, which readily exceeds the cut-off value of = .17. Thus, is can be concluded that the influence school number 7472 needs to be excluded form the analysis, before the results of that analyses are interpreted. This is done using the function exclude.influence(). This function basically has three parameters: first, the model from which the influence of some observations is to be excluded needs to be specified, together with the grouping factor and the specific level of that grouping factor in which the said observations are nested. The function modifies the original model and returns a new model, which can be checked again for possible influential data.

In the example below, the influence of school number 7472 is excluded from the orginal regression model, which was assigned to object ‘model’ in section 2.1.

The result of the exclude.influence() function again has the form of a mixed effects model and is here assigned to object model.2 (again, this name is to chosen by the user).

model.2 <- exclude.influence(model, "school.ID", "7472")
summary(model.2)

Functions that work with ‘normal’ mixed effects models estimated with lme4, also work with models that were modified with the exclude.influence() function. So, also a summary of model.2 was requested, which is shown below. A few things are clear from this output. The estimate of the effect of class structure is now much stronger (-4.55) and statistically significant (t=2.95). This corresponds to what may have been expected based on the graphical representation of the data in Figure [fig:Bivariate-influence-plots]. Some other changes have been made to the model as well. The original intercept vector (which originally was indicated by (Intercept)) is now replaced by a variable called intercept.alt. This variable is basically an ordinary intercept vector (thus, with a value of 1 for each observation), except for the observations that are nested in the excluded nesting group. For these observations, the intercept.alt variable has score 0. Also, a new variable called estex.7472 is shown. This variable is a dummy variable, indicating the observations that are nested in school number 7472. One such dummy variable is added to the model for each nesting group the influence of which is excluded. Generally, these modifications of the model ensure that the observations nested within the excluded nesting group do not contribute to the estimation of both the level and the variance of the intercept, and do not alter the higher level estimates unjustifiably.

Linear mixed model fit by REML 
Formula: math ~ intercept.alt + estex.7472 + structure + 
(0 + intercept.alt | school.ID) - 1 
   Data: ..2 
  AIC  BIC logLik deviance REMLdev
 3792 3814  -1891     3790    3782
Random effects:
 Groups    Name          Variance Std.Dev.
 school.ID intercept.alt 17.874   4.2277  
 Residual                81.301   9.0167  
Number of obs: 519, groups: school.ID, 23

Fixed effects:
              Estimate Std. Error t value
intercept.alt   69.346      6.314  10.983
estex.7472      54.839      3.617  15.163
structure       -4.550      1.545  -2.945

Correlation of Fixed Effects:
           intrc. e.7472
estex.7472  0.843       
structure  -0.987 -0.854

As is shown in the procedural schematic in Figure [fig:Three-steps], it is advisable to repeat this procedure to the point that the user is satisfied with the stability of the model, for instance when no group of observations exceeds the cut-off value. To do this in this example, the model.2 object is again input to the estex() function, the results of which are stored in a second altered estimates object which we call estex.model.2:

estex.model.2 <- estex(model.2, "school.ID")
ME.cook(estex.model.2, plot=TRUE, 
    xlab="Cook's Distance, Class structure",
    ylab="School", 
    cutoff=.18)

Again, ME.cook() is used to calculate the values for Cook's distance, which returns the output shown below. School number 62821 is associated with the largest value for Cook's distance (.39). The cut-off value now differs (slightly) from the previous one, for the number of (effective) groups in which the observations are nested is decreased by 1, for the influence of school number 7472 was excluded. Thus, the cut-off value now is . Based on the output below, it can thus be concluded that school number 62821 is influential as well.

Finally, the call for ME.cook() in the syntax example above shows one more distinguishing characteristic. Again plot=TRUE is specified, together with specifications for labels on both the x and y axes. A plot of the Cook's distances is thus created, shown in Figure [fig:Cook-2]. In addition to this, the cut-off value of .18 is now indicated as well using cutoff=.18. As a result of this, all Cook's distances with a value larger than .18 will be indicated differently in the plot, as is the case in Figure [fig:Cook-2] regarding the two schools numbered 62821 and 7474. Note that the Cook's distance for school number 7472 now equals 0, indeed, indicating that this school now no longer influences the parameter estimates.

              [,1]
6053  2.186203e-03
6327  2.645659e-02
6467  1.326879e-02
7194  1.319258e-02
7472  0.000000e+00
7474  2.273674e-01
7801  1.378937e-03
7829  7.780663e-02
7930  4.728342e-03
24371 8.621802e-03
24725 7.072999e-02
25456 1.985731e-03
25642 2.487072e-02
26537 1.900817e-03
46417 2.409483e-02
47583 7.919332e-02
54344 1.248145e-01
62821 3.706191e-01
68448 1.752182e-01
68493 2.607158e-02
72080 2.669324e-05
72292 1.193296e-02
72991 1.311974e-02

Further analysis of this example would thus entail the exclusion of the influence of observations nested within school number 62821, and then to recheck the model by running through the three steps of the procedure again. This is not shown here, to not make this exercise overly lengthy.

Presenting influence.ME at useR!

Rense Nieuwenhuis — Fri, 10 Jul 2009 09:49:33 +0000

Today I presented influence.ME at the useR! conference in Rennes. Influence.ME is an R package for detecting influential data in mixed models. I developed this package together with Ben Pelzer and Manfred te Grotenhuis.

More information about influence.ME can be found on another section of my website.

Below, please find the slides of the presentation.
Presentation Influence.ME at Rennes, useR! 2009

R-Sessions 32: Forward.lmer: Basic stepwise function for mixed effects in R

Rense Nieuwenhuis — Fri, 13 Feb 2009 10:59:03 +0000

Intended to be a customized solution, it may have grown to be a little more. forward.lmer is an early installment of a full stepwise function for mixed effects regression models in R-Project. I may put in some work to extend it, or I may not. Nevertheless, in a ‘forward sense of stepwise’, I think it can be pretty useful as it is. Also, it has an interesting take on the stepwise concept, I think.

Most stepwise functions (as far as I know) take a base model and a bunch of variables, and then iteratively adds and/or subtracts some variables, according to various criteria, to come to the best fitting regression model. All very interesting, but how to deal with interaction variables? And moreover: most existing functions do not work with mixed effects models ((I use the term ‘mixed effects model’ to describe this stepwise function to refer to what is often referred to as hierarchical or multilevel regression models, as well)).

Built around the lme4 package in R, forward.lmer provides a forward stepwise procedure to mixed effects models. Also, it allows the user not only to enter single variables to models, but also to do the same with blocks of variables. This opens up many options: users can add the complete interactions at once (i.e. both the original and the multiplicative terms), or add these consequetively. Future development will focus on additional selection criteria for interactions, such as the criterium that at least the multiplicative term needs to be statistically significant.

The user provides a starting model and a set of variables to evaluate. The procedure then updates the starting model with the addition of every single variable (or block of variables). The models are ordered based on their LogLikelihood (other criteria, i.e. BIC and AIC following soon), after which the best fitting model is evaluated against one of two criteria. The first criterium is that at least one of the added parameters is statistically significant. The other criterium is that the addition of the parameters together is statistically significant.

There are several parameters to be specified:

start.model: The starting model the procedure starts with. This can be a null-model, or a model already containing several variables. All lmer-models (i.e. logistic, poisson, linear) are supported.
blocks: a vector of variable names (as character strings) to be added to a model. Several variables can a concatenated within the same character string, so that these are added as a block of variables, instead of a single variables at once.
max.iter: The maximum number of variables that are evaluated. If max.iter is reached, the procedure stops without adding more variables.
sig.level: This is the p-value against which it is tested whether the new model fits better than a base model. Either sig.level or zt needs to be specified, but not both at once.
zt: This is either the T or Z value that is used to test whether (at least) one of the added variables is statistically significant. T values are used for linear regression, Z values for binary response models.
print.log: Should a log be printed? The log contains information on which variables (and on which criteria) were added in each step.

The forward.lmer function returns the best fitting model (according to the given criteria). Of course, one can use this resulting model as a starting model for a new stepwise procedure.

forward.lmer <- function( start.model, blocks, max.iter=1, sig.level=FALSE, zt=FALSE, print.log=TRUE) {


	# forward.lmer: a function for stepwise regression using lmer mixed effects models

	# Author: Rense Nieuwenhuis
	# Initialysing internal variables

	log.step <- 0

	log.LL <- log.p <- log.block <- zt.temp <- log.zt <- NA

	model.basis <- start.model
	# Maximum number of iterations cannot exceed number of blocks

	if (max.iter > length(blocks)) max.iter <- length(blocks)
	# Setting up the outer loop

	for(i in 1:max.iter)

		{
		models <- list()
		# Iteratively updating the model with addition of one block of variable(s)

		# Also: extracting the loglikelihood of each estimated model

		for(j in 1:length(blocks))

			{

			models[[j]] <- update(model.basis, as.formula(paste(". ~ . + ", blocks[j])))

			}
		LL <- unlist(lapply(models, logLik))
		# Ordering the models based on their loglikelihood.

		# Additional selection criteria apply

		for (j in order(LL, decreasing=TRUE))

			{
			##############

			############## Selection based on ANOVA-test

			##############
			if(sig.level != FALSE)

				{

				if(anova(model.basis, models[[j]])[2,7] < sig.level)

					{
					model.basis <- models[[j]]
					# Writing the logs

					log.step <- log.step + 1

					log.block[log.step] <- blocks[j]

					log.LL[log.step] <- as.numeric(logLik(model.basis))

					log.p[log.step] <- anova(model.basis, models[[j]])[2,7]
					blocks <- blocks[-j]
					break

					}

				}
			##############

			############## Selection based significance of added variable-block

			##############	
			if(zt != FALSE)

				{

				b.model <- summary(models[[j]])@coefs

				diff.par <- setdiff(rownames(b.model), rownames(summary(model.basis)@coefs))

				if (length(diff.par)==0) break

				sig.par <- FALSE
				for (k in 1:length(diff.par))

					{

					if(abs(b.model[which(rownames(b.model)==diff.par[k]),3]) > zt)

						{

						sig.par <- TRUE

						zt.temp <- b.model[which(rownames(b.model)==diff.par[k]),3]

						break

						}

					}					
				if(sig.par==TRUE)

					{

					model.basis <- models[[j]]
					# Writing the logs

					log.step <- log.step + 1

					log.block[log.step] <- blocks[j]

					log.LL[log.step] <- as.numeric(logLik(model.basis))

					log.zt[log.step] <- zt.temp

					blocks <- blocks[-j]
					break

					}

				}

			}

		}
	## Create and print log

	log.df <- data.frame(log.step=1:log.step, log.block, log.LL, log.p, log.zt)

	if(print.log == TRUE) print(log.df, digits=4)
	## Return the 'best' fitting model

	return(model.basis)

	}

As always, you're invited to use this function, or to adapt it and use that. However, it is required to make mention of this function and its author. Additionally, since I intend to continue working on this function (perhaps even evolve it to a 'package' on CRAN), I would love to hear about any experiences in using it.

R-Sessions 25: Book – Mixed Effects Models in S and S-PLUS (Pinheiro & Bates, 2000)

Rense Nieuwenhuis — Wed, 01 Oct 2008 10:00:51 +0000

Despite the reference to S and S-PLUS in the title of this book, it offers an excellent guide for the nlme-package in R-Project. Reason for this is the close resemblance between R and S. The nlme-package, available in R-Project for estimation of both linear and non-linear multilevel models, is written and maintained by the authors of this book.

The book is not an introduction to R. Basic knowledge of R-Project (or S / S-PLUS) is required to get the most out of it, as well as some knowledge on multilevel theory. Although the book forms a thorough introduction to multilevel modeling, addressing both some theory, the mathematics and of course the estimation and specification in R-Project (or S / S-PLUS), the learning curve it offers is quite steep. The authors are not shunned to apply matrix algebra and specify exactly the used estimation procedures.

Not only the specification of basic models is described, but many other subjects are brought up. A specific grouped-data object is considered, as well as ways to visualize hierarchical data and multilevel models. Heteroscedasticity, often a violation of assumptions, can be caught in the models easily, as is described clearly in one of the chapters. Finally, not only linear models are tackled, but non-linear models as well.

All in all, this book is an excellent addition for those who have prior knowledge of both R-Project and multilevel analysis. Using real-data examples and by providing tons of output, the authors accomplish to make clear the necessity of the more complex models and thereby invite the reader to invest time for the more fundamental aspects of multilevel analysis.

– – — — —– ——–

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

R-Sessions 23: Book: Data Analysis Using Regression and Multilevel/Hierarchical Models — Gelman & Hill (2007)

Rense Nieuwenhuis — Tue, 23 Sep 2008 10:00:05 +0000

Data Analysis Using Regression and Multilevel/Hierarchical Models

Andrew Gelman is known for his expertise on Bayesian statistics. Based on that knowledge he wrote a book in multilevel regression using R and WINbugs. This book aims to be a thorough description of (multilevel) regression techniques, implementation of these techniques in R and bugs, and a guide on interpreting the results of your analyses. Shortly put, the books excels on all three subjects.

Admittedly, this review has been written based on first impressions on the book. But, a sunny day in the park reading this book (literally) left me to believe that I have some understanding on what this book is trying to achieve. I bought this book in order to have an overview on fitting multilevel regression models using R. Starting to read the book, I soon found out that that is indeed what it has to offer me, but it offers me a lot more. After some introductory chapters, the book starts off with an introduction to both linear regression as well as introducing the reader to R software, by showing how to fit linear regression models in R. This is readily expanded to logistic regression and generalized regression models. All is illustrated lushly with many examples and illustrations.

Before these ‘basic’ regression models are extended to multilevel models, Bayesian statistics are introduced. Based on simulation techniques, causal inferences, based on regression models, are made. The multilevel section of the book is set up similarly. First, ‘basic’ multilevel regression models are introduced. Throughout the book, the lmer function is used. This function is not only able to fit simple multilevel models, but logistic and generalized models as well. It can even estimate non-nested models. All in all, this forms a thorough introduction to multilevel regression analysis in itself, but the book continues here as well to introduce the reader to Bayesian statistics.

All above-mentioned models, as well as more complicated models, are fitted using WINbugs as well. This very flexible method allows the reader to estimate a greater variety of (multilevel) models. Causal inference on multilevel models, using Bayesian statistics, is described as well. The third main part of the book elaborates on the skills the reader uses to ‘just’ fitting models. It learns the reader to really think about what it going on. Topics such as ‘understanding and summarizing the fitted models’, ‘sample size and power calculations’, and most of all ‘model checking and comparison’ each receive their own chapter of the book. In this we can see that the authors of this book aimed higher than just writing instructions on how to let R fit (multilevel) regression models. The aim of this book, is to teach the reader how to analyze data the proper way. Much attention is paid to assumptions, testing theory, and interpretation of what you’re doing. To quote the authors: “If you show something, be prepared to explain it”.

This philosophy seemed to be a guideline for the authors while writing this book, as well as flexibility. The book starts off with some examples of the authors’ own research. These examples return throughout the book, resulting in some degree of familiarity with the data by the reader. Due to this, the concepts, models and/or analyses described are certainly more easy to be understood. As a reader, you start to think along with the author, when a new problem is described. The relative worth of the techniques, as well as their drawbacks, are made perfectly clear. The use of R software, as well as WINbugs, pays of well in the sense that it requires some more effort to master these programs, but in that process the reader learns to think deeply about what he really want to do and how it is done properly.

I found it not an easy book, but thanks to the many examples throughout the book it can be fully understood by people with some prior knowledge in regression techniques. All of the examples in the book can be tried yourself, since the data and syntax are available on the author’s website on the book. This helps the reader to get some feel for the more difficult subjects of the book. All in all, this seems to me as a great book for every applied researcher that has basic prior understanding of regression analysis. Due to its focus on one set of techniques, a great depth of understanding can be derived from this book.

R-Sessions 21: Multilevel Model Specification (NLME)

Rense Nieuwenhuis — Thu, 11 Sep 2008 10:00:16 +0000

Multilevel models, or mixed effect models, can easily be estimated in R. Several packages are available. Here, the lme() function from the nlme-package is described. The specification of several types of models will be shown, using a fictive example. A detailed description of the specification rules is given. Output of the specified models is given, but not described or interpreted.
Please note that this description is very closely related to the description of the specification of the lmer() function of the lme4-package. The results are similar and here exactly the same possibilities are offered.

In this example, the dependent variable is the standardized result of a student on a specific exam. This variable is called “normexam”. In estimating the score on the exam, two levels will be discerned: student and school. On each level, one explanatory variable is present. On individual level, we are taking into account the standardized score of the student on a LR-test (“standLRT”). On the school-level, we take into account the average intake-score (“schavg”).

Preparation

Before analyses can be performed, preparation needs to take place. Using the library() command, two packages are loaded. The nlme-package contains functions for estimation of multilevel or hierarchical regression models. The mlmRev-package contains, amongst many other things, the data we are going to use here. In the output below, we see that R-Project automatically loads the Matrix- and the lattice-packages as well. These are needed for the mlmRev-package to work properly.
Finally, the names() command is used to examine which variables are contained in the ‘Exam’ data.frame.

require(nlme)
require(mlmRev)
names(Exam)

Note that in the output below, R-Project notifies us that the objects ‘Oxboys’ and ‘bdf’ are masked when the mlmRev-package is loaded. This is simply caused by the fact, that objects with similar names were already loaded as part of the nlme-package. When we know that these objects contain exactly the same data or functions, there is nothing to worry about. If we don’t know that or if we do indeed have knowledge of differences, we should be careful in which order the packages are loaded. Since we don’t need those here, there is no need to further investigate that.

> require(nlme)
Loading required package: nlme
[1] TRUE
> require(mlmRev)
Loading required package: mlmRev
Loading required package: lme4
Loading required package: Matrix
Loading required package: lattice

Attaching package: 'mlmRev'


	The following object(s) are masked from package:nlme :

	 Oxboys,
	 bdf 

[1] TRUE
> names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"       "intake"  
 [7] "standLRT" "sex"      "type"     "student"

null-model

The syntax below specifies the most simple multilevel regression model of all: the null-model. Only the levels are defined. Using the lme-function, the first level (here: students) do not have to be specified. It is assumed that the dependent variable (here: normexam) is on the first level (which it should be).

The model is specified using two standard R formulas, respectively for the fixed part and for the random part, in indicating the different levels as well. The fixed and the random formulas are preceded by respectively ‘fixed =’ and ‘random =’. In the formula for the fixed part, first the dependent variable is given, followed by a tilde ( ~ ). The ~ should be read as: “follows”, or: “is defined by”. Next, the predictors are defined. In this case, only the intercept is defined by entering a ‘1’.
The formula for the random part is given, stating with a tilde (~). The dependent variable is not given here. Then the random variables are given, followed by a vertical stripe ( | ), after which the group-level is specified.

After the model specification, several parameters can be given to the model. Here, we specify the data that should be used by data=Exam. Another often used parameter indicates the estimation method. If left unspecified, restricted maximum likelihood (REML) is used. Another option would be: method=”ML”, which calls for full maximum likelihood estimation. All this leads to the following model specification:

lme(fixed = normexam ~ 1,
data = Exam,
random = ~ 1 | school)

This leads to the following output:

> lme(fixed = normexam ~ 1, 
+ 	data = Exam,
+ 	random = ~ 1 | school)
Linear mixed-effects model fit by REML
  Data: Exam 
  Log-restricted-likelihood: -5507.327
  Fixed: normexam ~ 1 
(Intercept) 
-0.01325213 

Random effects:
 Formula: ~1 | school
        (Intercept)  Residual
StdDev:   0.4142457 0.9207376

Number of Observations: 4059
Number of Groups: 65

random intercept, fixed predictor in individual level

For the next model, we add a predictor to the individual level. We do this, by replacing the ‘1’ in the formula for the fixed part of the previous model by the predictor (here: standLRT). An intercept is always assumed, so it is still estimated here. It only needs to be specified when no other predictors are specified. Since we don’t want the effect of the predictor to vary between groups, the specification of the random part of the model remains identical to the previous model. The same data is used, so we specify data=Exam again.

lme(fixed = normexam ~ standLRT,
data = Exam,
random = ~ 1 | school)

> lme(fixed = normexam ~ standLRT, 
+ 	data = Exam,
+ 	random = ~ 1 | school)
Linear mixed-effects model fit by REML
  Data: Exam 
  Log-restricted-likelihood: -4684.383
  Fixed: normexam ~ standLRT 
(Intercept)    standLRT 
0.002322823 0.563306914 

Random effects:
 Formula: ~1 | school
        (Intercept)  Residual
StdDev:   0.3063315 0.7522402

Number of Observations: 4059
Number of Groups: 65

random intercept, random slope

The next model that will be specified, is a model with a random intercept on individual level and a predictor that is allowed to vary between groups. In other words, the effect of doing homework on the score on a math-test varies between schools. In order to estimate this model, the ‘1’ that indicates the intercept in the random part of the model specification is replaced by the variable of which we want the effect to vary between the groups.

lme(fixed = normexam ~ standLRT,
data = Exam,
random = ~ standLRT | school)

> lme(fixed = normexam ~ standLRT, 
+ 	data = Exam,
+ 	random = ~ standLRT | school)
Linear mixed-effects model fit by REML
  Data: Exam 
  Log-restricted-likelihood: -4663.8
  Fixed: normexam ~ standLRT 
(Intercept)    standLRT 
-0.01164834  0.55653379 

Random effects:
 Formula: ~standLRT | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.3034980 (Intr)
standLRT    0.1223499 0.494 
Residual    0.7440699       

Number of Observations: 4059
Number of Groups: 65

random intercept, individual and group level predictor

It is possible to enter variables on group level as well. Here, we will add a predictor that indicates the size of the school. The lme()-function needs this variable to be of the same length as variables on individual length. In other words: for every unit on the lowest level, the variable indicating the group level value (here: the average score on the intake-test for every school) should have a value. For this example, this implies that all respondents that attend the same school, have the same value on the variable “schavg”. We enter this variable to the model in the same way as individual level variables, leading to the following syntax:

lme(fixed = normexam ~ standLRT + schavg,
data = Exam,
random = ~ standLRT | school)

> lme(fixed = normexam ~ standLRT + schavg, 
+ 	data = Exam,
+ 	random = ~ standLRT | school)
Linear mixed-effects model fit by REML
  Data: Exam 
  Log-restricted-likelihood: -4661.943
  Fixed: normexam ~ standLRT + schavg 
 (Intercept)     standLRT       schavg 
-0.001422435  0.552241377  0.294758810 

Random effects:
 Formula: ~standLRT | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.2778443 (Intr)
standLRT    0.1237837 0.373 
Residual    0.7440440       

Number of Observations: 4059
Number of Groups: 65

random intercept, cross-level interaction

Finally, a cross-level interaction is specified. This basically works the same as any other interaction specified in R. In contrast with many other statistical packages, it is not necessary to calculate separate interaction variables (but you’re free to do so, of course).
In this example, the cross-level interaction between time spend on homework and size of the school can be specified by entering a model formula containing standLRT * schavg. This leads to the following syntax and output.

lme(fixed = normexam ~ standLRT * schavg,
data = Exam,
random = ~ standLRT | school)

> lme(fixed = normexam ~ standLRT * schavg, 
+ 	data = Exam,
+ 	random = ~ standLRT | school)
Linear mixed-effects model fit by REML
  Data: Exam 
  Log-restricted-likelihood: -4660.194
  Fixed: normexam ~ standLRT * schavg 
    (Intercept)        standLRT          schavg standLRT:schavg 
   -0.007091769     0.557943270     0.373396511     0.161829150 

Random effects:
 Formula: ~standLRT | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.2763292 (Intr)
standLRT    0.1105667 0.357 
Residual    0.7441656       

Number of Observations: 4059
Number of Groups: 65

– – — — —– ——–

R-Sessions 20: Plotting Multilevel Models

Rense Nieuwenhuis — Tue, 09 Sep 2008 10:00:56 +0000

Plotting the results of a multilevel analysis, without use of the extension package ‘Lattice’ can be quite complicated while using R. Using only the basic packages, as well as the multilevel packages (nlme and lme4) there are no functions readily available for this task. So, this is a good point in this manual to put some of our programming skills to use. This makes exactly clear how the results of a multilevel analysis are stored in R as well.

In order to be able to plot a multilevel model, we first need such a model. We will estimate a model here, which we have seen before. We want to estimate the effect that a standardized test at school-entry has on a specific exam students made. Students are, of course, nested withing shools, which is taken into account in our analysis, as well as the average score in the intake test for each of the schools. The effects the test at entry of the school has on the test result is allowed to vary by school. In other words: we are estimating a random intercept model with a random slope.

In order to do so, we have to load the nlme package, as well as the mlmREV-package, which contains the data we will use. Then, for technical reasons, we have to get rid of the lme4-package. This is done by using the detach() function. This is what is done in the syntax below.

library(nlme)
library(mlmRev)
detach(“packages:lme4″)

model.01 <- lme(fixed = normexam ~ standLRT + schavg,
data = Exam,
random = ~ standLRT | school)

summary(model.01)

The requested output of the model.01 results in:

> summary(model.01)
Linear mixed-effects model fit by REML
 Data: Exam 
       AIC     BIC    logLik
  9337.885 9382.04 -4661.943

Random effects:
 Formula: ~standLRT | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.2778443 (Intr)
standLRT    0.1237837 0.373 
Residual    0.7440440       

Fixed effects: normexam ~ standLRT + schavg 
                 Value  Std.Error   DF   t-value p-value
(Intercept) -0.0014224 0.03725472 3993 -0.038181  0.9695
standLRT     0.5522414 0.02035359 3993 27.132378  0.0000
schavg       0.2947588 0.10726821   63  2.747867  0.0078
 Correlation: 
         (Intr) stnLRT
standLRT  0.266       
schavg    0.089 -0.085

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.82942010 -0.63168683  0.03258589  0.68512318  3.43634584 

Number of Observations: 4059
Number of Groups: 65

Probably the best way to visualize this model is to create a plot of the relationship on individual level between the score a student had on an standardized intake test (standLRT) and the result on an specific exam (normexam). We want these relationship to be shown for each school specifically. To do this, several steps are needed:

A plot region has to be set
Coefficients for each school need to be extracted from the model
The minimum and maximum value on the x-axis for each school need to be extracted
Based on the model, the data actually needs to be plotted

Each of the steps shall be described explicitly below. Since many steps need to be performed, they will be gathered in a function, which we’ll call visualize.lme(). It will have three parameters: the model we want to use, the predictor and the group-variable that will be taken into account.

visualize.lme <- function (model, coefficient, group, ...)
{
r <- ranef(model)
f <- fixef(model)

effects <- data.frame(r[,1]+f[1], r[,2]+f[2])

number.lines <- nrow(effects)

Above, the first row defines the function visualize.lme(), an arbitrary name we chose ourselves. Between the brackets, four elements are placed. The three needed elements described above are named ‘model’, ‘coefficient’, and ‘group’. The fourth element ( … ) means that any other element can be added. These elements will be transfered to the plot() function that will be used to set the graphics device.
Then, four variables are created within the function, to which data from the model is extracted. First, ranef() extracts the random coefficients from the model, while fixef() does the same for the fixed elements.
The effects variable is created next. This will be a data.frame that contains information on the intercepts and the slopes that will be plotted. The number.lines variable contains the number of rows in the effects data.frame, that is equal to the number of groups in the model.

predictor.min <- tapply(model$data[[coefficient]], model$data[[group]], min)
predictor.max <- tapply(model$data[[coefficient]], model$data[[group]], max)

outcome.min <- min(predict(model))
outcome.max <- max(predict(model))

Before the plotting area can be set up, we will need four coordinates. This is done above. First, the minimum and maximum value of the predictor is gathered. Next, the minimum and maximum of the predicted values are determined, using predict(). The predict() function takes the data from the model specified and uses the original data and the estimated model formula (which are stored inside) and returns a vector of predicted values on the outcome-variable. Using min() and max(), the minimum and maximum values are obtained.

plot (c(min(predictor.min),max(predictor.max)),c(outcome.min,outcome.max), type=”n”, …)

for (i in 1:number.lines)
{
expression <- function(x) {effects[i,1] + (effects[i,2] * x) }
curve(expression, from=predictor.min[i], to=predictor.max[i], add=TRUE)
}
}

Finally, the plot is created above. First, the area in which to plot is set up. Using the four coordinates obtained before, a plot is created in such a way, that the lines the will be plotted next will fit in exactly. Specifying type=”n” results in setting up a plotting region on a graphics device, without an actual plot being made. Only the axes that are created can be seen. The … parameter in the specification of the plot() function transfers all the additional parameters that are given to the visualize.lme() we’re creating to the plot()-function that is used here. We can use it to give proper names to both axes, as well as to set a title for the plot.

Then, a loop is created using for(). For each of the groups, a line will be plotted. In order to do so, a function is created based on the intercept and the slopes extracted earlier. Then, this created function (called ‘expression’) is entered into the curve()-function. This curve()-function draws a graph based on a function (for instance, the one we just created) from a starting point to an end-point that both are specified. This process is repeated for every group in the model, resulting in a basic graph based on our multilevel, random intercept and random slope regression model.

Finnally, as well as for convenience, the complete syntax of this function will be shown below, as well as the result when used on the model we estimated at the start of this paragraph.

visualize.lme <- function (model, coefficient, group, ...)
{
r <- ranef(model)
f <- fixef(model)

effects <- data.frame(r[,1]+f[1], r[,2]+f[2])

number.lines <- nrow(effects)

predictor.min <- tapply(model$data[[coefficient]], model$data[[group]], min)
predictor.max <- tapply(model$data[[coefficient]], model$data[[group]], max)

outcome.min <- min(predict(model))
outcome.max <- max(predict(model))

plot (c(min(predictor.min),max(predictor.max)),c(outcome.min,outcome.max), type="n", ...)

for (i in 1:number.lines)
{
expression <- function(x) {effects[i,1] + (effects[i,2] * x) }
curve(expression, from=predictor.min[i], to=predictor.max[i], add=TRUE)
}
}

visualize.lme(model.01, "standLRT", "school", xlab="Student test at school-entry", ylab="Result on Exam", main="Exam results for 65 schools")

– – — — —– ——–

R-Sessions 19: Extractor Functions

Rense Nieuwenhuis — Fri, 05 Sep 2008 10:58:43 +0000

Unlike most statistical software packages, R often stores the results of an analysis in an object. The advantage of this is that while not all output is shown in the screen ad once, it is neither necessary to estimate the statistical model again if different output is required.

This paragraph will show the kind of data that is stored in a multilevel model estimated by R-Project and introduce some functions that make use of this data.

Inside the model

Let’s first estimate a simple multilevel model, using the nlme-package. For this paragraph we will use a model we estimated earlier: the education model with a random intercept and a random slope. This time though, we will assign it to an object that we call model.01. It is estimated as follows:

require(nlme)
require(mlmRev)

model.01 <- lme(fixed = normexam ~ standLRT, data = Exam,
random = ~ standLRT | school)

Basically, this results in no output at all, although the activation of the packages creates a little output. Basic results can be obtained by simply calling the object:

model.01

> model.01
Linear mixed-effects model fit by REML
  Data: Exam 
  Log-restricted-likelihood: -4663.8
  Fixed: normexam ~ standLRT 
(Intercept)    standLRT 
-0.01164834  0.55653379 

Random effects:
 Formula: ~standLRT | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.3034980 (Intr)
standLRT    0.1223499 0.494 
Residual    0.7440699       

Number of Observations: 4059
Number of Groups: 65

This gives a first impression of the estimated model. But, there is more. To obtain an idea of the elements that are actually stored inside the model, we use the names() functions, that gives us the names of all the elements of the model.

names(model.01)
model.01$method
model.01$logLik

The output below shows that our model.01 contains seventeen elements. For reasons of space, only some will be described. ‘Contrasts’ contains information on the way categorical variables were handled, ‘coefficients’ contains the model-parameters, in ‘call’ the model formula is stored and in ‘data’ even the original data is stored.

> names(model.01)
 [1] "modelStruct"  "dims"         "contrasts"    "coefficients"
 [5] "varFix"       "sigma"        "apVar"        "logLik"      
 [9] "numIter"      "groups"       "call"         "terms"       
[13] "method"       "fitted"       "residuals"    "fixDF"       
[17] "data"        
> model.01$method
[1] "REML"
> model.01$logLik
[1] -4663.8

In the syntax above two specific elements of the model were requested: the estimation method and the loglikelihood. This is done by sub-setting the model using the $-sign after which the desired element is placed. The output tells us that model.01 was estimated using Restricted Maximum Likelihood and that the loglikelihood is -4663.8 large.

Summary

All the information we could possibly want is stored inside the models, as we have seen. In order to receive synoptic results, many functions exist to extract some of the elements from the model and present them clearly. The most basic of these extractor-functions is probably summary():

summary(model.01)

> summary(model.01)
Linear mixed-effects model fit by REML
 Data: Exam 
     AIC     BIC  logLik
  9339.6 9377.45 -4663.8

Random effects:
 Formula: ~standLRT | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.3034980 (Intr)
standLRT    0.1223499 0.494 
Residual    0.7440699       

Fixed effects: normexam ~ standLRT 
                 Value  Std.Error   DF   t-value p-value
(Intercept) -0.0116483 0.04010986 3993 -0.290411  0.7715
standLRT     0.5565338 0.02011497 3993 27.667639  0.0000
 Correlation: 
         (Intr)
standLRT 0.365 

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-3.8323045 -0.6316837  0.0339390  0.6834319  3.4562632 

Number of Observations: 4059
Number of Groups: 65

Anova

The last extractor function that will be shown here is anova. This is a very general function that can be used for a great variety of models. When it is applied to a multilevel model, it results in a basic test for statistical significance of the model-parameters, as is showed below.

anova(model.01)

model.02 <- lme(fixed = normexam ~ standLRT, data = Exam,
random = ~ 1 | school)

anova(model.02,model.01)

In the syntax above an additional model is estimated that is very similar to our model.01, but does not have a random slope. It is stored in the object model.02. This is done to show that it is possible to test whether the random slope model fits better to the data than the fixed slope model. The output below shows that this is indeed the case.

> anova(model.01)
            numDF denDF  F-value p-value
(Intercept)     1  3993 124.3969  <.0001
standLRT        1  3993 765.4983  <.0001
> 
> model.02 <- lme(fixed = normexam ~ standLRT, data = Exam,
+ 	random = ~ 1 | school)
> 
> anova(model.02,model.01)
         Model df      AIC      BIC    logLik   Test  L.Ratio
model.02     1  4 9376.765 9401.998 -4684.383                
model.01     2  6 9339.600 9377.450 -4663.800 1 vs 2 41.16494
         p-value
model.02        
model.01  <.0001

- - -- --- ----- --------

- - -- --- ----- --------
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
-------- ----- --- -- - -

Public opinion on induced abortion, comparison in Western Europe

Rense Nieuwenhuis — Thu, 04 Sep 2008 10:00:46 +0000

Building upon the paper written by Jelen et al. (1993) that I wrote about a few days ago, I’d like to bring to your attention a more recent paper by Dutch researchers. ((Desclaration of interest: I personally know and work with most of the authors of this paper. Thereby, please don’t regard this blog as neutral, or possibly critical, review, but rather as a — hopefully — interesting perspective and notification of fascinating research.)) It also addresses attitudes toward abortion in Western Europe, but does so in a rather more advanced manner. As might be expected from an article written 15 years later, much developments have been made in the research on public opinion regarding induced abortion, both on a theoretical level, as well as on a methodological level. Let’s take a look at the outcomes of those improvements.

The authors state three main mechanisms on which the formation of attitudes toward induced abortion is based. At first, it is known that people adjust their opinion to ruling legislation in the country they live in. Secondly, based on the seminal work by Ã‰mile Durkheim, the authors state that in general people adjust their norms (and thereby attitudes) on topics to the norms prevalent in the (intermediary) groups they are a member of. Thirdly, previous research found that people tend to adjust their opinion to what is commonly thought to be good, or commonly done, in the ‘public domain’. They refer to this as the ‘marketplace of opinions and behaviour’.

Based on these three fundamental mechanisms, several interesting hypotheses are formulated, of which I will name only a few. Generally, it is expected that due to educational expansion people have become more liberal between 1981 and 2000. This is also, to some extend, expected due to a general trend toward more liberal legislation of induced abortion in Western Europe during the last few decades. Most churches object against (the possibility of) induced abortion, with the Catholic church expressing the most pronounced pro-life stance. It is thus hypothesised that members of more strict churches will object against induced abortion more strongly. Regarding the ‘marketplace of opinions and behaviour’, it is expected that people will express more favourable opinions toward the possibility of abortion when living in a country with high abortion ratio’s.

The authors tested these (and other) expectations on 14 European countries, with a time-span between 1981 and 2000. This was done by performing multilevel regression analyses on data from the European Value Survey. Some of the findings that I find especially interesting, is that when one lives in a country with many non-religious people, one tends to have fewer objections against induced abortion. Also, when more induced abortions are performed in a country (measured by abortion ratio’s), people tend to have more liberal attitudes on this subject. The authors accounted for some causality issues by taking the abortion ratio measured two years prior to the measurement of the attitude. Also, it was found that while members of a church and frequent church attendants have relatively negative attitudes towards induced abortion (compared with non-members and infrequent attendants), this impact waned over time. No differences between Protestants and non-members were found. Finally, by taking into account several demographic variables, educational level and religious denomination of respondents, and different levels of religiousness and abortion ratio’s of countries, the authors were able to explain much of the between-country differences in attitudes towards abortion.

Reference

Ariana Need, Wout Ultee, Mark Levels, Marike van Tienen (2008). Mening over abortus in West-Europa, 1981-2000 Mens en Maatschappij, 83 (1), 5-22

R-Sessions 18: Helper Functions

Rense Nieuwenhuis — Wed, 03 Sep 2008 10:00:05 +0000

Several functions already present in R-Project are very useful when analyzing multilevel models or when preparing data to do so. Three of these helper functions will be described: aggregating data, the behavior of the plot() function when applied to a multilevel model and finally setting contrasts for categorical functions. Note that none of these functions are related to multilevel analysis only.

Aggregate

We will continue to work with the Exam-dataset that is made available through the mlmRev-package. In the syntax below we load that package and use the names() function to see what variables are available in the Exam data-set. One of the variables on individual level is the normexam-variable. Let’s say we want to aggregate this variable to the school-level, so that the new variable represents for each student the level of the exam-score at school level.

library(mlmRev)
names(Exam)

meansch <- tapply(Exam$normexam, Exam$school, mean)
meansch[1:10]
Exam$meanexam <- meansch[Exam$school]
names(Exam)

Using tapply(), we create a table of the normexam-variable by school, calculating the mean for each school. The result of this is stored in the meansch-variable. This variable now contains 65 values representing the mean score on the normexam-variable for each school. The first ten of these values is shown. This meansch-variable is only a temporary helping variable. Then we create a new variable in the Exam-data.frame called meanexam as well (this can be any name you want, as long as it does not already exist in the data.frame).

We give the correct value on this new variable to each respondent, by indexing the meansch by the school-index stored in the Exam-data.frame. When we look at the names of the Exam data.frame now, we see that our new variable is added to it.

> library(mlmRev)
> names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"      
 [6] "intake"   "standLRT" "sex"      "type"     "student" 
[11] "meansch" 
> 
> meansch <- tapply(Exam$normexam, Exam$school, mean)
> meansch[1:10]
           1            2            3            4            5 
 0.501209573  0.783102291  0.855444696  0.073628516  0.403608663 
           6            7            8            9           10 
 0.944579967  0.391500023 -0.048192541 -0.435682141 -0.269390664 
> Exam$meanexam <- meansch[Exam$school]
> names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"      
 [6] "intake"   "standLRT" "sex"      "type"     "student" 
[11] "meansch"  "meanexam"

plot (multilevel.model)

R-Project has many generic functions, that behave differently when different data is provided to it. A good example of this is the plot()-function. We have already seen how this function can be used to plot simple data. When it is used to plot an estimated multilevel model, it will extract the residuals from it and the result is a plot of the standardized residuals.

library(nlme)

model.01 <- lme(fixed = normexam ~ standLRT,
data = Exam,
random = ~ 1 | school)

plot(model.01, main="Residual plot of model.01")

In the syntax above, this is illustrated using a basic multilevel model, based on the Exam-data. First, the nlme-package is loaded, which results in a few warning messages (not shown here) that indicate that some data-sets are loaded twice (they were already loaded with the mlmRev-package). We assign the estimated model, that has a random intercept by schools and a predictor on the first (individual) level, to an object we have called ‘model.01′.

When this is plotted, we see that the residuals are distributed quite homoscedastically. The layout of the plot-window is quite different from what we have seen before. The reason for this is that here the lattice-package for advanced graphics is automatically used.

Contrast

When using categorical data in regression analyses (amongst other types of analysis), it is needed to code dummy-variables or contrasts. This can be done manually of course, as needs to be done in some other statistical packages, but in R-Project this is done automatically.

The ‘vr’-variable in the Exam-dataset represents “Student level Verbal Reasoning (VR) score band at intake – a factor. Levels are bottom 25%, mid 50%, and top 25%.” (quote from the description of the dataset, found by using ?Exam). In the syntax below, a summary of this variable is asked for, resulting in a confirmation that this variable contains three categories indeed.

Then, a basic model is estimated using the ‘vr’-variable and assigned to an object called model.02 (it is a random intercept model at school level containing two variables on individual level (‘vr’ and ‘standLRT’) and one variable on second level (‘schavg’). The summary of that model shows that it is the first category, representing the lowest Verbal Reasoning score band, that is used as reference category in the analyses.

summary(Exam$vr)

model.02 <- lme(fixed = normexam ~ vr + standLRT + schavg,
data = Exam,
random = ~ 1 | school)

summary(model.02)

> summary(Exam$vr)
bottom 25%    mid 50%    top 25% 
       640       2263       1156 
> 
> model.02 <- lme(fixed = normexam ~ vr + standLRT + schavg, 
+ 	data = Exam,
+ 	random = ~ 1 | school)
> 	
> summary(model.02)
Linear mixed-effects model fit by REML
 Data: Exam 
       AIC     BIC    logLik
  9379.377 9423.53 -4682.689

Random effects:
 Formula: ~1 | school
        (Intercept)  Residual
StdDev:   0.2844674 0.7523679

Fixed effects: normexam ~ vr + standLRT + schavg 
                 Value  Std.Error   DF  t-value p-value
(Intercept)  0.0714396 0.16717547 3993  0.42733  0.6692
vrmid 50%   -0.0834973 0.16341508   61 -0.51095  0.6112
vrtop 25%   -0.0537963 0.27433650   61 -0.19610  0.8452
standLRT     0.5594779 0.01253795 3993 44.62276  0.0000
schavg       0.4007454 0.28016707   61  1.43038  0.1577
 Correlation: 
          (Intr) vrm50% vrt25% stnLRT
vrmid 50% -0.946                     
vrtop 25% -0.945  0.886              
standLRT   0.000  0.000  0.000       
schavg     0.863 -0.794 -0.914 -0.045

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.71568977 -0.63052325  0.02781354  0.68294752  3.25580231 

Number of Observations: 4059
Number of Groups: 65 
>

Should we want to change that, we need to use the contrast() function. When used on the ‘vr’-variable, it makes clear why the first category of the variable is used as reference: in the resulting matrix it is the only category without a ‘1’ on one of the columns.

To change this, we have two options at hand. First we use the contr.treatment()-function. We want the third category used as reference now. So, we specify that the contrasts should be treated in such a way that the categories are based on the levels of the ‘vr’-variable and that the base, or reference, should be the third. The result of the contr.treatment()-function looks exactly like the result of the contrasts()-function (except for a different reference category obviously). To change the used reference in an analysis, the result of the contr.treatment()-function should be assigned to the contrast(Exam$vr). The old contrasts -settings are replaced by the new ones in that way.

Instead of using the contr.treatment() function we can simply create a matrix ourselves and use this to change the contrasts. This is done below in order to make clear what is actually happening when the contr.treatment()-function is used. First the matrix is shown, then it is assigned to the contrasts of the ‘vr’-variable. Now we have chosen the middle category to be the reference. Although this works perfectly, the draw-back of this procedure is that the value-labels of the variable are lost, while these are maintained when contr.treatment() is used.

Finally, using the new contrasts-settings of the ‘vr’-variable, the same model is estimated again. In the output below, we see that indeed the value labels are gone now (because of using a basic matrix to set the contrasts) and that the middle category is used as reference, although the categories are now referred to as ‘1’ and ‘2’ which might lead to confusing interpretations.

contrasts(Exam$vr)
contrasts(Exam$vr) <- contr.treatment(levels(Exam$vr), base=3)
contrasts(Exam$vr)
matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
contrasts(Exam$vr) <- matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
contrasts(Exam$vr)

model.03 <- lme(fixed = normexam ~ vr + standLRT + schavg,
data = Exam,
random = ~ 1 | school)

summary(model.03)

> contrasts(Exam$vr)
           mid 50% top 25%
bottom 25%       0       0
mid 50%          1       0
top 25%          0       1
> contr.treatment(levels(Exam$vr), base=3)
           bottom 25% mid 50%
bottom 25%          1       0
mid 50%             0       1
top 25%             0       0
> contrasts(Exam$vr) <- contr.treatment(levels(Exam$vr), base=3)
> contrasts(Exam$vr)
           bottom 25% mid 50%
bottom 25%          1       0
mid 50%             0       1
top 25%             0       0
> matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
     [,1] [,2]
[1,]    1    0
[2,]    0    0
[3,]    0    1
> contrasts(Exam$vr) <- matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
> contrasts(Exam$vr)
           [,1] [,2]
bottom 25%    1    0
mid 50%       0    0
top 25%       0    1
> 
> model.03 <- lme(fixed = normexam ~ vr + standLRT + schavg, 
+ 	data = Exam,
+ 	random = ~ 1 | school)
> 	
> summary(model.03)
Linear mixed-effects model fit by REML
 Data: Exam 
       AIC     BIC    logLik
  9379.377 9423.53 -4682.689

Random effects:
 Formula: ~1 | school
        (Intercept)  Residual
StdDev:   0.2844674 0.7523679

Fixed effects: normexam ~ vr + standLRT + schavg 
                 Value  Std.Error   DF  t-value p-value
(Intercept) -0.0120577 0.05427674 3993 -0.22215  0.8242
vr1          0.0834973 0.16341508   61  0.51095  0.6112
vr2          0.0297010 0.15018792   61  0.19776  0.8439
standLRT     0.5594779 0.01253795 3993 44.62276  0.0000
schavg       0.4007454 0.28016707   61  1.43038  0.1577
 Correlation: 
         (Intr) vr1    vr2    stnLRT
vr1      -0.096                     
vr2      -0.551 -0.530              
standLRT  0.000  0.000  0.000       
schavg    0.267  0.794 -0.806 -0.045

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.71568977 -0.63052325  0.02781354  0.68294752  3.25580231 

Number of Observations: 4059
Number of Groups: 65 
>

– – — — —– ——–