Rense Nieuwenhuis » multilevel regression

Influence.ME: Tools for Detecting Influential Data in Multilevel Regression Models

Rense Nieuwenhuis — Thu, 20 Dec 2012 14:40:11 +0000

Despite the increasing popularity of multilevel regression models, the development of diagnostic tools lagged behind. Typically, in the social sciences multilevel regression models are used to account for the nesting structure of the data, such as students in classes, migrants from origin-countries, and individuals in countries. The strength of multilevel models lies in analyzing data on a large number of groups with only a couple of observations within each group, such as for instance students in classes.

Nevertheless, in the social sciences multilevel models are often used to analyze data on a limited number of groups with per group a large number of observations. A typical example would be the analysis of data on individuals nested within countries. By nature, only a limited number of countries exists. In practice, typical country-comparative analyses are based on about 25 countries. With such a small number of groups (e.g. countries), observations on a single group can easily be overly influential to the outcomes. This means that the conclusions based on the multilevel regression model could no longer hold when a single group is removed from the data.

In our recent publication in the R Journal, we introduce influence.ME, software that provides tools for detecting influential data in multilevel regression models (or: in mixed effects models, as these are commonly referred to in statistics). influence.ME is a publically available R package that evaluates multilevel regression models that were estimated with the lme4.0 package. It calculates standardized measures of influential data for the point estimates of generalized mixed effects models, such as DFBETAS, Cook’s distance, as well as percentile change and a test for changing levels of significance. influence.ME calculates these measures of influence while accounting for the nesting structure of the data. The package and measures of influential data are introduced, a practical example is given, and strategies for dealing with influential data are suggested.

With this publication, and of course with the software that was available for quite some time, we hope to contribute to a better usage of multilevel regression models. The provided example and guidelines were geared towards applications in the social sciences, but are applicable in all disciplines.

On a final note, the editorial of the R Journal describes how this journal is quickly ranking up in the degree of (academic) recognition it receives:

Thomson Reuters has informed us that The R Journal has been accepted for listing in the Science Citation Index-Expanded (SCIE), including the Web of Science, and the ISI Alerting Service, starting with volume 1, issue 1 (May 2009). This complements the current listings by EBSCO and the Directory of Open Access Journals (DOAJ), and completes a process started by Peter Dalgaard in 2010.

More information on our influence.ME software is available on this website.

Download the paper from the R Journal
Rense Nieuwenhuis, Manfred te Grotenhuis, & Ben Pelzer (2012). Influence.ME: tools for detecting influential data in mixed effects models R Journal, 4 (2), 38-47

Influential Data in Multilevel Regression: What are your strategies?

Rense Nieuwenhuis — Tue, 13 Nov 2012 22:18:41 +0000

The application of multilevel regression models has become common practice in the field of social sciences. Multilevel regression models take into account that observations on individual respondents are nested within higher-level groups such as schools, classrooms, states, and countries.

In the application of multilevel models in country-comparative studies, however, it has long been overlooked that on the country-level only a limited number of observations are available. As a result, measurements on single countries can easily overly influence the regression outcomes.

Diagnostic tools for detecting influential data in multilevel regression are becoming available (including our own influence.ME), but what are your experiences with influential cases in country-comparative (multilevel) studies? How do you deal with influential cases if you encounter them?

My blog as a Word Cloud

Rense Nieuwenhuis — Fri, 19 Sep 2008 10:00:20 +0000

Ok, this is truly a cool gadget! Surf to wordle.net and create beautiful representations of some text. I simply entered the URL of my blog, pressed ‘randomize’ a few times and this is what came out! A cool representation of the text on my blog.

But, there I was confronted with a reality check. I always thought that I was writing about science in general, with some focus on (advanced) data analysis in R-Project. But, as it seems, the word cloud clearly shows that I primarily write about statistical analyses, with a strong focus on multilevel regression analysis.

So, there we have it… I’m just trying a new gadget, and now I find myself confronted with what I’m actually writing about. Perhaps it is time for a slight re-focus.

What do you think about Curving Normality?

R-Sessions 16: Multilevel Model Specification (lme4)

Rense Nieuwenhuis — Wed, 27 Aug 2008 10:00:47 +0000

Multilevel models, or mixed effects models, can easily be estimated in R. Several packages are available. Here, the lmer() function from the lme4-package is described. The specification of several types of models will be shown, using a fictive example. A detailed description of the specification rules is given. Output of the specified models is given, but not described or interpreted.
Please note that this description is very closely related to the description of the specification of the lme() function of the nlme-package. The results are similar and here exactly the same possibilities are offered.

In this example, the dependent variable is the standardized result of a student on a specific exam. This variable is called “normexam”. In estimating the score on the exam, two levels will be discerned: student and school. On each level, one explanatory variable is present. On individual level, we are taking into account the standardized score of the student on a LR-test (“standLRT”). On the school-level, we take into account the average intake-score (“schavg”).

Preparation

Before analyses can be performed, preparation needs to take place. Using the library() command, two packages are loaded. The lme4-package contains functions for estimation of multilevel or hierarchical regression models. The mlmRev-package contains, amongst many other things, the data we are going to use here. In the output below, we see that R-Project automatically loads the Matrix- and the lattice-packages as well. These are needed for the lme4-package to work properly.
Finally, the names() command is used to examine which variables are contained in the ‘Exam’ data.frame.

library(lme4)
library(mlmRev)
names(Exam)

>library(lme4)
Loading required package: lme4
Loading required package: Matrix
Loading required package: lattice
[1] TRUE
>library(mlmRev)
Loading required package: mlmRev
[1] TRUE
>names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"       "intake"
 [7] "standLRT" "sex"      "type"     "student"

null-model

The syntax below specifies the most simple multilevel regression model of all: the null-model. Only the levels are defined. Using the lmer-function, the first level (here: students) do not have to be specified. It is assumed that the dependent variable (here: normexam) is on the first level (which it should be).

The model is specified using standard R formulas: First the dependent variable is given, followed by a tilde ( ~ ). The ~ should be read as: “follows”, or: “is defined by”. Next, the predictors are defined. In this case, only the intercept is defined by entering a ‘1’. Next, the random elements are specified between brackets ( ). Inside these brackets we specify the random predictors, followed by a vertical stripe ( | ), after which the group-level is specified.

After the model specification, several parameters can be given to the model. Here, we specify the data that should be used by data=Exam. Another often used parameter indicates the estimation method. If left unspecified, restricted maximum likelihood (REML) is used. Another option would be: method=”ML”, which calls for full maximum likelihood estimation. All this leads to the following model specification:

lmer(normexam ~ 1 + (1 | school), data=Exam)

This leads to the following output:

> lmer(normexam ~ 1 + (1 | school), data=Exam)
Linear mixed-effects model fit by REML
Formula: normexam ~ 1 + (1 | school)
   Data: Exam
   AIC   BIC logLik MLdeviance REMLdeviance
 11019 11031  -5507      11011        11015
Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept) 0.17160  0.41425
 Residual             0.84776  0.92074
number of obs: 4059, groups: school, 65

Fixed effects:
            Estimate Std. Error t value
(Intercept) -0.01325    0.05405 -0.2452

random intercept, fixed predictor in individual level

For the next model, we add a predictor to the individual level. We do this, by replacing the ‘1’ of the previous model by the predictor (here: standLRT). An intercept is always assumed, so it is still estimated here. It only needs to be specified when no other predictors are specified. Since we don’t want the effect of the predictor to vary between groups, the specification of the random part of the model remains identical to the previous model. The same data is used, so we specify data=Exam again.

lmer(normexam ~ standLRT + (1 | school), data=Exam)

> lmer(normexam ~ standLRT + (1 | school), data=Exam)
Linear mixed-effects model fit by REML
Formula: normexam ~ standLRT + (1 | school)
   Data: Exam
  AIC  BIC logLik MLdeviance REMLdeviance
 9375 9394  -4684       9357         9369
Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept) 0.093839 0.30633
 Residual             0.565865 0.75224
number of obs: 4059, groups: school, 65

Fixed effects:
            Estimate Std. Error t value
(Intercept) 0.002323   0.040354    0.06
standLRT    0.563307   0.012468   45.18

Correlation of Fixed Effects:
         (Intr)
standLRT 0.008

random intercept, random slope

The next model that will be specified, is a model with a random intercept on individual level and a predictor that is allowed to vary between groups. In other words, the effect of doing homework on the score on a math-test varies between schools. In order to estimate this model, the ‘1’ that indicates the intercept in the random part of the model specification is replaced by the variable of which we want the effect to vary between the groups.

lmer(normexam ~ standLRT + (standLRT | school), data=Exam, method=”ML”)

> lmer(normexam ~ standLRT + (standLRT | school), data=Exam, method="ML")
Linear mixed-effects model fit by maximum likelihood
Formula: normexam ~ standLRT + (standLRT | school)
   Data: Exam
  AIC  BIC logLik MLdeviance REMLdeviance
 9327 9358  -4658       9317         9328
Random effects:
 Groups   Name        Variance Std.Dev. Corr
 school   (Intercept) 0.090406 0.30068
          standLRT    0.014548 0.12062  0.497
 Residual             0.553656 0.74408
number of obs: 4059, groups: school, 65

Fixed effects:
            Estimate Std. Error t value
(Intercept) -0.01151    0.03978  -0.289
standLRT     0.55673    0.01994  27.917

Correlation of Fixed Effects:
         (Intr)
standLRT 0.365

random intercept, individual and group level predictor

It is possible to enter variables on group level as well. Here, we will add a predictor that indicates the size of the school. The lmer-function needs this variable to be of the same length as variables on individual length. In other words: for every unit on the lowest level, the variable indicating the group level value (here: the average score on the intake-test for every school) should have a value. For this example, this implies that all respondents that attend the same school, have the same value on the variable “schavg”. We enter this variable to the model in the same way as individual level variables, leading to the following syntax:

lmer(normexam ~ standLRT + schavg + (1 + standLRT | school), data=Exam)

> lmer(normexam ~ standLRT + schavg + (1 + standLRT | school), data=Exam)
Linear mixed-effects model fit by REML
Formula: normexam ~ standLRT + schavg + (1 + standLRT | school)
   Data: Exam
  AIC  BIC logLik MLdeviance REMLdeviance
 9336 9374  -4662       9310         9324
Random effects:
 Groups   Name        Variance Std.Dev. Corr
 school   (Intercept) 0.077189 0.27783
          standLRT    0.015318 0.12377  0.373
 Residual             0.553604 0.74405
number of obs: 4059, groups: school, 65

Fixed effects:
             Estimate Std. Error t value
(Intercept) -0.001422   0.037253  -0.038
standLRT     0.552243   0.020352  27.135
schavg       0.294737   0.107262   2.748

Correlation of Fixed Effects:
         (Intr) stnLRT
standLRT  0.266
schavg    0.089 -0.085

random intercept, cross-level interaction

Finally, a cross-level interaction is specified. This basically works the same as any other interaction specified in R. In contrast with many other statistical packages, it is not necessary to calculate separate interaction variables (but you’re free to do so, of course).
In this example, the cross-level interaction between time spend on homework and size of the school can be specified by entering a model formula containing standLRT * schavg. This leads to the following syntax and output.

lmer(normexam ~ standLRT * schavg + (1 + standLRT | school), data=Exam)

> lmer(normexam ~ standLRT * schavg + (1 + standLRT | school), data=Exam)
Linear mixed-effects model fit by REML
Formula: normexam ~ standLRT * schavg + (1 + standLRT | school)
   Data: Exam
  AIC  BIC logLik MLdeviance REMLdeviance
 9334 9379  -4660       9303         9320
Random effects:
 Groups   Name        Variance Std.Dev. Corr
 school   (Intercept) 0.076326 0.27627
          standLRT    0.012240 0.11064  0.357
 Residual             0.553780 0.74416
number of obs: 4059, groups: school, 65

Fixed effects:
                Estimate Std. Error t value
(Intercept)     -0.00709    0.03713  -0.191
standLRT         0.55794    0.01915  29.134
schavg           0.37341    0.11094   3.366
standLRT:schavg  0.16182    0.05773   2.803

Correlation of Fixed Effects:
            (Intr) stnLRT schavg
standLRT     0.236
schavg       0.070 -0.064
stndLRT:sch -0.065  0.087  0.252

– – — — —– ——–

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –