R-Sessions 18: Helper Functions


Several functions already present in R-Project are very useful when analyzing multilevel models or when preparing data to do so. Three of these helper functions will be described: aggregating data, the behavior of the plot() function when applied to a multilevel model and finally setting contrasts for categorical functions. Note that none of these functions are related to multilevel analysis only.

Aggregate

We will continue to work with the Exam-dataset that is made available through the mlmRev-package. In the syntax below we load that package and use the names() function to see what variables are available in the Exam data-set. One of the variables on individual level is the normexam-variable. Let’s say we want to aggregate this variable to the school-level, so that the new variable represents for each student the level of the exam-score at school level.

library(mlmRev)
names(Exam)

meansch <- tapply(Exam$normexam, Exam$school, mean)
meansch[1:10]
Exam$meanexam <- meansch[Exam$school]
names(Exam)

Using tapply(), we create a table of the normexam-variable by school, calculating the mean for each school. The result of this is stored in the meansch-variable. This variable now contains 65 values representing the mean score on the normexam-variable for each school. The first ten of these values is shown. This meansch-variable is only a temporary helping variable. Then we create a new variable in the Exam-data.frame called meanexam as well (this can be any name you want, as long as it does not already exist in the data.frame).

We give the correct value on this new variable to each respondent, by indexing the meansch by the school-index stored in the Exam-data.frame. When we look at the names of the Exam data.frame now, we see that our new variable is added to it.

> library(mlmRev)
> names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"      
 [6] "intake"   "standLRT" "sex"      "type"     "student" 
[11] "meansch" 
> 
> meansch <- tapply(Exam$normexam, Exam$school, mean)
> meansch[1:10]
           1            2            3            4            5 
 0.501209573  0.783102291  0.855444696  0.073628516  0.403608663 
           6            7            8            9           10 
 0.944579967  0.391500023 -0.048192541 -0.435682141 -0.269390664 
> Exam$meanexam <- meansch[Exam$school]
> names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"      
 [6] "intake"   "standLRT" "sex"      "type"     "student" 
[11] "meansch"  "meanexam"

plot (multilevel.model)

R-Project has many generic functions, that behave differently when different data is provided to it. A good example of this is the plot()-function. We have already seen how this function can be used to plot simple data. When it is used to plot an estimated multilevel model, it will extract the residuals from it and the result is a plot of the standardized residuals.

library(nlme)

model.01 <- lme(fixed = normexam ~ standLRT,
data = Exam,
random = ~ 1 | school)

plot(model.01, main="Residual plot of model.01")

In the syntax above, this is illustrated using a basic multilevel model, based on the Exam-data. First, the nlme-package is loaded, which results in a few warning messages (not shown here) that indicate that some data-sets are loaded twice (they were already loaded with the mlmRev-package). We assign the estimated model, that has a random intercept by schools and a predictor on the first (individual) level, to an object we have called ‘model.01′.

When this is plotted, we see that the residuals are distributed quite homoscedastically. The layout of the plot-window is quite different from what we have seen before. The reason for this is that here the lattice-package for advanced graphics is automatically used.

Contrast

When using categorical data in regression analyses (amongst other types of analysis), it is needed to code dummy-variables or contrasts. This can be done manually of course, as needs to be done in some other statistical packages, but in R-Project this is done automatically.

The ‘vr’-variable in the Exam-dataset represents “Student level Verbal Reasoning (VR) score band at intake – a factor. Levels are bottom 25%, mid 50%, and top 25%.” (quote from the description of the dataset, found by using ?Exam). In the syntax below, a summary of this variable is asked for, resulting in a confirmation that this variable contains three categories indeed.

Then, a basic model is estimated using the ‘vr’-variable and assigned to an object called model.02 (it is a random intercept model at school level containing two variables on individual level (‘vr’ and ‘standLRT’) and one variable on second level (‘schavg’). The summary of that model shows that it is the first category, representing the lowest Verbal Reasoning score band, that is used as reference category in the analyses.

summary(Exam$vr)

model.02 <- lme(fixed = normexam ~ vr + standLRT + schavg,
data = Exam,
random = ~ 1 | school)

summary(model.02)

> summary(Exam$vr)
bottom 25%    mid 50%    top 25% 
       640       2263       1156 
> 
> model.02 <- lme(fixed = normexam ~ vr + standLRT + schavg, 
+ 	data = Exam,
+ 	random = ~ 1 | school)
> 	
> summary(model.02)
Linear mixed-effects model fit by REML
 Data: Exam 
       AIC     BIC    logLik
  9379.377 9423.53 -4682.689

Random effects:
 Formula: ~1 | school
        (Intercept)  Residual
StdDev:   0.2844674 0.7523679

Fixed effects: normexam ~ vr + standLRT + schavg 
                 Value  Std.Error   DF  t-value p-value
(Intercept)  0.0714396 0.16717547 3993  0.42733  0.6692
vrmid 50%   -0.0834973 0.16341508   61 -0.51095  0.6112
vrtop 25%   -0.0537963 0.27433650   61 -0.19610  0.8452
standLRT     0.5594779 0.01253795 3993 44.62276  0.0000
schavg       0.4007454 0.28016707   61  1.43038  0.1577
 Correlation: 
          (Intr) vrm50% vrt25% stnLRT
vrmid 50% -0.946                     
vrtop 25% -0.945  0.886              
standLRT   0.000  0.000  0.000       
schavg     0.863 -0.794 -0.914 -0.045

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.71568977 -0.63052325  0.02781354  0.68294752  3.25580231 

Number of Observations: 4059
Number of Groups: 65 
> 

Should we want to change that, we need to use the contrast() function. When used on the ‘vr’-variable, it makes clear why the first category of the variable is used as reference: in the resulting matrix it is the only category without a ‘1’ on one of the columns.

To change this, we have two options at hand. First we use the contr.treatment()-function. We want the third category used as reference now. So, we specify that the contrasts should be treated in such a way that the categories are based on the levels of the ‘vr’-variable and that the base, or reference, should be the third. The result of the contr.treatment()-function looks exactly like the result of the contrasts()-function (except for a different reference category obviously). To change the used reference in an analysis, the result of the contr.treatment()-function should be assigned to the contrast(Exam$vr). The old contrasts -settings are replaced by the new ones in that way.

Instead of using the contr.treatment() function we can simply create a matrix ourselves and use this to change the contrasts. This is done below in order to make clear what is actually happening when the contr.treatment()-function is used. First the matrix is shown, then it is assigned to the contrasts of the ‘vr’-variable. Now we have chosen the middle category to be the reference. Although this works perfectly, the draw-back of this procedure is that the value-labels of the variable are lost, while these are maintained when contr.treatment() is used.

Finally, using the new contrasts-settings of the ‘vr’-variable, the same model is estimated again. In the output below, we see that indeed the value labels are gone now (because of using a basic matrix to set the contrasts) and that the middle category is used as reference, although the categories are now referred to as ‘1’ and ‘2’ which might lead to confusing interpretations.

contrasts(Exam$vr)
contrasts(Exam$vr) <- contr.treatment(levels(Exam$vr), base=3)
contrasts(Exam$vr)
matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
contrasts(Exam$vr) <- matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
contrasts(Exam$vr)

model.03 <- lme(fixed = normexam ~ vr + standLRT + schavg,
data = Exam,
random = ~ 1 | school)

summary(model.03)

> contrasts(Exam$vr)
           mid 50% top 25%
bottom 25%       0       0
mid 50%          1       0
top 25%          0       1
> contr.treatment(levels(Exam$vr), base=3)
           bottom 25% mid 50%
bottom 25%          1       0
mid 50%             0       1
top 25%             0       0
> contrasts(Exam$vr) <- contr.treatment(levels(Exam$vr), base=3)
> contrasts(Exam$vr)
           bottom 25% mid 50%
bottom 25%          1       0
mid 50%             0       1
top 25%             0       0
> matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
     [,1] [,2]
[1,]    1    0
[2,]    0    0
[3,]    0    1
> contrasts(Exam$vr) <- matrix(data=c(1,0,0,0,0,1), nrow = 3, ncol = 2, byrow=FALSE)
> contrasts(Exam$vr)
           [,1] [,2]
bottom 25%    1    0
mid 50%       0    0
top 25%       0    1
> 
> model.03 <- lme(fixed = normexam ~ vr + standLRT + schavg, 
+ 	data = Exam,
+ 	random = ~ 1 | school)
> 	
> summary(model.03)
Linear mixed-effects model fit by REML
 Data: Exam 
       AIC     BIC    logLik
  9379.377 9423.53 -4682.689

Random effects:
 Formula: ~1 | school
        (Intercept)  Residual
StdDev:   0.2844674 0.7523679

Fixed effects: normexam ~ vr + standLRT + schavg 
                 Value  Std.Error   DF  t-value p-value
(Intercept) -0.0120577 0.05427674 3993 -0.22215  0.8242
vr1          0.0834973 0.16341508   61  0.51095  0.6112
vr2          0.0297010 0.15018792   61  0.19776  0.8439
standLRT     0.5594779 0.01253795 3993 44.62276  0.0000
schavg       0.4007454 0.28016707   61  1.43038  0.1577
 Correlation: 
         (Intr) vr1    vr2    stnLRT
vr1      -0.096                     
vr2      -0.551 -0.530              
standLRT  0.000  0.000  0.000       
schavg    0.267  0.794 -0.806 -0.045

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.71568977 -0.63052325  0.02781354  0.68294752  3.25580231 

Number of Observations: 4059
Number of Groups: 65 
> 

– – — — —– ——–

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

One comment on “R-Sessions 18: Helper Functions

Leave a Reply