Rense Nieuwenhuis » open-source

New version of WEC: focus on interactions

Rense Nieuwenhuis — Tue, 17 Jan 2017 11:00:52 +0000

We have uploaded a new version of WEC, an R package to apply ‘weighted effect coding’ to your dummy variables. With weighted effect coding, your dummy variables represent the deviation of their respective category from the sample mean, rather than the deviation from a reference category. Particularly with observational data, which are often unbalanced, this can have attractive interpretations. We recently published two articles in which we discuss some of the advantages:

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2016b). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health, 1–5. http://doi.org/10.1007/s00038-016-0901-1

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2016a). A novel method for modelling interaction between categorical variables. International Journal of Public Health, 1–5. http://doi.org/10.1007/s00038-016-0902-0

As some of the real advantages of weighted effect coding come into play when using interactions, that was what we focused in the current update to our ‘wec’ package (version 0.4). The package now supports interactions between a weighted effect coded factor variable and an interval variable, and the calculation of interactions between two weighted effect coded factor variables was much improved. An example is given below (with more to follow, hopefully soon).

library(wec) data(PUMS) PUMS$race.wec <- factor(PUMS$race) contrasts(PUMS$race.wec) <- contr.wec(PUMS$race.wec, "White") PUMS$race.educint <- wec.interact(PUMS$race.wec, PUMS$education.int) m.wec.educ <- lm(wage ~ race.wec + education.int + race.educint, data=PUMS) summary(m.wec.educ)$coefficients

The code above results in a regression model (shown below) in which the main effect for education (9048) remains the same, whether the interaction terms are included or not (you can try this yourself). Thus, the interaction terms represent how much the average education effect varies by race.

                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                     52320        559    93.5  0.0e+00
race.wecHispanic                -4955       1736    -2.9  4.3e-03
race.wecBlack                  -11276       1817    -6.2  5.7e-10
race.wecAsian                    5151       2381     2.2  3.1e-02
education.int                    9048        287    31.6 2.3e-208
race.educintinteractHispanic    -3266        977    -3.3  8.3e-04
race.educintinteractBlack       -3293        990    -3.3  8.8e-04
race.educintinteractAsian        3575       1217     2.9  3.3e-03

Update influence.ME, or why I love the open source community

Rense Nieuwenhuis — Wed, 17 Aug 2016 11:39:28 +0000

The other day, Kevin Darras contacted me about my R package influence.ME. The package didn’t work with the kind of models he wanted to estimate, and Kevin was looking for a solution. He had been able to go ‘under the hood’ of the program code in influence.ME and to program a solution, which he kindly shared with me. After some testing, and some adjustments, the influence.ME package is now updated and uploaded to CRAN, available for anyone to use. That’s well within a week after his first e-mail.

This is why I love the open source community so much. Not only can users extend the use of influence.ME, and all other R packages, to do things that the package authors/maintainers did not implement. Or to check procedures. Or fix mistakes. Moreover, in line with the positive attitude towards sharing in the open access community, the improved code was shared back so that other users can benefit.

So, thanks to the help of the community, I am happy to announce an update to influence.ME, with two improvements:

influence.ME now better handles binomial models
influence.ME now supports functions inside the model call;for instance:
model.a <- lmer(math ~ structure + scale(SES) + (1 | school.ID), data=school23)

influence.ME is an extension package for the R statistical software. It provides tools for detecting influential data in multilevel regression models (also known as mixed effects models). It was introduced in the R Journal (Nieuwenhuis, Te Grotenhuis & Pelzer, 2012). influence.ME can be downloaded from with the R software.

Nieuwenhuis, R., Grotenhuis, te, H. F., & Pelzer, B. J. (2012). Influence. ME: tools for detecting influential data in mixed effects models. R Journal, 4(2), 38–47.

R-Sessions 11: Tables

Rense Nieuwenhuis — Fri, 15 Aug 2008 10:00:27 +0000

The one most often used function in the analysis of statistical data is the creation of tables. This edition of the R-Sessions describes the use of several functions to do some nifty cross-tabulations. And more.

TAPPLY

The function TAPPLY can be used to perform calculations on table-marginals. Different functions can be used, such as MEAN, SUM, VAR, SD, LENGTH (for frequency-tables). For example:

x <- c(0,1,2,3,4,5,6,7,8,9)
y <- c(1,1,1,1,1,1,2,2,2,2)
tapply(x,y,mean)
tapply(x,y,sum)
tapply(x,y,var)
tapply(x,y,length)

> x <- c(0,1,2,3,4,5,6,7,8,9)
> y <- c(1,1,1,1,1,1,2,2,2,2)
> tapply(x,y,mean)
  1     2
2.5   7.5
> tapply(x,y,sum)
 1  2
15 30
> tapply(x,y,var)
       1        2
3.500000 1.666667
> tapply(x,y,length)
1 2
6 4
>

FTABLE

More elaborate frequency tables can be created with the FTABLE-function. For example:

x <- c(0,1,2,3,4,5,6,7,8,9)
y <- c(1,1,1,1,1,1,2,2,2,2)
z <- c(1,1,1,2,2,2,2,2,1,1)
ftable(x,y,z)

> x <- c(0,1,2,3,4,5,6,7,8,9)
> y <- c(1,1,1,1,1,1,2,2,2,2)
> z <- c(1,1,1,2,2,2,2,2,1,1)
> ftable(x,y,z)
    z 1 2
x y
0 1   1 0
  2   0 0
1 1   1 0
  2   0 0
2 1   1 0
  2   0 0
3 1   0 1
  2   0 0
4 1   0 1
  2   0 0
5 1   0 1
  2   0 0
6 1   0 0
  2   0 1
7 1   0 0
  2   0 1
8 1   0 0
  2   1 0
9 1   0 0
  2   1 0

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

Find the original article embedded in the manual.

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

R-Sessions 10: Conditionals

Rense Nieuwenhuis — Wed, 13 Aug 2008 10:00:21 +0000

Conditionals, or logicals, are used to check vectors of data against conditions. In practice, this is used to select subsets of data or to recode values. Here, only some of the fundamentals of conditionals are described.

Basics

The general form of conditionals are two values, or two sets of values, and the condition to test against. Examples of such tests are ‘is larger than’, ‘equals’, and ‘is larger than’. In the example below the values ‘3’ and ‘4’ are tested using these three tests.

3 > 4 3 == 4 3 < 4

> 3 > 4
[1] FALSE
> 3 == 4
[1] FALSE
> 3 < 4
[1] TRUE

Numerical returns

The output shown directly above makes clear that R-Project returns the values ‘TRUE’ and ‘FALSE’ to conditional tests. The results here are pretty straightforward: 3 is not larger than 4, therefore R returns FALSE. If you don’t desire TRUE or FALSE as response, but a numeric output, use the as.numeric() command which transforms the values to numerics, in this case ‘0’ or ‘1’. This is shown below.

as.numeric(3 > 4) as.numeric(3 < 4)

> as.numeric(3 > 4)
[1] 0
> as.numeric(3 < 4)
[1] 1

Conditionals on vectors

As on most functionality of R-project, vectors (or multiple values) can be used alongside single values, as is the case on conditionals. These can be used not only against single values, but against variables containing multiple values as well. This will result in a succession of tests, one for each value in the variable. The output is a vector of values, ‘TRUE’ or ‘FALSE’.The examples below show two things: the subsequent values 1 to 10 are tested against the condition ‘is smaller than or equals 5′. It is shown as well that when these values are assigned to a variable (here: ‘x’), this variable can be tested against the same condition, giving exactly the same results.

1:10 1:10 <= 5 x <- 1:10 x <= 5

> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10
> 1:10 <= 5
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
> x <- 1:10
> x == 5
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Conditionals and multiple tests

More tests can be gathered into one conditional expression. For instance, building on the example above, the first row of the next example tests the values of variable ‘x’ against being smaller than or equal to 4, or being larger than or equal to ‘6’. This results in ‘TRUE’ for all the values, except for 5. Since the ‘|’-operator is used, only one of the set conditions need to be true. The second row of this example below tests the same values against two conditions as well, namely ‘equal to or larger than 4′ and ‘equal to or smaller than 6′. since this time the ‘&’-operator is used, both conditionals need to be true.

x <= 4 | x >= 6 x >= 4 & x <= 6

> x <= 4 | x >= 6
 [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> x >= 4 & x <= 6
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

Conditionals on character values

In the example below, a string variable ‘gender’ is constructed, containing the values ‘male’ and ‘female’. This is shown in the first two rows of the example below.

gender <- c(“male”,”female”,”female”,”male”,”male”,”male”,”female”) gender == “male”

> gender <- c("male","female","female","male","male","male","female")
> gender == "male"
[1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Additional functions

The last examples demonstrate two other functions using conditionals, using the same ‘gender’ variable as above. The first is an additional way to get a numerical output of the same test as in the row above. The iselse() command has three arguments: the first is a conditional, the second is the desired output if the conditional is TRUE, the third is the output in case the result of the test is ‘FALSE’. The second example shows a way to obtain a list of which values match the condition tested against. In the output above, the second, third and last values are ‘female’. Using which() and the condition “== ‘male’ ” (equals ‘male’) returns the indices of the values in variable ‘gender’ that equal ‘male’.

ifelse(gender==”male”,0,1) which(gender==”male”)

> ifelse(gender=="male",0,1)
[1] 0 1 1 0 0 0 1
> which(gender=="male")
[1] 1 4 5 6

Conditionals on missing values

Missings values (‘NA’) form a special case in many ways, such as when using conditionals. Normal conditionals cannot be used to find the missing values in a range of values, as is shown below.

x <- c(4,3,6,NA,4,3,NA) x == NA which(x == NA) is.na(x) which(is.na(x))

The last two rows of the syntax above show what can be done. The is.na() command tests whether a value or a vector of values is missing. It returns a vector of logicals (‘TRUE’ or ‘FALSE’), that indicates missing values with a ‘TRUE’. Nesting this command in the which() command described earlier enables us to find which of the values are missing. In this case, the fourth and the seventh values are missing.

> x <- c(4,3,6,NA,4,3,NA)
> x == NA
[1] NA NA NA NA NA NA NA
> which(x == NA)
integer(0)
> is.na(x)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
> which(is.na(x))
[1] 4 7

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

Find the original article embedded in the manual.

R-Sessions 09: Data Manipulation

Rense Nieuwenhuis — Mon, 11 Aug 2008 10:00:39 +0000

Today’s edition of R-Sessions deals with the manipulation of data that is stored R-Project. Building upon the previous R-Session, attention is paid to recoding of data, ordering, and finally the merging of several sets of data.

Recoding

The most direct way to recode data in R-Project is using a combination of both indexing and conditionals as described elsewhere. To exemplify this, a simply data.frame will be created below, containing variables indicating gender and monthly income in thousands of euros.

gender <- c(“male”, “female”, “female”, “male”, “male”, “male”, “female”)
income <- c(54, 34, 556, 57, 88, 856, 23)
data <- data.frame(gender, income)
data

> gender <- c("male", "female", "female", "male", "male", "male", "female")
> income <- c(54, 34, 556, 57, 88, 856, 23)
> data <- data.frame(gender, income)
> data
  gender income
1   male     54
2 female     34
3 female    556
4   male     57
5   male     88
6   male    856
7 female     23

Some of the values on the income variable seem exceptionally high. Let’s say we want to remove the two values on income higher than 500. In order to do so, we use the which() command, that reveals which of the values is greater than 500. Next, the result of this is used for indexing the data$income variable. Finally, the indicator for missing values, ‘NA’ is assigned to the that selected values of the ‘income’ variables. Obviously, we would normally only use the third line. The first two are shown here, to make clear exactly what is happening.

which(data$income > 500)
data$income[data$income > 500]
data$income[data$income > 500] <- NA
data

> which(data$income > 500)
[1] 3 6
> data$income[data$income > 500]
[1] 556 856
> data$income[data$income > 500] <- NA
> data
  gender income
1   male     54
2 female     34
3 female     NA
4   male     57
5   male     88
6   male     NA
7 female     23

Sometimes, it is desirable to replace missing values by the mean on the respective variables. That is what we are going to do here. Note, that in general practice it is not very sensible to impute two missing values using only five valid values. Nevertheless, we will proceed here.
The first row of the example below shows that it is not automatically possible to calculate the mean of a variable that contains missing values. Since R-Project cannot compute a valid value, NA is returned. This is not what we want. Therefore, we instruct R-Project to remove missing values by adding na.rm=TRUE to the mean() command. Now, the right value is returned. When the same selection-techniques as above are used, an error will occur. Therefore, we need the is.na() command, that returns a vector of logicals (‘TRUE’ and ‘FALSE’ ). Using is.na(), we can use the which() command to select the desired values on the income variable. To these, the calculated mean is assigned.

mean(data$income)
mean(data$income, na.rm=TRUE)
data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
data

> mean(data$income)
[1] NA
> mean(data$income, na.rm=TRUE)
[1] 51.2
> data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
> data
  gender income
1   male   54.0
2 female   34.0
3 female   51.2
4   male   57.0
5   male   88.0
6   male   51.2
7 female   23.0

ORDER

It is easy to sort a data-frame using the command order. Combined with indexing functions, it works as follows:

x <- c(1,3,5,4,2)
y <- c('a','b','c','d','e')
df <- data.frame(x,y)

df
  x y
1 1 a
2 3 b
3 5 c
4 4 d
5 2 e

df[order(df$x),]
  x y
1 1 a
5 2 e
2 3 b
4 4 d
3 5 c

MERGE

Merge puts multiple data.frames together, based on an identifier-variable which is unique or a combination of variables.

x <- c(1,2,5,4,3)
y <- c(1,2,3,4,5)
z <- c('a','b','c','d','e')

df1 <- data.frame(x,y)
df2 <- data.frame(x,z)
df3 <- merge(df1,df2,by=c("x"))

 df3
  x y z
1 1 1 a
2 2 2 b
3 3 5 e
4 4 4 d
5 5 3 c

R-Sessions 05: Getting Help

Rense Nieuwenhuis — Fri, 01 Aug 2008 10:00:14 +0000

Concordant with the open source community, R-Project is accompanied by many additional help functions. Most of them are freely available.

The help() – function

R-Project has a help function build in. This functionality is focused on informing the user on the parameters function have. Almost for all functions some examples are given as well.

A general help page is available, which contains several introductory documents as ‘An Introduction to R’, ‘Frequently Asked Questions’, and ‘The R Language Definition’. More advanced documents are made available as well, such as ‘Writing R Extensions’ and ‘R Internals’. This general help page is called for by entering:

help.start()

To obtain help on a specific function, you use help() with the name of the function between the brackets. For instance, if you want help on the plot() function, use the following syntax:

help(plot)

This results in a page that gives a short definition of the function, shows the parameters of the function, links to related functions, and finally gives some examples.

Freely available documents

More elaborate documents can be found on the website of R-Project (http://www.r-project.org) in the documents section. This can be found by clicking on ‘manuals’ from the home-page, just below the ‘documents’ header. First, a couple of documents written by the core development team of R-Project are offered, but don’t forget to click on the ‘Contributed Documentation’ link, which leads to many more documents, often of a very high quality.

Books on R-Project

Many books have been written on R-Project, ranging from very basic-level introductions to the ones that address the fundamental parts of the software. In this manual I review some of these books, which I can advise to every starting or more advanced user of R-Project:

An R and S-PLUS Companion to Applied Regression, by John Fox
Introductory Statistics with R, by Peter Dalgaard
Data Analysis Using Regression and Multilevel / Hierarchical Models, by Andrew Gelman and Jennifer Hill

R-help mailinglist

When all help fails, there is always the R-Help mailing-list. This is a service where all members receive the e-mails that are send to a specific address. The quality and speed of the given answers and solutions is often very high. Questions are asked and answered many times a day, so be prepared to receive a high volume of e-mail when signing up for this service.

More information on the R-help mailing-list, as well as the ability to sign-up, can be found on: https://stat.ethz.ch/mailman/listinfo/r-help

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

Find the original article embedded in the manual.