Rense Nieuwenhuis » regression

Weighted Effect Coding: New publication in the R Journal

Rense Nieuwenhuis — Mon, 03 Jul 2017 08:00:21 +0000

Weighted effect coding is a technique for dummy coding that can have attractive properties, particularly when analysing observational data. In a new publication in the R Journal we explain the rationale of weighted effect coding, introduce the ‘wec’ package, and provide examples that include interactions.

The attractive property of applying weighted effect coding to categorical (‘factor’) variables is that each category represents the deviation of that category from the sample mean. This is unlike the more commonly used treatment coding where each a specific category has to be selected as a reference. Weighted effect coding is a generalized form of effect coding that applies to both balanced and unbalanced data.

A form of weighted effect coding was already formulated in 1972 by Sweeney and Ulveling, but it seems to never have found its place in statistical repertoires. Weighted effect coding was not implemented in mainstream statistical software. In an ongoing project, we have now further developed weighted effect coding to also apply to interactions (with both categorical and continuous variables), and provide procedures for mainstream statistical software. For R, we developed the ‘wec’ package, and procedures for STATA and SPSS are available as well.

A key innovation in our article in the R Journal is the formulation of interactions between a categorical variable with a continuous variable. This is visualised in the Figure above. The benefit of estimating such an interaction with weighted effect coding is that upon entering the interaction terms the estimate for the continous variable (as well as the ‘main effects’ for the categorical variable) does not change. The ‘main’ continous term reflects the average effect in the sample, and the interaction terms represent the deviation of the effect size for each category.

References

Grotenhuis, Te, M, Pelzer, B., Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017b). A novel method for modelling interaction between categorical variables. International Journal of Public Health, 62(3), 427–431. (open access!)

Grotenhuis, Manfred, Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017a). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health, (62), 163–167. (open access!)

Nieuwenhuis, R., Grotenhuis, Te, M., & Pelzer, B. (2017). Weighted Effect Coding for Observational Data with wec. R Journal, 9(1), 477–485. (open access!)

Sweeney, R. E., & Ulveling, E. F. (1972). A transformation for simplifying the interpretation of coefficients of binary variables in regression analysis. The American Statistician.

Super Crunchers – Ayres (2007) – 1/2

Rense Nieuwenhuis — Wed, 30 Sep 2009 10:00:30 +0000

With the Triumph of Numbers, I read and wrote about the power of using numbers, and how the observation of empirical regularities led to the basic knowledge on how to use such numbers. Already in the triumph of numbers, it was indicated how valuable (numerical) data were regarded to be, for instance by the recollection how the first censuses were regarded as state secrets, because the information could be used to make assertions about the military strength of (rival) nations.

Unfortunately, I.B. Cohen’s Triumph of Numbers ended quite abruptly with a description of Florence Nightingale. It felt unfinished. But the use of numbers has evolved since, and quite substantially so.

How much our use of numerical data has evolved, and to what extent is has invaded our daily lives (without many of us knowing it!), is convincingly described by Ian Ayers, in his magnificent book ‘Super Crunchers’ (2007).

Companies know more and more (and more!) about you: you buy products online, you speak with the customer relations department (with a person behind a computer), you gain discounts with customer cards, and of course you are careful to make sure you receive you frequent flyer miles. Right? If not, you may have bought it all using a credit card, the transactions of which are stored anyway.

So, the companies from whom you buy, know all this, because they have learned to store all this precious information. And using this information – and believe me, we’re dealing with massive amounts of data – each of these companies crunches the data and is able to very exactly predict what each of its’ customers will do next. Groceries successfully predict what to buy the next summer, based on what they sold months or weeks ago. Casino’s know how to predict how much money each individual customer is willing to lose before leaving (it’s actually called the ‘pain point’). You can be sure that if a gambler reaches this pain point, an employee of the casino steps forward to offer him/her an incentive to stay (i.e. a free drink or meal). Airlines predict when you will be unsatisfied by their service (i.e. they lost your baggage too often), and will upgrade your seat (for free) just before you’ll start flying with another company.

The list of excellent examples goes on and on. But, the general – and possibly frightening – conclusion drawn by Ian Ayres is, that if a company starts giving you gifts, you probably have paid too much.

Introducing Influence.ME: Tools for detecting influential data in mixed models

Rense Nieuwenhuis — Wed, 29 Apr 2009 09:03:25 +0000

I’m highly excited to announce that influence.ME is now available. Influence.ME is a new software package for R, providing statistical tools for detecting influential data in mixed models. It has been developed by Rense Nieuwenhuis, Ben Pelzer, and Manfred te Grotenhuis. The basic rationale behind identifying influential data is that when iteratively single units are omitted from the data, models based on these data should not produce substantially different estimates. To standardize the assessment of how influential data is, several measures of influence are commonly used, such as DFBETAS and Cook’s Distance.

Mixed effects regression models tend to become common practice in the field of Social Sciences. However, diagnostic tools to evaluate these models lag behind. For instance there is no general applicable tool to check whether all units (or cases) roughly have the same influence on the regression parameters. It is however commonly accepted that tests for influential cases should be performed, especially when estimates are based on a relatively small number of cases. Testing for influence with mixed effects models is especially important in Social Science applications, for two reasons. First, models in the Social Sciences are frequently based on large numbers of individuals while the number of higher level units is often relatively small. Secondly, often the higher level units are remarkably similar, for instance in the case of neighboring countries. Influence.ME is a new package for R which provides two innovations for evaluating influential cases: it extends existing procedures for use with mixed effects models, and it allows to not only search for single influential cases, but for combinations of cases that as a combination exert too much influence.

I plan to use my blog to provide more information about influence.ME. For instance, you can expect some example analyses soon. Other developments, new features, or exciting applications in research papers will be discussed here as well in due time. A static page on influence.ME is available as well, where all important information is collected.

Questions, comments, thoughts, experiences, notes on bugs (and other vermin), feature requests, and what more: it is all highly appreciated. They can be sent by e-mail, or placed in the comments-section on this blog.

R-Sessions 25: Book – Mixed Effects Models in S and S-PLUS (Pinheiro & Bates, 2000)

Rense Nieuwenhuis — Wed, 01 Oct 2008 10:00:51 +0000

Despite the reference to S and S-PLUS in the title of this book, it offers an excellent guide for the nlme-package in R-Project. Reason for this is the close resemblance between R and S. The nlme-package, available in R-Project for estimation of both linear and non-linear multilevel models, is written and maintained by the authors of this book.

The book is not an introduction to R. Basic knowledge of R-Project (or S / S-PLUS) is required to get the most out of it, as well as some knowledge on multilevel theory. Although the book forms a thorough introduction to multilevel modeling, addressing both some theory, the mathematics and of course the estimation and specification in R-Project (or S / S-PLUS), the learning curve it offers is quite steep. The authors are not shunned to apply matrix algebra and specify exactly the used estimation procedures.

Not only the specification of basic models is described, but many other subjects are brought up. A specific grouped-data object is considered, as well as ways to visualize hierarchical data and multilevel models. Heteroscedasticity, often a violation of assumptions, can be caught in the models easily, as is described clearly in one of the chapters. Finally, not only linear models are tackled, but non-linear models as well.

All in all, this book is an excellent addition for those who have prior knowledge of both R-Project and multilevel analysis. Using real-data examples and by providing tons of output, the authors accomplish to make clear the necessity of the more complex models and thereby invite the reader to invest time for the more fundamental aspects of multilevel analysis.

– – — — —– ——–

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

R-Sessions 23: Book: Data Analysis Using Regression and Multilevel/Hierarchical Models — Gelman & Hill (2007)

Rense Nieuwenhuis — Tue, 23 Sep 2008 10:00:05 +0000

Data Analysis Using Regression and Multilevel/Hierarchical Models

Andrew Gelman is known for his expertise on Bayesian statistics. Based on that knowledge he wrote a book in multilevel regression using R and WINbugs. This book aims to be a thorough description of (multilevel) regression techniques, implementation of these techniques in R and bugs, and a guide on interpreting the results of your analyses. Shortly put, the books excels on all three subjects.

Admittedly, this review has been written based on first impressions on the book. But, a sunny day in the park reading this book (literally) left me to believe that I have some understanding on what this book is trying to achieve. I bought this book in order to have an overview on fitting multilevel regression models using R. Starting to read the book, I soon found out that that is indeed what it has to offer me, but it offers me a lot more. After some introductory chapters, the book starts off with an introduction to both linear regression as well as introducing the reader to R software, by showing how to fit linear regression models in R. This is readily expanded to logistic regression and generalized regression models. All is illustrated lushly with many examples and illustrations.

Before these ‘basic’ regression models are extended to multilevel models, Bayesian statistics are introduced. Based on simulation techniques, causal inferences, based on regression models, are made. The multilevel section of the book is set up similarly. First, ‘basic’ multilevel regression models are introduced. Throughout the book, the lmer function is used. This function is not only able to fit simple multilevel models, but logistic and generalized models as well. It can even estimate non-nested models. All in all, this forms a thorough introduction to multilevel regression analysis in itself, but the book continues here as well to introduce the reader to Bayesian statistics.

All above-mentioned models, as well as more complicated models, are fitted using WINbugs as well. This very flexible method allows the reader to estimate a greater variety of (multilevel) models. Causal inference on multilevel models, using Bayesian statistics, is described as well. The third main part of the book elaborates on the skills the reader uses to ‘just’ fitting models. It learns the reader to really think about what it going on. Topics such as ‘understanding and summarizing the fitted models’, ‘sample size and power calculations’, and most of all ‘model checking and comparison’ each receive their own chapter of the book. In this we can see that the authors of this book aimed higher than just writing instructions on how to let R fit (multilevel) regression models. The aim of this book, is to teach the reader how to analyze data the proper way. Much attention is paid to assumptions, testing theory, and interpretation of what you’re doing. To quote the authors: “If you show something, be prepared to explain it”.

This philosophy seemed to be a guideline for the authors while writing this book, as well as flexibility. The book starts off with some examples of the authors’ own research. These examples return throughout the book, resulting in some degree of familiarity with the data by the reader. Due to this, the concepts, models and/or analyses described are certainly more easy to be understood. As a reader, you start to think along with the author, when a new problem is described. The relative worth of the techniques, as well as their drawbacks, are made perfectly clear. The use of R software, as well as WINbugs, pays of well in the sense that it requires some more effort to master these programs, but in that process the reader learns to think deeply about what he really want to do and how it is done properly.

I found it not an easy book, but thanks to the many examples throughout the book it can be fully understood by people with some prior knowledge in regression techniques. All of the examples in the book can be tried yourself, since the data and syntax are available on the author’s website on the book. This helps the reader to get some feel for the more difficult subjects of the book. All in all, this seems to me as a great book for every applied researcher that has basic prior understanding of regression analysis. Due to its focus on one set of techniques, a great depth of understanding can be derived from this book.

R-Sessions 17: Generalized Multilevel {lme4}

Rense Nieuwenhuis — Mon, 01 Sep 2008 10:00:40 +0000

Although all introductions on regression seem to be based on the assumption of data that is distributed normally, in practice this is not the case. Many other types of distributions exist. To name a few: normal distribution, binomial distribution, poisson, gaussian and so on. The lmer()-function in the lme4-package can easily estimate models based on these distributions. This is done by adding the ‘family’-argument to the command syntax, thereby specifying that not a linear multilevel model needs to be estimated, but a generalized linear model.

Logistic Multilevel Regression

Let us say, we want to estimate the chance for success on a test a student in a specific school has. Therefor, we can use the Exam data-set in the mlmRev-package. This contains the standardized scores on a test. Here, we’ll define success on the test as having a standardized score of 0 or larger. This is recoded to a 0-1 variable below, using the ifelse() function. Using summary() the process of recoding is checked. The needed packages are loaded as well, using the library() function.

library(lme4)
library(mlmRev)
names(Exam)

Exam$success <- ifelse(Exam$normexam >= 0,1,0)
summary(Exam$normexam)
summary(Exam$success)

> library(lme4)
Loading required package: Matrix
Loading required package: lattice
> library(mlmRev)
> names(Exam)
 [1] "school"   "normexam" "schgend"  "schavg"   "vr"       "intake"  
 [7] "standLRT" "sex"      "type"     "student" 
> 
> Exam$success <- ifelse(Exam$normexam >= 0,1,0)
> summary(Exam$normexam)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-3.6660000 -0.6995000  0.0043220 -0.0001138  0.6788000  3.6660000 
> summary(Exam$success)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  1.0000  0.5122  1.0000  1.0000

In order to be able to properly use the so created binary ‘success’ variable, a logistic regression model needs to be estimated. This is done by specifying binomial family, using the logit as a link-function, using “family = binomial(link = “logit”)”. The rest of the specification is exactly the same as a normal linear multilevel regression model using the lmer() function.

lmer(success~ schavg + (1|school), data=Exam, family=binomial(link = “logit”))

> lmer(success~ schavg + (1|school), 
+ 	data=Exam, 
+ 	family=binomial(link = "logit"))
Generalized linear mixed model fit using Laplace 
Formula: success ~ schavg + (1 | school) 
   Data: Exam 
 Family: binomial(logit link)
  AIC  BIC logLik deviance
 5323 5342  -2658     5317
Random effects:
 Groups Name        Variance Std.Dev.
 school (Intercept) 0.23113  0.48076 
number of obs: 4059, groups: school, 65

Estimated scale (compare to  1 )  0.9909287 

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.08605    0.07009   1.228    0.220    
schavg       1.60548    0.21374   7.511 5.86e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Correlation of Fixed Effects:
       (Intr)
schavg 0.072

– – — — —– ——–