Rense Nieuwenhuis » Statistics

Statistical Tools – Te Grotenhuis and Van der Weegen (2009)

Rense Nieuwenhuis — Wed, 16 Sep 2009 10:00:29 +0000

How does one teach statistics? Is it more important to start with mathematical thoroughness, or to help students to gain a conceptual understanding first? Few give a comprehensive introduction to statistics for those without the otherwise indispensable mathematical background. Manfred te Grotenhuis and Theo van der Weegen recently published an introductory book on statistics, explaining statistical concepts using words and graphs, rather than formulas.

Less than a year ago, I wrote these exact words. I then discussed the publication of a Dutch book on statistics, to which I provided minor assistance. Now, I repeat these words to introduce the Enligsh translation of this conceptual introduction to statistics, called Statistical Tools. Again, I contributed to this publication, this time by providing a first, rough, translation from Dutch to English. Let me repeat below what I wrote before on this blog, for of course this still holds relevance for the translation to English:

With the focus on practical application rather than statistical theory, the first chapter starts explaining the goal of inferential statistics, meanwhile introducing the concepts of measurement and variables. Considerable attention is paid to the importance of high quality data to perform your analyses on. The second chapter deals with descriptive statistics, both in a numerical and a graphical way. Here, also the concepts of a distribution and of correlation are introduced. The third and final chapter discusses the testing of hypotheses, using techniques as the cross-table, tests for means and proportions, various forms of correlation, and finally multiple regression.

Clearly, the setup of this book is what one might expect from an introduction to statistics. However, I think this book has a unique approach by its strong focus on the conceptual level, rather than the (mathematical) statistical theory. Nevertheless, it does not shy away from relatively complex subjects such as the multiple regression. Even on the conceptual level, it pays a lot of attention to the assumptions required for the various analyses discussed. The practical approach of this book is enhanced even further, because all examples come from â€˜real lifeâ€™ research. On the accompanying website SPSS data files and syntax files are made available, so that every example from the book can be repeated by the reader.

Aimed at the novice statistics student, this book offers a comprehensible and conceptual approach at statistics. It will surely help students of statistics to grasp what theyâ€™re actually doing when pushing SPSSâ€™s buttons or trying to interpret published figures. In that sense, I think that for many statistics student, this book successfully reaches is goal of transforming statistics form an abstract undertaking to an actually useful and applicable tool.

– – — — —– ——–
This post is part of my ‘Reading List’. In this series I jot down some thoughts about the books I read and enjoyed. Some posts my give a somewhat balanced overview of a book, others will just focus on some aspects that, for whatever reason, caught my attention. Never are these posts meant as an evaluation or even review of the book. I just like to share some impressions.

An overview of my Reading List is available, which contains both a list of the books that I wrote about, and another list of books I’m planning to read.

The Triumph of Numbers – Cohen (2005)

Rense Nieuwenhuis — Tue, 08 Sep 2009 10:00:54 +0000

My new job involves working with numbers. A lot. So, I started reading about using numbers, and I very much enjoyed ‘The Triumph of Numbers’ by I.B. Cohen (2005). This book gives an historical account not only of how numbers were used in different times, but also of ‘how counting shaped modern life’.

The books starts out by illustrating the power of numbers. Just by using very simple calculations, Cohen quickly arrives at the conclusion that the building of the ancient pyramids involved placing one giant block of stone in the structure, every two minutes. Since the weight of such stones is enormous, this required quite advanced techniques to achieve. Knowing the vast size of such an operation, this helps us to gain an understanding in how the Egyptians may have done it, and the level of technology available to them.

For long, people have been fascinated by numbers. Cohen’s description of the history of using numbers therefore starts with numerology. The reader is treated with lovely exercises is numerology: it is quite amazing how we can prove about anything, simply by reordering numbers that somehow correspond to letters. If only there was an empirical basis for such magic.

Off to more serious applications of numbers (by today’s standards), Cohen locates the proper start of using numbers in Hutcheson’s Moral Arithmetic. Hutcheson used formulae (and which are based on numbers) to make his claims about morality. Here, numbers were only used to illustrate a claim, but not much later people started to relate such numbers to observable phenomena. An example of this Benjamin Franklin, who used his mathematical genius to find arguments based on numbers for his political claims regarding the safety of inoculation against smallpox. He used numbers to show it was safe to have your children inoculated.

Many more examples are given of how claims were backed up with (increasingly advanced) numbers, and calculations based on these numbers. For instance, Alexandre Louis’s statistics showed the ineffectiveness of blood-letting in treating patients. Laplace used probability theory to suggest improvements to both the British and the French judicial system. Guerry was struck by the regularities he found in his tables on crime. Quetelet, referred to as the ‘powerhouse of the statistical movement’, introduced the ‘average man’. He explicitly started using statistics to gain an understanding of society. Quetelet is seen as the founder of statistically based sociology.

What did I learn

I think that the central claim of the book is that statistics became interesting when society became more complex. Especially in warfare, knowing how many troops one has, and can expect in the coming years, provides key insights in military strength. Unsurprisingly, the results of early censuses were highly confidential, not to give the opponents the benefit of the information. From a sociological perspective, this insight allows the rise of the use of statistics to be understood from an evolutionary perspective: the fundaments of societies change, and so does the way people think, as a result of that.

Florence Nightingale: the lady with the numbers

I was especially intrigued by Cohen spending a complete chapter on Florence Nightingale. As early readers of this blog may know, I used to be in the nursing profession myself, and was inspired by how she and her ideas was a strong force behind the movement towards a professional nursing practice. Of course, Nightingale has saved many lives during the Crimean Wars, ‘simply’ by improving sanitary and hygienic conditions in the war hospitals. Later, she improved these conditions in other hospitals, saving many more.

What I didn’t knwo about Nightingale, is that she greatly admired Quetelet. Cohen acclaims her not by inventing new statistics, but by using them appropriately in a time when such use was not common at all. By recording causes of death, Nightingale found that many soldiers died from infections, rather than war wounds. Also, using these records, she was able to show the results of the sanitary and hygienic changes she made. In that, she was a very early proponent of evidence based medicine.

Conclusion: Abrupt Ending

Unfortunately, the book seems to come to an abrupt ending after Florence Nightingale’s interest in statistics is described. As a result of this, only the distant history of statistics is dealt with in this book, whereas the introduction seems to hint at more recent developments in the use of numbers as well. Since the book was published after his death in 2003, I suspect that Cohen has been unable to finish his work. Despite this abrupt ending, however, I think the book is a very nice introduction to the history of using numbers, and provides an insightful overview of many of those in history who have contributed to the modern applications of statistics.

An overview of my Reading List is available, which contains both a list of the books that I wrote about, and another list of books I’m planning to read.

Curving Normality Blog Carnival #1

Rense Nieuwenhuis — Mon, 01 Dec 2008 10:00:11 +0000

Today, I am happy to present to you the first edition of the Curving Normality blog carnival. It is all about the quantitative social sciences, and aims at bringing together high quality blog posts about our lovely profession. With just a few weeks of preparation, I am very pleased with the number of submissions, and especially glad with their quality. Apparently, the quantitative social scientists are quite well represented in the blogosphere!

The first article was submitted really quickly by Inti Suarez. In his series on the applicability of (social) science articles for political practice, he investigates the worth of an article on Terrorism and the world economy. After sharing some of his own personal experiences in politics with having difficulties to properly define the concept of `terrorism’, he praises the article to be confined to a single issue. To come short: “The claim of this paper is straightforward: if a country is threaten by terrorism, it will attract less investments.” Does this have practical relevance? “What is painful to realize is that this conclusion might reinforce the terrorist agenda, instead of weaken it.”

Secondly, statistics aficionado Stijn Ruiter writes on his blog ‘Your Sixth Degree’ about the advanced use of statistics. In his post on the presidential elections and the so-called Bradley-effect, he does however show that without asking the right question, advanced statstics does bring you nowhere. The election of Barack Obama denies this Bradley-effect, which “basically refers to the idea that a black American would not get elected because in the election booth voters would decide against what they said in the polls.” However, research should perhaps have a more detailed starting point: “The Bradley effect hypothesis is rather general, and as it is generally described (as above), it does not really specify who the voters are and what characteristics they (should) have. It only specifies whom to choose from, a black candidate or a white candidate. But there are two sides to the voting equation, namely voters and candidates. […] So, the question becomes who votes for whom.” ((Also see Gary King’s note on a paper investgating the (decline) of the Bradley effect.))

Such a detailed perspective was also taken up in an article on the educational achievement of migrants’ children, which I described myself a while ago. “The authors of the article â€” recently published in American Sociological Review â€” were able to take into account influences from both (characteristics of) country of origin, country of destination, and the migrant community in the country of origin.” Doing so, has led to some interesting findings, which would remain unclear if not this level of detail was maintained. “Counter-intuitively, immigrant children from countries with lower levels of economic development have better scholastic performance than comparable children who emigrate from countries with higher levels of economic development.”

Also focused on educational attainment of migrants’ children, in relation with integration in the host society, FÃ«anor on ‘Just a Mon’ discusses a ‘natural experiment’. This natural experiment entails that after Indonesian independence thousands Moluccans were allowed to settle in various Dutch municipalities. The socio-economic backgrounds of these people were rather similar, which allowed the the researchers to compare their children on educational achievement, and cross-tabulate this with measures of integration. They found that “children from Moluccan fathers and native mothers have a higher educational attainment than children from ethnic homogeneous Moluccan couples or children from a Moluccan mother and a native father.”

Finally, a `natural experiment’ is nice, but what about the holy grail of scientific rigourness: a real experiment? Often difficult to achieve in the social sciences, but it has been done. Ed Yong on ‘Not exactly Rocket Science’ discusses an experimental test of the ‘broken windows theory’, “which suggests that signs of petty crimes, like broken windows, serve as a trigger for yet more criminal behaviour”. The science-published article describes how simple experiments were conducted, such as measuring ‘littering’ when a wall was severely tainted by graffiti, or when it was completely painted over. A very interesting article, and Ed Yong gives a thorough summary. “All in all, the suite of experiments, all in a realistic setting, provide powerful evidence that the Broken Windows Theory is valid and all of Keiser’s results were statistically significant”

That’s it for today. No more entries for this first edition of the Curving Normality blog carnival. I would like to thank all those having submitted their entries. It was very nice to read all your blogs and to tie it all together in this editorial. The next edition will be published on the first day of 2009, so please submit your next article in the comments below as soon as it’s ready!

Newsflash: Lucia de B. gets re-trial!

Rense Nieuwenhuis — Wed, 08 Oct 2008 10:00:37 +0000

Dutch nurse Lucia de B., convicted to a life sentence for the murder on 7 infants during her shifts, is now entitled to a re-trial. Why do I write about it here? Because one of the grounds she was convicted on was a statistical argument. A statistical argument that has been thoroughly contested by prominent statisticians, arguing that according to the court’s line of reasoning, one out of every nine nurses would go to jail!

I have written before about this statistical argument, but did so in Dutch. For those interested, I’ll give you a short recap, and a nice movie.

Lucia de B. has been convicted for murder on seven children on numerous grounds. Most of these have been contested or already been refuted. Ton Derksen, a Dutch philosopher of science, even wrote a book to discuss many of the court’s considerations. One of the main arguments has been, that an statistically highly improbable number of children died during her shifts. There are many arguments against this statement. For instance, after she was related to one unusual death, investigators specifically sought for other unusual deaths during her shifts. Clearly, this increases the numerator of the abovementioned chance. Later, it was discovered that the ‘unusual’ deaths actually didn’t need to be unusual, for the ‘unusual’ substance in the infants’ blood had been mixed up with a similarly named, but completely different, substance that is found in infants blood very often.

Nevertheless, one of the courts’ main considerations was that the chance that so many infants would die during or shortly after the shifts of Lucia de B. was so low, that she had to be guilty. Apparently, the court reasoned that the probability of these events (deaths during her shift) was so low, that other explanations would be highly improbable.

Clearly, something goes horribly wrong here. In the movie below a similar case is addressed by Peter Donnelly in a very accessible way. In the case that is addressed in the movie, a woman was convicted for the murder of her two children. These two children, independently, had died from sudden infant death syndrome. Sudden infant death syndrome is rather rare (and a tragedy for the family). As a matter of fact, it is so rare, that the chance that it happens to two babies of the same mother is so extremely small (according to the judge: extremely small chance times another extremely small chance), that this mother was sent to jail.

I’m not going to completely summarise Peter Donnelly’s arguments, but what it comes down to, is that we should interpret the court’s decision as a ‘test’. And we know of statistical tests that two errors can be made: we can erroneously conclude that an event is highly improbably, while it in fact is not. Or, we can erroneously conclude that something is not highly improbable, while in fact it is.

What this has to do with the case of the mother who lost two of her babies to infant death syndrome, and correspondingly to the case of Lucia de B., is made clear in the movie below. The basic argument, which relates to the case of Lucia de B., is that although some events are rather rare, if enough possibilities for the event to occur are present (lots of mothers have two babies, many nurses work with infants who die), the odds of the event to occur in the whole population isn’t that small at all. Watch and see how Peter Donnelly explains this eloquently:

And by the way: for the statisticians amongst us: the prominent statisticians “>I mentioned before basically argue that the assumption of homoscedasticity has not been met, which makes matters even worse!

R-Sessions 11: Tables

Rense Nieuwenhuis — Fri, 15 Aug 2008 10:00:27 +0000

The one most often used function in the analysis of statistical data is the creation of tables. This edition of the R-Sessions describes the use of several functions to do some nifty cross-tabulations. And more.

TAPPLY

The function TAPPLY can be used to perform calculations on table-marginals. Different functions can be used, such as MEAN, SUM, VAR, SD, LENGTH (for frequency-tables). For example:

x <- c(0,1,2,3,4,5,6,7,8,9)
y <- c(1,1,1,1,1,1,2,2,2,2)
tapply(x,y,mean)
tapply(x,y,sum)
tapply(x,y,var)
tapply(x,y,length)

> x <- c(0,1,2,3,4,5,6,7,8,9)
> y <- c(1,1,1,1,1,1,2,2,2,2)
> tapply(x,y,mean)
  1     2
2.5   7.5
> tapply(x,y,sum)
 1  2
15 30
> tapply(x,y,var)
       1        2
3.500000 1.666667
> tapply(x,y,length)
1 2
6 4
>

FTABLE

More elaborate frequency tables can be created with the FTABLE-function. For example:

x <- c(0,1,2,3,4,5,6,7,8,9)
y <- c(1,1,1,1,1,1,2,2,2,2)
z <- c(1,1,1,2,2,2,2,2,1,1)
ftable(x,y,z)

> x <- c(0,1,2,3,4,5,6,7,8,9)
> y <- c(1,1,1,1,1,1,2,2,2,2)
> z <- c(1,1,1,2,2,2,2,2,1,1)
> ftable(x,y,z)
    z 1 2
x y
0 1   1 0
  2   0 0
1 1   1 0
  2   0 0
2 1   1 0
  2   0 0
3 1   0 1
  2   0 0
4 1   0 1
  2   0 0
5 1   0 1
  2   0 0
6 1   0 0
  2   0 1
7 1   0 0
  2   0 1
8 1   0 0
  2   1 0
9 1   0 0
  2   1 0

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

Find the original article embedded in the manual.

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

R-Sessions 10: Conditionals

Rense Nieuwenhuis — Wed, 13 Aug 2008 10:00:21 +0000

Conditionals, or logicals, are used to check vectors of data against conditions. In practice, this is used to select subsets of data or to recode values. Here, only some of the fundamentals of conditionals are described.

Basics

The general form of conditionals are two values, or two sets of values, and the condition to test against. Examples of such tests are ‘is larger than’, ‘equals’, and ‘is larger than’. In the example below the values ‘3’ and ‘4’ are tested using these three tests.

3 > 4 3 == 4 3 < 4

> 3 > 4
[1] FALSE
> 3 == 4
[1] FALSE
> 3 < 4
[1] TRUE

Numerical returns

The output shown directly above makes clear that R-Project returns the values ‘TRUE’ and ‘FALSE’ to conditional tests. The results here are pretty straightforward: 3 is not larger than 4, therefore R returns FALSE. If you don’t desire TRUE or FALSE as response, but a numeric output, use the as.numeric() command which transforms the values to numerics, in this case ‘0’ or ‘1’. This is shown below.

as.numeric(3 > 4) as.numeric(3 < 4)

> as.numeric(3 > 4)
[1] 0
> as.numeric(3 < 4)
[1] 1

Conditionals on vectors

As on most functionality of R-project, vectors (or multiple values) can be used alongside single values, as is the case on conditionals. These can be used not only against single values, but against variables containing multiple values as well. This will result in a succession of tests, one for each value in the variable. The output is a vector of values, ‘TRUE’ or ‘FALSE’.The examples below show two things: the subsequent values 1 to 10 are tested against the condition ‘is smaller than or equals 5′. It is shown as well that when these values are assigned to a variable (here: ‘x’), this variable can be tested against the same condition, giving exactly the same results.

1:10 1:10 <= 5 x <- 1:10 x <= 5

> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10
> 1:10 <= 5
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
> x <- 1:10
> x == 5
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Conditionals and multiple tests

More tests can be gathered into one conditional expression. For instance, building on the example above, the first row of the next example tests the values of variable ‘x’ against being smaller than or equal to 4, or being larger than or equal to ‘6’. This results in ‘TRUE’ for all the values, except for 5. Since the ‘|’-operator is used, only one of the set conditions need to be true. The second row of this example below tests the same values against two conditions as well, namely ‘equal to or larger than 4′ and ‘equal to or smaller than 6′. since this time the ‘&’-operator is used, both conditionals need to be true.

x <= 4 | x >= 6 x >= 4 & x <= 6

> x <= 4 | x >= 6
 [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> x >= 4 & x <= 6
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

Conditionals on character values

In the example below, a string variable ‘gender’ is constructed, containing the values ‘male’ and ‘female’. This is shown in the first two rows of the example below.

gender <- c(“male”,”female”,”female”,”male”,”male”,”male”,”female”) gender == “male”

> gender <- c("male","female","female","male","male","male","female")
> gender == "male"
[1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Additional functions

The last examples demonstrate two other functions using conditionals, using the same ‘gender’ variable as above. The first is an additional way to get a numerical output of the same test as in the row above. The iselse() command has three arguments: the first is a conditional, the second is the desired output if the conditional is TRUE, the third is the output in case the result of the test is ‘FALSE’. The second example shows a way to obtain a list of which values match the condition tested against. In the output above, the second, third and last values are ‘female’. Using which() and the condition “== ‘male’ ” (equals ‘male’) returns the indices of the values in variable ‘gender’ that equal ‘male’.

ifelse(gender==”male”,0,1) which(gender==”male”)

> ifelse(gender=="male",0,1)
[1] 0 1 1 0 0 0 1
> which(gender=="male")
[1] 1 4 5 6

Conditionals on missing values

Missings values (‘NA’) form a special case in many ways, such as when using conditionals. Normal conditionals cannot be used to find the missing values in a range of values, as is shown below.

x <- c(4,3,6,NA,4,3,NA) x == NA which(x == NA) is.na(x) which(is.na(x))

The last two rows of the syntax above show what can be done. The is.na() command tests whether a value or a vector of values is missing. It returns a vector of logicals (‘TRUE’ or ‘FALSE’), that indicates missing values with a ‘TRUE’. Nesting this command in the which() command described earlier enables us to find which of the values are missing. In this case, the fourth and the seventh values are missing.

> x <- c(4,3,6,NA,4,3,NA)
> x == NA
[1] NA NA NA NA NA NA NA
> which(x == NA)
integer(0)
> is.na(x)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
> which(is.na(x))
[1] 4 7

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

Find the original article embedded in the manual.

useR! 2008: Harrell already wrote it …

Rense Nieuwenhuis — Mon, 11 Aug 2008 15:33:27 +0000

Unfortunately, Frank E. Harrell Jr. already wrote the book that I would have loved to (be able to) write, probably somewhere at the end of my career. If at all. Fortunately, I can learn a lot very much faster now. I’m talking about a book on statistics that also contains a perspective and opinion on the application statistics. Harrell called his book “Regression Modeling Strategies”. Oh, and he also demonstrates his main arguments in R-Project. And now he is telling me that his philosophy on applied statistics is also condensed in an R-package (the design package).

An eye-opener to me was his description of non-statisticians being afraid of continuous variables. Indeed, when I doubt the linearity of a continuous variable and I can’t find a way to fix it, I tend to categorize or even dichotomize it. I feel that this is not very uncommon to do so, but now that I have heard Harrell’s criticism on this strategy, I hope to never do that again and will give some serious thought on his suggestion of using ‘spline functions’. He argued that we tend to dichotomize a not-completely-linear variable, because we do not believe that it is linear in reality. But, given the finding of near linearity in our data, do we believe reality to be dichotomized? Probably our actions will bring our model farther away form correspondence with reality, than would our inaction. “Nature is not that kind. There is no reason to expect linearity.”

This does, again, raise the question on how theory and statistical model relate to each other. Is there a need to analyze data with models that are richer in detail than the actual theory that we are testing? If so, does this subsequently mean that our theories are not fit (enough) to be tested with the models that we are forced to estimate?

It is too easy to find interpretations of parameters that basically don’t have any meaning at all. Harrell told a story of how he had found a hugely significant interaction parameter. He went to the cardiologist he worked with, who soon thought of an interpretation. Then, Harrell found that he had made a mistake, and had to correct the sign of the parameter. Shockingly, the cardiologist immediately had a new, and completely different, interpretation.

Several other topics were addressed, such as the treatment of missing values. It is all to easy to add a category ‘missing’ to our analyses, but this does mess up our number of degrees of freedom. He analyzed how a horrid technique as variable / model selection came about, by arguing that computer were able to perform the technique, before they could run the simulations to properly test the technique. Harrell explained that the purpose of data imputation is not to recover missing data, but to retain data that was not not missing to begin with on the other variables.

The last half hour of the presentation was spend on analyzing some magnificent data with survival rates of passengers of the Titanic. “What did ‘women and children first’ really mean?” Applying much of what he already discussed, he was not only able to show how age, sex, and social social class affected chances of survival, but moreover how they interacted. For instance, younger people indeed had relatively high chances of survival, but lower class young males (> 20 years) had almost no chance of survival.

A lot of other topics were covered, only some of which I hinted at here. All were stated with confidence and from a clear perspective. I don’t really know what to think of the implications for the relationship between theory and empirical analysis, but Harrell sure has given some input to my thought on that. However, what I learned on the most fundamental level, I think, is that I can’t wait for the conference bookstore to open tomorrow.

useR! 2008: Bates excels on mixed models

Rense Nieuwenhuis — Mon, 11 Aug 2008 12:11:42 +0000

Douglas Bates excelled during my first tutorial session of the useR! 2008 conference. He gave a three hours talk on mixed models, in which he was able to give an overview on theory and basic specification of these kind of models in R-Project, and to address highly advanced and avant-garde issues as well. I’m impressed. During the brake he was so kind as to answer a question regarding mixed models, that had nothing much to do with what he addressed during his talk. We even ended up having a short but nice talk about dutch politics.

During what was basically his introduction, he gave a nice guideline regarding a discussion that we have been having at our own university. It is the discussion on what instances we can apply mixed models to grouped data, and in what cases we can’t. Although he basically didn’t add anything that was new to me, his statements gave a lot of clarity to my thoughts on the subject. His basic argument was that we can estimate a grouping factor as mixed effects only, if it is reasonable that they come from a collection of these factors. So, for instance, the distinction between male and female would not be a good mixed effect, because should we repeat the ‘experiment’, we would automatically end up with the same values (male and female) on our grouping factor. A good example would be the class that a school child is in, for when we repeat the experiment with a new sample, we would end up with students in different classes. More interestingly, though, was his acknowledgment that there are simply grey areas. These are found on two extremes of the same dimension. When a small number of grouping factors are present, we end up with problems estimating the model. On the other hand, if we have (almost) all existing factors (i.e. all American states in a survey research project), then we wouldn’t end up with different grouping factors (states) when the project would be repeated. I find the fact that these extremes are defined as a grey area is rather clarifying and more informative than simply taking one of the extreme positions (‘always estimate mixed models’ or `mixed models are completely flawed in such cases’.

Following this introduction, a wide array of issues were addressed. Longitudinal models with time as a co-variate, interactions on the level of the grouping factor, theory of generalized models, an example of these generalized linear models, and finally some attention was paid to non-linear mixed models.

What I found especially interesting, though, was the explanation of how item response models can be represented by using a generalized linear mixed models. Item response models are based on theory that basically states that the responses people give to a stimulus (i.e. survey questions), are both due to characteristics of the stimulus, and due to characteristics of the respondent. We thus need a method for disentangling both influences. Douglas Bates demonstrated a method of doing so by applying mixed models. For long, computer was not capable of properly estimating such models. Now, it has become possible to approach the analysis of such models, by interpreting the responses to the items to be nested within individuals. Both item characteristics and person characteristics can then be added to this basic model.

To sum up, I found this session to be extremely fascinating. It gave a very good overview on mixed models, I learned some new thing, and I saw things that I did not understand. At all. That’s the risk that lies in getting a statistics course given by a mathematician. But, since we have the slides and books, these sections of the course will still function as pointers of what topics to study in the future.

Being in such an interdisciplinary setting as the useR! conference does that to you: you see topics and methods used in a completely different context that what you’re used to. From that you can easily gain a more general understanding of the techniques you work with within the safe confines of your own discipline. Very enriching and inspiring, and I think the applause was well deserved.

More to come this afternoon, when I will attend a session by Frank E. Harrell Jr. on regression modeling strategies.

R-Sessions 09: Data Manipulation

Rense Nieuwenhuis — Mon, 11 Aug 2008 10:00:39 +0000

Today’s edition of R-Sessions deals with the manipulation of data that is stored R-Project. Building upon the previous R-Session, attention is paid to recoding of data, ordering, and finally the merging of several sets of data.

Recoding

The most direct way to recode data in R-Project is using a combination of both indexing and conditionals as described elsewhere. To exemplify this, a simply data.frame will be created below, containing variables indicating gender and monthly income in thousands of euros.

gender <- c(“male”, “female”, “female”, “male”, “male”, “male”, “female”)
income <- c(54, 34, 556, 57, 88, 856, 23)
data <- data.frame(gender, income)
data

> gender <- c("male", "female", "female", "male", "male", "male", "female")
> income <- c(54, 34, 556, 57, 88, 856, 23)
> data <- data.frame(gender, income)
> data
  gender income
1   male     54
2 female     34
3 female    556
4   male     57
5   male     88
6   male    856
7 female     23

Some of the values on the income variable seem exceptionally high. Let’s say we want to remove the two values on income higher than 500. In order to do so, we use the which() command, that reveals which of the values is greater than 500. Next, the result of this is used for indexing the data$income variable. Finally, the indicator for missing values, ‘NA’ is assigned to the that selected values of the ‘income’ variables. Obviously, we would normally only use the third line. The first two are shown here, to make clear exactly what is happening.

which(data$income > 500)
data$income[data$income > 500]
data$income[data$income > 500] <- NA
data

> which(data$income > 500)
[1] 3 6
> data$income[data$income > 500]
[1] 556 856
> data$income[data$income > 500] <- NA
> data
  gender income
1   male     54
2 female     34
3 female     NA
4   male     57
5   male     88
6   male     NA
7 female     23

Sometimes, it is desirable to replace missing values by the mean on the respective variables. That is what we are going to do here. Note, that in general practice it is not very sensible to impute two missing values using only five valid values. Nevertheless, we will proceed here.
The first row of the example below shows that it is not automatically possible to calculate the mean of a variable that contains missing values. Since R-Project cannot compute a valid value, NA is returned. This is not what we want. Therefore, we instruct R-Project to remove missing values by adding na.rm=TRUE to the mean() command. Now, the right value is returned. When the same selection-techniques as above are used, an error will occur. Therefore, we need the is.na() command, that returns a vector of logicals (‘TRUE’ and ‘FALSE’ ). Using is.na(), we can use the which() command to select the desired values on the income variable. To these, the calculated mean is assigned.

mean(data$income)
mean(data$income, na.rm=TRUE)
data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
data

> mean(data$income)
[1] NA
> mean(data$income, na.rm=TRUE)
[1] 51.2
> data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
> data
  gender income
1   male   54.0
2 female   34.0
3 female   51.2
4   male   57.0
5   male   88.0
6   male   51.2
7 female   23.0

ORDER

It is easy to sort a data-frame using the command order. Combined with indexing functions, it works as follows:

x <- c(1,3,5,4,2)
y <- c('a','b','c','d','e')
df <- data.frame(x,y)

df
  x y
1 1 a
2 3 b
3 5 c
4 4 d
5 2 e

df[order(df$x),]
  x y
1 1 a
5 2 e
2 3 b
4 4 d
3 5 c

MERGE

Merge puts multiple data.frames together, based on an identifier-variable which is unique or a combination of variables.

x <- c(1,2,5,4,3)
y <- c(1,2,3,4,5)
z <- c('a','b','c','d','e')

df1 <- data.frame(x,y)
df2 <- data.frame(x,z)
df3 <- merge(df1,df2,by=c("x"))

 df3
  x y z
1 1 1 a
2 2 2 b
3 3 5 e
4 4 4 d
5 5 3 c

R-Sessions 08: Getting Data into R

Rense Nieuwenhuis — Fri, 08 Aug 2008 10:00:13 +0000

Introduction

Various ways are provided to enter data into R. The most basic method is entering is manually, but this tends to get very tedious. An often more useful way is using the read.table command. It has some variants, as will be shown below. Another way of getting data into R is using the clipboard. The back-draw thereof is the loss of some control over the process. Finally, it will be described how data from SPSS can be read in directly.

Only basic ways of entering data into R are shown here. Much more is possible as other functions offer almost unlimited control. Here the emphasis will be on day-to-day usage.

Reading data from a file

The most general of data-files are basically plain text-files that store the data. Rows generally represent the cases ( / respondents), although the top-row often will state the variable labels. The values these variables can take are written in columns, separated by some kind of indicator, often spaces, commas or tabs. Another variant is that there is no separating character. In that case all variables belonging to a single case are written in succession. Each variable then needs to have a specific number of character places defined, to be able to distinguish between variables. Variable labels are often left out on these type of files.

R is able to read all of the above-mentioned filetypes with the read.table() command, or its derivatives read.csv() and read.delim(). The exception to this are fixed-width files. These are loaded using the read.fwf() command, that uses different parameters. The derivatives of read.table() are basically the same command, but have different defaults. Because their use is so much convenience, these will be used here.

Comma / Tab separated files

As said, the most generic way of reading data is the read.table() command. When given only the filename as parameter, it treats a space as the separating character (so, beware on using spaces in variable labels) and assumes that there are no variable names on the first row of the data. The decimal sign is a “.”. This would lead to the first row of the syntax below, which assigns the contents of a datafile “filename” to the object data, which becomes a data.frame.

The read.csv() and the read.delim() commands are basically the same, but they have a different set of standard values to the parameters. Read.csv() is used for comma-separated files (such as, for instance, Microsoft Excell can export to). The syntax for read.csv() is very simple, as shown below. The read.table()-command can be used for the exact same purpose, by altering the parameters. The header=TRUE – parameter means that the first row of the file is now regarded as containing the variable names. The sep – parameter now indicates the comma “,” as the separating character. fill=TRUE tells the function that if a row contains less columns than there are variables defined by the header row, the missing variables are still assigned to the data frame that results from this function. Those variables for these cases will have the value ‘NA’ (missing). By dec=”.” the character used for decimal points is set to a point (to not interfere with the separating comma). In contrast with the read.table() function. the comment.char is disabled (set to nothing). Normally, if the comment.char is found in the data, no more data is read from the row that is was found on (after the sign, of course). In read.csv() this is disabled by default.

The last two rows of the syntax below shows the read.delim() command and the parameters needed to create the same functionality from read.table. The read.delim() function is used to read tab-delimited data. So, the sep-parameter is now set to “\t” by default. \t means tab. The other parameters are identical to those that read.csv() defaults to.

data <- read.table(“filename”)

data <- read.csv(“filename”)
data <- read.table(“filename”, header = TRUE, sep = “,”, dec=”.”, fill = TRUE, comment.char=””)

data <- read.delim(“filename”)
data <- read.table(“filename”, header = TRUE, sep = “\t”, dec=”.”, fill = TRUE, comment.char=””)

Variable labels

Data that is read into a data.frame can be given variable names. For instance, if the above commands were used to read a data-file containing three variables, variable names can be assigned in several ways. Two ways will be described here: assigning them after the data is read or assigning them using the read.table() command.

names(data) <- c(“Age”,”Income”,”Gender”)
data <- read.table(“filename”, colnames=c(“Age”,”Income”,”Gender”))

In the syntax above, the names() command is used to assign names to the columns of the data.frame (representing the variables). The names are given as strings (hence the apostrophes) and gathered using the c() command.

Fixed width files

When reading files in the ‘fixed width’ format, we cannot rely on a single character that indicates the separations between variables. Instead, the read.fwf() function has a parameter by which we tell the function where to end a variable and start the next one. Just as with read.table(), a data.frame is returned. Variable labels are treated the same way as the previous mentioned

data <- read.fwf(“filename”, widths = c(2,5,1), colnames=c(“Age”, “Income”, “Gender”))
data <- read.fwf(“filename”, widths = c(-5,2,5,-2, 1), colnames=c(“Age”, “Income”, “Gender”))

Reading data from the clipboard

data <- read.table(pipe(“pbpaste”))
data <- read.table(“clipboard”)

read.table is used for read comma seperated files. read.delim is used for reading tab delimited files. read.table(pipe(“pbpaste”)) is used for reading data from the clipboard on mac. read.table(“clipboard”) is used for reading data from the clipboard on Windows. Instead of read.table(pipe(“pbpaste”)) you can use read.delim(pipe(“pbpaste”)) as well.

Reading data from other statistical packages {foreign}

library(foreign)
data <-read.spss(“filename”)

require(foreign) loads the foreign package, which contains the read.spss() function, which can read data as written by the SPSS software.

– – — — —– ——–

Discuss this article and pose additional questions in the R-Sessions Forum

Find the original article embedded in the manual.