Unfortunately, Frank E. Harrell Jr. already wrote the book that I would have loved to (be able to) write, probably somewhere at the end of my career. If at all. Fortunately, I can learn a lot very much faster now. I’m talking about a book on statistics that also contains a perspective and opinion on the application statistics. Harrell called his book “Regression Modeling Strategies”. Oh, and he also demonstrates his main arguments in R-Project. And now he is telling me that his philosophy on applied statistics is also condensed in an R-package (the design package).
An eye-opener to me was his description of non-statisticians being afraid of continuous variables. Indeed, when I doubt the linearity of a continuous variable and I can’t find a way to fix it, I tend to categorize or even dichotomize it. I feel that this is not very uncommon to do so, but now that I have heard Harrell’s criticism on this strategy, I hope to never do that again and will give some serious thought on his suggestion of using ‘spline functions’. He argued that we tend to dichotomize a not-completely-linear variable, because we do not believe that it is linear in reality. But, given the finding of near linearity in our data, do we believe reality to be dichotomized? Probably our actions will bring our model farther away form correspondence with reality, than would our inaction. “Nature is not that kind. There is no reason to expect linearity.”
This does, again, raise the question on how theory and statistical model relate to each other. Is there a need to analyze data with models that are richer in detail than the actual theory that we are testing? If so, does this subsequently mean that our theories are not fit (enough) to be tested with the models that we are forced to estimate?
It is too easy to find interpretations of parameters that basically don’t have any meaning at all. Harrell told a story of how he had found a hugely significant interaction parameter. He went to the cardiologist he worked with, who soon thought of an interpretation. Then, Harrell found that he had made a mistake, and had to correct the sign of the parameter. Shockingly, the cardiologist immediately had a new, and completely different, interpretation.
Several other topics were addressed, such as the treatment of missing values. It is all to easy to add a category ‘missing’ to our analyses, but this does mess up our number of degrees of freedom. He analyzed how a horrid technique as variable / model selection came about, by arguing that computer were able to perform the technique, before they could run the simulations to properly test the technique. Harrell explained that the purpose of data imputation is not to recover missing data, but to retain data that was not not missing to begin with on the other variables.
The last half hour of the presentation was spend on analyzing some magnificent data with survival rates of passengers of the Titanic. “What did ‘women and children first’ really mean?” Applying much of what he already discussed, he was not only able to show how age, sex, and social social class affected chances of survival, but moreover how they interacted. For instance, younger people indeed had relatively high chances of survival, but lower class young males (> 20 years) had almost no chance of survival.
A lot of other topics were covered, only some of which I hinted at here. All were stated with confidence and from a clear perspective. I don’t really know what to think of the implications for the relationship between theory and empirical analysis, but Harrell sure has given some input to my thought on that. However, what I learned on the most fundamental level, I think, is that I can’t wait for the conference bookstore to open tomorrow.