On choosing a statistical analysis

There are hundreds of different statistical analyses available to the applied statistician. While each scientific field tends to have common, go-to analyses, that often leaves a plethora from which to choose. So, we are often left with the question “which analysis to use?” The answer is simple: use an analysis that tests your hypothesis. After all, the broad goal of the social sciences is to generate knowledge by testing a hypothesis that addresses a gap in the scientific literature. The statistics are simply the mathematical tools for conducting these tests. Therefore, the applied statistician should always start with a clear, specific hypothesis before beginning any analysis.

A justifiable analysis is not the one that is the most complicated or difficult to understand; again, it is the analysis that tests the hypothesis. As with many human creations, statistics can be used as a way to inflate one’s ego:  “Look at how sophisticated I am for doing a complicated analysis!” However, if the analysis does not help test the hypothesis, then it is nothing more than hot air. The problem compounds when readers – intimidated by the statistics –take the author’s statistical interpretations as gospel because they don’t know how to evaluate their validity. Some would argue the analysis should be as simple as possible to increase the number of readers who can understand the analysis. While there may not be one right analysis, certainly there are wrong ones and ones that are better than others.

I recently came across a scientific article published in a top journal by Swartout et al. (2015) that did just what I advised against: They chose an analysis that did not test their hypothesis, or at least, did not test it very well. The article instead used a very complicated analysis. I will give the authors the benefit of the doubt and assume that they were not using their statistics as an ego-boost or to intimidate their readers. The article sought to test the serial rapist assumption, which states that most rapists will rape multiple times rather than only once.

To test this assumption, the authors ran a latent class growth (or trajectory) analysis. Participants were college men who completed a survey every year of college. At each of these four time points, participants self-reported how many times they forced a woman to have sex in the past year (i.e., rape). The authors created a dichotomous variable out of the responses from each year – whether a participant raped at least once in the past year. These four observed variables were entered into a latent class growth analysis to place participants into groups (or classes) based upon how their probability of rape grew over time. The authors found that most participants fell into a latent class who showed no growth in the probability of rape over time. A minority of participants showed increased growth over time, such that their probability of rape freshman year started at around zero and increased until senior year. Another minority of participants showed decreased growth over the course of college. The authors concluded that because most participants showed no growth in their probability of rape, the serial rapist assumption was not supported.

Hopefully, many of you realize the problems with this interpretation of the statistical analyses. Let’s return to what the serial rapist assumption states:  the majority of rapists will rape more than once. The first problem with the authors’ analysis was the way the authors created their observed variables. They coded participants for raping at least once in the past year. The authors stripped the data of telling them whether participants raped multiple times in a year! This obviously prevented them from fully testing the serial rapist assumption. The authors were only able to see if participants raped during multiple years of college. The second problem is the choice of analysis:  a growth curve analysis, which test whether variables (or probabilities of variables) increase or decrease over time. However, the serial rapist assumption says nothing about the frequency or probability of rape increasing or decreasing over time. It simply states the majority of rapists will rape more than once.

A justifiable analysis would be some type of relative frequency analysis. This could be as simple as a single percentage, such as the percentage of total rapists who raped more than once. One of the samples that the authors used, and its associated dataset, is publicly available. After the article’s publication, other rape researchers decided to run an analysis that actually tested the serial rapist assumption. They found that 63% of the total rapists in the dataset raped more than once, which supported the serial rapist assumption. To test Swartout et al.’s (2015) hypothesis, they did not have to run a latent variable analysis; they did not have to run a growth curve analysis; and, they did not have to estimate logit equations. All they had to do was obtain two numbers: the total number of rapists and the number of rapists who raped more than once.

Right now, there is a lot of controversy surrounding the validity of the serial rapist assumption and its implications for public and collegiate policy. While important for any scientific investigation, choosing an analysis that tests your hypothesis is arguably more important for a theory with such influential, real-world implications. Future research should perform some type of relative frequency analysis. As for the implications of the 63% result in one of the Swartout et al. (2015) samples…I will leave that for the politicians.