The matrix algebra of linear regression

Since the advent of modern computers, social scientists analyze their data with computer software such as the Statistical Package for the Social Sciences (SPSS). Long gone are the days of Raymond Cattell doing factor analysis in his University’s gymnasium. Consequently, most social scientists are not aware of the mathematics behind their statistical analyses. The computer program does the math for them and spits out the final numbers used for interpretation. And that is arguably all the social scientist needs. Interpretation of output is emphasized in graduate training instead of the mathematical formulas.

However, have you ever wondered what SPSS is doing when you asked it to run your statistical analyses? Are you ever curious how exactly it turned your large spreadsheet of data into a few key numbers in your output? I remember in my first year of graduate statistics asking my professor what formulas I could use to do certain advanced statistics by hand. I liked knowing the math behind my analyses to ensure I understood what was going on. His response was “SPSS’s formulas” implying it was now too complicated to do by hand. I accepting his dissatisfying answer at the time, but my curiosity never really dissipated.

Now, for some basic statistical analyses, we already know the math. Everyone learns how to calculate the mean in middle school, which has a relatively straight forward formula: you add up all the scores and divide by the number of scores. In undergraduate statistics courses, most of us also learned how to calculate the variance and standard deviation. We even learned how to obtain the intercept and coefficient from a simple regression model. These slightly more complicated formulas usually involved summing Xs, summing X2, and maybe even X*Ys. But what about when there are multiple predictors in your regression model? Most of us don’t learn how to obtain the intercept and set of coefficients from a multiple regression model. Once you move beyond descriptive statistics and univariate models, the mathematics behind statistics is almost all matrix algebra. Although it is an introductory course for undergraduate math majors, most social scientists never see a lick of it in undergrad.

I want to take the time to walk you through the basics of matrix algebra and its application for one of the most widely used statistical analyses in social science: linear regression. The formulas I will go over are essentially what your statistical software program is doing when you go Analyze -> Regression -> Linear Regression.

This is my first attempt at a mathematical blog post! I am not going to go over the proofs behind the formulas I present, but likely will in a future blog post. Because square space does not allow for mathematical notation in its blog space, I wrote the post in a word doc and have attached it here. Enjoy!

P.S. If you are curious where I learned the math from this post, check out Dr. Gilbert Strang’s free matrix algebra lectures online (you can also find them on iTunes U). Dr. Strang is an adorably awkward mathematics professor from MIT. He and I bonded over my weekly commutes for the past year as I listened to him while driving. Matrix calculus is next…