Lab 6: Semi-partial, Partial-correlations, and Model Comparison

Fabio Setti

PSYC 7804 - Regression with Lab

Today’s Packages and Data 🤗

Install Packages Code

install.packages("ppcor")
# install.packages("tidyverse")

library(ppcor)
library(tidyverse)
theme_set(theme_classic(base_size = 16, 
                        base_family = 'serif'))

ppcor

The ppcor package (Kim, 2015) contains functions to calculate partial and semi-partial correlations.

Data

# Data from https://library.virginia.edu/data/articles/hierarchical-linear-regression

dat <- rio::import("https://github.com/quinix45/PSYC-7804-Regression-Lab-Slides/raw/refs/heads/main/Slides%20Files/Data/Hierarc_data.csv")

Our Predictors

We want to see how happiness (happiness) relates to age (age), and number of friends (friends).

We can calculate the correlation among our predictors

cor(dat[,1:3])

           happiness         age     friends
happiness  1.0000000 -0.16091772  0.32424344
age       -0.1609177  1.00000000 -0.02097324
friends    0.3242434 -0.02097324  1.00000000

Although useful, correlation by itself may not be the measure we are interested in. Rather, we may want to know the relationship between two variables after accounting for another set of variables.

Partial correlation and semi-partial correlation are what we are looking for in this case.

Some Formulas for Reference

If \(r_{xy}\) represent the correlation between \(x\) and \(y\), then

Partial correlation:
- \(r_{xy.z} = \frac{r_{xy} - r_{xz} \times r_{yz}}{\sqrt{1 - r_{xz}^2}\times\sqrt{1 - r_{yz}^2}}\), where \(z\) is controlled for both \(x\) and \(y\)

Semi-partial correlation: In semi-partial correlation, the \(z\) variable is held constant only for 1 of the other 2 variables. That means that in our case, there are 2 semi-partial correlations that can be calculated

\(r_{x(y.z)} = \frac{r_{xy} - r_{xz} \times r_{yz}}{\sqrt{1 - r_{yz}^2}}\),

where \(z\) is controlled only for \(y\)

\(r_{y(x.z)} = \frac{r_{xy} - r_{xz} \times r_{yz}}{\sqrt{1 - r_{xz}^2}}\),

where \(z\) is controlled only for \(x\)

spot differences and similarities in related formulas

Notice that the numerator (top part) of these formulas is always the same, and the denominator (bottom part) changes slightly. This means that if you want to understand the differences, you should not look at the numerator at all (pretend it does not exist!), but only at the denominator.

Partial Correlation “by hand”

As always, I like calculating things by hand first to make sure I know what is going on 🤷 Let’s say that \(y =\) happiness, \(x =\) age, and \(z =\) friends.

Let’s first get our vanilla correlations

r_xy <- cor(dat$age, dat$happiness)
r_xz <- cor(dat$age, dat$friends)
r_yz <- cor(dat$happiness, dat$friends)

Then, the partial correlation between \(x\) and \(y\), accounting for \(z\) is

(r_xy - r_xz*r_yz)/(sqrt(1 - r_xz^2) * sqrt(1 - r_yz^2))

[1] -0.162955

Note that this is also equivalent to calculating the correlation between the residuals of \(y\) and \(x\) after a regression with \(z\) as a predictor

resid_y_z <- residuals(lm(happiness ~ friends, dat))
resid_x_z <- residuals(lm(age ~ friends, dat))

cor(resid_x_z, resid_y_z)

[1] -0.162955

So, \(r_{xy.z}-.16\) is the correlation between happiness and age after taking out the proportion of variance explained by friends in both variables.

Semi-Partial Correlation “by hand”

The semi-partial correlations between \(x\) and \(y\) accounting for \(z\) can either be:

With the formulas from the previous slides:

(r_xy - r_xz*r_yz)/(sqrt(1 - r_xz^2))

[1] -0.1541512

(r_xy - r_xz*r_yz)/(sqrt(1 - r_yz^2))

[1] -0.1629192

Similarly, we can get this by using regression residuals:

cor(dat$happiness, resid_x_z)

[1] -0.1541512

and

cor(dat$age, resid_y_z)

[1] -0.1629192

\(r_{y(x.z)} = -.15\) represents the relationship between \(y\) and \(x\) after the taking out of \(x\) the variance explained by \(z\).
\(r_{x(y.z)} = -.16\) represents the relationship between \(x\) and \(y\) after the taking out of \(y\) the variance explained by \(z\).

Which type of correlation to use? it depends on your research question

Using `ppcor` Functions

In practice, we use the ppcor package

# partial correlation

partial_cor <- pcor(dat[,1:3])
partial_cor$estimate

           happiness        age   friends
happiness  1.0000000 -0.1629550 0.3251768
age       -0.1629550  1.0000000 0.0334209
friends    0.3251768  0.0334209 1.0000000

# semi-partial correlation

semipartial_cor <- spcor(dat[,1:3])
semipartial_cor$estimate

           happiness         age    friends
happiness  1.0000000 -0.15415119 0.32093907
age       -0.1629192  1.00000000 0.03298535
friends    0.3251053  0.03161529 1.00000000

Careful! in the semi-partial correlation output, the variable in the row is the variable that has no variance taken out.

The functions from ppcor work in the same way as the ones used in Lab 2, where significance and other information is saved to a separate element.

partial_cor$p.value

           happiness       age    friends
happiness 0.00000000 0.1070529 0.00102305
age       0.10705291 0.0000000 0.74260755
friends   0.00102305 0.7426075 0.00000000

semipartial_cor$p.value

            happiness       age     friends
happiness 0.000000000 0.1276520 0.001200009
age       0.107131170 0.0000000 0.745847111
friends   0.001025828 0.7560662 0.000000000

Model Comparison: Gain in Prediction

We are often (actually always) interested in comparing different models. One often relevant question is: “is it worth it to add more variables as predictors?”

In the case of regression, the most popular way of comparing models is by comparing \(R^2\) from nested models. Let’s go back to how happiness (happiness) relates to age (age); then, we want to know whether adding number of friends (friends) improves our prediction.

reg_age <- lm(happiness ~ age, dat)

summary(reg_age)$r.squared

[1] 0.02589451

So age roughly explains \(3\%\) of the variance in happiness

reg_age_fr <- lm(happiness ~ age + friends, dat)

summary(reg_age_fr)$r.squared

[1] 0.1288964

age and friends together roughly explains \(13\%\) of the variance in happiness

The answer seems fairly straightforward: the model with both age and friends does better. But there is a catch with \(R^2\) 🤨

The catch With \(R^2\)

The “problem” with most popular measures of model fit such as \(R^2\) is that they always increase as variables are added, even when the added variables are unrelated to \(Y\).

# generate random variable
set.seed(8879)
dat$random_var <- rnorm(nrow(dat))

reg_age_fr_rand <- lm(happiness ~ age + friends + random_var, dat)

# before
summary(reg_age_fr)$r.squared

[1] 0.1288964

# after 
summary(reg_age_fr_rand)$r.squared

[1] 0.1361937

\(R^2\) increases When we add a random variable 😦

However, this is certainly a case where adding a randomly generated variable is not worth it.

We want to apply the principle of Occam’s razor, and choose the simpler model if the increase in \(R^2\) is not worth it.

Testing gain in prediction: \(\Delta R^2\)

In hierarchical regression we check whether adding variables to a model yields a significant \(R^2\) improvement (\(\Delta R^2\)). We can use the anova() function to compare nested regression models. The F-test tells us whether the \(R^2\) improvement is significant.

anova(reg_age, reg_age_fr)

Analysis of Variance Table

Model 1: happiness ~ age
Model 2: happiness ~ age + friends
  Res.Df   RSS Df Sum of Sq     F   Pr(>F)   
1     98 234.6                               
2     97 209.8  1    24.807 11.47 0.001023 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, adding friends provides a significant improvement in variance explained, \(\Delta R^2 = .1, F(1, 97) = 11.47, p = .001\)

anova(reg_age_fr, reg_age_fr_rand)

Analysis of Variance Table

Model 1: happiness ~ age + friends
Model 2: happiness ~ age + friends + random_var
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1     97 209.80                          
2     96 208.04  1    1.7575 0.811 0.3701

Unsurprisingly, adding adding a random variable does not provide a significant improvement in variance explained, \(\Delta R^2 = .01, F(1, 96) = 0.81, p = .37\).

Note: For reporting, you need to calculate the difference between the \(R^2\) of the two regression models, the \(\Delta R^2\), by hand.

Information Criteria for Model Selection

There also exist a family of statistics used for comparing model based on information theory. What you will often see are the Akaike information criterion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978).

Both AIC and BIC calculate model fit, and then penalize it based on model complexity (number of estimated parameters). The logic is that if you are going to add an extra variable to a regression, it should improve fit enough to offset the penalty term. For regression:

\[AIC = N \times ln(\frac{SS_{\mathrm{resisuals}}}{N}) + 2p\]

\(N =\) sample size

\(p =\) number of model parameters (i.e., intercept, slopes, residual variance)

\(SS =\) sum of squares of the residuals

\(ln() =\) stands for the natural log

\[BIC = N \times ln(\frac{SS_{\mathrm{resisuals}}}{N}) + p\times ln(N)\]

The \(N \times ln(\frac{SS_{\mathrm{resisuals}}}{N})\) part, which is shared by both formulas, measures model fit. In contrast \(2p\) is the AIC penalty, while \(p\times ln(N)\) is the BIC penalty. The BIC penalty is stricter.

The `AIC()` and `BIC()` Functions

We can calculate both AIC and BIC for all our models at once with the AIC() and BIC() functions. Both for AIC and BIC, the smallest value is the model that fits best according to the criteria.

AIC(reg_age, reg_age_fr, reg_age_fr_rand)

                df      AIC
reg_age          3 375.0604
reg_age_fr       4 365.8845
reg_age_fr_rand  5 367.0433

The second model fits best.

BIC(reg_age,reg_age_fr, reg_age_fr_rand)

                df      BIC
reg_age          3 382.8759
reg_age_fr       4 376.3052
reg_age_fr_rand  5 380.0691

The second model fits best.

In both cases the second model fits best because the addition of the random variable is not enough to offset the penalty for adding an additional variable (ok, that is to be expected, we added a random variable).

Although, AIC and BIC can disagree (often in my experience!). Usually, AIC likes more complex models, while BIC prefers simpler models. If I have to choose, I tend to trust BIC more.

Finally, information criteria like the AIC and BIC can be used to compare non-nested models (the data must be the same though), unlike \(\Delta R^2\), which can only be used to compare nested models.

Caveats With AIC and BIC

There are a couple of caveats to be aware of when using AIC and BIC.

Different functions/software will calculate AIC and BIC differently (I was confused for a bit myself before finding this and this). You should not compare AIC and BIC if they come from different functions/software (or, if you need to, be very careful).

AIC(reg_age)

[1] 375.0604

extractAIC(reg_age)[2]

[1] 89.2727

The AIC() function adds a \(N \times \mathrm{ln}(2\pi) + N\) to the AIC (for math reasons), but other functions may not. So always use the same function/software to calculate AIC and BIC.

AIC and BIC do not tell you how much better a model fits over another. So, you may reject a model that practically is quite similar to the model you chose, but you will not know that by looking at AIC and BIC alone. Your theory should also guide the model selection choices that you make.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705

Kim, S. (2015). Ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients. Communications for Statistical Applications and Methods, 22(6), 665–674. https://doi.org/10.5351/CSAM.2015.22.6.665

Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136

Wickham, H., & RStudio. (2023). Tidyverse: Easily Install and Load the ’Tidyverse’ (Version 2.0.0) [Computer software]. https://cran.r-project.org/web/packages/tidyverse/index.html