SEM and multiple (linear) regression in R

After much peer pressure I’ve started to play with SEM. I thought I’d try John Fox’s sem package in R, to start with just doing a simple multiple regression.

Should mention Fox (2006) [Structural Equation Modeling With the sem Package in R. Structural Equation Modeling, 13:465-486] as a useful starting point.

First get the package in:

install.packages(“sem”)
library(sem)

Using simulated data to start with, here’s 100 values for x1, x2, and x3 from Gaussian distributions.

x1 = rnorm(100, 20, 20)
x2 = rnorm(100, 50, 10)
x3 = rnorm(100, 100, 15)

And also some noise:

e = rnorm(100,0,50)

Now set an arbitrarily chosen relationship between y and the x’s:

y = 20*x1 + 3*x2 + 1.5*x3 +40 + e

And throw this all into a data frame

thedata = data.frame(x1,x2,x3,y)

Eventually after much impolite language use, the following model specification made sense.

thedata.mod1 = specify.model()
y <->y, e.y, NA,
x1 <->x1, e.x1, NA,
x2 <->x2, e.x2, NA,
x3 <->x3, e.x3, NA,
y <- x1, bx1, NA,
y <- x2, bx2, NA,
y <- x3, bx3, NA

The first four rows specify the error terms, the last three specify the regression I want to fit. NA is used to specify no starting value.

The sem function requires the model specification, a covariance matrix of the data (from cov) , and the number of observations (using dim):

sem.thedata1 = sem(thedata.mod1, cov(thedata), N=dim(thedata)[1], debug=T)
summary(sem.thedata1)
summary(lm(y ~ x1 + x2 + x3))

Let’s look at the output for the last two:

> summary(sem.thedata1)

Model Chisquare = 1.7523 Df = 3 Pr(>Chisq) = 0.62537
Chisquare (null model) = 397.79 Df = 6
Goodness-of-fit index = 0.99112
Adjusted goodness-of-fit index = 0.9704
RMSEA index = 0 90% CI: (NA, 0.13780)
Bentler-Bonnett NFI = 0.9956
Tucker-Lewis NNFI = 1.0064
Bentler CFI = 1
SRMR = 0.049717
BIC = -12.063

Normalized Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.861 -0.357 -0.122 -0.139 0.000 0.993

Parameter Estimates
Estimate Std Error z value Pr(>|z|)
e.y 3011.4667 428.11714 7.0342 2.0040e-12 y <–> y
e.x1 390.1715 55.46767 7.0342 2.0040e-12 x1 <–> x1
e.x2 107.8570 15.33320 7.0342 2.0040e-12 x2 <–> x2
e.x3 254.7381 36.21417 7.0342 2.0040e-12 x3 <–> x3
bx1 20.3897 0.28029 72.7436 0.0000e+00 y <— x1
bx2 3.3891 0.53566 6.3269 2.5015e-10 y <— x2
bx3 1.7304 0.34732 4.9822 6.2860e-07 y <— x3

Iterations = 0
> summary(lm(y ~ x1 + x2 + x3))

Call:
lm(formula = y ~ x1 + x2 + x3)

Residuals:
Min 1Q Median 3Q Max
-127.84 -30.55 -5.85 32.10 128.30

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.5809 43.1330 -0.083 0.934
x1 20.3897 0.2846 71.633 < 2e-16 ***
x2 3.3891 0.5440 6.230 1.23e-08 ***
x3 1.7304 0.3527 4.906 3.79e-06 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 55.73 on 96 degrees of freedom
Multiple R-Squared: 0.9817, Adjusted R-squared: 0.9811
F-statistic: 1716 on 3 and 96 DF, p-value: < 2.2e-16

Reassuringly, the estimates for bx1, bx2, and bx3 match the corresponding estimates from lm. I don’t yet understand the fit statistics. I also don’t understand the error model.

There’s no intercept. Multiple ways to get it, apparently. One is to use the raw moments matrix.

thedata.mod2 = specify.model()
(Intercept) <-> (Intercept), e.i, NA,
y <-> y, e.y, NA,
x1 <-> x1, e.x1, NA,
x2 <-> x2, e.x2, NA,
x3 <-> x3, e.x3, NA,
y <- x1, bx1, NA,
y <- x2, bx2, NA,
y <- x3, bx3, NA,
y <- (Intercept), b0, NA

raw.thedata = raw.moments(~ ., data = thedata)
sem.thedata2 = sem(thedata.mod2, raw.thedata, N=dim(thedata)[1], debug=T)
summary(sem.thedata2)

Running this gave a couple of warnings: “Negative parameter variances. Model is probably underidentified.” Let’s compare again with lm anyway:

> summary(sem.thedata2)

Model Chisquare = 741.02 Df = 6 Pr(>Chisq) = 0
Chisquare (null model) = 1273 Df = 10
Goodness-of-fit index = 0.37618
Adjusted goodness-of-fit index = -0.55955
RMSEA index = 1.1124 90% CI: (1.0455, 1.1807)
Bentler-Bonnett NFI = 0.41789
Tucker-Lewis NNFI = 0.030061
Bentler CFI = 0.41804
SRMR = 0.63208
BIC = 713.39

Normalized Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 5.89 6.60 6.60 9.65 11.70

Parameter Estimates
Estimate Std Error z value Pr(>|z|)
e.i 1.0000 0.14211 7.0368 1.9669e-12 (Intercept) <–> (Intercept)
e.y 2981.3520 423.70206 7.0364 1.9722e-12 y <–> y
e.x1 689.3882 97.78209 7.0503 1.7859e-12 x1 <–> x1
e.x2 2627.5144 372.62751 7.0513 1.7724e-12 x2 <–> x2
e.x3 10049.2181 1427.00874 7.0422 1.8929e-12 x3 <–> x3
bx1 20.3897 0.26869 75.8850 0.0000e+00 y <— x1
bx2 3.3891 0.27530 12.3104 0.0000e+00 y <— x2
bx3 1.7304 NaN NaN NaN y <— x3
b0 -3.5809 NaN NaN NaN y <— (Intercept)

Iterations = 0

Aliased parameters: bx3 b0
Warning message:
NaNs produced in: sqrt(diag(object$cov))

The estimates look okay but something has gone wrong :-( I shall try again later.

Next up, some latent variable magic. Maybe, if sem allows me, it would also be nice to play with, say, binomially distributed data. And what do all those fit statistics mean…?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s