# SEM and multiple (linear) regression in R

After much peer pressure I’ve started to play with SEM. I thought I’d try John Fox’s sem package in R, to start with just doing a simple multiple regression.

Should mention Fox (2006) [Structural Equation Modeling With the sem Package in R.* Structural Equation Modeling*, 13:465-486] as a useful starting point.

First get the package in:

install.packages(“sem”)

library(sem)

Using simulated data to start with, here’s 100 values for x1, x2, and x3 from Gaussian distributions.

x1 = rnorm(100, 20, 20)

x2 = rnorm(100, 50, 10)

x3 = rnorm(100, 100, 15)

And also some noise:

e = rnorm(100,0,50)

Now set an arbitrarily chosen relationship between y and the x’s:

y = 20*x1 + 3*x2 + 1.5*x3 +40 + e

And throw this all into a data frame

thedata = data.frame(x1,x2,x3,y)

Eventually after much impolite language use, the following model specification made sense.

thedata.mod1 = specify.model()

y <->y, e.y, NA,

x1 <->x1, e.x1, NA,

x2 <->x2, e.x2, NA,

x3 <->x3, e.x3, NA,

y <- x1, bx1, NA,

y <- x2, bx2, NA,

y <- x3, bx3, NA

The first four rows specify the error terms, the last three specify the regression I want to fit. NA is used to specify no starting value.

The *sem *function requires the model specification, a covariance matrix of the data (from *cov*) , and the number of observations (using dim):

sem.thedata1 = sem(thedata.mod1, cov(thedata), N=dim(thedata)[1], debug=T)

summary(sem.thedata1)

summary(lm(y ~ x1 + x2 + x3))

Let’s look at the output for the last two:

> summary(sem.thedata1)

Model Chisquare = 1.7523 Df = 3 Pr(>Chisq) = 0.62537

Chisquare (null model) = 397.79 Df = 6

Goodness-of-fit index = 0.99112

Adjusted goodness-of-fit index = 0.9704

RMSEA index = 0 90% CI: (NA, 0.13780)

Bentler-Bonnett NFI = 0.9956

Tucker-Lewis NNFI = 1.0064

Bentler CFI = 1

SRMR = 0.049717

BIC = -12.063Normalized Residuals

Min. 1st Qu. Median Mean 3rd Qu. Max.

-0.861 -0.357 -0.122 -0.139 0.000 0.993Parameter Estimates

Estimate Std Error z value Pr(>|z|)

e.y 3011.4667 428.11714 7.0342 2.0040e-12 y <–> y

e.x1 390.1715 55.46767 7.0342 2.0040e-12 x1 <–> x1

e.x2 107.8570 15.33320 7.0342 2.0040e-12 x2 <–> x2

e.x3 254.7381 36.21417 7.0342 2.0040e-12 x3 <–> x3

bx1 20.3897 0.28029 72.7436 0.0000e+00 y <— x1

bx2 3.3891 0.53566 6.3269 2.5015e-10 y <— x2

bx3 1.7304 0.34732 4.9822 6.2860e-07 y <— x3Iterations = 0

> summary(lm(y ~ x1 + x2 + x3))Call:

lm(formula = y ~ x1 + x2 + x3)Residuals:

Min 1Q Median 3Q Max

-127.84 -30.55 -5.85 32.10 128.30Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.5809 43.1330 -0.083 0.934

x1 20.3897 0.2846 71.633 < 2e-16 ***

x2 3.3891 0.5440 6.230 1.23e-08 ***

x3 1.7304 0.3527 4.906 3.79e-06 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 55.73 on 96 degrees of freedom

Multiple R-Squared: 0.9817, Adjusted R-squared: 0.9811

F-statistic: 1716 on 3 and 96 DF, p-value: < 2.2e-16

Reassuringly, the estimates for bx1, bx2, and bx3 match the corresponding estimates from *lm*. I don’t yet understand the fit statistics. I also don’t understand the error model.

There’s no intercept. Multiple ways to get it, apparently. One is to use the raw moments matrix.

thedata.mod2 = specify.model()

(Intercept) <-> (Intercept), e.i, NA,

y <-> y, e.y, NA,

x1 <-> x1, e.x1, NA,

x2 <-> x2, e.x2, NA,

x3 <-> x3, e.x3, NA,

y <- x1, bx1, NA,

y <- x2, bx2, NA,

y <- x3, bx3, NA,

y <- (Intercept), b0, NAraw.thedata = raw.moments(~ ., data = thedata)

sem.thedata2 = sem(thedata.mod2, raw.thedata, N=dim(thedata)[1], debug=T)

summary(sem.thedata2)

Running this gave a couple of warnings: “Negative parameter variances. Model is probably underidentified.” Let’s compare again with *lm *anyway:

> summary(sem.thedata2)

Model Chisquare = 741.02 Df = 6 Pr(>Chisq) = 0

Chisquare (null model) = 1273 Df = 10

Goodness-of-fit index = 0.37618

Adjusted goodness-of-fit index = -0.55955

RMSEA index = 1.1124 90% CI: (1.0455, 1.1807)

Bentler-Bonnett NFI = 0.41789

Tucker-Lewis NNFI = 0.030061

Bentler CFI = 0.41804

SRMR = 0.63208

BIC = 713.39Normalized Residuals

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00 5.89 6.60 6.60 9.65 11.70Parameter Estimates

Estimate Std Error z value Pr(>|z|)

e.i 1.0000 0.14211 7.0368 1.9669e-12 (Intercept) <–> (Intercept)

e.y 2981.3520 423.70206 7.0364 1.9722e-12 y <–> y

e.x1 689.3882 97.78209 7.0503 1.7859e-12 x1 <–> x1

e.x2 2627.5144 372.62751 7.0513 1.7724e-12 x2 <–> x2

e.x3 10049.2181 1427.00874 7.0422 1.8929e-12 x3 <–> x3

bx1 20.3897 0.26869 75.8850 0.0000e+00 y <— x1

bx2 3.3891 0.27530 12.3104 0.0000e+00 y <— x2

bx3 1.7304 NaN NaN NaN y <— x3

b0 -3.5809 NaN NaN NaN y <— (Intercept)Iterations = 0

Aliased parameters: bx3 b0

Warning message:

NaNs produced in: sqrt(diag(object$cov))

The estimates look okay but something has gone wrong :-( I shall try again later.

Next up, some latent variable magic. Maybe, if *sem *allows me, it would also be nice to play with, say, binomially distributed data. And what do all those fit statistics mean…?