Basic area-level model

The basic area-level model (Fay and Herriot 1979; Rao and Molina 2015) is given by \[ y_i | \theta_i \stackrel{\mathrm{iid}}{\sim} {\cal N} (\theta_i, \psi_i) \,, \\ \theta_i = \beta' x_i + v_i \,, \] where \(i\) runs from 1 to \(m\), the number of areas, \(\beta\) is a vector of regression coefficients for given covariates \(x_i\), and \(v_i \stackrel{\mathrm{iid}}{\sim} {\cal N} (0, \sigma_v^2)\) are independent random area effects. For each area an observation \(y_i\) is available with given variance \(\psi_i\).

First we generate some data according to this model:

m <- 75L  # number of areas
df <- data.frame(
  area=1:m,      # area indicator
  x=runif(m)     # covariate
)
v <- rnorm(m, sd=0.5)    # true area effects
theta <- 1 + 3*df$x + v  # quantity of interest
psi <- runif(m, 0.5, 2) / sample(1:25, m, replace=TRUE)  # given variances
df$y <- rnorm(m, theta, sqrt(psi))

A sampler function for a model with a regression component and a random intercept is created by

library(mcmcsae)
model <- y ~ reg(~ 1 + x, name="beta") + gen(factor = ~iid(area), name="v")
sampler <- create_sampler(
  model,
  family=f_gaussian(var.prior=pr_fixed(1), var.vec = ~ psi),
  linpred="fitted", data=df
)

The meaning of the arguments used is as follows:

  • the first argument is a formula specifying the response variable and the linear predictor to model the mean of the sampling distribution
  • the family argument allows to set one of a number of sampling distributions and possibly pass additional family-dependent arguments. In this case the scalar observation level variance parameter is set to a fixed value 1, and unequal variances are set to the vector psi.
  • linpred="fitted" indicates that we wish to obtain samples from the posterior distribution for the vector \(\theta\) of small area means.
  • data is the data.frame in which variables used in the model specification are looked up.

An MCMC simulation using this sampler function is then carried out as follows:

sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)

A summary of the results is obtained by

(summ <- summary(sim))
## llh_ :
##       Mean  SD t-value  MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -23.4 5.7    -4.1 0.117 -32.9 -23.1 -14.1  2377 0.999
## 
## linpred_ :
##    Mean    SD t-value    MCSE q0.05 q0.5 q0.95 n_eff R_hat
## 1  2.85 0.226   12.58 0.00430  2.48 2.85  3.23  2772 1.000
## 2  3.57 0.212   16.85 0.00387  3.22 3.57  3.92  3000 1.000
## 3  1.80 0.202    8.93 0.00372  1.47 1.80  2.13  2941 1.000
## 4  2.61 0.287    9.11 0.00524  2.15 2.61  3.08  3000 0.999
## 5  2.25 0.208   10.80 0.00394  1.90 2.25  2.59  2793 1.000
## 6  3.39 0.259   13.06 0.00473  2.96 3.39  3.81  3000 0.999
## 7  3.11 0.203   15.32 0.00381  2.78 3.11  3.45  2847 0.999
## 8  2.70 0.276    9.78 0.00526  2.23 2.70  3.15  2751 1.000
## 9  2.27 0.256    8.87 0.00468  1.86 2.27  2.69  3000 1.000
## 10 3.64 0.256   14.22 0.00487  3.22 3.64  4.06  2765 1.000
## ... 65 elements suppressed ...
## 
## beta :
##             Mean    SD t-value    MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 1.02 0.131    7.76 0.00240 0.802 1.02  1.23  3000     1
## x           2.96 0.226   13.14 0.00412 2.595 2.96  3.33  3000     1
## 
## v_sigma :
##          Mean     SD t-value    MCSE q0.05  q0.5 q0.95 n_eff R_hat
## v_sigma 0.469 0.0587    7.98 0.00136 0.378 0.465  0.57  1875     1
## 
## v :
##        Mean    SD t-value    MCSE   q0.05    q0.5   q0.95 n_eff R_hat
## 1   0.55957 0.233  2.3993 0.00437  0.1717  0.5639  0.9349  2845 1.000
## 2   0.30816 0.225  1.3710 0.00410 -0.0604  0.3015  0.6861  3000 1.000
## 3  -0.22721 0.214 -1.0633 0.00390 -0.5804 -0.2257  0.1225  3000 1.000
## 4   0.46068 0.289  1.5943 0.00528 -0.0117  0.4629  0.9222  3000 0.999
## 5  -0.56288 0.218 -2.5879 0.00409 -0.9312 -0.5620 -0.2169  2826 1.000
## 6  -0.46397 0.269 -1.7231 0.00492 -0.8963 -0.4590 -0.0192  3000 1.000
## 7  -0.34243 0.218 -1.5693 0.00411 -0.7108 -0.3477  0.0105  2813 1.000
## 8  -0.62941 0.278 -2.2648 0.00522 -1.0970 -0.6239 -0.1777  2836 1.000
## 9   0.14692 0.258  0.5703 0.00470 -0.2750  0.1399  0.5672  3000 1.000
## 10  0.00776 0.264  0.0294 0.00498 -0.4202  0.0047  0.4335  2812 1.000
## ... 65 elements suppressed ...

In this example we can compare the model parameter estimates to the ‘true’ parameter values that have been used to generate the data. In the next plots we compare the estimated and ‘true’ random effects, as well as the model estimates and ‘true’ estimands. In the latter plot, the original ‘direct’ estimates are added as red triangles.

plot(v, summ$v[, "Mean"], xlab="true v", ylab="posterior mean"); abline(0, 1)
plot(theta, summ$linpred_[, "Mean"], xlab="true theta", ylab="estimated"); abline(0, 1)
points(theta, df$y, col=2, pch=2)

We can compute model selection measures DIC and WAIC by

compute_DIC(sim)
##      DIC    p_DIC 
## 95.89987 49.19739
compute_WAIC(sim, show.progress=FALSE)
##    WAIC1  p_WAIC1    WAIC2  p_WAIC2 
## 66.95183 20.24189 88.48055 31.00625

Posterior means of residuals can be extracted from the simulation output using method residuals. Here is a plot of (posterior means of) residuals against covariate \(x\):

plot(df$x, residuals(sim, mean.only=TRUE), xlab="x", ylab="residual"); abline(h=0)

A linear predictor in a linear model can be expressed as a weighted sum of the response variable. If we set compute.weights=TRUE then such weights are computed for all linear predictors specified in argument linpred. In this case it means that a set of weights is computed for each area.

sampler <- create_sampler(
  model,
  family=f_gaussian(var.prior=pr_fixed(1), var.vec = ~ psi),
  linpred="fitted", data=df, compute.weights=TRUE
)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)

Now the weights method returns a matrix of weights, in this case a 75 \(\times\) 75 matrix \(w_{ij}\) holding the weight of direct estimate \(i\) in linear predictor \(j\). To verify that the weights applied to the direct estimates yield the model-based estimates we plot them against each other. Also shown is a plot of the weight of the direct estimate for each area in the predictor for that same area, against the variance of the direct estimate.

plot(summ$linpred_[, "Mean"], crossprod(weights(sim), df$y),
     xlab="estimate", ylab="weighted average")
abline(0, 1)
plot(psi, diag(weights(sim)), ylab="weight")

References

Fay, R. E., and R. A. Herriot. 1979. “Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data.” Journal of the American Statistical Association 74 (366): 269–77.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.