How is the survey R package estimating covariances between domain estimates?

A nice recent addition to the survey R package is the ability to easily estimate between-domain covariances for several different kinds of statistics. For example, let’s say you want to estimate the ratio of total hospitalizations to total Covid-19 cases in two populations, such as vaccinated and unvaccinated persons. The two estimates have sampling variances as well as a sampling covariance. The covariance is important to take into account when estimating basic quantities such as the difference in the ratios between the two populations, as we know from basic stats rules such as the following:

$\begin{array}{r} V a r (R_{v a x} - R_{u n v a x}) = V a r (R_{v a x}) + V a r (R_{u n v a x}) + 2 \times C o v (R_{v a x}, R_{u n v a x}) \end{array}$

Previously, the only way to estimate between-domain covariances was to use replication methods such as the bootstrap or BRR. So if you wanted to estimate between-domain covariances, you had to create replicate weights for your data. That did the job, but it required extra care to make sure you chose an appropriate replication method with a sufficient number of replicates. From a research persepctive, this also made your results a bit harder for others to replicate; if someone else wanted to get the exact same variance estimates as you, you would have to share a file of replicate weights or they would have to use exactly the same software and the exact same random seed you used to generate your replicate weights.

Now, with version 4.0 of the survey package you can use the familiar method of linearization to estimate the covariances. I’ll give an example of how to do this and then provide a little bit of background for how this works.

An Example

Here’s an example of how you would estimate ratios in two domains (i.e. subpopulations) and their covariances. We’ll use the apiclus1 survey data from the survey package, which represents a cluster sample of schools in California. With this data, we’ll estimate the percentage of students eligible for subsidized meals (i.e. the ratio of total students eligible for subsidized meals to the total number of students), separately for elementary school and middle school students. If we want to estimate the difference in percentages and get an accompanying variance estimate, then we need to estimate the covariance between the middle-school estimate and the high-school estimate.

First you’d set up your survey design object (i.e. specifying weights and other sample design variables such as strata/cluster IDs).

library(survey)

data(api)
apiclus1$total_elig_sub <- round((apiclus1$meals/100) * apiclus1$enroll)
tsl_svy_design <- svydesign(id = ~ dnum, weights = ~ pw, data = apiclus1, fpc = ~ fpc)

Prior to version 4.0 of the survey package, you would need to create replicate weights (e.g. bootstrap weights) for your design.

rep_svy_design <- survey::as.svrepdesign(tsl_svy_design, type = "bootstrap",
                                         replicates = 200)

To estimate the percentage of students eligible for subsidized meals, you’d use svyratio(~ meals, ~ enroll). To get these estimates separately for both middle school and high school students, you’d use the svyby() function. And if you want to make sure to estimate the covariance between the two estimates, then you’d need to specify covmat = TRUE.

tsl_ratio_estimates <- survey::svyby(~ total_elig_sub, denominator = ~ enroll,
                                     FUN = svyratio, covmat = TRUE,
                                     by = ~ stype,
                                     design = subset(tsl_svy_design, stype %in% c("E", "M")))

Below are the estimates along with their standard errors. We’d estimate that roughly 54% of elementary school students are eligible for subsidized meals compared to 41% of middle school students.

print(tsl_ratio_estimates)

  stype total_elig_sub/enroll se.total_elig_sub/enroll
E     E             0.5430043               0.07552383
M     M             0.4096155               0.05852004

The difference between the two domains seems to be about 13 percentage points. If we want to estimate the sampling variance of the difference, then we need to get the variances of the two estimates along with their covariance.

$\begin{array}{r} V a r (R_{E} - R_{M}) = V a r (R_{E}) + V a r (R_{M}) + 2 \times C o v (R_{E}, R_{M}) \end{array}$

We can get this using the covariance matrix.

vcov_matrix <- vcov(tsl_ratio_estimates)

cov_of_estimates <- vcov(tsl_ratio_estimates)['E','M']
print(cov_of_estimates)

[1] 0.0039271

Now we’d estimate the difference and its sampling variance as follows:

# Get the estimates
r_e <- subset(tsl_ratio_estimates, stype == "E")[['total_elig_sub/enroll']]
r_m <- subset(tsl_ratio_estimates, stype == "M")[['total_elig_sub/enroll']]

# Calculate their difference
diff <- r_e - r_m

# Calculate the variance of their difference
var_diff <- vcov_matrix['E', 'E'] + vcov_matrix['M','M'] + 2 * vcov_matrix['E','M']

# Calculate the standard error
se_diff <- sqrt(var_diff)

# Print as formatted output
sprintf(paste("%s%% is the estimated difference",
              "%s%% is the estimated variance of the difference",
              "%s%% is the estimated standard error of the difference",
              sep = "\n"),
        100*round(diff, 2), (100)*round(var_diff, 3), 100*round(se_diff, 2)) |> writeLines()

13% is the estimated difference
1.7% is the estimated variance of the difference
13% is the estimated standard error of the difference

So we’d estimate that the share of students eligible for subsidized meals is 13 percentages points higher in elementary schools than in middle schools. But since the standard error of this estimated difference is also about 13 percentage points, we wouldn’t have a lot of confidence than this difference in estimates reflects real population differences rather than just sampling variability.

If we had ignored the covariance, our estimated standard error would have been noticeably smaller:

var_diff_ignoring_covariance <- vcov_matrix['E', 'E'] + vcov_matrix['M','M']
se_diff_ignoring_covariance <- sqrt(var_diff_ignoring_covariance)

sprintf("%s%%", 100*round(se_diff_ignoring_covariance, 3))

[1] "9.6%"

How does this work?

The release notes for version 4.0 of the survey package provide a minimal explanation as to how the covariances are being estimated.

4.0 Some (and eventually nearly all) functions now return influence functions when called with a survey.design2 object and the influence=TRUE option. These allow svyby() to estimate covariances between domains, which could previously only be done for replicate-weight designs, and so allow svycontrast() to do domain contrasts - svymean, svytotal, svyratio, svymle, svyglm, svykappa

Dr. Thomas Lumley provides a little bit of background in this blog post.

If you’re like me, you were confused when you saw the use of the words “influence functions.” What on god’s green earth is an influence function? And why are they useful for variance estimation in surveys?

Influence functions are not something that are typically taught to statistics students outside of the area of robust statistics or advanced sampling theory. There’s nothing in my favorite survey statistics textbooks: Dr. Sharon Lohr’s “Sampling: Design and Analysis”; Heeringa, West, and Berglund’s “Applied Survey Data Analysis”; Rao and Molina’s “Small Area Estimation”, etc. And they’re never mentioned by name in the standard reference, “Introduction to Variance Estimation” by Dr. Kirk Wolter.

After an embarassing amount of time spent searching, I found that the underlying theory for using influence functions to estimate variances in surveys seems to originate from the following paper:

Deville, 1999. “Variance estimation for complex statistics and estimators: Linearization and residual techniques”

And a very brief summary of the Deville 1999 paper is given in the first couple pages of the following paper:

Goga, Deville, Ruiz-Gazen, 2009. “Use of functionals in linearization and composite estimation with application to two-sample survey data”

But so far, the most helpful resource I’ve seen on this topic is the following paper, which includes an overview of how influence functions are used for estimating variances of survey statistics and contrasts it to the more familiar method of Taylor Series Linearization:

Osier, Guillaume 2009. “Variance estimation for complex indicators of poverty and inequality using linearization techniques.”

In a nutshell, the idea of the influence function method is this. Let’s say you estimate a population quantity such as a ratio with a statistic which is a “substitution estimator”, i.e. the quantity calculated as if your (weighted) sample was the population. If your sample is “large enough”, you can estimate the sample statistic’s sampling variance by defining a special variable equal to $z_{i}$ for a given respondent and estimating the sampling variance of the statistic $\hat{Z} = \sum_{i \in s} w_{i} z_{i}$ . In other words, you make the following approximation:

$\begin{aligned} V a r (\hat{θ}) \approx V a r ( & \hat{Z}) \\ where & \hat{Z} = \sum_{i \in s} w_{i} z_{i} \end{aligned}$

This works for vectors of parameters/statistics, too. For example, you could define a parameter vector $θ = [θ_{E} θ_{M}]$ where $θ_{E}$ is the parameter for elementary school students and $θ_{M}$ is the parameter for middle school students. In that case, you’d define variables $z_{E, i}$ and $z_{M, i}$ and estimate the variance-covariance matrix of the vector-valued statistic $[{\hat{Z}}_{E} {\hat{Z}}_{M}]$ .

So, nu, how do we define those special variables $z_{i}$ ? We define them to be influence functions! For a given population quantity $T$ (e.g. a mean or a ratio) and a given population member $i$ , we can calculate an influence function. Very roughly speaking, the influence function indicates how the population quantity $T$ is influenced by the presence of person $i$ in the population. For example, if $T$ is a population total of a variable $y$ , then the influence function of $T$ for person $i$ is simply $y_{i}$ : person $i$ contributes $y_{i}$ to the population total.

Now, there’s some more nuance to this idea which we’ve ignored so far. For instance, the influence function for most statistics (e.g. the mean) depends on unknown parameter values, and so we have to estimate the influence function by first estimating the unknown parameters and plugging them in to formulas for the influence function. For example, the influence function of a mean $\bar{Y}$ is $z_{i} = \frac{1}{N} (y_{i} - \bar{Y})$ and so the influence function must be estimated as ${\tilde{z}}_{i} = \frac{1}{N} (y_{i} - \hat{\bar{Y}})$ to get the variance estimator $v a r (\hat{\bar{Y}}) = v a r (\sum_{i \in s} w_{i} {\tilde{z}}_{i})$ .

But that’s the essence of the idea. In a future post, I’ll try to provide more explanation. I’m still working on understanding the underlying mathematical proofs from the Deville (1999) paper, as evidenced by this question I put on StackExchange.