Understanding the code used to implement the {survey} package’s recursive variance estimation for multistage samples

Unlike SAS and SPSS, the {survey} package in R can properly estimate variances for designs with multistage sampling and significant sampling fractions. The estimation is implemented using a recursive algorithm, whose basic idea is well-documented but whose R code is a bit intimidating to understand. This post walks through the R code used to implement this algorithm.

statistics
surveys
R
survey package
Author
Published

December 5, 2021

The {survey} package has a certain conceptual simplicity to it. It calculates the linearization-based estimates of the variance of several kinds of statistics–totals, ratios, regression coefficients, etc.–using a single core function, whose purpose is simply to estimate the sampling variance of an estimated total.

This works because of the method of linearization based on estimating functions and influence functions. For each statistic, you determine the estimating functions or influence functions, and you estimate the variance of their estimated total. So there are three key steps to linearization-based variance estimation in the {survey} package:

  1. Appropriately calculate (or estimate) the estimating functions or influence functions for the statistic of interest.

  2. Determine the appropriate variance estimator for a total under a specified sampling design.

  3. Plug the estimating/influence functions into the variance estimator for the total.

The purpose of this blog post is to dig into the internals of the {survey} package and see how exactly it implements the linearization-based estimation of sampling variances for totals in a multistage, stratified design.

Motivating Formulas

To understand the code used in the {survey} package, it’s helpful to first know what mathematical formulas it’s using to estimate the variance of a total under multistage sampling.

At the first stage of sampling, the sampling variance of a total is estimated separately by stratum, and then the variances from all the strata are added together.

\[ \begin{aligned} &\textbf{The estimated sampling variance} \\ &\textbf{of the estimated population total } \\ &v(\hat{Y}) = \sum_{h=1}^{H} v_h(\hat{Y}_h) \\ &\text{where } h \text{ denotes one of the }H \text{ strata}, \\ & \hat{Y_h} \text{ denotes the estimated population total in stratum } h, \\ \end{aligned} \]

Within stratum \(h\), the sampling variance of the estimated stratum population total is estimated by first estimating the population standard deviation of the weighted totals of the sampling units, then multiplying this by the number of sampling units (\(n_h\)), and finally multiplying this by the stratum’s finite population correction, \((1 - f_h)\).

\[ \begin{aligned} & v_h(\hat{Y}) = (1 - f_h) \times n_h \times \frac{1}{n_h-1} \sum_{i=1}^{n_h}(y_{hi.} - \bar{y}_{h..})^2 \\ &\text{where } n_h \text{ denotes the number of first-stage sampling units}, \\ & y_{hi.} \text{ is the weighted total of first-stage sampling unit } i, \\ & \bar{y}_{h...} = \frac{1}{n_h}\sum_{i=1}^{n_h} y_{hi.} \\ &\text{and } f_h = n_h/N_h \text{ is the sampling fraction of first-stage sampling units} \end{aligned} \]

That’s the story for single-stage sampling designs.1

For multiple stages of sampling, a first-stage unit’s weighted total is only an estimate of the true unknown population total for that unit, and so we need to account for variance caused by this fact. To do this, we apply the above formula to the second-stage sampling units and strata inside a given first-stage unit. Then from all the first-stage units, we sum up these variances from the second-stage of sampling, and then multiply this by the first-stage sampling fraction.

\[ \begin{aligned} v_h(\hat{Y})_{two-stage} &= (\text{est. variance from 1st stage}) + (\text{est. variance from 2nd stage}) \\ &= \text{previous formula} + f_h \sum_{j=1}^{n_h} v_{hj} (\hat{Y}_{hj}) \\ & \text{where } \hat{Y}_{hi} \text{ is the estimated total in first-stage unit j}\\ \text{} & \text{and } v_{hj} \text{ is the estimated sampling variance} \\ & \text{and } f_h \text{ is the first-stage sampling fraction} \end{aligned} \]

For three or more stages of sampling, the above method is applied recursively. Note that for each stage of sampling after the first stage, the variance contribution from that stage equals the sum of within-unit sampling variances, multiplied by the product of the sampling fractions from all previous stages. This means that later-stage sampling has increasingly small impacts on the overall sampling variance.

The overview from Dr. Lumley’s book

Thomas Lumley’s book provides a nice overview of the functions used for the recursive variance estimator.

All standard error computations for linearization go through svyrecvar(), which is passed either the estimating functions or the influence functions for the estimator. The first step is to project the influence functions orthogonal to any post-stratifying or raking or calibration variables. For post-stratification, post-stratum means are computed using the original sampling weights and subtracted off. For raking, the post-stratification procedure is done iteratively. For calibration, a weighted QR decomposition of the matrix of auxiliary variables weighted by the original sampling weights is used to compute residuals. The resulting residuals are passed to multistage(), which implements the recursive algorithm for multistage variances.

The variance contribution for the current stage is computed by onestage(), which calls onestrat() to do the actual calculations for one stratum, and then multistage() is recursively called for the next stage of sampling. In onestrat() the observations for each cluster are added, then centered within each stratum and the outer product is taken of the row vector resulting for each cluster. This is added within strata, multiplied by a degrees-of-freedom correction and by a finite population correction (if supplied), and added across strata in onestage().

p.240, Complex Surveys: A Guide to Analysis Using R. By Thomas Lumley

The rest of this post goes into more details based on this explanation. For simplicity, let’s ignore the content about influence functions and estimating functions; we’ll concentrate on the estimation of variance of the total for a simple variable, such as income.2

We’ll create an example survey design object to help us understand what’s going on. This design will feature strata and two stages of sampling (with two sets of finite population corrections).

set.seed(1999)

library(dplyr)
data(api, package = 'survey')

apipop$district_group <- apipop[,c("meals", "api00")] |>
  dist() |>
  hclust() |>
  cutree(k = 12)

  sample_data <- apipop %>%
    # Sample two groups of districts
    group_by(stype) %>%
    mutate(N_Dist_Groups = n_distinct(district_group)) %>% 
    filter(district_group %in% sample(x = unique(district_group),
                                      size = 5, replace = FALSE)) %>%
    # Calculate the first stage sampling probability
    mutate(stage_1_prob = 5/N_Dist_Groups) %>%
    # Sample 3 schools from each district group
    group_by(stype, district_group) %>%
    mutate(N_in_Dist_Group = n()) %>%
    sample_n(size = 3, replace = FALSE) %>%
    # Calculate the second stage sampling probability
    mutate(stage_2_prob = 3/N_in_Dist_Group) %>%
    ungroup()
library(survey)
options("survey.ultimate.cluster" = FALSE)

data(api)

design <- svydesign(id = ~ district_group + snum,
                    fpc = ~ N_Dist_Groups + N_in_Dist_Group,
                    strata = ~ stype,
                    nest = TRUE,
                    data = sample_data)

The high-level function, svyrecvar()

Let’s look at an example of a function that uses svyrecvar() to estimate the variance for a statistic. We’ll focus on survey:::svytotal.survey.design2, the function used to estimate variances of totals. Within svytotal.survey.design2() we can see that the variances are estimated using the following code.

attr(total, "var") <- v <- svyrecvar(
  x = x/design$prob, 
  clusters =  design$cluster, 
  stratas = design$strata,
  fpcs = design$fpc,
  postStrata = design$postStrata
)

Interestingly, we can see that the base sampling weights (i.e. inverses of inclusion probabilities) are not supplied directly to svyrecvar(). Rather, the input variables (income, height, etc.) are first weighted using the base sampling weights, and then the weighted input variables are supplied to svyrecvar().

View example input data

For the example survey design object we created, here’s what the inputs to svyrecvar() look like.

x <- design$variables$meals
x |> head()
[1] 63 68 41 22 32 11
design$prob |> head()
          1           2           3           4           5           6 
0.002577320 0.002577320 0.002577320 0.002340824 0.002340824 0.002340824 
design$cluster |> head()
  district_group     snum
1            E.2 E.2.3433
2            E.2 E.2.3171
3            E.2 E.2.5263
4            E.3 E.3.1860
5            E.3 E.3.3145
6            E.3 E.3.5499
design$strata |> head()
  stype    V2
1     E E.E.2
2     E E.E.2
3     E E.E.2
4     E E.E.3
5     E E.E.3
6     E E.E.3
design$fpc |> str()
List of 2
 $ popsize : int [1:45, 1:2] 12 12 12 12 12 12 12 12 12 12 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:45] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "N_Dist_Groups" "N_in_Dist_Group"
 $ sampsize: int [1:45, 1:2] 5 5 5 5 5 5 5 5 5 5 ...
 - attr(*, "class")= chr "survey_fpc"
design$fpc$popsize |> head()
  N_Dist_Groups N_in_Dist_Group
1            12             485
2            12             485
3            12             485
4            12             534
5            12             534
6            12             534
design$fpc$sampsize |> head()
     [,1] [,2]
[1,]    5    3
[2,]    5    3
[3,]    5    3
[4,]    5    3
[5,]    5    3
[6,]    5    3
design$postStrata |> head()
NULL

Note that the postStrata object is NULL, since we didn’t conduct any form of post-stratification or calibration. For illustration though, let’s consider what it would look like if we had calibrated our design, for example with the rake() function.

raked_design <- rake(design, sample.margins = list(~ sch.wide, ~ comp.imp),
                     population.margins = list(xtabs(~ sch.wide, data = apipop),
                                               xtabs(~ comp.imp, data = apipop)))

str(raked_design$postStrata)
List of 1
 $ :List of 2
  ..$ : int [1:45] 1 2 2 2 2 2 2 2 2 2 ...
  .. ..- attr(*, "oldweights")= Named num [1:45] 302 437 222 481 245 ...
  .. .. ..- attr(*, "names")= chr [1:45] "1" "2" "3" "4" ...
  .. ..- attr(*, "weights")= Named num [1:45] 302 437 222 481 245 ...
  .. .. ..- attr(*, "names")= chr [1:45] "1" "2" "3" "4" ...
  ..$ : int [1:45] 1 2 1 2 1 2 1 2 2 2 ...
  .. ..- attr(*, "oldweights")= Named num [1:45] 302 437 222 481 245 ...
  .. .. ..- attr(*, "names")= chr [1:45] "1" "2" "3" "4" ...
  .. ..- attr(*, "weights")= Named num [1:45] 302 437 222 481 245 ...
  .. .. ..- attr(*, "names")= chr [1:45] "1" "2" "3" "4" ...
  ..- attr(*, "class")= chr "raking"

The documentation in the {survey} package provides the following description for survey::svyrecvar:

The main use of this function is to compute the variance of the sum of a set of estimating functions under multistage sampling. The sampling is assumed to be simple or stratified random sampling within clusters at each stage except perhaps the last stage. The variance of a statistic is computed from the variance of estimating functions as described by Binder (1983).

Use one.stage=FALSE for compatibility with other software that does not perform multi-stage calculations, and set options(survey.ultimate.cluster=TRUE) to make this the default.

The idea of a recursive algorithm is due to Bellhouse (1985). Texts such as Cochran (1977) and Sarndal et al (1991) describe the decomposition of the variance into a single-stage between-cluster estimator and a within-cluster estimator, and this is applied recursively.

If one.stage is a positive integer it specifies the number of stages of sampling to use in the recursive estimator.

Let’s take a look inside. For now, we’ll ignore all the special code used to handle post-stratification and raking.

svyrecvar <- function(x,
                      clusters,
                      stratas,
                      fpcs,
                      postStrata = NULL, lonely.psu = getOption("survey.lonely.psu"), 
                      one.stage = getOption("survey.ultimate.cluster")) 
{
  x <- as.matrix(x)
  cal <- NULL
  
  multistage(x = x,
             clusters = clusters, stratas = stratas,
             nPSUs = fpcs$sampsize, fpcs = fpcs$popsize, 
             lonely.psu = getOption("survey.lonely.psu"),
             one.stage = one.stage,
             stage = 1,
             cal = cal)
}

From this, we can see that regardless of whether the survey design actually has multiple stages of sampling, the multistage() function is always used in the estimation of variances.

Where the recursion happens: survey:::multistage()

Here’s the code for multistage(), minus some unnecessary details about how calibration and raking are handled. We can see that, if there’s only a single stage of sampling, then multistage() only gets used once. If there are multiple stages of sampling, then multistage() will be used recursively; the definition of the multistage() function includes a call to multistage().

multistage <- function(x,
                       clusters, stratas,
                       nPSUs, fpcs,
                       lonely.psu = getOption("survey.lonely.psu"), 
                       one.stage = FALSE, stage,
                       cal) {
  n <- NROW(x) # Gives the total number of observations in the dataset
  
  # Calculate the variance-covariance matrix from the first stage
  v <- onestage(x,
                stratas[, 1],
                clusters[, 1],
                nPSUs[, 1], 
                fpcs[, 1],
                lonely.psu = lonely.psu,
                stage = stage,
                cal = cal)
  
  # Check whether later-stage variances are supposed to be estimated
  if (one.stage != TRUE && !is.null(fpcs) && NCOL(clusters) > 1) {
    
    # Iterate over each first-stage cluster
    v.sub <- by(
      data = 1:n,
      INDICES = list(as.numeric(clusters[, 1])),
      FUN = function(index) {

        # Use `multistage()` within a given first-stage cluster
        multistage(x[index, , drop = FALSE],
                   clusters[index, -1, drop = FALSE],
                   stratas[index, -1, drop = FALSE], 
                   nPSUs[index, -1, drop = FALSE],
                   fpcs[index, -1, drop = FALSE],
                   lonely.psu = lonely.psu, 
                   one.stage = one.stage - 1, stage = stage + 1, 
                   cal = cal) * nPSUs[index[1], 1]/fpcs[index[1], 1]
        
      })
    for (i in 1:length(v.sub)) v <- v + v.sub[[i]]
  }
  
  dimnames(v) <- list(colnames(x), colnames(x))
  v
}

We can see that at the outset, no matter what, the variance for the first stage of sampling (e.g. counties) is estimated using onestage(). Variances for later stages of sampling (households, persons within households, etc.) are only estimated if all three of the following conditions are met:

  1. The value of the argument one.stage is set to FALSE, which only occurs if the R session has the option option("survey.ultimate.cluster" = FALSE); by default, this option is set to FALSE, and so by default one.stage = FALSE. In general, the function svyrecvar() is the only place in the package that is directly affected by the value of the "survey.ultimate.cluster" option.

  2. There are multiple stages of units. The object clusters is a matrix, with one row for every observation in the entire dataset and one column for each stage of sampling. So this is why the code checks for NCOL(clusters) > 1.

  3. The user has provided finite population corrections, which is checked using the logic !is.null(fpcs).

If all of the three conditions above are met, then multistage() will be used recursively to get the variance contributions from later stages of sampling. The by() function in the code above applies multistage() separately in each current-stage sampling unit. In other words, it splits up the objects x, clusters, stratas, fpcs, and nPSUs into separate matrices, one matrix for each primary sampling unit. When calling multistage() in a given current-stage sampling unit, the input matrices (x, clusters, stratas, fpcs, and nPSUs) all get their first columns peeled off. The reason for this is that each column represents a given stage of sampling and since we’re moving on to later stages, we no longer need the column from the first stage.

After multistage() is called to get the variance from later stages of sampling within a given current-stage sampling unit, the result is multiplied by nPSUs[index[1], 1]/fpcs[index[1], 1], which is simply the current-stage sampling fraction for the stratum of the current-stage sampling unit.

Then the results are added up across all of the first stage sampling units.

Calculating the variance of a single stage of sampling

Within multistage(), the function named onestage() is always used at every stage of sampling. This function essentially implements the following calculation we described at the beginning of this post.

\[ \begin{aligned} &v(\hat{Y}) = \sum_{h=1}^{H} v_h(\hat{Y}_h) \\ & v_h(\hat{Y}) = (1 - f_h) \times n_h \times \frac{1}{n_h-1} \sum_{i=1}^{n_h}(y_{hi.} - \bar{y}_{h..})^2 \\ \end{aligned} \]

onestage <- function(x, strata, clusters, nPSU, fpc,
                     lonely.psu=getOption("survey.lonely.psu"),
                     stage=0, cal) {
  
   if (NROW(x)==0) {
      return(matrix(0,NCOL(x),NCOL(x)))
   }
  # Calculate the variances in each stratum
  stratvars<- tapply(1:NROW(x), list(factor(strata)), function(index){
    onestrat(x[index,,drop=FALSE], clusters[index],
             nPSU[index][1], fpc[index], ##changed from fpc[index][1], to allow pps(brewer)
             lonely.psu=lonely.psu,stratum=strata[index][1], stage=stage,cal=cal)
  })
  p<-NCOL(x)
  nstrat<-length(unique(strata))
  # Sum the variances across all the strata
  # and do some special handling 
  # needed if there are any "lonely PSUs" (i.e. strata with only one PSU)
  nokstrat<-sum(sapply(stratvars,function(m) !any(is.na(m))))
  apply(array(unlist(stratvars),c(p,p,length(stratvars))),1:2,sum,na.rm=TRUE)*nstrat/nokstrat
}

The function survey:::onestrat() is a sort of “helper” function used inside onestage() that does the computation for a single stratum. In other words, it handles the following calculation.

\[ \begin{aligned} & v_h(\hat{Y}) = (1 - f_h) \times n_h \times \frac{1}{n_h-1} \sum_{i=1}^{n_h}(y_{hi.} - \bar{y}_{h..})^2 \\ \end{aligned} \]

There’s some more nuance to this owing to the need to use special handling for strata with only a single PSU (which the {survey} package refers to as “lonely PSUs”). The code of onestrat() is included below.

Show R Code
onestrat <- function(x, cluster, nPSU, fpc, lonely.psu,
                     stratum=NULL, stage=1, cal=cal) {
  
  # Calculate the term (1-f_h) * n_h * (1/(n_h - 1))
  # which is the constants in the formula described above
  if (is.null(fpc))
      f<-rep(1,NROW(x))
  else{
      f<-ifelse(fpc==Inf, 1, (fpc-nPSU)/fpc)
  }

  if (nPSU>1)
      scale<-f*nPSU/(nPSU-1)
  else
      scale<-f
  
  # For self-representing strata, just return variances of zero
  # and skip unnecessary further calculations
  
  if (all(f<0.0000001))## self-representing stratum
      return(matrix(0,NCOL(x),NCOL(x)))

  scale<-scale[!duplicated(cluster)]
  
  x<-rowsum(x,cluster)
  nsubset<-nrow(x)
  
  # Calculate stratum mean, with some special handling
  # if there is a "lonely PSU"
  if (nsubset<nPSU) {
    ##can't be PPS, so scale must be a constant
    x<-rbind(x,matrix(0,ncol=ncol(x),nrow=nPSU-nrow(x)))
    scale<-rep(scale[1],NROW(x))
  }
  
  ## Update `x` by subtracting the stratum mean(s)
  if (lonely.psu!="adjust" || nsubset>1 ||
      (nPSU>1 & !getOption("survey.adjust.domain.lonely")))
      x<-sweep(x, 2, colMeans(x), "-")

  if (nsubset==1 && nPSU>1 && getOption("survey.adjust.domain.lonely")){ 
      warning("Stratum (",stratum,") has only one PSU at stage ",stage)
      if (lonely.psu=="average" && getOption("survey.adjust.domain.lonely"))
          scale<-NA
    }
  
  # Calculate the sum of squares from the formula described earlier, multiply by the scale factor
  if (nPSU>1){
      return(crossprod(x*sqrt(scale)))
  } else {
      rval<-switch(lonely.psu, 
                   certainty=crossprod(x*sqrt(scale)),
                   remove=crossprod(x*sqrt(scale)),
                   adjust=crossprod(x*sqrt(scale)),
                   average=NA*crossprod(x),
                   fail= stop("Stratum (",stratum,") has only one PSU at stage ",stage),
                   stop("Can't handle lonely.psu=",lonely.psu)
            )
      rval
  }
}

Interestingly, the variance of PSU totals in a stratum is calculated using crossprod(x * sqrt(scale)), where x is the matrix of data after the stratum means are subtracted (i.e. the data are recentered around the stratum mean), and scale is the combined term \((1-f_h) \times n_h \times \frac{1}{n_h - 1}\) used in the formulas we saw earlier. This might be surprising, because you might expect that instead it would simply use scale * var(x), where var(x) is a generic R function which calculates the variance-covariance matrix of a supplied matrix. Why does the {survey} package use crossprod() instead? The reason is because of singleton strata (“lonely PSUs”, in the language of the {survey} package), which are strata where there is only PSU. For a singelton stratum, since there is only one data observation used for the variance calculation, the variance is always undefined and var(x) will return a matrix of missing values. With crossprod(), we can avoid this problem: if the option for handling lonely PSUs is set to "adjust", then x can recentered not around the stratum mean (which for singleton strata results in a matrix of all zeroes), but around a grand mean (e.g. the mean of all PSUs from all strata), and so using crossprod() can result in a reasonable nonzero variance estimate.

Footnotes

  1. Note that SAS and SPSS treat all designs as if they were single-stage designs. They ignore any stages of sampling beyond the first. See the SAS documentation for details.↩︎

  2. The beauty of influence functions and estimation functions is that they are treated like any simple variable (income, acreage, etc.), and the sampling variance of the totals of influence/estimation functions gives us the variance of complicated statistics such as ratios or regression coefficients.↩︎