An Optimization-based Bootstrap

Creating bootstrap weights for survey data can be viewed as a constrained optimization problem: obtain low-error variance estimates with as few replicates as possible using nonnegative weights. The standard solution to this problem is to use an effective but inefficent random resampling method. But what if we tried using an optimization algorithm?

sampling
statistics
R
Author
Published

May 23, 2023

Creating bootstrap weights for survey data can be viewed as a constrained optimization problem: make the bootstrap variance estimate as close as possible to some target variance estimator, with as few replicates as possible, and using nonnegative weights. The standard solution to this problem is to use a tried-and-tested recipe: randomly resample the data a few thousand times in a certain way, and then rescale the weights for resampled observations a certain way. Various such recipes have been invented that work for certain kinds of sample designs: the Rao-Wu bootstrap method, for example, works for single-stage cluster samples, while Preston’s bootstrap method works for multistage simple random sampling designs. Many common sampling designs, such as systematic sampling or two-phase designs, do not yet have a bootstrap recipe invented for them.

All of those random resampling methods work, but because they rely so heavily on randomness, they’re terribly inefficient. Even for a small sample of a dozen observations, the random resampling bootstrap methods require several hundred replicates to achieve a stable variance estimate. This raises the question: what if we didn’t rely on randomization to generate our bootstrap weights but instead used modern optimization techniques to generate bootstrap weights with the desired properties?

This blog post explores that idea. First, I explicitly spell out the objectives and constraints we have in mind when generating bootstrap weights. I show how these can be translated into a constrained optimization problem with a smooth loss function, and then I demonstrate how to use a modern, high-dimensional optimization algorithm to generate our bootstrap weights. Finally, I discuss a few of the practical reasons why this optimization approach is worthwhile.

Goals We Have When Forming Bootstrap Weights

Recently on the blog, I wrote about the generalized bootstrap and why I think it’s such an exciting tool for survey statisticians. One of the nice things I’ve learned from working on the generalized bootstrap is how to think of the requirements and goals we have when constructing a bootstrap variance estimator. Representing our bootstrap replicates as \(n\)-vectors of adjustment factors (denoted \(\mathbf{a}\)) helps clarify these objectives, which can be broken into “core requirements” and “nice-to-have” features.

Notation and Set-up

Let \(v( \hat{T_y})\) be the textbook variance estimator for an estimated population total \(\hat{T}_y\) of some variable \(y\). The base weight for case \(i\) in our sample is \(w_i\), and we let \(\breve{y}_i\) denote the weighted value \(w_iy_i\). So \(\hat{T}_y = \sum_{i=1}^{n} w_i y_i\).

Based on our sample design, we have a textbook variance estimator we want to use. This might be the usual Horvitz-Thompson estimator or, say, the successive-differences estimator commonly used for systematic samples. It could even be something especially complex like a kernel-weighted variance estimator. In all these cases, we can represent our textbook variance estimator as a quadratic form: \(v(\hat{T}_y) = \mathbf{\breve{y}}^{\prime}\Sigma\mathbf{\breve{y}}\), for some \(n \times n\) matrix \(\Sigma\). The only constraint on \(\Sigma\) is that, for our sample, it must be symmetric and positive semidefinite.

As an example, for a simple random sample, consider our usual variance estimator: \(v(\hat{T}_y)=(1-\frac{n}{N})\frac{n}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})^2\). It can be represented as a quadratic form \(\breve{y}^{\prime}\boldsymbol{\Sigma}\breve{y}\) where \(\boldsymbol{\Sigma}\) has all diagonal elements equal to \(1-\frac{n}{N}\), and all non-diagonal elements equal to \(-(1-\frac{n}{N})\times(n-1)^{-1}\).

The bootstrapping process creates \(B\) sets of replicate weights, where the \(b\)-th set of replicate weights is a vector of length \(n\) denoted \(\mathbf{a}^{(b)}\), whose \(k\)-th value is denoted \(a_k^{(b)}\). This yields \(B\) replicate estimates of the population total, \(\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k\), for \(b=1, \ldots,B\), which can be used to estimate sampling variance.

\[ v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B} \]

Like the textbook variance estimator, the bootstrap variance estimator can also be written as a quadratic form:

\[ v_B\left(\hat{T}_y\right) =\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}} \]

where

\[ \boldsymbol{\Sigma}_B = \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B} \]

Core Requirements

When creating bootstrap adjustment factors, there are a couple of requirements we have to adhere to if we want an unbiased bootstrap variance estimate (for totals).

Core Requirement 1: \(\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n\)

The first requirement is that the bootstrap distribution generates adjustment factors with expectation equal to 1. This means that, on average, the bootstrap estimate for a population total matches the full-sample estimated population total.

Core Requirement 2: \(\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\boldsymbol{\Sigma}\)

This requirement means that bootstrap variance estimates on average match the estimate we would get from the target variance estimator.

As far as I know, Bob Fay seems to be the first person to have clearly articulated these requirements and proposed a “generalized replication” method that fulfills these requirements for an arbitrary variance estimator. These ideas were articulated nicely in a series of three papers in the 1980s (1, 2, and 3). Bertail and Combris (1997) and Beaumont and Patak (2012) applied Bob Fay’s ideas about generalized replication to the specific goal of forming bootstrap weights using the “generalized survey bootstrap.” Unfortunately, while the generalized replication and generalized bootstrap methods proposed so far both fulfill these “core requirements”, they each are missing one or more “nice-to-have” features which we could potentially obtain by using an optimization method.

“Nice-to-Have” Features

In addition to the core requirements listed above, there are some “nice-to-have” features that would make the bootstrap adjustment factors much more useful in practice.

Nice-to-have Feature 1: \(\mathbf{a} > \mathbf{0}\)

Generally, we want to have positive weights, since negative weights will cause popular statistical software procedures to fail and are also just hard to explain to users. And having nonzero weights (even if they’re some tiny positive number) would be helpful to avoid issues with domain estimates having a sample size of zero for some replicates.

Nice-to-have Feature 2: Minimize \(\left\Vert \mathbf{\Sigma}_{B} - \mathbf{\Sigma} \right\Vert_F\)

Finally, we want our bootstrap estimates to be as close to the target variance estimator as possible. This is accomplished by making the bootstrap adjustment factor’s variance-covariance matrix as close as possible to the target quadratic form matrix.

Fulfilling These Goals

The “Recipe”-Based Approach

Typically, bootstrap adjustment factors are generated by using a “recipe” that a clever statistician has determined will satisfy the core requirements and also have the other nice features outlined above. These recipes generally involve randomly resampling cases some integer number of times. For complex surveys, the Rao-Wu method is probably the best-known such “recipe.” These recipes rely on some clever math and generally have been worked out on a case-by-case basis, such as for designs with small sampling fractions or for stratified single-stage cluster samples. Some important classes of designs (e.g., systematic samples and two-phase designs) don’t yet have a nice recipe-based approach for generating bootstrap replicate weights.

See the ‘svrep’ R package documentation or Mashreghi’s 2016 overview paper for a summary of the most popular recipes and when they’re appropriate for different sample designs.

In general, these recipes are inefficient: they require several hundred replicates to get stable estimates, even if the sample size is small and the target variance estimator they converge to has a low-rank quadratic form matrix. For example, even with a sample size of 10, the Rao-Wu method of generating bootstrap replicates would require over 100 replicates to get a stable variance estimate.

In short, the recipes are great, but they’re inefficient and don’t apply to all of the survey designs commonly used in practice.

An Optimization-based Approach

Instead of coming up with a clever recipe for generating bootstrap weights that satisfy our goals, why not have a computer figure it out for us? Isn’t the whole point of bootstrapping that we can replace difficult human work with cheap computing?

Let’s just tell the computer what our goals are and ask it to use numerical optimization to generate replicate weights that fulfill those goals. Hopefully, as in many problems in statistics, we’ll be able to simply substitute cheap numerical optimization by a computer for careful mathematical work by a highly-educated (read: expensive) statistician.

That’s the idea behind the rest of this blog post.

Setting Up the Optimization Problem

Our inputs to the optimization process are the following:

  1. A target quadratic form matrix, \(\mathbf{\Sigma}\), which is an \(n \times n\) matrix.

  2. A specified number of bootstrap replicates, \(B\).

The output is an \(n \times B\) matrix, \(A\), composed of \(B\) columns of replicate adjustment factor vectors, \(a^{(b)}\), each of length \(n\). We’ll constrain our solution to matrices \(A\) whose entries are all positive (but arbitrarily small), so that we have entirely non-negative replicate weights. This is a “nice-to-have” feature that’s usually a practical necessity in real-world applications.

There are a few different ways we could define a loss function so as to capture the other goals. For example, we could define our loss function as \(\left\Vert \mathbf{\Sigma}_{B} - \mathbf{\Sigma} \right\Vert_F^2\). This makes intuitive sense, because the fully-efficient and bootstrap variance estimators are both quadratic forms in \(\breve{y}\), and so the error of the bootstrap variance estimator for a total has an upper bound that is closely related to \(\left\Vert \mathbf{\Sigma}_{B} - \mathbf{\Sigma} \right\Vert_F^2\).

But that wouldn’t capture our additional requirement that \(\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n\). So we should define our loss function to incorporate that requirement.

As a first pass, we’ll define our loss function as follows:

\[ L(\mathbf{A}) = \left\Vert \mathbf{\Sigma}_{B} - \mathbf{\Sigma} \right\Vert_F^2 + \left\Vert \left( \frac{1}{n}\sum_{b=1}^{B}a^{(b)} \right) - \mathbf{1}_n \right\Vert_F^2 \]

In other words, this is the sum of the total “error” of entries in the covariance matrix and the total “error” of entries in the mean vector. If we set some maximum value for our loss, say, \(0.01\), then this guarantees that whenever we use our bootstrap weights to estimate the sampling variance \(v_B\left(\hat{T}_y\right)\) of a population total \(\hat{T}_y\), the error of an estimated CV \(\frac{\sqrt{v_B\left(\hat{T}_y\right)}}{\vert\hat{T}_y\vert}\) will be off by no more than about \(1\%\).

We’ll consider a solution \(A\) acceptable if its loss is below some threshold, say, \(0.01\).

Creating Example Data

To illustrate, we’ll use some example data and run some R code. For the example data, we’ll load a survey dataset included in the ‘svrep’ package. This is a stratified, systematic sample of public libraries in the U.S.

library(dplyr)
library(survey)
library(svrep)

# Load survey data
  data('library_stsys_sample', package = 'svrep')
  
# Create a survey design object
  library_stsys_sample <- library_stsys_sample |>
    arrange(SAMPLING_SORT_ORDER)
  
  survey_design <- svydesign(
    data   = library_stsys_sample,
    ids    = ~ 1,
    strata = ~ SAMPLING_STRATUM,
    fpc    = ~ STRATUM_POP_SIZE
  )

Because it’s a systematic sample, we want to use the “SD2” variance estimator (commonly known as the “successive differences” variance estimator). We can use the ‘svrep’ package to obtain the quadratic form matrix for this variance estimator.

The SD2 estimator is the basis of Fay’s successive-differences replication (SDR) estimator. Here’s what the SD2 estimator looks like:

\[ \begin{aligned} \hat{v}_{S D 2}(\hat{Y}) &= \left(1-\frac{n}{N}\right) \frac{1}{2}\left[\sum_{k=2}^n\left(\breve{y}_k-\breve{y}_{k-1}\right)^2+\left(\breve{y}_n-\breve{y}_1\right)^2\right] \\ \text{where }&\breve{y}_k = w_k y_k \text{ is the weighted value for unit k} \end{aligned} \]

# Get the quadratic form for a target variance estimator
  Sigma <- get_design_quad_form(
    design = survey_design,
    variance_estimator = "SD2"
  )

We want our bootstrap estimator to be efficient: that is, the number of replicates should be as close as possible to the rank of \(\Sigma\).

rank_of_Sigma <- Sigma |> Matrix::rankMatrix() |> as.numeric()

print(rank_of_Sigma)
[1] 164

Since the rank of \(\Sigma\) is 164, we’ll use \(B = 170\). So the matrix of replicate weights we’ll form will have dimension \(n \times B\).

B <- 170
n <- nrow(survey_design)

print(n)
[1] 219

Implementing an Optimization Algorithm

Fortunately, there are several powerful algorithms and accompanying software which we can apply to this problem. The machinery of neural network training is particularly well-suited due to its focus on quickly optimizing large arbitrary matrices of weights based on customized loss functions. We’ll use the Torch library, which implements several optimization methods based on automatic differentiation. This idea of using the Torch library specifically was first suggested by a StackExchange user in response to a question I posed about the topic. We’ll start by using the Adam algorithm, an adaptive gradient descent algorithm which is relatively efficient for high-dimensional parameters.

To ensure nonnegativity, our input for the optimization will be log_A, the matrix of logarithms of entries of \(A\).

First, we’ll define the loss function.

library(torch)

#' @param log_A The logarithms of the matrix A we are trying to optimize 
#' @param target_sigma The target quadratic form matrix
loss_fn <- function(log_A, target_sigma) {
  A = torch_exp(log_A)
  # Loss consists of:
  # (1) Frobenius distance
  #     between weights' covariance matrix
  #     and the target covariance matrix
  cov_loss = (
    A$cov(correction = 0) - target_sigma
  )$square()$sum()
  
  # (2) Euclidean distance from mean across replicates
  #     and '1'
  mean_loss = (A$t()$mean(1)-torch_ones(n))$square()$sum()

  # Total loss is the sum of (1) and (2)
  sse = cov_loss + (n^(-1))*mean_loss
  return(sse)
}

Second, we’ll create “tensor” data objects. In the jargon of the Torch library, tensors are essentially arrays, with matrices and vectors being a special case.

# Create a tensor object for the target quadratic form matrix
Sigma_tensor = Sigma |> as.matrix() |> torch_tensor()

Next, we’ll initialize a solution matrix. Our initial solution will be based on a matrix of bootstrap adjustment factors created using the ‘svrep’ package. These adjustment factors should have the correct expected mean and have a covariance matrix that’s proportional to the target covariance matrix.

set.seed(2022)

initial_solution <- make_gen_boot_factors(
  Sigma = Sigma,
  num_replicates = B,
  tau = 'auto', exact_vcov = TRUE
)

log_A = torch_tensor(log(initial_solution),
                     requires_grad = TRUE)

Now we’ll initialize the optimization algorithm.

optimizer = optim_adam(params = log_A, 
                       lr = 0.01)

And now we’ll write a simple while-loop to update the matrix log_A using the iterative optimization algorithm.

# Determine when the iterations should end
max_iter <- 15000 # Either after maximum # of iterations
max_loss <- 0.01  # Or when loss is below a certain value

# Initialize loss and iteration index
iteration <- 1L
loss <- loss_fn(log_A, Sigma_tensor)

start_time <- Sys.time() # Keep track of run time

# Conduct the loop
while ( (as.numeric(loss) > max_loss) & (iteration < max_iter) ) {
  
  # Print information for every 10-th iteration
  if ((iteration %% 500) == 0) {
    cat("Iteration: ", iteration, "   Loss: ", loss$item(), "\n")
  }
  
  # Set the gradient to zero
  optimizer$zero_grad()
  
  # Determine loss at current iteration
  loss = loss_fn(log_A, Sigma_tensor)
  
  # Calculate gradients
  loss$backward()
  
  # Update `log_A`
  optimizer$step()

  iteration <- iteration + 1L
}
Iteration:  500    Loss:  0.1749303 
Iteration:  1000    Loss:  0.1464402 
Iteration:  1500    Loss:  0.1205335 
Iteration:  2000    Loss:  0.09569879 
Iteration:  2500    Loss:  0.07386597 
Iteration:  3000    Loss:  0.0557612 
Iteration:  3500    Loss:  0.04135401 
Iteration:  4000    Loss:  0.03023168 
Iteration:  4500    Loss:  0.02184549 
Iteration:  5000    Loss:  0.01564433 
Iteration:  5500    Loss:  0.01113603 
end_time <- Sys.time()

After the iterations have concluded, we obtain our solution by exponentiating log_A.

A <- log_A |> torch_exp() |> as.matrix()

Let’s see if we were able to find a solution.

loss |> as.numeric() |> print()
[1] 0.009995561
converged <- as.numeric(loss) < max_loss

print(converged)
[1] TRUE

Success! Now let’s see how long it took.

print(iteration)
[1] 5658
elapsed_time_in_seconds <- end_time - start_time

print(elapsed_time_in_seconds)
Time difference of 1.012221 mins

While that’s a lot of iterations, that took no more than a couple minutes to run.

Now let’s use the bootstrap replicate weights.

optim_boot_design <- svrepdesign(
  data             = library_stsys_sample,
  weights          = library_stsys_sample$SAMPLING_PROB^(-1),
  repweights       = A,
  combined.weights = FALSE,
  type = "bootstrap"
)

We’ll estimate the population totals of TOTCIR and TOTSTAFF as well as the variance-covariance matrix of those estimates.

optim_boot_var_est <- optim_boot_design |> 
  svytotal(x = ~ TOTCIR + TOTSTAFF, na.rm = TRUE) |>
  vcov()

print(optim_boot_var_est)
               TOTCIR     TOTSTAFF
TOTCIR   3.207583e+17 2.093672e+13
TOTSTAFF 2.093672e+13 1.521484e+09
attr(,"means")
[1] 2399788328.8     198286.9

We can compare the bootstrap variance estimate to the fully-efficient variance estimator.

# Obtain weighted values of variables
Y <- model.frame(formula = ~ TOTCIR + TOTSTAFF,
                 optim_boot_design$variables, 
                 na.action = na.pass)
Y[is.na(Y)] <- 0
wtd_Y <- as.matrix(Y) * optim_boot_design$pweights

# Produce variance estimate using quadratic form
fully_efficient_var_est <- t(wtd_Y) %*% Sigma %*% wtd_Y
print(fully_efficient_var_est)
2 x 2 Matrix of class "dgeMatrix"
               TOTCIR     TOTSTAFF
TOTCIR   3.188938e+17 2.081503e+13
TOTSTAFF 2.081503e+13 1.512631e+09
# Calculate ratio of variance estimates
ratio_of_var_ests <- optim_boot_var_est / fully_efficient_var_est
print(ratio_of_var_ests)
2 x 2 Matrix of class "dgeMatrix"
           TOTCIR TOTSTAFF
TOTCIR   1.005847 1.005846
TOTSTAFF 1.005846 1.005853

In the output above, we can see that the bootstrap variance estimate is almost exactly the same as the fully-efficient variance estimate. Because we were able to sufficiently minimize our loss to 0.01, the variance estimates for totals are generally off by no more than about 1%.

Success!

Now that we’ve created the full set of bootstrap replicates, we can use all of them or–if we need to reduce the size of our data file–we can use a random subset of them.

Why Is This Ultimately Worthwhile?

The main appeal of replication methods is that they let us replace tedious, expensive hours of human computing with cheap, fast machine computations, without sacrificing much if anything in terms of accuracy. However, in practice, survey statisticians end up spending a lot of expensive human hours figuring out how to generate replicate weights (bootstrap or otherwise) for complex survey designs, and we often sacrifice accuracy so that we can reduce the amount of machine computation and data storage required for the replicate weights. For example, we shoehorn sample designs into a two-PSU-per-stratum approximation so that we can use balanced half-sample replication, or we treat a multistage sample as a single-stage sample so that we can use jackknife methods. The main appeal of an “optimization-based bootstrap” is that we can stop spending so many labor hours figuring out how to come up with approximations of the variance estimator we actually want to use. With an optimization approach, we just tell the computer what our survey design is, what variance estimator we want to use, and then we let it chug away for a few minutes or a few hours to figure out how to do that as efficiently as possible. This computational approach is cheaper and more accurate than the statisticians’ tedious work.

In addition to reducing costs, an optimization approach could also help us use better variance estimation approaches in our surveys. Consider for example a recently-proposed kernel-based variance estimator for systematic samples. That variance estimator makes a lot of sense to use for systematic samples or other highly-stratified surveys. But for many government surveys it couldn’t be used in practice, because it’s typically a practical necessity to use replicate weights and the kernel-based variance estimator can’t be implemented as a jackknife or BRR replication estimator. The random generalized bootstrap could be used, but unfortunately that method can result in negative replicate weights and might require several hundred replicates to get a stable variance estimate. An optimization-based bootstrap would let us turn that textbook estimator into a replication estimator with a manageable number of replicates.

In short, an optimization-based bootstrap could help survey statisticians spend our limited budgets on surveys in more productive work. Rather than labor on approximations required to use the jackknife or half-sample replication methods, we could use better variance estimators and use cheap computing to work out the precise way we translate those into replicate weights.