Simulation-based Variance Estimation for the Cube Method

This post shows a couple approaches for using simulation to help improve variance estimation for samples selected using the cube method. It also proposes a new variance estimator that combines the strengths of existing variance estimators to potentially yield more reliable variance estimates for the cube method.

Author
Published

September 3, 2024

The Analytical Challenge Posed by the Cube Method

Recently on this blog, I wrote a post about the cube method and why I think it’s a promising method for survey sampling. The cube method in many ways simplifies our goals we have when we draw a stratified sample: it allows us to allocate sample sizes to several (potentially overlapping) domains and automatically get a sample that is well-calibrated to known population information. However, one drawback of the cube method is that it complicates analysis: it’s not immediately obvious how to estimate sampling variances for estimates produced from a sample selected using the cube method. There is not a simple analytic formula that yields exactly unbiased variance estimation, and it’s not obvious how one would generate an appropriate replication method using the bootstrap or other approaches.

Simulations vs. Approximations

There’s a small literature on this problem, which has generally taken two approaches:

  1. Use the Horvitz-Thompson estimator using estimates of the joint inclusion probabilities, where the estimates are obtained using simulation. Here, simulation means repeating the sampling process several thousand times, tallying up what percent of the simulations each pair of units ends up in the sample together, and then using that as the estimated joint probabilities.

  2. Make some simplifying approximations and use a variance estimator proposed by Deville and Tillé. In the technical jargon of the cube method, this amounts to assuming that all of the variance comes from the “flight phase” rather than the “landing phase” of the algorithm, so that the sampling method can be treated as if it was a conditional Poisson sample design.

The Deville-Tillé estimator works fine in many cases, but it has some important drawbacks. For starters, we don’t have a great sense of its error properties in any given application. The only way to figure that out is through simulations. But we do have some guidance about when it won’t work well. In particular, the Deville-Tillé approximation will underestimate the sampling variance when the sampling process has a hard time with the balancing; that is, when the landing phase has to give up on selecting a perfectly balanced sample and results in substantial imbalance. This can frequently happen when using the “stratified cube method” of Chauvet (2009). So in practice, the Deville-Tillé estimator can often yield systematically too small estimates of sampling variance, which is concerning for official statistics as it means that data users end up overconfident about how much the data can tell them about the population.

Maybe We Should Be Using Simulations More Often

Despite these issues, the simulation-based approach has been ignored in favor of the approximation-based approach of Deville and Tillé.

The hope behind a simulation-based approach is that, if we throw enough computing resources at the problem, we can get arbitrarily accurate variance estimates. In the “Computer Age of Statistical Inference” that we now live in, simulation seems like it should be totally feasible and even easy for many applications of the cube method. Every year that passes, simulation-based inference gets easier and easier as our computers keep getting more powerful. Nowadays, statisticians working on standard-issue laptops routinely use Monte Carlo simulation to estimate uncertainty for complex Bayesian models in hours or even minutes.

Maybe survey sampling and variance estimation for complex samples is an area where we can benefit from using more simulations.

An Example in R

To show how easy it is to adopt the simulation-based method of variance estimation for the cube method, I put together an example in R. The data source we’ll use is a dataset of all counties in the U.S., with a measure of literacy levels obtained from the PIAAC small area estimation program.

Data Source: https://nces.ed.gov/surveys/piaac/skillsmap/

This example is nice because it’s a national survey where we have clear outcome variables (literacy) and an obvious choice of balancing variables: the covariates used in a validated small area estimation model.

Creating a Sampling Frame

The sampling frame for this example is the set of all counties in the U.S. as of 2017, with data obtained from the public PIAAC small area estimation dataset.

Click to show R code for downloading and importing data
library(readxl)
library(dplyr)
library(stringr)

# Download the PIAAC Skills Map data to a temporary location
data_url <- "https://nces.ed.gov/surveys/piaac/skillsmap/static/media/SAE_website_dataset.c15d59a2d7e219fcb6d1.xlsx"
downloaded_filepath <- "sae_website_dataset.xlsx"

if (!file.exists(downloaded_filepath)) {
  download.file(data_url, destfile = downloaded_filepath,
                method = "curl")
}

# Load the data for U.S. counties
skills_map_data <- readxl::read_xlsx(
  path = downloaded_filepath, sheet = "County",
  guess_max = 5000
) |> 
  rename_with(toupper) |>
  filter(GRPNAME == "all") |>
  mutate(FIPS_CODE = str_pad(FIPS_CODE, width = 5,
                             side = "left", pad = 0))

We’ll select only the variables of interest. Since some counties have a missing value for population size, we’ll impute using a simple method.

Click to show R code
sampling_frame <- skills_map_data |>
  # Impute missing population size for a small number of counties
  mutate(POP_SIZE = as.numeric(POP_DOMAIN),
         POP_SIZE = ifelse(is.na(POP_SIZE),
                           round(0.75 * as.numeric(POP), 0),
                           POP_SIZE) |> as.integer()) |>
  select(
    # Geographic identifiers
    FIPS_CODE, COUNTY, STATE,
    # Population size
    POP_SIZE,
    # Literacy and numeracy measures
    LIT_A, LIT_P1, LIT_P2, LIT_P3,
    NUM_A, NUM_P1, NUM_P2, NUM_P3,
    # Auxiliary variables derived from ACS tables
    BLACK, HISPANIC,
    LESS_HS, HS, MORE_HS,
    POVERTY_100, SNAP, UNEMPLOYED, OCC_SERVICE, NO_INSURANCE,
    FB, ENG_NOT_WELL
  ) |>
  mutate(across(-c(FIPS_CODE, COUNTY, STATE),
                as.numeric))

Here’s what the data look like. There’s one row per county, a population size variable POP_SIZE, literacy and numeracy measures, and several auxiliary variables.

sampling_frame |> head(5)
# A tibble: 5 × 24
  FIPS_CODE COUNTY STATE POP_SIZE LIT_A LIT_P1 LIT_P2 LIT_P3 NUM_A NUM_P1 NUM_P2
  <chr>     <chr>  <chr>    <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1 01001     Autau… Alab…    39765  262.  0.205  0.384  0.411  247.  0.328  0.363
2 01003     Baldw… Alab…   148385  270.  0.161  0.347  0.493  257.  0.261  0.353
3 01005     Barbo… Alab…    19450  236.  0.394  0.404  0.201  216.  0.581  0.273
4 01007     Bibb … Alab…    16970  248.  0.269  0.454  0.277  232.  0.417  0.402
5 01009     Bloun… Alab…    41955  258.  0.248  0.37   0.382  246.  0.352  0.341
# ℹ 13 more variables: NUM_P3 <dbl>, BLACK <dbl>, HISPANIC <dbl>,
#   LESS_HS <dbl>, HS <dbl>, MORE_HS <dbl>, POVERTY_100 <dbl>, SNAP <dbl>,
#   UNEMPLOYED <dbl>, OCC_SERVICE <dbl>, NO_INSURANCE <dbl>, FB <dbl>,
#   ENG_NOT_WELL <dbl>

Determining a Sample Design

We’ll draw a sample of 200 counties, with their sampling probabilities proportional to their population size.

library(sampling)

# Define sampling probabilities
sampling_frame <- sampling_frame |>
  mutate(SAMP_PROB = inclusionprobabilities(
    a = POP_SIZE, n = 200
  )) |>
  ungroup() |>
  relocate(SAMP_PROB, .after = POP_SIZE)

To balance the sample, we’ll use a handful of auxiliary variables known to be correlated with literacy and numeracy. These are most of the variables that were used as predictors for the small area estimation models.

balancing_vars <- model.matrix(
  object = ~ -1 + POVERTY_100 + BLACK + HISPANIC + NO_INSURANCE + OCC_SERVICE,
  data   = sampling_frame
)

To ensure a fixed sample size when using the cube method, the matrix of balancing variables will include a variable which is simply equal to the sampling probabilities.

balancing_vars <- cbind(
  'SAMP_PROB' = sampling_frame[['SAMP_PROB']],
  balancing_vars
)

Note that this set of balancing variables is quite strong.

balancing_variables_model <- lm(
  data    = sampling_frame, 
  formula = LIT_A ~ POVERTY_100 + BLACK + HISPANIC +
                    NO_INSURANCE + OCC_SERVICE
)

balancing_variables_model |> summary()

Call:
lm(formula = LIT_A ~ POVERTY_100 + BLACK + HISPANIC + NO_INSURANCE + 
    OCC_SERVICE, data = sampling_frame)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.222  -4.798  -0.518   4.471  33.446 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  289.2886     0.6801 425.368  < 2e-16 ***
POVERTY_100  -95.2764     2.6486 -35.972  < 2e-16 ***
BLACK        -20.4995     1.0357 -19.793  < 2e-16 ***
HISPANIC     -29.7682     1.0763 -27.659  < 2e-16 ***
NO_INSURANCE -50.2900     3.1318 -16.058  < 2e-16 ***
OCC_SERVICE  -21.6135     3.8592  -5.601 2.32e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.348 on 3136 degrees of freedom
Multiple R-squared:  0.6965,    Adjusted R-squared:  0.6961 
F-statistic:  1440 on 5 and 3136 DF,  p-value: < 2.2e-16

Selecting a Sample

In R, we can use the ‘BalancedSampling’ R package to select a sample using the cube method, like so.

library(BalancedSampling)

# Select sample, get row indices of sample elements on frame
  sample_indices <- cube(
    prob  = sampling_frame[['SAMP_PROB']],
    x     = balancing_vars
  )

# Get the data for the selected sample
  selected_sample <- sampling_frame |> slice(sample_indices)

Creating Replicate Weights Using Simulation

We’ll use simulation to estimate the joint sampling probabilities, and then use those to construct replicate weights.

Estimating Joint Probabilities Using Direct Simulation

One approach would be to simply re-run the sampling method a large number of times and count the number of times each pair of units is sampled together. We’ll call that the “direct simulation” approach.

Here’s some R code that does that using the lightning-fast cube sampling function from the ‘BalancedSampling’ package, which lets us conduct 100,000 simulations in under five minutes.

library(BalancedSampling)

### Estimating Joint Probabilities Using Simple Simulations
sample_indices <- cube(
  prob = sampling_frame[['SAMP_PROB']],
  x    = balancing_vars
)

n <- length(selected_sample)

joint_probs <- matrix(data = 0, nrow = n, ncol = n)

B <- 10000
b <- 1
while (b <= B) {
  
  sim_sample <- cube(
    prob = sampling_frame[['SAMP_PROB']],
    x    = balancing_vars
  )
  
  joint_probs <- joint_probs + tcrossprod(sample_indices %in% sim_sample)
  
  b <- b + 1
}

joint_probs <- (1/B) * joint_probs

That approach is effective but it’s unfortunately very inefficient. When the population size is large relative to the sample size, some of the estimated joint probabilities might be zero, even after 1,000,000 simulations, which poses a big problem for variance estimation.

Estimating Joint Probabilities Using The Breidt-Chauvet Simulation Method

Fortunately, Breidt and Chauvet (2011) developed a much more efficient way of estimating joint probabilities with simulations, which takes advantage of some nice mathematical properties of the cube method for sampling. The only problem is that there is no published software for the Breidt-Chauvet method.

That’s why I coded this up in Julia, which is a nice tool for prototyping statistical methods and running simulations. This Julia script below defines a module with functions for conducting cube sampling and also estimating the joint inclusion probabilities of the sampled elements.

Click to show the contents of the Julia script
```{julia}
module fast_cube_with_probs

# Helper function to obtain a vector in matrix's null space
import LinearAlgebra: nullspace
import Random: Random

import Base.Threads: @threads, nthreads

function get_null_space_vector(X::Matrix{Float64}) 
    try nullspace(X)[:,end] 
    catch Nothing 
    end
end

# Helper function to check whether the vector of probabilities is all zeros or ones
function is_finalized(π::Vector{Float64}; ϵ=1e-11)
    (π .< ϵ) .| (π .> (1-ϵ))
end

# Define function to conduct the "fast" flight phase
function cube_fast_flight(π::Vector{Float64}, X::Matrix{Float64}; ϵ = 1e-11, π_0 = π, Δ_indices = Vector(1:length(π)))
    
    result = copy(π)

    already_resolved = is_finalized(π; ϵ = ϵ)
    π = π[.!already_resolved]
    π_0 = π_0[.!already_resolved]
    X = X[.!already_resolved,:]
    
    if (length(Δ_indices) > 0)
        Δ_result = zeros(length(Δ_indices), length(Δ_indices))
    else
        Δ_result = Matrix{Float64}(undef, 0, 0)
    end
    Δ_result_indices = findall(map(in(findall(.!already_resolved)), Δ_indices))
    Δ_indices = findall(map(in(Δ_indices), findall(.!already_resolved)))
    
    # Conduct Step 1 of Algorithm 2 of Chauvet (2006)
    N = length(π)
    J = size(X, 2)
    ψ = π[begin:(J+1)]
    r = Vector(1:(J+1))
    B = Matrix(transpose((π_0[1:(J+1)].^(-1)) .* X[1:(J+1),:]))
    k = J + 2
    N_plus_one = N + 1

    if (length(Δ_indices) > 0)
        Δ = zeros(length(Δ_indices), length(Δ_indices))
        Δ_ind_dict = Dict(zip(Δ_indices, Vector(1:length(Δ_indices))))
    else
        Δ = Matrix{Float64}(undef, 0, 0)
    end  

    while (k <= N_plus_one)
        # Determine random direction to move ψ
        # within the nullspace of B
        u_B = get_null_space_vector(B)

        u_B_nonzero = (u_B .!= 0)
        vec_1 = (1 .- ψ[u_B_nonzero]) ./ u_B[u_B_nonzero]
        vec_2 = (ψ[u_B_nonzero]) ./ u_B[u_B_nonzero]

        λ_1 = minimum(max.(vec_1, -vec_2))
        λ_2 = minimum(max.(-vec_1, vec_2))

        q_1 = λ_2/(λ_1 + λ_2)

        if (rand(Float64) < q_1)
            λ =  λ_1
        else 
            λ = -λ_2
        end
        ψ = ψ .+ (λ .* u_B)

        which_r_relevant = findall(map(in(Δ_indices), r))
        if (length(which_r_relevant) > 0) 
            Δ[[Δ_ind_dict[i] for i in r[which_r_relevant]],
              [Δ_ind_dict[i] for i in r[which_r_relevant]]] += (λ_1 * λ_2) .* (u_B[which_r_relevant] * transpose(u_B[which_r_relevant]))
        end

        # Update π for cases resolved by the update to ψ,
        # and then remove those case from ψ and B,
        # replacing them with cases not yet handled
        for i in 1:(J+1)
            if (ψ[i] < ϵ) | (ψ[i] > (1 - ϵ))
                π[r[i]] = ψ[i]
                if (k <= N)
                    r[i] = k
                    ψ[i] = π[k]
                    B[:,i] = (π_0[k].^(-1)) .* X[k,:]
                end
                k += 1
            end
        end
    end

    π[r[1:(J+1)]] = ψ[1:(J+1)]
    
    π[π .< ϵ] .= 0
    π[π .> (1-ϵ)] .= 1

    result[.!already_resolved] = π
    if (length(Δ_indices) > 0)
        Δ_result[Δ_result_indices, Δ_result_indices] = Δ
    end
    return result, Δ_result
end
 
# Define function to conduct the landing phase by dropping balancing variables
# and repeating the fast flight phase with the reduced set of balancing variables
function cube_land(π::Vector{Float64}, π_0::Vector{Float64}, X::Matrix{Float64}; ϵ = 1e-11, Δ_indices = Vector(1:length(π)))
 
    # Get dimensions of problem
      N = length(π)
      J = size(X, 2)
  
      # Initialize variables
      t = 0
      π_t = copy(π)
      remaining_indices = findall(.!is_finalized(π))
      n_remaining = length(remaining_indices)
      n_dropped_variables = 0
  
      if (length(Δ_indices) > 0)
          Δ = zeros(length(Δ_indices), length(Δ_indices))
      else 
          Δ = Matrix{Float64}(undef, 0, 0)
      end    
   
      while ((n_remaining > 0))
   
          # Drop variables as needed
          n_dropped_variables += 1
          J -= 1
  
          Δ_entries_to_update = findall(map(in(intersect(Δ_indices,remaining_indices)), Δ_indices)) 
   
          π_t[remaining_indices], Δ_t = cube_fast_flight(
              π_t[remaining_indices], 
              X[remaining_indices,1:J];
              π_0 = π_0[remaining_indices],
              Δ_indices = findall(map(in(Δ_indices), remaining_indices))
          )
          if (length(Δ_entries_to_update) > 0)
              Δ[Δ_entries_to_update, Δ_entries_to_update] += Δ_t
          end
   
          remaining_indices = findall((π_t .> ϵ) .& (π_t .< (1-ϵ)))
          n_remaining = length(remaining_indices)
      end
   
      a = π_t
      a[a .> (1-ϵ)] .= 1
      a[a .< ϵ] .= 0
      result = Int64.(round.(a; digits = 0))
      return result, Δ
end
 
# Combine the flight and landing phases, conduct simulations if requested
function fast_cube(π::Vector{Float64}, X::Matrix{Float64}; seed::Union{Int64, Nothing} = nothing, sim_method = "Breidt-Chauvet", n_sims = 1e+5, ϵ = 1e-11)

    if (!isnothing(seed))
        Random.seed!(seed)
    end

    π_f, _   = cube_fast_flight(π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
    a  , _   = cube_land(π_f, π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)

    if (n_sims > 0)
        Δ = zeros(sum(a), sum(a))
        Δ_indices = findall(a .== 1)
    else
        Δ = zeros(0, 0)
    end

    if ((n_sims > 0) & (sim_method == "Monte Carlo"))
        for sim in 1:n_sims
            π_f    , _ = cube_fast_flight(π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
            a_sim  , _ = cube_land(π_f, π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
            Δ += (a_sim[Δ_indices] - π[Δ_indices]) * transpose(a_sim[Δ_indices] - π[Δ_indices])
        end
        Δ /= n_sims
    end

    if ((n_sims > 0) & (sim_method == "Breidt-Chauvet"))
        chunk_sizes = [ceil(Int, n_sims / nthreads()) for thread in 1:nthreads()]
        chunk_index = 1
        while (sum(chunk_sizes) > n_sims)
            chunk_sizes[chunk_index] -= 1
            chunk_index    += 1
        end
        Δ_for_thread = [zeros(length(Δ_indices), length(Δ_indices)) for thread in 1:nthreads()]
        @threads for i in 1:nthreads()
            for sim in 1:chunk_sizes[i]
                π_f, Δ_f = cube_fast_flight(π, X; Δ_indices = Δ_indices, ϵ = ϵ)
                _  , Δ_l = cube_land(π_f, π, X; Δ_indices = Δ_indices, ϵ = ϵ)
                Δ_for_thread[i] += Δ_f + Δ_l
            end
        end
        Δ = reduce(+, Δ_for_thread)
        Δ /= n_sims
    end

    return (indicators = a, Delta = Δ)
end

end

```

The R code below starts a Julia session, and runs the Julia script so that the functions are available in R.

library(JuliaConnectoR)

# Let Julia use high-level parallelization through multithreading
  Sys.setenv(JULIA_NUM_THREADS = 6)

# Let Julia use low-level multithreading for linear algebra operations
  juliaEval("using LinearAlgebra")
  juliaEval("BLAS.set_num_threads(4)")
  
# Set a random seed for the Julia session
  juliaEval("import Random")
  juliaEval("Random.seed!(2024)")
<Julia object of type Random.TaskLocalRNG>
Random.TaskLocalRNG()
# Run the Julia script
  julia_script <- file.path(
    "cube-method-with-probabilities.jl"
  ) |> normalizePath()
  
  juliaCall("include", julia_script)
$name
[1] "fast_cube_with_probs"

attr(,"JLTYPE")
[1] "Module"
# Import the module of Julia functions so that the functions can be called in R
  cube_module <- juliaImport(".fast_cube_with_probs")

Now that the Julia module is available in R, we can run the main function of interest. The code below selects a sample using the cube method, and then uses the Breidt-Chauvet simulation method to estimate joint sampling probabilities for the sampled elements, using a total of 100,000 simulations, which takes about five minutes to run on my laptop.

# Run the Julia function, using inputs from R
  sampling_output <- juliaGet(
    cube_module$fast_cube(
      sampling_frame[['SAMP_PROB']],
      balancing_vars,
      n_sims     = 100000,
      sim_method = "Breidt-Chauvet"
    )
  )
  
# Get the data for the selected sample
  selected_sample <- sampling_output$indicators |>
    as.logical() |> which() |>
    slice(.data = sampling_frame)

Let’s look at the structure of the output. The code below shows us that we have a list object with two elements. The first element is a vector of inclusion indicators of length N, with all entries equal to 0 or 1. The second element is a square matrix named Delta.

str(sampling_output)
List of 2
 $ indicators: int [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
 $ Delta     : num [1:200, 1:200] 2.48e-01 -9.14e-05 -1.96e-03 -1.01e-03 0.00 ...
 - attr(*, "JLTYPE")= chr "@NamedTuple{indicators::Vector{Int64}, Delta::Matrix{Float64}}"

This matrix Δ gives the estimated variance-covariance matrix of sampling indicators for the sampled elements. For any pair of elements k and l in the sample, the entry in row k and column l is as follows:

Δkl=π^klπ^kπ^l

We can easily convert those to estimates of joint probabilities, by adding the known quantities πiπj to each element ij of the matrix.

get_joint_probs <- function(pi, Delta, indicators) {
  Delta + tcrossprod(pi[as.logical(indicators)])
}
joint_probs_bc <- get_joint_probs(
  pi         = sampling_frame[['SAMP_PROB']], 
  Delta      = sampling_output$Delta, 
  indicators = sampling_output$indicators
)

Let’s look at the first few entries of the matrix.

joint_probs_bc[1:5, 1:5]
           [,1]        [,2]        [,3]        [,4]       [,5]
[1,] 0.45640569 0.010918199 0.064186377 0.094806829 0.45618667
[2,] 0.01091820 0.024177025 0.003547846 0.005406388 0.02413396
[3,] 0.06418638 0.003547846 0.145303146 0.014087532 0.14499282
[4,] 0.09480683 0.005406388 0.014087532 0.210428247 0.21004013
[5,] 0.45618667 0.024133960 0.144992823 0.210040131 1.00000000

As we’d expect, the diagonal entries (which are simply the sampling probabilities for individual elements, πi), closely match the known sampling probabilities.

# Compare estimated to actual probabilities
joint_probs_bc |> diag() |> head(5)
[1] 0.45640569 0.02417702 0.14530315 0.21042825 1.00000000
sampling_frame[['SAMP_PROB']][as.logical(sampling_output$indicators)] |>
  head(5)
[1] 0.45618667 0.02413396 0.14499282 0.21004013 1.00000000

Fay’s Generalized Replication Method

Now that we have the estimated joint probabilities, we can create replicate weights using Fay’s generalized replication method. This approach creates replicates based around a target variance estimator’s quadratic form. As our target variance estimator, we’ll use the Horvitz-Thompson variance estimator based on the estimated joint probabilities.

The R code below creates the replication factors.

library(svrep)

quad_form_matrix <- make_quad_form_matrix(
  variance_estimator = "Horvitz-Thompson",
  joint_probs        = joint_probs_bc
)
replicate_factors <- make_fays_gen_rep_factors(
  quad_form_matrix,
  max_replicates = 200
)

And the R code below prepares our data for variance estimation using the replicate factors.

library(survey)

ht_rep_design <- svrepdesign(
  data       = selected_sample,
  weights    = selected_sample$SAMP_PROB^(-1),
  repweights = replicate_factors,
  type       = 'other',
  combined   = FALSE,
  scale      = attr(replicate_factors, 'scale'),
  rscales    = attr(replicate_factors, 'rscales'),
  mse        = TRUE
)

Let’s look at a couple example estimates. Below, we have the average county’s share of persons who are at the lowest level of literacy.

ht_rep_design |> svymean(x = ~ LIT_P1)
          mean     SE
LIT_P1 0.22208 0.0131

That’s pretty close to the actual population value.

sampling_frame$LIT_P1 |> mean()
[1] 0.2171292

Comparing to An Alternative Variance Estimator

An alternative variance estimator is the Deville-Tillé (2005) variance estimator developed for balanced sampling methods. We can create replicate weights based on this variance estimator using the ‘svrep’ package. In particular, we’ll create replicate weights using Fay’s generalized replication method.

dt_rep_design <- selected_sample |>
  svydesign(data = _, ids = ~ 1, probs = ~ SAMP_PROB) |>
  as_fays_gen_rep_design(
    variance_estimator = "Deville-Tille",
    aux_var_names      = colnames(balancing_vars) 
  )

Now let’s compare variance estimates.

ht_rep_design |> svymean(x = ~ LIT_A)
        mean     SE
LIT_A 259.85 1.9415
dt_rep_design |> svymean(x = ~ LIT_A)
        mean     SE
LIT_A 259.85 1.1423
ht_rep_design |> svytotal(x = ~ LIT_P2)
        total     SE
LIT_P2 1054.8 113.72
dt_rep_design |> svytotal(x = ~ LIT_P2)
        total     SE
LIT_P2 1054.8 29.694

It’s not surprising that the Deville-Tillé estimated standard errors are smaller. Part of the problem with that estimator is that it tends to be downwardly biased, since it ignores a systematic part of the sampling error: it behaves as if the cube method always resulted in perfectly balanced samples, but that’s often far from true.

Below, we compare the sample estimates of the population totals to the actual population totals. The discrepancies are pretty big for several of the balancing variables.

# Get vector of population totals
pop_totals <- balancing_vars |> colSums()

# Get the sample-based estimates of population totals
estimated_totals <- svytotal(
  x      = reformulate(colnames(balancing_vars)),
  design = ht_rep_design
) |> coef()

# Calculate relative differences
(estimated_totals - pop_totals) / pop_totals
   SAMP_PROB  POVERTY_100        BLACK     HISPANIC NO_INSURANCE  OCC_SERVICE 
  0.00000000  -0.03974905   0.24282185  -0.25767936  -0.06059120  -0.11482742 

Fortunately, the imbalance can be fixed up by calibrating on the balancing variables so that their estimates actually do have a variance of zero.

# Calibrate the samples to the population totals
ht_rep_design_calibrated <- ht_rep_design |>
  calibrate(formula = reformulate(colnames(balancing_vars),
                                  intercept = FALSE),
            population = pop_totals,
            compress = FALSE)

dt_rep_design_calibrated <- dt_rep_design |>
  calibrate(formula = reformulate(colnames(balancing_vars),
                                  intercept = FALSE),
            population = pop_totals,
            compress = FALSE)

Here are the variance estimates for a couple key outcome variables after calibration. The two variance estimators are now very closely aligned.

ht_rep_design_calibrated |> svymean(x = ~ LIT_A)
        mean     SE
LIT_A 261.83 1.4581
dt_rep_design_calibrated |> svymean(x = ~ LIT_A)
        mean     SE
LIT_A 261.83 1.3796
ht_rep_design_calibrated |> svytotal(x = ~ LIT_P2)
        total     SE
LIT_P2 1175.4 57.096
dt_rep_design_calibrated |> svytotal(x = ~ LIT_P2)
        total     SE
LIT_P2 1175.4 39.695

Still, it’s a little concerning that one of these variance estimators would perform so badly if we didn’t calibrate the sample. The variance estimator seemed to have a large downward bias, which would lead us to be overconfident in our inferences. As we can see below, the point estimate was drastically changed by the calibration, but the variance estimate hardly changed at all, which is a warning sign that the variance estimator before calibration was badly biased.

dt_rep_design            |> svymean(x = ~ LIT_A)
        mean     SE
LIT_A 259.85 1.1423
dt_rep_design_calibrated |> svymean(x = ~ LIT_A)
        mean     SE
LIT_A 261.83 1.3796

One advantage we might hope to see from the simulation-based approach is that we should be able to limit the bias of our estimator to as small as we want, since the Horvitz-Thompson estimator is unbiased and any bias in our estimator is due to the estimation of the probabilities that go into the Horvitz-Thompson estimator. And we can minimize this bias as much as we want by just throwing more simulations at the problem.

The Problem with Simulating Joint Probabilities

There’s one big catch however. With the Horvitz-Thompson estimator and other estimators that rely on knowing (or estimating) joint inclusion probabilities, the variance estimator will be undefined if the sampling design leads to joint inclusion probabilities of zero. And we can’t know whether the sampling design produces zero joint inclusion probabilities just from our selected sample: it’s a truism that selected samples will only ever contain pairs of units that have positive joint inclusion probabilities. On top of that, for many applications, simulating joint probabilities for the entire population is computationally infeasible.

So that’s a notable advantage in favor of other variance estimators that don’t rely on joint probabilities. The Deville-Tillé estimator works because it relies on a reasonable approximating model for the sampling variance, and that model can hold true even when the sampling design involves some joint inclusion probabilities being zero. But the problem with the Deville-Tillé variance estimator is that it relies on a frequently unreasonable assumption about the the quality of balancing for the cube method. So whether we simulate joint probabilities or we use the Deville-Tillé estimator, we have to rely on an assumption that may be flawed and lead to bad inference.

Using Simulation to Augment the Deville-Tillé Estimator

Fortunately, the Deville-Tillé estimator’s assumption of perfect balancing can be assessed using simulation. Instead of assuming that the cube method leads to perfect or near-perfect balancing for auxiliary variables, we can use B simulations to estimate the variance of the sample balance.

V^(X^)=1B1b=1B(X^X)(X^X)T

The R code below runs B=10,000 simulations to estimate the variance-covariance matrix of the vector of weighted totals for the balancing variables, X^. And while we’re doing that, we’ll also estimate the variance of the statistic for the outcome variable.

Click to show R code
# Initialize variance-covariance matrix for balancing variables
pop_totals <- balancing_vars |> colSums()
Y <- sampling_frame$LIT_A |> sum()
X_vcov <- matrix(0, nrow = length(pop_totals), ncol = length(pop_totals))
Yhat_vcov <- 0

# Initialize simulation counter
n_sim_samples <- 10000
sim_index <- 1

# Run simulations, updating estimate of X_vcov each time
while (sim_index <= n_sim_samples) {
  
  # Select a sample using the cube method
  sim_sample <- juliaGet(
    cube_module$fast_cube(
      sampling_frame[['SAMP_PROB']],
      balancing_vars,
      n_sims = 0
    )
  ) |> 
    getElement("indicators") |>
    as.logical() |> which()
  
  # Get weighted totals of balancing variables
  X_hat <- colSums(
    balancing_vars[sim_sample,] / sampling_frame[['SAMP_PROB']][sim_sample]
  )
  
  # Get weighted total for outcome variable
  Yhat <- sum(
    sampling_frame$LIT_A[sim_sample] / sampling_frame[['SAMP_PROB']][sim_sample]
  )
  
  # Update variance estimate
  X_vcov    <- X_vcov    + (n_sim_samples - 1)^(-1) * tcrossprod(X_hat - pop_totals)
  Yhat_vcov <- Yhat_vcov + (n_sim_samples - 1)^(-1) * (Yhat - Y)^2
  
  sim_index <- sim_index + 1L
}

Now let’s look at the coefficient of variation for X^.

(X_vcov |> diag() |> sqrt()) / pop_totals
   SAMP_PROB  POVERTY_100        BLACK     HISPANIC NO_INSURANCE  OCC_SERVICE 
   0.0000000    0.1174322    0.1705357    0.3130446    0.2030957    0.1282019 

So the assumption of perfect balancing isn’t right at all.

To assess what kind of impact that might have on the variance estimates, let’s consider the theoretical variance of the estimator Y^. We can decompose it as follows using the law of total variance.

V(Y^)=E(V(Y^|X^))+V(E(Y^|X^))

The Deville-Tillé estimator assumes that the variance of X^ is zero, and so the estimator is simply an estimate of the first component in the equation above. But since the variance of X^ is actually nonzero, the second component in the decomposition is nonzero and so the estimator is downward biased.

We can estimate the second term using a quadratic form.

V^(E(Y^|X^))=β^VX^β^

The term β^ is an estimate of the conditional expectation of Y^ given X^, which we can estimate using a working model. Below, we estimate a linear model of Y|X, weighted by the sampling weights. We specified the option intercept = FALSE because one of the balancing variables is simply the sampling probabilities, and so it acts as an intercept term for the working model.

working_model <- lm(
  data    = selected_sample,
  formula = reformulate(
    response   = "LIT_A",
    termlabels = colnames(balancing_vars),
    intercept  = FALSE
  ),
  weights = selected_sample$SAMP_PROB^(-1)
)
beta_hat <- coef(working_model)

This gives us the following estimate for the component β^VX^β^.

b_X_b_var_est   <- t(beta_hat) %*% X_vcov %*% beta_hat

And we can add this with the estimate of the first component, which is simply the value from the Deville-Tillé estimator.

augmented_dt_std_error <- sqrt(
  as.numeric(vcov(svytotal(x = ~ LIT_A, dt_rep_design))) + 
  b_X_b_var_est
)
print(augmented_dt_std_error)
         [,1]
[1,] 110029.1

Let’s compare that to the variance estimate we got from using a Horvitz-Thompson estimator with estimated joint probabilities.

ht_rep_design |> svytotal(x = ~ LIT_A)
       total    SE
LIT_A 730247 76125

Interesting. The Horvitz-Thompson type variance estimate is a lot larger.

And now let’s see how all these variance estimates compares to the true sampling variance that we estimated using 10,000 simulations.

std_error_estimates <- data.frame(
  'True'                     = Yhat_vcov |> sqrt(),
  'HT - Joint Probabilities' = svytotal(x = ~ LIT_A, ht_rep_design) |> SE(),
  'Deville-Tillé'            = svytotal(x = ~ LIT_A, dt_rep_design) |> SE(),
  'Augmented Deville-Tillé'  = augmented_dt_std_error,
  check.names = FALSE
)
print(std_error_estimates)
      True HT - Joint Probabilities Deville-Tillé Augmented Deville-Tillé
1 107591.2                 76124.93      21053.16                110029.1

For simplicity, let’s scale these errors relative to the true standard error.

std_error_estimates / std_error_estimates[['True']]
  True HT - Joint Probabilities Deville-Tillé Augmented Deville-Tillé
1    1                0.7075385     0.1956773                1.022658

The Deville-Tillé estimate was much too small. But the alternative simulation-based variance estimators yielded much better variance estimates.

The “augmented” Deville-Tillé estimator considered here might be able to combine the relative strengths of the simulation-based approach with the analytic variance estimator. On the one hand, we can use simulation to weaken the assumption of perfect balance required by the Deville-Tillé variance estimator, and thereby reduce its bias. On the other hand, by not relying on simulation to estimate joint probabilities for a Horvitz-Thompson type variance estimator, we don’t have to rely on the assumption that our sampling design has entirely positive joint probabilities, which is not guaranteed with the cube method.

And like the Deville-Tillé estimator, we can easily form replicate weights using Fay’s generalized replication method. I’m going to do some more work building out this idea and try to poke some holes in it, but I think it might have some promise.