library(readxl)
library(dplyr)
library(stringr)
# Download the PIAAC Skills Map data to a temporary location
<- "https://nces.ed.gov/surveys/piaac/skillsmap/static/media/SAE_website_dataset.c15d59a2d7e219fcb6d1.xlsx"
data_url <- "sae_website_dataset.xlsx"
downloaded_filepath
if (!file.exists(downloaded_filepath)) {
download.file(data_url, destfile = downloaded_filepath,
method = "curl")
}
# Load the data for U.S. counties
<- readxl::read_xlsx(
skills_map_data path = downloaded_filepath, sheet = "County",
guess_max = 5000
|>
) rename_with(toupper) |>
filter(GRPNAME == "all") |>
mutate(FIPS_CODE = str_pad(FIPS_CODE, width = 5,
side = "left", pad = 0))
The Analytical Challenge Posed by the Cube Method
Recently on this blog, I wrote a post about the cube method and why I think it’s a promising method for survey sampling. The cube method in many ways simplifies our goals we have when we draw a stratified sample: it allows us to allocate sample sizes to several (potentially overlapping) domains and automatically get a sample that is well-calibrated to known population information. However, one drawback of the cube method is that it complicates analysis: it’s not immediately obvious how to estimate sampling variances for estimates produced from a sample selected using the cube method. There is not a simple analytic formula that yields exactly unbiased variance estimation, and it’s not obvious how one would generate an appropriate replication method using the bootstrap or other approaches.
Simulations vs. Approximations
There’s a small literature on this problem, which has generally taken two approaches:
Use the Horvitz-Thompson estimator using estimates of the joint inclusion probabilities, where the estimates are obtained using simulation. Here, simulation means repeating the sampling process several thousand times, tallying up what percent of the simulations each pair of units ends up in the sample together, and then using that as the estimated joint probabilities.
Make some simplifying approximations and use a variance estimator proposed by Deville and Tillé. In the technical jargon of the cube method, this amounts to assuming that all of the variance comes from the “flight phase” rather than the “landing phase” of the algorithm, so that the sampling method can be treated as if it was a conditional Poisson sample design.
The Deville-Tillé estimator works fine in many cases, but it has some important drawbacks. For starters, we don’t have a great sense of its error properties in any given application. The only way to figure that out is through simulations. But we do have some guidance about when it won’t work well. In particular, the Deville-Tillé approximation will underestimate the sampling variance when the sampling process has a hard time with the balancing; that is, when the landing phase has to give up on selecting a perfectly balanced sample and results in substantial imbalance. This can frequently happen when using the “stratified cube method” of Chauvet (2009). So in practice, the Deville-Tillé estimator can often yield systematically too small estimates of sampling variance, which is concerning for official statistics as it means that data users end up overconfident about how much the data can tell them about the population.
Maybe We Should Be Using Simulations More Often
Despite these issues, the simulation-based approach has been ignored in favor of the approximation-based approach of Deville and Tillé.
The hope behind a simulation-based approach is that, if we throw enough computing resources at the problem, we can get arbitrarily accurate variance estimates. In the “Computer Age of Statistical Inference” that we now live in, simulation seems like it should be totally feasible and even easy for many applications of the cube method. Every year that passes, simulation-based inference gets easier and easier as our computers keep getting more powerful. Nowadays, statisticians working on standard-issue laptops routinely use Monte Carlo simulation to estimate uncertainty for complex Bayesian models in hours or even minutes.
Maybe survey sampling and variance estimation for complex samples is an area where we can benefit from using more simulations.
An Example in R
To show how easy it is to adopt the simulation-based method of variance estimation for the cube method, I put together an example in R. The data source we’ll use is a dataset of all counties in the U.S., with a measure of literacy levels obtained from the PIAAC small area estimation program.
Data Source: https://nces.ed.gov/surveys/piaac/skillsmap/
This example is nice because it’s a national survey where we have clear outcome variables (literacy) and an obvious choice of balancing variables: the covariates used in a validated small area estimation model.
Creating a Sampling Frame
The sampling frame for this example is the set of all counties in the U.S. as of 2017, with data obtained from the public PIAAC small area estimation dataset.
Click to show R code for downloading and importing data
We’ll select only the variables of interest. Since some counties have a missing value for population size, we’ll impute using a simple method.
Click to show R code
<- skills_map_data |>
sampling_frame # Impute missing population size for a small number of counties
mutate(POP_SIZE = as.numeric(POP_DOMAIN),
POP_SIZE = ifelse(is.na(POP_SIZE),
round(0.75 * as.numeric(POP), 0),
|> as.integer()) |>
POP_SIZE) select(
# Geographic identifiers
FIPS_CODE, COUNTY, STATE,# Population size
POP_SIZE,# Literacy and numeracy measures
LIT_A, LIT_P1, LIT_P2, LIT_P3,
NUM_A, NUM_P1, NUM_P2, NUM_P3,# Auxiliary variables derived from ACS tables
BLACK, HISPANIC,
LESS_HS, HS, MORE_HS,
POVERTY_100, SNAP, UNEMPLOYED, OCC_SERVICE, NO_INSURANCE,
FB, ENG_NOT_WELL|>
) mutate(across(-c(FIPS_CODE, COUNTY, STATE),
as.numeric))
Here’s what the data look like. There’s one row per county, a population size variable POP_SIZE
, literacy and numeracy measures, and several auxiliary variables.
|> head(5) sampling_frame
# A tibble: 5 × 24
FIPS_CODE COUNTY STATE POP_SIZE LIT_A LIT_P1 LIT_P2 LIT_P3 NUM_A NUM_P1 NUM_P2
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01001 Autau… Alab… 39765 262. 0.205 0.384 0.411 247. 0.328 0.363
2 01003 Baldw… Alab… 148385 270. 0.161 0.347 0.493 257. 0.261 0.353
3 01005 Barbo… Alab… 19450 236. 0.394 0.404 0.201 216. 0.581 0.273
4 01007 Bibb … Alab… 16970 248. 0.269 0.454 0.277 232. 0.417 0.402
5 01009 Bloun… Alab… 41955 258. 0.248 0.37 0.382 246. 0.352 0.341
# ℹ 13 more variables: NUM_P3 <dbl>, BLACK <dbl>, HISPANIC <dbl>,
# LESS_HS <dbl>, HS <dbl>, MORE_HS <dbl>, POVERTY_100 <dbl>, SNAP <dbl>,
# UNEMPLOYED <dbl>, OCC_SERVICE <dbl>, NO_INSURANCE <dbl>, FB <dbl>,
# ENG_NOT_WELL <dbl>
Determining a Sample Design
We’ll draw a sample of 200 counties, with their sampling probabilities proportional to their population size.
library(sampling)
# Define sampling probabilities
<- sampling_frame |>
sampling_frame mutate(SAMP_PROB = inclusionprobabilities(
a = POP_SIZE, n = 200
|>
)) ungroup() |>
relocate(SAMP_PROB, .after = POP_SIZE)
To balance the sample, we’ll use a handful of auxiliary variables known to be correlated with literacy and numeracy. These are most of the variables that were used as predictors for the small area estimation models.
<- model.matrix(
balancing_vars object = ~ -1 + POVERTY_100 + BLACK + HISPANIC + NO_INSURANCE + OCC_SERVICE,
data = sampling_frame
)
To ensure a fixed sample size when using the cube method, the matrix of balancing variables will include a variable which is simply equal to the sampling probabilities.
<- cbind(
balancing_vars 'SAMP_PROB' = sampling_frame[['SAMP_PROB']],
balancing_vars )
Note that this set of balancing variables is quite strong.
<- lm(
balancing_variables_model data = sampling_frame,
formula = LIT_A ~ POVERTY_100 + BLACK + HISPANIC +
+ OCC_SERVICE
NO_INSURANCE
)
|> summary() balancing_variables_model
Call:
lm(formula = LIT_A ~ POVERTY_100 + BLACK + HISPANIC + NO_INSURANCE +
OCC_SERVICE, data = sampling_frame)
Residuals:
Min 1Q Median 3Q Max
-25.222 -4.798 -0.518 4.471 33.446
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 289.2886 0.6801 425.368 < 2e-16 ***
POVERTY_100 -95.2764 2.6486 -35.972 < 2e-16 ***
BLACK -20.4995 1.0357 -19.793 < 2e-16 ***
HISPANIC -29.7682 1.0763 -27.659 < 2e-16 ***
NO_INSURANCE -50.2900 3.1318 -16.058 < 2e-16 ***
OCC_SERVICE -21.6135 3.8592 -5.601 2.32e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.348 on 3136 degrees of freedom
Multiple R-squared: 0.6965, Adjusted R-squared: 0.6961
F-statistic: 1440 on 5 and 3136 DF, p-value: < 2.2e-16
Selecting a Sample
In R, we can use the ‘BalancedSampling’ R package to select a sample using the cube method, like so.
library(BalancedSampling)
# Select sample, get row indices of sample elements on frame
<- cube(
sample_indices prob = sampling_frame[['SAMP_PROB']],
x = balancing_vars
)
# Get the data for the selected sample
<- sampling_frame |> slice(sample_indices) selected_sample
Creating Replicate Weights Using Simulation
We’ll use simulation to estimate the joint sampling probabilities, and then use those to construct replicate weights.
Estimating Joint Probabilities Using Direct Simulation
One approach would be to simply re-run the sampling method a large number of times and count the number of times each pair of units is sampled together. We’ll call that the “direct simulation” approach.
Here’s some R code that does that using the lightning-fast cube sampling function from the ‘BalancedSampling’ package, which lets us conduct 100,000 simulations in under five minutes.
library(BalancedSampling)
### Estimating Joint Probabilities Using Simple Simulations
<- cube(
sample_indices prob = sampling_frame[['SAMP_PROB']],
x = balancing_vars
)
<- length(selected_sample)
n
<- matrix(data = 0, nrow = n, ncol = n)
joint_probs
<- 10000
B <- 1
b while (b <= B) {
<- cube(
sim_sample prob = sampling_frame[['SAMP_PROB']],
x = balancing_vars
)
<- joint_probs + tcrossprod(sample_indices %in% sim_sample)
joint_probs
<- b + 1
b
}
<- (1/B) * joint_probs joint_probs
That approach is effective but it’s unfortunately very inefficient. When the population size is large relative to the sample size, some of the estimated joint probabilities might be zero, even after 1,000,000 simulations, which poses a big problem for variance estimation.
Estimating Joint Probabilities Using The Breidt-Chauvet Simulation Method
Fortunately, Breidt and Chauvet (2011) developed a much more efficient way of estimating joint probabilities with simulations, which takes advantage of some nice mathematical properties of the cube method for sampling. The only problem is that there is no published software for the Breidt-Chauvet method.
That’s why I coded this up in Julia, which is a nice tool for prototyping statistical methods and running simulations. This Julia script below defines a module with functions for conducting cube sampling and also estimating the joint inclusion probabilities of the sampled elements.
Click to show the contents of the Julia script
```{julia}
module fast_cube_with_probs
# Helper function to obtain a vector in matrix's null space
import LinearAlgebra: nullspace
import Random: Random
import Base.Threads: @threads, nthreads
function get_null_space_vector(X::Matrix{Float64})
try nullspace(X)[:,end]
catch Nothing
end
end
# Helper function to check whether the vector of probabilities is all zeros or ones
function is_finalized(π::Vector{Float64}; ϵ=1e-11)
(π .< ϵ) .| (π .> (1-ϵ))
end
# Define function to conduct the "fast" flight phase
function cube_fast_flight(π::Vector{Float64}, X::Matrix{Float64}; ϵ = 1e-11, π_0 = π, Δ_indices = Vector(1:length(π)))
result = copy(π)
already_resolved = is_finalized(π; ϵ = ϵ)
π = π[.!already_resolved]
π_0 = π_0[.!already_resolved]
X = X[.!already_resolved,:]
if (length(Δ_indices) > 0)
Δ_result = zeros(length(Δ_indices), length(Δ_indices))
else
Δ_result = Matrix{Float64}(undef, 0, 0)
end
Δ_result_indices = findall(map(in(findall(.!already_resolved)), Δ_indices))
Δ_indices = findall(map(in(Δ_indices), findall(.!already_resolved)))
# Conduct Step 1 of Algorithm 2 of Chauvet (2006)
N = length(π)
J = size(X, 2)
ψ = π[begin:(J+1)]
r = Vector(1:(J+1))
B = Matrix(transpose((π_0[1:(J+1)].^(-1)) .* X[1:(J+1),:]))
k = J + 2
N_plus_one = N + 1
if (length(Δ_indices) > 0)
Δ = zeros(length(Δ_indices), length(Δ_indices))
Δ_ind_dict = Dict(zip(Δ_indices, Vector(1:length(Δ_indices))))
else
Δ = Matrix{Float64}(undef, 0, 0)
end
while (k <= N_plus_one)
# Determine random direction to move ψ
# within the nullspace of B
u_B = get_null_space_vector(B)
u_B_nonzero = (u_B .!= 0)
vec_1 = (1 .- ψ[u_B_nonzero]) ./ u_B[u_B_nonzero]
vec_2 = (ψ[u_B_nonzero]) ./ u_B[u_B_nonzero]
λ_1 = minimum(max.(vec_1, -vec_2))
λ_2 = minimum(max.(-vec_1, vec_2))
q_1 = λ_2/(λ_1 + λ_2)
if (rand(Float64) < q_1)
λ = λ_1
else
λ = -λ_2
end
ψ = ψ .+ (λ .* u_B)
which_r_relevant = findall(map(in(Δ_indices), r))
if (length(which_r_relevant) > 0)
Δ[[Δ_ind_dict[i] for i in r[which_r_relevant]],
[Δ_ind_dict[i] for i in r[which_r_relevant]]] += (λ_1 * λ_2) .* (u_B[which_r_relevant] * transpose(u_B[which_r_relevant]))
end
# Update π for cases resolved by the update to ψ,
# and then remove those case from ψ and B,
# replacing them with cases not yet handled
for i in 1:(J+1)
if (ψ[i] < ϵ) | (ψ[i] > (1 - ϵ))
π[r[i]] = ψ[i]
if (k <= N)
r[i] = k
ψ[i] = π[k]
B[:,i] = (π_0[k].^(-1)) .* X[k,:]
end
k += 1
end
end
end
π[r[1:(J+1)]] = ψ[1:(J+1)]
π[π .< ϵ] .= 0
π[π .> (1-ϵ)] .= 1
result[.!already_resolved] = π
if (length(Δ_indices) > 0)
Δ_result[Δ_result_indices, Δ_result_indices] = Δ
end
return result, Δ_result
end
# Define function to conduct the landing phase by dropping balancing variables
# and repeating the fast flight phase with the reduced set of balancing variables
function cube_land(π::Vector{Float64}, π_0::Vector{Float64}, X::Matrix{Float64}; ϵ = 1e-11, Δ_indices = Vector(1:length(π)))
# Get dimensions of problem
N = length(π)
J = size(X, 2)
# Initialize variables
t = 0
π_t = copy(π)
remaining_indices = findall(.!is_finalized(π))
n_remaining = length(remaining_indices)
n_dropped_variables = 0
if (length(Δ_indices) > 0)
Δ = zeros(length(Δ_indices), length(Δ_indices))
else
Δ = Matrix{Float64}(undef, 0, 0)
end
while ((n_remaining > 0))
# Drop variables as needed
n_dropped_variables += 1
J -= 1
Δ_entries_to_update = findall(map(in(intersect(Δ_indices,remaining_indices)), Δ_indices))
π_t[remaining_indices], Δ_t = cube_fast_flight(
π_t[remaining_indices],
X[remaining_indices,1:J];
π_0 = π_0[remaining_indices],
Δ_indices = findall(map(in(Δ_indices), remaining_indices))
)
if (length(Δ_entries_to_update) > 0)
Δ[Δ_entries_to_update, Δ_entries_to_update] += Δ_t
end
remaining_indices = findall((π_t .> ϵ) .& (π_t .< (1-ϵ)))
n_remaining = length(remaining_indices)
end
a = π_t
a[a .> (1-ϵ)] .= 1
a[a .< ϵ] .= 0
result = Int64.(round.(a; digits = 0))
return result, Δ
end
# Combine the flight and landing phases, conduct simulations if requested
function fast_cube(π::Vector{Float64}, X::Matrix{Float64}; seed::Union{Int64, Nothing} = nothing, sim_method = "Breidt-Chauvet", n_sims = 1e+5, ϵ = 1e-11)
if (!isnothing(seed))
Random.seed!(seed)
end
π_f, _ = cube_fast_flight(π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
a , _ = cube_land(π_f, π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
if (n_sims > 0)
Δ = zeros(sum(a), sum(a))
Δ_indices = findall(a .== 1)
else
Δ = zeros(0, 0)
end
if ((n_sims > 0) & (sim_method == "Monte Carlo"))
for sim in 1:n_sims
π_f , _ = cube_fast_flight(π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
a_sim , _ = cube_land(π_f, π, X; Δ_indices = Vector{Int64}(undef, 0), ϵ = ϵ)
Δ += (a_sim[Δ_indices] - π[Δ_indices]) * transpose(a_sim[Δ_indices] - π[Δ_indices])
end
Δ /= n_sims
end
if ((n_sims > 0) & (sim_method == "Breidt-Chauvet"))
chunk_sizes = [ceil(Int, n_sims / nthreads()) for thread in 1:nthreads()]
chunk_index = 1
while (sum(chunk_sizes) > n_sims)
chunk_sizes[chunk_index] -= 1
chunk_index += 1
end
Δ_for_thread = [zeros(length(Δ_indices), length(Δ_indices)) for thread in 1:nthreads()]
@threads for i in 1:nthreads()
for sim in 1:chunk_sizes[i]
π_f, Δ_f = cube_fast_flight(π, X; Δ_indices = Δ_indices, ϵ = ϵ)
_ , Δ_l = cube_land(π_f, π, X; Δ_indices = Δ_indices, ϵ = ϵ)
Δ_for_thread[i] += Δ_f + Δ_l
end
end
Δ = reduce(+, Δ_for_thread)
Δ /= n_sims
end
return (indicators = a, Delta = Δ)
end
end
```
The R code below starts a Julia session, and runs the Julia script so that the functions are available in R.
library(JuliaConnectoR)
# Let Julia use high-level parallelization through multithreading
Sys.setenv(JULIA_NUM_THREADS = 6)
# Let Julia use low-level multithreading for linear algebra operations
juliaEval("using LinearAlgebra")
juliaEval("BLAS.set_num_threads(4)")
# Set a random seed for the Julia session
juliaEval("import Random")
juliaEval("Random.seed!(2024)")
<Julia object of type Random.TaskLocalRNG>
Random.TaskLocalRNG()
# Run the Julia script
<- file.path(
julia_script "cube-method-with-probabilities.jl"
|> normalizePath()
)
juliaCall("include", julia_script)
$name
[1] "fast_cube_with_probs"
attr(,"JLTYPE")
[1] "Module"
# Import the module of Julia functions so that the functions can be called in R
<- juliaImport(".fast_cube_with_probs") cube_module
Now that the Julia module is available in R, we can run the main function of interest. The code below selects a sample using the cube method, and then uses the Breidt-Chauvet simulation method to estimate joint sampling probabilities for the sampled elements, using a total of 100,000 simulations, which takes about five minutes to run on my laptop.
# Run the Julia function, using inputs from R
<- juliaGet(
sampling_output $fast_cube(
cube_module'SAMP_PROB']],
sampling_frame[[
balancing_vars,n_sims = 100000,
sim_method = "Breidt-Chauvet"
)
)
# Get the data for the selected sample
<- sampling_output$indicators |>
selected_sample as.logical() |> which() |>
slice(.data = sampling_frame)
Let’s look at the structure of the output. The code below shows us that we have a list object with two elements. The first element is a vector of inclusion indicators of length Delta
.
str(sampling_output)
List of 2
$ indicators: int [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
$ Delta : num [1:200, 1:200] 2.48e-01 -9.14e-05 -1.96e-03 -1.01e-03 0.00 ...
- attr(*, "JLTYPE")= chr "@NamedTuple{indicators::Vector{Int64}, Delta::Matrix{Float64}}"
This matrix
We can easily convert those to estimates of joint probabilities, by adding the known quantities
<- function(pi, Delta, indicators) {
get_joint_probs + tcrossprod(pi[as.logical(indicators)])
Delta
}<- get_joint_probs(
joint_probs_bc pi = sampling_frame[['SAMP_PROB']],
Delta = sampling_output$Delta,
indicators = sampling_output$indicators
)
Let’s look at the first few entries of the matrix.
1:5, 1:5] joint_probs_bc[
[,1] [,2] [,3] [,4] [,5]
[1,] 0.45640569 0.010918199 0.064186377 0.094806829 0.45618667
[2,] 0.01091820 0.024177025 0.003547846 0.005406388 0.02413396
[3,] 0.06418638 0.003547846 0.145303146 0.014087532 0.14499282
[4,] 0.09480683 0.005406388 0.014087532 0.210428247 0.21004013
[5,] 0.45618667 0.024133960 0.144992823 0.210040131 1.00000000
As we’d expect, the diagonal entries (which are simply the sampling probabilities for individual elements,
# Compare estimated to actual probabilities
|> diag() |> head(5) joint_probs_bc
[1] 0.45640569 0.02417702 0.14530315 0.21042825 1.00000000
'SAMP_PROB']][as.logical(sampling_output$indicators)] |>
sampling_frame[[head(5)
[1] 0.45618667 0.02413396 0.14499282 0.21004013 1.00000000
Fay’s Generalized Replication Method
Now that we have the estimated joint probabilities, we can create replicate weights using Fay’s generalized replication method. This approach creates replicates based around a target variance estimator’s quadratic form. As our target variance estimator, we’ll use the Horvitz-Thompson variance estimator based on the estimated joint probabilities.
The R code below creates the replication factors.
library(svrep)
<- make_quad_form_matrix(
quad_form_matrix variance_estimator = "Horvitz-Thompson",
joint_probs = joint_probs_bc
)<- make_fays_gen_rep_factors(
replicate_factors
quad_form_matrix,max_replicates = 200
)
And the R code below prepares our data for variance estimation using the replicate factors.
library(survey)
<- svrepdesign(
ht_rep_design data = selected_sample,
weights = selected_sample$SAMP_PROB^(-1),
repweights = replicate_factors,
type = 'other',
combined = FALSE,
scale = attr(replicate_factors, 'scale'),
rscales = attr(replicate_factors, 'rscales'),
mse = TRUE
)
Let’s look at a couple example estimates. Below, we have the average county’s share of persons who are at the lowest level of literacy.
|> svymean(x = ~ LIT_P1) ht_rep_design
mean SE
LIT_P1 0.22208 0.0131
That’s pretty close to the actual population value.
$LIT_P1 |> mean() sampling_frame
[1] 0.2171292
Comparing to An Alternative Variance Estimator
An alternative variance estimator is the Deville-Tillé (2005) variance estimator developed for balanced sampling methods. We can create replicate weights based on this variance estimator using the ‘svrep’ package. In particular, we’ll create replicate weights using Fay’s generalized replication method.
<- selected_sample |>
dt_rep_design svydesign(data = _, ids = ~ 1, probs = ~ SAMP_PROB) |>
as_fays_gen_rep_design(
variance_estimator = "Deville-Tille",
aux_var_names = colnames(balancing_vars)
)
Now let’s compare variance estimates.
|> svymean(x = ~ LIT_A) ht_rep_design
mean SE
LIT_A 259.85 1.9415
|> svymean(x = ~ LIT_A) dt_rep_design
mean SE
LIT_A 259.85 1.1423
|> svytotal(x = ~ LIT_P2) ht_rep_design
total SE
LIT_P2 1054.8 113.72
|> svytotal(x = ~ LIT_P2) dt_rep_design
total SE
LIT_P2 1054.8 29.694
It’s not surprising that the Deville-Tillé estimated standard errors are smaller. Part of the problem with that estimator is that it tends to be downwardly biased, since it ignores a systematic part of the sampling error: it behaves as if the cube method always resulted in perfectly balanced samples, but that’s often far from true.
Below, we compare the sample estimates of the population totals to the actual population totals. The discrepancies are pretty big for several of the balancing variables.
# Get vector of population totals
<- balancing_vars |> colSums()
pop_totals
# Get the sample-based estimates of population totals
<- svytotal(
estimated_totals x = reformulate(colnames(balancing_vars)),
design = ht_rep_design
|> coef()
)
# Calculate relative differences
- pop_totals) / pop_totals (estimated_totals
SAMP_PROB POVERTY_100 BLACK HISPANIC NO_INSURANCE OCC_SERVICE
0.00000000 -0.03974905 0.24282185 -0.25767936 -0.06059120 -0.11482742
Fortunately, the imbalance can be fixed up by calibrating on the balancing variables so that their estimates actually do have a variance of zero.
# Calibrate the samples to the population totals
<- ht_rep_design |>
ht_rep_design_calibrated calibrate(formula = reformulate(colnames(balancing_vars),
intercept = FALSE),
population = pop_totals,
compress = FALSE)
<- dt_rep_design |>
dt_rep_design_calibrated calibrate(formula = reformulate(colnames(balancing_vars),
intercept = FALSE),
population = pop_totals,
compress = FALSE)
Here are the variance estimates for a couple key outcome variables after calibration. The two variance estimators are now very closely aligned.
|> svymean(x = ~ LIT_A) ht_rep_design_calibrated
mean SE
LIT_A 261.83 1.4581
|> svymean(x = ~ LIT_A) dt_rep_design_calibrated
mean SE
LIT_A 261.83 1.3796
|> svytotal(x = ~ LIT_P2) ht_rep_design_calibrated
total SE
LIT_P2 1175.4 57.096
|> svytotal(x = ~ LIT_P2) dt_rep_design_calibrated
total SE
LIT_P2 1175.4 39.695
Still, it’s a little concerning that one of these variance estimators would perform so badly if we didn’t calibrate the sample. The variance estimator seemed to have a large downward bias, which would lead us to be overconfident in our inferences. As we can see below, the point estimate was drastically changed by the calibration, but the variance estimate hardly changed at all, which is a warning sign that the variance estimator before calibration was badly biased.
|> svymean(x = ~ LIT_A) dt_rep_design
mean SE
LIT_A 259.85 1.1423
|> svymean(x = ~ LIT_A) dt_rep_design_calibrated
mean SE
LIT_A 261.83 1.3796
One advantage we might hope to see from the simulation-based approach is that we should be able to limit the bias of our estimator to as small as we want, since the Horvitz-Thompson estimator is unbiased and any bias in our estimator is due to the estimation of the probabilities that go into the Horvitz-Thompson estimator. And we can minimize this bias as much as we want by just throwing more simulations at the problem.
The Problem with Simulating Joint Probabilities
There’s one big catch however. With the Horvitz-Thompson estimator and other estimators that rely on knowing (or estimating) joint inclusion probabilities, the variance estimator will be undefined if the sampling design leads to joint inclusion probabilities of zero. And we can’t know whether the sampling design produces zero joint inclusion probabilities just from our selected sample: it’s a truism that selected samples will only ever contain pairs of units that have positive joint inclusion probabilities. On top of that, for many applications, simulating joint probabilities for the entire population is computationally infeasible.
So that’s a notable advantage in favor of other variance estimators that don’t rely on joint probabilities. The Deville-Tillé estimator works because it relies on a reasonable approximating model for the sampling variance, and that model can hold true even when the sampling design involves some joint inclusion probabilities being zero. But the problem with the Deville-Tillé variance estimator is that it relies on a frequently unreasonable assumption about the the quality of balancing for the cube method. So whether we simulate joint probabilities or we use the Deville-Tillé estimator, we have to rely on an assumption that may be flawed and lead to bad inference.
Using Simulation to Augment the Deville-Tillé Estimator
Fortunately, the Deville-Tillé estimator’s assumption of perfect balancing can be assessed using simulation. Instead of assuming that the cube method leads to perfect or near-perfect balancing for auxiliary variables, we can use
The R code below runs
Click to show R code
# Initialize variance-covariance matrix for balancing variables
<- balancing_vars |> colSums()
pop_totals <- sampling_frame$LIT_A |> sum()
Y <- matrix(0, nrow = length(pop_totals), ncol = length(pop_totals))
X_vcov <- 0
Yhat_vcov
# Initialize simulation counter
<- 10000
n_sim_samples <- 1
sim_index
# Run simulations, updating estimate of X_vcov each time
while (sim_index <= n_sim_samples) {
# Select a sample using the cube method
<- juliaGet(
sim_sample $fast_cube(
cube_module'SAMP_PROB']],
sampling_frame[[
balancing_vars,n_sims = 0
)|>
) getElement("indicators") |>
as.logical() |> which()
# Get weighted totals of balancing variables
<- colSums(
X_hat / sampling_frame[['SAMP_PROB']][sim_sample]
balancing_vars[sim_sample,]
)
# Get weighted total for outcome variable
<- sum(
Yhat $LIT_A[sim_sample] / sampling_frame[['SAMP_PROB']][sim_sample]
sampling_frame
)
# Update variance estimate
<- X_vcov + (n_sim_samples - 1)^(-1) * tcrossprod(X_hat - pop_totals)
X_vcov <- Yhat_vcov + (n_sim_samples - 1)^(-1) * (Yhat - Y)^2
Yhat_vcov
<- sim_index + 1L
sim_index }
Now let’s look at the coefficient of variation for
|> diag() |> sqrt()) / pop_totals (X_vcov
SAMP_PROB POVERTY_100 BLACK HISPANIC NO_INSURANCE OCC_SERVICE
0.0000000 0.1174322 0.1705357 0.3130446 0.2030957 0.1282019
So the assumption of perfect balancing isn’t right at all.
To assess what kind of impact that might have on the variance estimates, let’s consider the theoretical variance of the estimator
The Deville-Tillé estimator assumes that the variance of
We can estimate the second term using a quadratic form.
The term intercept = FALSE
because one of the balancing variables is simply the sampling probabilities, and so it acts as an intercept term for the working model.
<- lm(
working_model data = selected_sample,
formula = reformulate(
response = "LIT_A",
termlabels = colnames(balancing_vars),
intercept = FALSE
),weights = selected_sample$SAMP_PROB^(-1)
)<- coef(working_model) beta_hat
This gives us the following estimate for the component
<- t(beta_hat) %*% X_vcov %*% beta_hat b_X_b_var_est
And we can add this with the estimate of the first component, which is simply the value from the Deville-Tillé estimator.
<- sqrt(
augmented_dt_std_error as.numeric(vcov(svytotal(x = ~ LIT_A, dt_rep_design))) +
b_X_b_var_est
)print(augmented_dt_std_error)
[,1]
[1,] 110029.1
Let’s compare that to the variance estimate we got from using a Horvitz-Thompson estimator with estimated joint probabilities.
|> svytotal(x = ~ LIT_A) ht_rep_design
total SE
LIT_A 730247 76125
Interesting. The Horvitz-Thompson type variance estimate is a lot larger.
And now let’s see how all these variance estimates compares to the true sampling variance that we estimated using 10,000 simulations.
<- data.frame(
std_error_estimates 'True' = Yhat_vcov |> sqrt(),
'HT - Joint Probabilities' = svytotal(x = ~ LIT_A, ht_rep_design) |> SE(),
'Deville-Tillé' = svytotal(x = ~ LIT_A, dt_rep_design) |> SE(),
'Augmented Deville-Tillé' = augmented_dt_std_error,
check.names = FALSE
)print(std_error_estimates)
True HT - Joint Probabilities Deville-Tillé Augmented Deville-Tillé
1 107591.2 76124.93 21053.16 110029.1
For simplicity, let’s scale these errors relative to the true standard error.
/ std_error_estimates[['True']] std_error_estimates
True HT - Joint Probabilities Deville-Tillé Augmented Deville-Tillé
1 1 0.7075385 0.1956773 1.022658
The Deville-Tillé estimate was much too small. But the alternative simulation-based variance estimators yielded much better variance estimates.
The “augmented” Deville-Tillé estimator considered here might be able to combine the relative strengths of the simulation-based approach with the analytic variance estimator. On the one hand, we can use simulation to weaken the assumption of perfect balance required by the Deville-Tillé variance estimator, and thereby reduce its bias. On the other hand, by not relying on simulation to estimate joint probabilities for a Horvitz-Thompson type variance estimator, we don’t have to rely on the assumption that our sampling design has entirely positive joint probabilities, which is not guaranteed with the cube method.
And like the Deville-Tillé estimator, we can easily form replicate weights using Fay’s generalized replication method. I’m going to do some more work building out this idea and try to poke some holes in it, but I think it might have some promise.