Suppose we select a stratified simple random sample, with strata and sampling units in stratum . Then the usual Horvitz-Thompson variance estimator has degrees of freedom , and we thus need at least replicates in order to obtain a fully efficient replicate estimator of the sampling variance. As a concrete example, suppose we have two strata () with for both strata. Then the Horvitz-Thompson variance estimator has , and to get a fully efficient variance estimator, we would need replicates with the JKn jackknife method (i.e., the stratified, delete-one jackknife), or replicates with a more sophisticated method such as Fay’s generalized replication method.
For illustration, here’s what it would look like to create the 12 JKn replicates for the two strata.
In practice, we might have computational constraints and can’t simply make all of the replicates that we would need to get a fully efficient estimator. In the two-stratum example above, say, we might only be able to make replicates, even though we need at least to get a fully-efficient estimator. So in practice, it’s common to simply create replicate weights separately in each stratum and then stack them together. For example:
It turns out that there are good ways to stack replicates and there are also some pretty bad ways to stack replicates, depending on the replication method and the way the sample was selected. The example above would be fine for a stratified simple random sample, but it could be really bad for a systematic sample. The safest thing to do in general is to randomize the order of each stratum’s replicates before combining replicates across strata.
But why though? What separates a good method of stacking from a bad method of stacking?
Our Goal with Replication
One way of thinking about replication methods is that they are supposed to yield variance estimates for linear statistics (e.g., totals, or means in certain cases) that match some target variance estimator, such as the Horvitz-Thompson variance estimator. The easiest way to see this is to write down our variance estimator as a quadratic form, , where is the vector of weighted values , for . For a stratified simple random sample with two strata and in each stratum, we can see below what the quadratic form matrix looks like.
An important observation is that if units and are in different strata, then entry in the quadratic form matrix equals (that is, ).
With replication, we create sets of replicate weights, where the -th set of replicate weights is a vector of length denoted , whose -th value is denoted . This yields replicate estimates of the population total, , for , which can be used to estimate sampling variance.
The standard formula is:
where and are constants that depend on the replication method.
Like the target variance estimator, the replication variance estimator can also be written as a quadratic form:
where
Ideally, we want the matrix to exactly equal . That’s possible if we use enough replicates. For instance, with the example we just looked at, if we create replicates with the JKn method, which has and , then we have the following replicate weights:
Show code
rep_weights <-jknweights(strata =rep(1:2, each =3),psu =1:6,compress =FALSE) |>getElement("repweights")rownames(rep_weights) <-rep(c("Stratum 1", "Stratum 2"), each =3)print(rep_weights)
The key thing to note is that the entries of the two matrices, and exactly match for entries corresponding to sampling units in the same stratum, and the mismatches only occur for entries corresponding to sampling units in different strata.
Important
The downside of stacking replicate weights from different strata is that the between-stratum elements of are incorrect: they should all be zero, but stacking causes some of them to be nonzero. This adds error to the resulting variance estimates.
Using Randomization to Prevent Bias
Because stacking replicate weights introduces error, we might want to make sure that this error is random rather than systematic, so that our variance estimates don’t end up being biased one way or the other. Put another way, if we’re going to content ourselves with , we would probably feel better if we knew that , where the expectation is evaluated with respect to a randomization process used to form the replicates.
To make , we really only need to worry about the elements of that correspond to pairs of units in different strata. That is, if sampling units and are in different strata, then we need to make sure .
The simplest way to do this is just to randomly shuffle the order of the replicates in each stratum, before stacking the replicates.
We can see this using a Monte Carlo simulation. Below, our Monte Carlo estimate of shows that the entries corresponding to pairs of units within the same stratum are exactly correct (i.e., match entries in ), while the entries corresponding to pairs of units from different strata are also on average correct.
Show code
expected_Sigma_R_stacked <-replicate(n =100000, expr = {# Stack replicates from each stratum, after shuffling stacked_rep_weights <-rbind( stratum_one_reps[,sample(1:3, size =3, replace =FALSE)], stratum_one_reps[,sample(1:3, size =3, replace =FALSE)] )rownames(stacked_rep_weights) <-rep(c("Stratum 1", "Stratum 2"), each =3)# Calculate Sigma_R Sigma_R_stacked <- (2/3) *tcrossprod((stacked_rep_weights -1), (stacked_rep_weights -1))return(Sigma_R_stacked)}, simplify =FALSE) |># Average across Monte Carlo simulationsReduce(f =`+`) |> magrittr::divide_by(100000) |>round(2)expected_Sigma_R_stacked
The error in the between-stratum elements of is a problem if there is some systematic pattern behind the ordering of sampling units in each stratum. This can happen by accident more often than you might think. For example, maybe your sampling frame was sorted in some way and you accidentally retained that frame ordering in your sample. That can easily happen, like so:
Show code
library(dplyr) # For data manipulationlibrary(sampling) # For sampling# Order the sampling frame of libraries from smallest to largest sampling_frame <- svrep::library_census |>arrange(LIBRARIA)# Select the sample is_sampled <- sampling::srswor(n =5, N =nrow(sampling_frame)) |>as.logical() selected_sample <- sampling_frame[is_sampled,]# Inspect the sample data selected_sample |> dplyr::select(LIBNAME, LIBRARIA)
# A tibble: 5 × 2
LIBNAME LIBRARIA
<chr> <dbl>
1 GERALDINE E. ANDERSON VILLAGE LIBRARY 0.7
2 SONOMA COUNTY PUBLIC LAW LIBRARY 1
3 GLADSTONE AREA SCHOOL & PUBLIC LIBRARY 1
4 CALAIS FREE LIBRARY 3.18
5 ABINGTON TWP PUBLIC LIBRARY 8.93
Or if you use unequal-probability sampling, your software might roughly order the sample data file by “number of hits”, which would mean that the sampling file is approximately sorted by the sampling weights.
Most of us probably wouldn’t think to double-check that our data file is actually unordered: we’re just happy that we got a random sample with the correct inclusion probabilities and weights, and we probably don’t think much about the order of the data file records. But if we are eventually going to form replicates using a stacking approach, we should be wary of the possibility that the data has some non-random order to it. So generally speaking, it’s a good idea to shuffle replicates before stacking.
Using the ‘svrep’ R Package to Stack Replicates
The ‘svrep’ package includes a few functions that are helpful for this. First, there’s the function shuffle_replicates() which “shuffles” the order of the replicates in a survey design object.
Show code
library(survey)library(svrep)# Create a replicate weights design object in R sample_data <-data.frame(STRATUM =rep(1:2, each =6),PSU =1:12 ) stratum_one_rep_design <- sample_data |>subset(STRATUM ==1) |>svydesign(data = _, ids =~ PSU, weights =~1) |>as.svrepdesign(type ="JK1", compress =FALSE)# Inspect replicate weights stratum_one_rep_design$repweights
# Shuffle the order of the replicates stratum_one_shuffled_rep_design <- stratum_one_rep_design |>shuffle_replicates()# Inspect the shuffled replicate weight columns stratum_one_shuffled_rep_design$repweights
Second, there’s the function stack_replicate_designs(), which allows us to stack together two or more replicate designs, combining their columns of replicate weights, their data variables, and other metadata.
Show code
set.seed(2023)# Create replicates for a second stratum stratum_two_rep_design <- sample_data |>subset(STRATUM ==2) |>svydesign(data = _, ids =~ PSU, weights =~1) |>as.svrepdesign(type ="JK1", compress =FALSE)# Combine the replicate designs from both strata,# after independently shuffling their replicates stacked_rep_design <-stack_replicate_designs( stratum_one_rep_design |>shuffle_replicates(), stratum_two_rep_design |>shuffle_replicates() )
This allowed us to create just 6 replicates for a design with 12 PSUs.
Show code
# Create a data frame with the weights and specific variables of intereststacked_rep_design |>as_data_frame_with_weights(vars_to_keep =c("STRATUM", "PSU"),full_wgt_name ="BASE_WGT",rep_wgt_prefix ="REP_WGT_" )