# Load R packages for survey analysis
library(survey)
library(svrep)
library(srvyr)
library(dplyr)
# Load an example dataset
data('lou_vax_survey', package = 'svrep')
# Create a JK1 survey design
<- as_survey_design(
lou_vax_survey .data = lou_vax_survey,
weight = SAMPLING_WEIGHT,
ids = 1,
|> as_survey_rep(type = "JK1") )
A common form of nonresponse adjustment is to simply ‘redistribute’ weight from the nonrespondents to the respondents. For example, if the sum of weights among respondents is
This type of nonresponse adjustment is widespread in official statistics, and is usually motivated by discussions about survey response propensities. In this blog post, I show how this type of adjustment can be viewed as a form of regression-based imputation. The first part of this post explains the mathematical equivalence, and the second part of this post uses an example in R to illustrate. At the end, I briefly discuss some of the implications of this insight.
Check out Rod Little’s 1984 paper “Survey Nonresponse Adjustments” for a more in-depth discussion of this idea.
The math
This type of adjustment is described in mathematical notation below.
After adjusting the weights, we typically subset the data to only include respondents, and we use the adjusted weights
We can rewrite this in terms of the adjustment factor, as follows:
And with a little more algebra:
By rewriting it this way, we can see that our estimated population total is a base-weighted sum over the full sample, where for respondents we use the observed values,
In other words, our weight adjustment process is effectively mean-imputation: for nonrespondents, we impute the missing values using the weighted respondent mean.
Typically, nonresponse adjustments like this are conducted using weighting classes: the nonresponse adjustment factor is calculated separately for different groups, referred to as nonresponse adjustment cells. When we do this, we’re still using mean-imputation. It’s just that we do mean imputation separately within each nonresponse adjustment cell.
This mean imputation process can be formulated as a regression model with
To get equivalent results as mean imputation, we simply fit this model using base weights and subsetting to only use data from respondents. The imputations are then generated using the estimated model coefficients.
Illustration in R
In the sections below, we’ll use R to illustrate how we get equivalent results from the three approaches:
Nonresponse weighting adjustment
Mean imputation (by group)
Imputation using regression
Nonresponse weighting adjustment
In R, this weighting class adjustment can easily be done using the ‘svrep’ package.
In the example below, we load a dataset corresponding to a survey measuring vaccination status in Louisville, KY, and we create jackknife replicate weights.
We can carry out a simple nonresponse adjustment by using the redistribute_weights()
function from the ‘svrep’ package.
# Implement a nonresponse adjustment
<- lou_vax_survey |>
nr_adjusted_survey redistribute_weights(
reduce_if = RESPONSE_STATUS == "Nonrespondent",
increase_if = RESPONSE_STATUS == "Respondent"
)
Normally, though, we want to create nonresponse adjustment cells to use. In the example below, we create four nonresponse adjustment cells based on dividing the sample into four groups based on response propensity to the survey.
# Fit a response propensity model
<- lou_vax_survey |>
response_propensity_model mutate(IS_RESPONDENT = ifelse(RESPONSE_STATUS == "Respondent", 1, 0)) |>
svyglm(formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
family = quasibinomial(link = 'logit'))
# Predict response propensities for individual cases
<- lou_vax_survey |>
lou_vax_survey mutate(
RESPONSE_PROPENSITY = predict(response_propensity_model,
newdata = cur_svy(),
type = "response")
)
# Divide sample into propensity classes
<- lou_vax_survey |>
lou_vax_survey mutate(NR_ADJ_CELL = ntile(x = RESPONSE_PROPENSITY, n = 4)) |>
mutate(NR_ADJ_CELL = factor(NR_ADJ_CELL))
With the nonresponse adjustment cells thus defined, we can conduct the nonresponse adjustments.
# Conduct weight adjustment
<- lou_vax_survey |>
nr_adjusted_survey redistribute_weights(
reduce_if = RESPONSE_STATUS == "Nonrespondent",
increase_if = RESPONSE_STATUS == "Respondent",
by = "NR_ADJ_CELL"
)
After adjusting the weights, we can subset our data to respondents and produce estimates that are hopefully similar to the estimates we would get if we had obtained a 100% response rate.
|>
nr_adjusted_survey # Subset data to only include respondents
filter(RESPONSE_STATUS == "Respondent") |>
# Create a binary 0/1 variable for vaccination status
mutate(VACCINATED = ifelse(VAX_STATUS == "Vaccinated", 1, 0)) |>
# Estimate the total and percent of the population vaccinated
summarize(
TOTAL_VACCINATED = survey_total(VACCINATED, vartype = NULL),
PERCENT_VACCINATED = survey_mean(VACCINATED, vartype = NULL)
)
# A tibble: 1 × 2
TOTAL_VACCINATED PERCENT_VACCINATED
<dbl> <dbl>
1 316259. 0.530
Mean imputation
We can reproduce these estimates exactly using (weighted) mean imputation in each nonresponse adjustment cell.
Here’s a quick illustration of this in R. First, we calculate weighted means in each nonresponse adjustment cell.
<- lou_vax_survey |>
lou_vax_survey # Create dummy variable for vaccination status
mutate(VACCINATED = case_when(
== "Vaccinated" ~ 1,
VAX_STATUS == "Unvaccinated" ~ 0,
VAX_STATUS == "Nonrespondent" ~ NA_real_
RESPONSE_STATUS |>
)) # Calculate base-weighted mean in each cell
group_by(NR_ADJ_CELL) |>
mutate(
CELL_MEAN = weighted.mean(x = VACCINATED, w = SAMPLING_WEIGHT, na.rm = TRUE)
|>
) ungroup() |>
# Impute missing values for nonrespondents
mutate(
VACCINATED_IMP = case_when(
is.na(VACCINATED) ~ CELL_MEAN,
!is.na(VACCINATED) ~ VACCINATED
) )
Next, we use the base weights to estimate the mean using the full-sample. Below, we can see that the estimates we get from imputation exactly match the estimates we got earlier from using the nonresponse-adjusted weights.
|>
lou_vax_survey summarize(
TOTAL_VACCINATED = survey_total(VACCINATED_IMP, vartype = NULL),
PERCENT_VACCINATED = survey_mean(VACCINATED_IMP, vartype = NULL)
)
# A tibble: 1 × 2
TOTAL_VACCINATED PERCENT_VACCINATED
<dbl> <dbl>
1 316259. 0.530
Imputation using regression
As noted earlier, this imputation can be expressed as a regression model, where the predictor variables are simply dummy variables for each propensity cell.
To return to our example in R, we first fit the imputation model using the data from respondents.
# Fit the imputation model
<- lou_vax_survey |>
imputation_model filter(RESPONSE_STATUS == "Respondent") |>
svyglm(formula = I(VAX_STATUS == "Vaccinated") ~ -1 + NR_ADJ_CELL,
family = stats::gaussian())
Next, we use the model to make predictions for each case.
# Make imputations using the model
<- lou_vax_survey |>
lou_vax_survey mutate(
PREDICTION = predict(
object = imputation_model,
newdata = cur_svy()
|> as.numeric()
) )
Then we use the predictions to impute missing values for nonrespondents.
# Impute missing values for nonrespondents
<- lou_vax_survey |>
lou_vax_survey mutate(
VACCINATED_IMP = case_when(
== "Respondent" ~ ifelse(VAX_STATUS == "Vaccinated", 1, 0),
RESPONSE_STATUS == "Nonrespondent" ~ PREDICTION,
RESPONSE_STATUS
) )
And finally, we use the imputed variable to estimate the total and percent vaccinated using the entire sample (respondents and nonrespondents).
# Produce estimates using imputed data
|>
lou_vax_survey summarize(
TOTAL_VACCINATED = survey_total(VACCINATED_IMP, vartype = NULL),
PERCENT_VACCINATED = survey_mean(VACCINATED_IMP, vartype = NULL)
)
# A tibble: 1 × 2
TOTAL_VACCINATED PERCENT_VACCINATED
<dbl> <dbl>
1 316259. 0.530
Implications
Viewing the nonresponse adjustments in this way helps us better understand what this method is actually doing. It’s simply imputation using the same type of model and the same set of predictor variables for every outcome variable (
Typically, weighting class adjustments are discussed in terms of response propensities rather than outcome variables. In a typical application, the nonresponse adjustment cells are defined based on response propensities for the survey. The statistician develops some model predicting each sample person’s propensity to respond to the survey, and then nonresponse adjustment cells are defined by dividing the sample into groups with different response propensities. By discussing adjustments in this way, we don’t focus specifically on any single survey outcome variable (e.g., income). But ultimately, our weighting adjustments translate into an imputation model for each and every outcome variable.
An interesting idea raised by this insight is that perhaps it would be helpful to analyze our nonresponse adjustment process using prediction and imputation diagnostics. In his 1984 paper, Rod Little suggested using an overall