Nonresponse weighting class adjustments are model predictions

A common form of nonresponse adjustment is to simply ‘redistribute’ weight from the nonrespondents to the respondents. For example, if the sum of weights among respondents is $100$ and the sum of weights among nonrespondents is $200$ , then a basic nonresponse adjustment would set the weights among nonrespondents to $0$ and multiply the weight for each respondent by an adjustment factor equal to $1 + (200 / 100)$ .

This type of nonresponse adjustment is widespread in official statistics, and is usually motivated by discussions about survey response propensities. In this blog post, I show how this type of adjustment can be viewed as a form of regression-based imputation. The first part of this post explains the mathematical equivalence, and the second part of this post uses an example in R to illustrate. At the end, I briefly discuss some of the implications of this insight.

Check out Rod Little’s 1984 paper “Survey Nonresponse Adjustments” for a more in-depth discussion of this idea.

The math

This type of adjustment is described in mathematical notation below.

\begin{aligned} w_{i} & = Original sampling weight for case i \\ = 1 / π_{i}, where π_{i} is the probability case i had of being sampled \\ f_{N R, i} & = Nonresponse adjustment factor for case i \\ w_{i}^{*} & = w_{i} \times f_{N R, i} = Weight for case i after nonresponse adjustment \\ {\hat{N}}_{R} & = \sum_{i \in s_{r e s p}} w_{i} = Sum of sampling weights among respondents \\ {\hat{N}}_{N R} & = \sum_{i \in s_{n o n r e s p}} w_{i} = Sum of sampling weights among nonrespondents \\ f_{N R_{i}} & = {\begin{cases} 0 & if case i is a nonrespondent \\ 1 + \frac{{\hat{N}}_{N R}}{{\hat{N}}_{R}} & if case i is a respondent \end{cases} \end{aligned}

After adjusting the weights, we typically subset the data to only include respondents, and we use the adjusted weights $w^{*}$ for estimating population totals:

$\begin{aligned} \hat{Y} & = \sum_{i \in s_{r e s p}} w_{i}^{*} \times y_{i} \end{aligned}$

We can rewrite this in terms of the adjustment factor, as follows:

$\begin{aligned} \hat{Y} & = \sum_{i \in s_{r e s p}} f_{N R, i} \times w_{i} \times y_{i} \\ = \sum_{i \in s_{r e s p}} (1 + \frac{{\hat{N}}_{N R}}{{\hat{N}}_{R}}) \times w_{i} \times y_{i} \end{aligned}$

And with a little more algebra:

$\begin{aligned} \hat{Y} & = \sum_{i \in s_{r e s p}} (1 + \frac{{\hat{N}}_{N R}}{{\hat{N}}_{R}}) \times w_{i} \times y_{i} \\ = (\sum_{i \in s_{r e s p}} w_{i} \times y_{i}) + \frac{{\hat{N}}_{N R}}{{\hat{N}}_{R}} (\sum_{i \in s_{r e s p}} w_{i} \times y_{i}) \\ = (\sum_{i \in s_{r e s p}} w_{i} \times y_{i}) + {\hat{N}}_{N R} (\frac{\sum_{i \in s_{r e s p}} w_{i} \times y_{i}}{\sum_{i \in s_{r e s p}} w_{i}}) \\ = (\sum_{i \in s_{r e s p}} w_{i} \times y_{i}) + {\hat{N}}_{N R} {\hat{\bar{Y}}}_{R} \\ = (\sum_{i \in s_{r e s p}} w_{i} \times y_{i}) + (\sum_{i \in s_{n o n r e s p}} w_{i} \times {\hat{\bar{Y}}}_{R}) \end{aligned}$

By rewriting it this way, we can see that our estimated population total is a base-weighted sum over the full sample, where for respondents we use the observed values, $y_{i}$ , and for nonrespondents we replace the unknown $y_{i}$ with the weighted respondent mean, ${\hat{\bar{Y}}}_{R}$ .

In other words, our weight adjustment process is effectively mean-imputation: for nonrespondents, we impute the missing values using the weighted respondent mean.

Typically, nonresponse adjustments like this are conducted using weighting classes: the nonresponse adjustment factor is calculated separately for different groups, referred to as nonresponse adjustment cells. When we do this, we’re still using mean-imputation. It’s just that we do mean imputation separately within each nonresponse adjustment cell.

This mean imputation process can be formulated as a regression model with $0 / 1$ dummy variables defined based on the adjustment cells, and the $k$ -th coefficient corresponding to the $k$ -th nonresponse adjustment cell.

$\begin{aligned} Y & = β_{1} \times 1 (NR Adjustment Cell = 1) \\ + β_{2} \times 1 (NR Adjustment Cell = 2) \\ + β_{3} \times 1 (NR Adjustment Cell = 3) \\ + \dots \\ + β_{k} \times 1 (NR Adjustment Cell = k) \\ + ε \end{aligned}$

To get equivalent results as mean imputation, we simply fit this model using base weights and subsetting to only use data from respondents. The imputations are then generated using the estimated model coefficients.

$\begin{aligned} {\hat{y}}_{i} & = {\hat{β}}_{1} \times 1 (NR Adjustment Cell = 1) \\ + {\hat{β}}_{2} \times 1 (NR Adjustment Cell = 2) \\ + {\hat{β}}_{3} \times 1 (NR Adjustment Cell = 3) \\ + \dots \\ + {\hat{β}}_{k} \times 1 (NR Adjustment Cell = k) \end{aligned}$

Illustration in R

In the sections below, we’ll use R to illustrate how we get equivalent results from the three approaches:

Nonresponse weighting adjustment
Mean imputation (by group)
Imputation using regression

Nonresponse weighting adjustment

In R, this weighting class adjustment can easily be done using the ‘svrep’ package.

In the example below, we load a dataset corresponding to a survey measuring vaccination status in Louisville, KY, and we create jackknife replicate weights.

# Load R packages for survey analysis
  library(survey)
  library(svrep)
  library(srvyr)
  library(dplyr)

# Load an example dataset
  data('lou_vax_survey', package = 'svrep')
  
# Create a JK1 survey design
  lou_vax_survey <- as_survey_design(
    .data = lou_vax_survey,
    weight = SAMPLING_WEIGHT,
    ids = 1,
  ) |> as_survey_rep(type = "JK1")

We can carry out a simple nonresponse adjustment by using the redistribute_weights() function from the ‘svrep’ package.

# Implement a nonresponse adjustment
  nr_adjusted_survey <- lou_vax_survey |>
    redistribute_weights(
      reduce_if = RESPONSE_STATUS == "Nonrespondent",
      increase_if = RESPONSE_STATUS == "Respondent"
    )

Normally, though, we want to create nonresponse adjustment cells to use. In the example below, we create four nonresponse adjustment cells based on dividing the sample into four groups based on response propensity to the survey.

# Fit a response propensity model
  response_propensity_model <- lou_vax_survey |>
    mutate(IS_RESPONDENT = ifelse(RESPONSE_STATUS == "Respondent", 1, 0)) |>
    svyglm(formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
           family = quasibinomial(link = 'logit'))

# Predict response propensities for individual cases
  lou_vax_survey <- lou_vax_survey |>
    mutate(
      RESPONSE_PROPENSITY = predict(response_propensity_model,
                                    newdata = cur_svy(),
                                    type = "response")
    )
  
# Divide sample into propensity classes
  lou_vax_survey <- lou_vax_survey |>
    mutate(NR_ADJ_CELL = ntile(x = RESPONSE_PROPENSITY, n = 4)) |>
    mutate(NR_ADJ_CELL = factor(NR_ADJ_CELL))

With the nonresponse adjustment cells thus defined, we can conduct the nonresponse adjustments.

# Conduct weight adjustment
  nr_adjusted_survey <- lou_vax_survey |>
    redistribute_weights(
      reduce_if = RESPONSE_STATUS == "Nonrespondent",
      increase_if = RESPONSE_STATUS == "Respondent",
      by = "NR_ADJ_CELL"
    )

After adjusting the weights, we can subset our data to respondents and produce estimates that are hopefully similar to the estimates we would get if we had obtained a 100% response rate.

nr_adjusted_survey |>
  # Subset data to only include respondents
  filter(RESPONSE_STATUS == "Respondent") |>
  # Create a binary 0/1 variable for vaccination status
  mutate(VACCINATED = ifelse(VAX_STATUS == "Vaccinated", 1, 0)) |>
  # Estimate the total and percent of the population vaccinated
  summarize(
    TOTAL_VACCINATED = survey_total(VACCINATED, vartype = NULL),
    PERCENT_VACCINATED = survey_mean(VACCINATED, vartype = NULL)
  )

# A tibble: 1 × 2
  TOTAL_VACCINATED PERCENT_VACCINATED
             <dbl>              <dbl>
1          316259.              0.530

Mean imputation

We can reproduce these estimates exactly using (weighted) mean imputation in each nonresponse adjustment cell.

Here’s a quick illustration of this in R. First, we calculate weighted means in each nonresponse adjustment cell.

lou_vax_survey <- lou_vax_survey |>
  # Create dummy variable for vaccination status
  mutate(VACCINATED = case_when(
    VAX_STATUS == "Vaccinated" ~ 1,
    VAX_STATUS == "Unvaccinated" ~ 0,
    RESPONSE_STATUS == "Nonrespondent" ~ NA_real_
  )) |>
  # Calculate base-weighted mean in each cell
  group_by(NR_ADJ_CELL) |>
  mutate(
    CELL_MEAN = weighted.mean(x = VACCINATED, w = SAMPLING_WEIGHT, na.rm = TRUE)
  ) |>
  ungroup() |>
  # Impute missing values for nonrespondents
  mutate(
    VACCINATED_IMP = case_when(
      is.na(VACCINATED) ~ CELL_MEAN,
      !is.na(VACCINATED) ~ VACCINATED
    )
  )

Next, we use the base weights to estimate the mean using the full-sample. Below, we can see that the estimates we get from imputation exactly match the estimates we got earlier from using the nonresponse-adjusted weights.

lou_vax_survey |>
  summarize(
    TOTAL_VACCINATED = survey_total(VACCINATED_IMP, vartype = NULL),
    PERCENT_VACCINATED = survey_mean(VACCINATED_IMP, vartype = NULL)
  )

# A tibble: 1 × 2
  TOTAL_VACCINATED PERCENT_VACCINATED
             <dbl>              <dbl>
1          316259.              0.530

Imputation using regression

As noted earlier, this imputation can be expressed as a regression model, where the predictor variables are simply dummy variables for each propensity cell.

To return to our example in R, we first fit the imputation model using the data from respondents.

# Fit the imputation model
imputation_model <- lou_vax_survey |>
  filter(RESPONSE_STATUS == "Respondent") |>
  svyglm(formula = I(VAX_STATUS == "Vaccinated") ~ -1 + NR_ADJ_CELL,
         family = stats::gaussian())

Next, we use the model to make predictions for each case.

# Make imputations using the model
lou_vax_survey <- lou_vax_survey |>
  mutate(
    PREDICTION = predict(
      object = imputation_model,
      newdata = cur_svy()
    ) |> as.numeric()
  )

Then we use the predictions to impute missing values for nonrespondents.

# Impute missing values for nonrespondents
lou_vax_survey <- lou_vax_survey |>
  mutate(
    VACCINATED_IMP = case_when(
      RESPONSE_STATUS == "Respondent" ~ ifelse(VAX_STATUS == "Vaccinated", 1, 0),
      RESPONSE_STATUS == "Nonrespondent" ~ PREDICTION,
    )
  )

And finally, we use the imputed variable to estimate the total and percent vaccinated using the entire sample (respondents and nonrespondents).

# Produce estimates using imputed data
lou_vax_survey |>
  summarize(
    TOTAL_VACCINATED = survey_total(VACCINATED_IMP, vartype = NULL),
    PERCENT_VACCINATED = survey_mean(VACCINATED_IMP, vartype = NULL)
  )

# A tibble: 1 × 2
  TOTAL_VACCINATED PERCENT_VACCINATED
             <dbl>              <dbl>
1          316259.              0.530

Implications

Viewing the nonresponse adjustments in this way helps us better understand what this method is actually doing. It’s simply imputation using the same type of model and the same set of predictor variables for every outcome variable ( $Y_{1}$ , $Y_{2}$ , etc.) We don’t normally talk about nonresponse weighting adjustments in this way, but that’s what they’re doing.

Typically, weighting class adjustments are discussed in terms of response propensities rather than outcome variables. In a typical application, the nonresponse adjustment cells are defined based on response propensities for the survey. The statistician develops some model predicting each sample person’s propensity to respond to the survey, and then nonresponse adjustment cells are defined by dividing the sample into groups with different response propensities. By discussing adjustments in this way, we don’t focus specifically on any single survey outcome variable (e.g., income). But ultimately, our weighting adjustments translate into an imputation model for each and every outcome variable.

An interesting idea raised by this insight is that perhaps it would be helpful to analyze our nonresponse adjustment process using prediction and imputation diagnostics. In his 1984 paper, Rod Little suggested using an overall $F$ -test in a linear regression model to assess whether the survey outcome is predicted by a regression model predicting survey measurements using response propensity adjustment cells. Similarly, James Wagner suggested using imputation diagnostics to assess the quality of nonresponse adjustments in a 2010 POQ paper, which focused on using Donald Rubin’s Fraction-of-Missing-Information (FMI) statistic to estimate the impact of nonresponse adjustments on key survey estimates. Perhaps there are other diagnostic tools from statistical learning (e.g., cross-validation) which could help us improve our nonresponse adjustment process.