Calculating R Indicators in R

A short post on calculating R indicators (and their sampling variance) in R.

statistics
surveys
R
Author

Ben Schneider

Published

May 15, 2022

Background: A Brief Introduction to R-Indicators

The phrase “representative sample” is a useful but slippery term. Sometimes it is meant to indicate that a sample meets strong general guarantees of representativeness, such as being selected at random using probability sampling, having a high response rate, or coming from a high quality sampling frame. At other times, the term “representative sample” is meant to convey simply that a pseudo-scientific opt-in poll has top-line distributions of race and sex that roughly match benchmark numbers from the Census. The ceaseless decline in survey response rates in recent decades has led to efforts to more precisely define what is meant by the term “representative.” One prominent effort is the development of single-number summaries termed “representativeness indicators”, or “R-Indicators” for short.

The R-indicator is a tool for assessing representativeness among sampled respondents, and thus measuring whether nonresponse causes a representative selected sample to become an unrepresentative responding sample. The basic premise of the R-indicator is that respondents are representative of a population if every person in a selected sample is equally likely to respond to the survey. The reason for choosing this definition of “representative” is that if all sampled individuals are equally likely to respond to the survey, then there is no risk of nonresponse bias, and nonresponse is only a problem insofar as it reduces the size of responding samples and thus increases variance. In other words, with this definition of “representative”, a sample with only a 10% response rate can still be said to be representative provided there is no possibility for nonresponse bias.

This definition of “representative” as having sampled persons be equally likely to respond to a survey suggests that the most unrepresentative survey is one in which half of the population is 100% likely to respond to a survey and the other half of the population is 0% likely to respond to a survey. Everything between the two extremes could be considered to be an intermediate level of “representative.” The R-Indicator codifies this idea into a numeric summary measure ranging from 0 to 1, where a value of 0 means that half the population is 100% likely to respond and half the population is 0% likely to respond, while a value of 1 means that every person in the population is equally likely to respond to the survey.

In mathematical notation, let \(p_i\) represent the probability that person \(i\) in a population of \(N\) people would respond to a survey if they were selected as part of a random sample. In common usage, \(p_i\) is referred to as a “response propensity.” Then in the population, there is an average response propensity and a standard deviation of response propensities:

\[ \begin{aligned} \bar{p} &= \frac{1}{N} \sum_{i=1}^{N}p_i \\ S(p) &= \sqrt{\frac{1}{N-1} \sum_{i=1}^{N}(p_i - \bar{p})^2} \end{aligned} \]

If everyone in the population has an equal response propensity, then \(S(p)=0\). If half the population has a response propensity of 1 and the other half has a response propensity of 0, then \(S(p)=0.5\). So we can thus define the R-Indicator, denoted \(R\), as follows:

\[ R = 1 - (2\times S(p)) \]

Calculations in R

Example Data

To illustrate how R-Indicator are calculated in R, we’ll load an example survey dataset from the {svrep} package. The lou_vax_survey dataset is a simulated survey measuring Covid-19 vaccination status and a handful of demographic variables, based on a simple random sample of 1,000 adult residents of Louisville, Kentucky with an approximately 50% response rate.

suppressPackageStartupMessages({
  library(svrep)
  library(survey)
  library(srvyr)
})
Warning: package 'survey' was built under R version 4.2.2
data("lou_vax_survey")

lou_vax_survey <- srvyr::as_survey_design(
  .data = lou_vax_survey,
  ids = 1, weights = SAMPLING_WEIGHT
)

summary(lou_vax_survey)
Independent Sampling design (with replacement)
Called via srvyr
Probabilities:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001676 0.001676 0.001676 0.001676 0.001676 0.001676 
Data variables:
[1] "RESPONSE_STATUS" "RACE_ETHNICITY"  "SEX"             "EDUC_ATTAINMENT"
[5] "VAX_STATUS"      "SAMPLING_WEIGHT"

The dataset includes a variable VAX_STATUS, which gives the Covid-19 vaccination status of survey respondents. The variable RESPONSE_STATUS gives the response status (“Respondent” or “Nonrespondent”) for the survey, and additional variables RACE_ETHNICITY, SEX, and EDUC_ATTAINMENT provide demographic information for both respondents and nonrespondents.

Estimating Response Propensities

To estimate response propensities, we can use a variety of predictive modeling approaches, such as logistic regression or tree-based classification models. For model selection, we can use K-fold cross-validation, suitably adapted for a complex survey design, with the help of the {surveyCV} package.

# Add an outcome variable to use for prediction
lou_vax_survey <- lou_vax_survey |>
  transform(
    IS_RESPONDENT = ifelse(
      RESPONSE_STATUS == "Respondent",
      1, 0
    )
  )

# Choose a list of predictors using 10-fold cross-validation
library(surveyCV)

cross_validation_results <- cv.svydesign(
  design_object = lou_vax_survey,
  method = "logistic",
  formulae = c(
    "IS_RESPONDENT ~ RACE_ETHNICITY + SEX + EDUC_ATTAINMENT",
    "IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT"
  ),
  nfolds = 10
)

cross_validation_results[which.min(cross_validation_results)]
 .Model_2 
0.6947332 
# Select a final response propensity model
resp_propensity_model <- svyglm(
  design = lou_vax_survey,
  formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
  family = binomial(link = 'logit')
)

summary(resp_propensity_model)

Call:
svyglm(formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT, 
    design = lou_vax_survey, family = binomial(link = "logit"))

Survey design:
update(`_data`, ...)

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                         0.0492     0.1804   0.273
RACE_ETHNICITYHispanic or Latino                   -0.3334     0.3430  -0.972
RACE_ETHNICITYOther Race, not Hispanic or Latino    0.0926     0.2993   0.309
RACE_ETHNICITYWhite alone, not Hispanic or Latino   0.2065     0.1690   1.222
EDUC_ATTAINMENTLess than high school               -0.3051     0.1321  -2.309
                                                  Pr(>|t|)  
(Intercept)                                         0.7851  
RACE_ETHNICITYHispanic or Latino                    0.3313  
RACE_ETHNICITYOther Race, not Hispanic or Latino    0.7571  
RACE_ETHNICITYWhite alone, not Hispanic or Latino   0.2221  
EDUC_ATTAINMENTLess than high school                0.0211 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1.001039)

Number of Fisher Scoring iterations: 4

Next, we’ll estimate response propensities for each member of the selected sample.

lou_vax_survey <- lou_vax_survey |>
  mutate(
    RESP_PROPENSITY = predict(resp_propensity_model,
                              type = "response") |>
      as.numeric()
  )

Calculating the R-Indicator using the estimated response propensities

After estimating response propensities, it’s straightforward to estimate the population standard deviation of response propensities.

# Estimate the population variance of response propensities
estimated_pop_variance <- lou_vax_survey |>
  svyvar(x = ~ RESP_PROPENSITY)

estimated_pop_variance
                variance    SE
RESP_PROPENSITY 0.002843 1e-04
# Use the Delta method to estimate the SE of the population standard deviation
estimated_pop_std_dev <- sqrt(
  coef(estimated_pop_variance)
)

estimated_pop_std_dev
RESP_PROPENSITY 
     0.05332017 

From there, the R-Indicator is easy to calculate:

r_indicator <- 1 - 2*estimated_pop_std_dev

print(r_indicator)
RESP_PROPENSITY 
      0.8933597 

Variance Estimation Assuming Known Response Propensities

To get a linearization-based estimate of the R-Indicator’s standard error, we can use svycontrast() to calculate the R-Indicator and an estimate of its standard error calculating using the Delta method.

estimated_pop_variance <- lou_vax_survey |>
  svyvar(x = ~ RESP_PROPENSITY)

r_indicator_w_std_error <- svycontrast(
  stat = estimated_pop_variance,
  contrasts = list(
    'R-Indicator' = quote(1 - 2*sqrt(RESP_PROPENSITY))
  )
)

print(r_indicator_w_std_error)
              nlcon     SE
R-Indicator 0.89336 0.0022

This approach assumes that the response propensities we used for estimation are fixed values rather than estimates which would have differed if we’d drawn a different random sample. However, we had to estimate the response propensities using sample data, and so this variance estimation approach described above doesn’t describe the full story: there’s probably some extra variance of the estimate which we’ve failed to account for.

Variance Estimation, Accounting for Unknown Response Propensities

To try and account for the variance caused by having to estimate response propensities, we can use a replication method of variance estimation. With this approach, we can refit the response propensity model, separately for each replicate.

First, we’ll generate replicate weights appropriate to the survey design.

lou_vax_rep_design <- lou_vax_survey |> 
  as.svrepdesign(type = "bootstrap", replicates = 500)

Next, we can use the function withReplicates() from the {survey} package to estimate response propensities and calculate the R-Indicator separately for each replicate.

r_indicator_w_rep_std_error <- lou_vax_rep_design |>
  withReplicates(theta = function(weights, variables_data_frame) {
    
    # Create a design object that's only good for
    # getting weighted estimates
    design_to_get_point_estimates <- svydesign(
      ids = ~ 1, weights = weights,
      data = variables_data_frame
    )
    
    # Note that with `withReplicates()`,
    # we can use `glm(...weights = current_rep_wts)`
    # instead of `svyglm()` since we only need point estimates
    resp_propensity_model <- glm(
      data = variables_data_frame,
      weights = weights,
      formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
      family = stats::quasibinomial(link = 'logit')
    )
    
    # Add estimate response propensities to data
    resp_propensities <- predict(resp_propensity_model,
                                 type = "response")
    
    design_to_get_point_estimates[['variables']][['RESP_PROPENSITY']] <- (
      resp_propensities
    )

    # Estimate the population variance of response propensities
    estimated_pop_variance <- svyvar(x = ~ RESP_PROPENSITY,
           design = design_to_get_point_estimates) |>
      coef()
    
    # Calculate the R-Indicator
    r_indicator <- 1 - 2*sqrt(estimated_pop_variance)
    return(r_indicator)
  })
Warning: glm.fit: algorithm did not converge

Let’s compare the two standard error estimates.

# The estimate, w/ std. error accounting for propensity estimation
r_indicator_w_rep_std_error
                  theta     SE
RESP_PROPENSITY 0.89336 0.0573
# The estimate which assumes propensities are fixed
r_indicator_w_std_error
              nlcon     SE
R-Indicator 0.89336 0.0022

Wow! After we take into account the model-fitting for the response propensities, the standard error estimate is over twice as large as the “naive” estimate. So if we want to be honest about the sampling variance of the R-Indicator, we need to take into account the estimation of response propensities, at least in this example. Of course, even here we’re only accounting for model-fitting; like most statisticians, we’re acting as if the model selection (choice of variables, functional form, etc.) was fixed rather than itself varying across random samples. But that’s an issue for another day…