What’s the margin of error for the employee engagement index?

The key performance metric of many employee engagement surveys, the Engagement Index, is a statistic whose margin of error is deceptively difficult to estimate. As a result, many organizations fail to present a margin of error at all. I discuss why this estimation problem is difficult using the standard tools of engagement researchers, and I suggest two solutions based on tools from survey sampling theory. I provide example R code and discuss how to estimate the margin of error in general-purpose, non-statistical software.

employee engagement
statistics
surveys
sampling
R
Author
Published

November 1, 2021

The key performance metric in employee engagement surveys is usually a statistic called an “engagement score” or “engagement index”, which is a score that ranges from 0% to 100%. To calculate the engagement index, analysts follow a two-step process. First, for a group of, say, five key surveys questions, they calculate for each question the percent of positive responses (referred to as the question’s “percent positive”). Next, they take an unweighted average of the “percent positive” of all the questions in the group (i.e. they add up the five percentages and then divide the total by the number of questions).

Such scores are ubiquitous in employee engagement research. A prominent example is the Employee Engagement Index used in the Federal Employee Viewpoint Survey (FEVS), which is a massive employee survey conducted for the federal government’s many employees. Another example is the Gallup Q12 score, which is an average of the percent positive responses from twelve specific questionnaire items. IQS Research, where I worked on employee surveys for several years, calculates index scores for many groups of survey questions, producing statistics such as a “Work Environment” score or a “Management” score. Another commonly-used vendor, Culture Amp, calculates these scores as well (using the name “factor score”, although the score’s calculation has nothing to do with factor analysis).

The engagement index is commonly interpreted as the percent of employees who are engaged in their work at an organization. This interpretation is not quite right, since the engagement index is not calculated as an average over employees but as an average over survey questions. This fact is inconvenient in a couple of respects. For one thing, averaging over questions rather than employees complicates our ability to interpret the score. Second, averaging over questions substantially complicates the calculation of margins of error for engagement scores. This second problem is the focus of this blog post. In this post, I explain why estimating the margin of error for index scores is not so simple, and I suggest a couple of solutions using tools from survey sampling theory.

An Illustrative Example

Here’s a totally made-up example of how an index score is calculated. Suppose we have a survey consisting of the following questions:

Please rate the following items on a scale from “1 - Not at all satisfied” to “5 - Completely satisfied”, where applicable.

  1. The instructions I receive from my manager.

  2. My work boots.

  3. My office computer.

We would then expect the survey responses to look something like the following:

Employee Type Q1 Q2 Q3
Field 2 NA 5
Field 5 NA 4
Office 3 4 NA
Office 4 3 NA

To calculate the “percent positive” score for each item, we first classify each response as positive or not and then calculate the percent positive for each question (ignoring responses of “N/A”).

Question Percent Positive
Q1 50%
Q2 50%
Q3 100%

Finally, to calculate an index score, we take the unweighted average of the three ‘percent positive’ scores.

Index Score: 67% = (1/3) * (50% + 50% + 100%)

Why sampling error is tricky to estimate using the standard tools of engagement researchers

Although engagement research is almost always done using surveys, employee engagement researchers rarely make use of methods from the specific branch of statistics which is referred to as “survey statistics”, “sampling theory”, or “finite population statistics.” Typically, engagement researchers are trained in statistical methods from psychology or marketing which, as a rule, assume simple random sampling from an infinitely large population. There’s essentially only one formula from survey sampling theory which is well-known to employee survey researchers: the margin of error at the 95% confidence level for a percentage estimated using a simple random sample of size \(n\) from a population of size \(N\), which is often accessed using popular online calculators such as that of Pollfish or Stat Trek.

That formula doesn’t work for engagement scores. Clearly it could be used to estimate the margin of error for a single question’s “percent positive” score (assuming simple random sampling). But an index score is not a simple proportion: it’s an average of proportions.

Ok, no big deal. An index score is an average of random variables whose sampling variances we know how to calculate, and you can use a Stats 101 formula to get the variance of an average of random variables: \(Var(p_1+p_2)=Var(p_1)+Var(p_2)+2 \times Cov(p_1,p_2)\), where \(p_1\) is the “percent positive” score of one question and \(p_2\) is the percent-positive score of another question. This generalizes to \(M\) questions with “percent positive” scores \(p_1, p_2,\dots,.p_M\) via the formula \(Var(\sum_{i=1}^{M} p_i)=\sum_{i=1}^{M}Var(p_i) + \sum_{i=1}^{M}\sum_{j=1}^{M} Cov(p_i, p_j)\). Since the dimension score is an average, we then calculate \(Var(\frac{1}{M} \sum_{i=1}^{M}p_i) = \frac{1}{M^2} Var(\sum_{i=1}^{M}p_i)\).

Pretty simple! Assuming our data are a simple random sample from a large population, and that the same set of respondents answer every pair of questions (whose respective scores are \(p_1\) and \(p_2\)), we can calculate estimates of the covariances using \(cov(p_1, p_2) = \frac{1}{n(n-1)}\sum_{i=1}^{n} (y_{1,i} - p_1)(y_{2,i} - p_2)\), where \(y_{1,i},y_{2,i}\) are indicator variables for individual respondents (1 if they had a positive response to an item (e.g. ‘5 - completely satisfied’ or ‘4 - highly satisfied’); 0 otherwise).

However, things become more complicated when the set of respondents differs between two items, which happens in almost every major survey. This happens for instance if the two questions apply to somewhat different sets of employees (e.g. “are you satisfied with your work boots?” applies to field workers while “are you satisfied with your computer monitor?” applies to office workers). The more common scenario which arises in essentially every survey is that the set of respondents can differ between pairs of questions because a respondent might happen to answer only one question and not the other. There’s no tool in the standard employee engagement researcher’s toolbox that tells us how to estimate the covariances between two items’ percent-positive scores in this case. And I suspect this is a big part of why so many engagement researchers never calculate the most basic of inferential statistics–a standard error–for engagement index scores, even when they report impressive-sounding statistics such as “Cronbach’s Alpha”, “Spearman’s Rho”, or factor loadings from a structural equation model used to analyze the questions used in the index. 1

And the problem becomes more complicated still when the sample design uses stratification or clustering, when we sample a significant percent of employees (say, greater than 10%), or when our estimates use weighting and non-response adjustments. How do we estimate covariances in these cases?

Solution #1: Resampling-based methods such as the jackknife

A relatively easy solution is to use resampling methods. In a nutshell, you treat your sample as if it were a population, resample with replacement to generate a new “replicate sample”, calculate the index score for that replicate sample, and repeat this process many times. The variation of the index score across replicate samples is then used to estimate the sampling variance of the index score. The most commonly-used and widely-applicable resampling methods are the bootstrap and the jackknife, and the jackknife is in fact what is used in the Federal Employee Viewpoint Survey. The jackknife is a deterministic method (i.e. two people using the method will always get the same estimate), while the bootstrap uses random resampling (i.e. two people using the same method will usually get slightly different estimates). In practice, both methods are often implemented using replicate weights produced by a data provider and attached to the survey dataset, which allow multiple analysts to get the exact same variance estimates as one another when using the data. Once replicate weights have been produced, it is trivial to use them for variance estimation with any computational software. Thus, resampling methods are well-suited to the web applications often used to disseminate employee survey results. 2

Below is a tiny example to illustrate the use of the jackknife with replicate weights:

# Create fake data
survey_data <- data.frame(`Employee Type` = c('Field', 'Field', 'Office', 'Office'),
                          Q1 = c(2,5,3,4),
                          Q2 = c(NA, NA, 4, 3),
                          Q3 = c(5,4, NA, NA),
                          Weight = c(1,1,1,1),
                          check.names = FALSE)

# Create jackknife weights

n <- nrow(survey_data)
jackknife_weights <- array(dim = c(n, n))
for (i in 1:n) {
  jackknife_weights[i,i] <- 0
  jackknife_weights[-i,i] <- (n/(n-1)) * survey_data$Weight[-i]
}
colnames(jackknife_weights) <- paste0("JK_Wt_Set", 1:n)

survey_data <- cbind(survey_data, jackknife_weights)
Employee Type Q1 Q2 Q3 Weight JK_Wt_Set1 JK_Wt_Set2 JK_Wt_Set3 JK_Wt_Set4
Field 2 NA 5 1 0.000000 1.333333 1.333333 1.333333
Field 5 NA 4 1 1.333333 0.000000 1.333333 1.333333
Office 3 4 NA 1 1.333333 1.333333 0.000000 1.333333
Office 4 3 NA 1 1.333333 1.333333 1.333333 0.000000
# Create function to calculate an index score ----
  calculate_index_score <- function(data, wts) {
    pct_pos_scores <- sapply(data[,c("Q1", "Q2", "Q3")],
                             function(x) {
                               sum(wts[x %in% c(4,5)]) / sum(wts[!is.na(x)])
                               })
    index_score <- c('index_score' = mean(pct_pos_scores))
    return(index_score)
  }

# Calculate the estimated index score using the full sample ----
full_sample_estimate <- calculate_index_score(data = survey_data,
                                              wts = survey_data$Weight)


# Calculate the index score with each set of weights ----
rep_estimates <- vector('numeric', 0)

for (jk_weight_set in select(survey_data, JK_Wt_Set1:JK_Wt_Set4)) {
  jk_estimate <- calculate_index_score(data = survey_data,
                                       wts = jk_weight_set)
  rep_estimates <- c(rep_estimates, jk_estimate)
}
Estimate from full sample: 67%
Resample estimate #1 Resample estimate #2 Resample estimate #3 
           0.7222222            0.6111111            0.5555556 
Resample estimate #4 
           0.7777778 

\[ \begin{aligned} \text{Variance estimate} &= \frac{n-1}{n} \sum_{j=1}^{n} (\hat{\theta}_{\textit{rep. }j} - \hat{\theta}_{\textit{full-sample}})^2 \\ &= \frac{3}{4} \left[ (0.72 - 0.67)^2 + (0.61 - 0.67)^2 + (0.56 - 0.67)^2 + (0.78 - 0.67)^2\right] \\ &= 0.0231481 \\ \text{Estimated standard error} &= \sqrt{\text{Variance estimate}} \\ &= 0.1521452 \\ \end{aligned} \]

# Use the replicate estimates to estimate the standard error ----
jackknife_var_estimate <- ((n-1)/n) * sum( (rep_estimates - full_sample_estimate)^2 )
jackknife_standard_error <- sqrt(jackknife_var_estimate)

Resampling procedures are provided in a number of common software tools (e.g. the R package named {boot} which facilitates bootstrap methods), but such off-the-shelf procedures often assume the use of simple random sampling and are usually inappropriate for survey data which has features such as large sampling fractions, stratification, clustering, or weight adjustments. The R package {survey} provides user-friendly and appropriate jackknife and bootstrap methods for surveys and can easily be used to produce variance estimates for arbitrary statistics such as the Engagement Index.

Below is a simple example of how to use the {survey} package to estimate standard errors with jackknife weights.

library(survey)
Warning: package 'survey' was built under R version 4.2.2
Loading required package: grid
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loading required package: survival

Attaching package: 'survey'
The following object is masked from 'package:graphics':

    dotchart
# Describe the survey design to R
# so it can figure out how to estimate variances
survey_design <- srvyr::as_survey_design(
  .data = survey_data,
  ids = 1,
  weights = Weight
)

# Create Jackknife weights
jackknife_design <- srvyr::as_survey_rep(
  .data = survey_design,
  type = "JK1"
)

# Use the Jackknife weights to estimate variances

index_score <- withReplicates(
  design = jackknife_design,
  theta = function(wts, data) {
    calculate_index_score(data = data, wts = wts)
  })

print(index_score)
              theta     SE
index_score 0.66667 0.1521
# Calculate a 95% confidence interval
confint(index_score, level = 0.95, degf = degf(jackknife_design))
                2.5 %    97.5 %
index_score 0.3684676 0.9648657

The sets of replicate weights created by R can easily be exported and used to estimate sampling errors with other, non-statistical software (Excel, SQL, etc.).

rep_weights <- jackknife_design$repweights$weights
colnames(rep_weights) <- paste0("JK_Weights_", 1:4)

data_with_rep_weights <- cbind(survey_data[,1:5], rep_weights)
Employee Type Q1 Q2 Q3 Weight JK_Weights_1 JK_Weights_2 JK_Weights_3 JK_Weights_4
Field 2 NA 5 1 0.000000 1.333333 1.333333 1.333333
Field 5 NA 4 1 1.333333 0.000000 1.333333 1.333333
Office 3 4 NA 1 1.333333 1.333333 0.000000 1.333333
Office 4 3 NA 1 1.333333 1.333333 1.333333 0.000000

Solution #2: Linearization-based methods

Two statistical concepts ubiquitous in survey sampling theory are helpful here.

First, there’s the notion of a ratio estimate, which is an estimate of a quantity such as \(R=Y/X\). A percent-positive score can be thought of as a ratio estimate, where \(Y\) is the number of employees who would answer a given question with a response of ‘4’ or ‘5’, and \(X\) is the number of employees who would answer the question at all. There is a subtle difference between a ratio estimate and the proportion \(p=y/n\) (with \(n\) being the number of survey respondents and \(y\) being the number of survey responses with a positive response to a question). Only the proportion is typically taught in Stats 101. For the Stats 101 proportion \(p=Y/n\), we regard \(n\) as fixed across all of the samples that might be realized. However, for a ratio estimate, \(X\) may vary across samples. This can occur, for example, if we want to calculate the engagement index score for a subset of employees (e.g. part-time employees) and the sample size of such employees is not fixed to some pre-determined number. The sampling variance of \(p=Y/N\) is simple to calculate using methods from Stats 101, but the sampling variance of \(R=Y/X\) is not as straightforward.

This leads us to the concept of linearization. Linearization (also known as the “delta method”) is a tool for dealing with non-linear statistics, such as ratios, whose variances cannot generally be described using exact formulas. The essence of linearization is to take a non-linear statistic, approximate it with a linear one whose variance is easier to estimate, and then use the linear statistic’s variance estimate as an estimate of the non-linear statistic’s variance. The most common way to linearize a statistic is by using “Taylor series linearization”, which approximates a statistic using a first-order Taylor polynomial. Other methods, such as the “functional delta method”, are also used in practice (see recent blog posts 1 and 2 for examples and details).

Setting up the notation

First, let’s use clear notation to describe the variables involved in our problem. First we’ll set up the notation for “percent-positive” scores, which for the sake of illustration we’ll define as a response of “4” or “5” to a questionnaire item with a 5-point Likert items (e.g. a question with response options ranging from “1 - Not at all satisfied” to “5 - Completely satisfied.”)

\[ \begin{aligned} y_{k,i} &= \begin{cases}1~&{\text{ if person }}i~\text{responds }`\text{4' or }`\text{5' for questionnaire item }k~\\0~&{\text{ otherwise }}\end{cases} \\ d_{k,i} &= \begin{cases}1~&{\text{ if person }}i~\text{responds to questionnaire item }k~\\0~&{\text{ otherwise}}~\end{cases} \\ \\ Y_k &= \sum_{i=1}^{N}y_{k,i} \\ N_k &= \sum_{i=1}^{N} d_{k,i} \\ \end{aligned} \]

What we’re interested in is the dimension score, denoted \(S\), which is the average of \(M\) percent-positive scores, \(R_1, \dots ,R_M\).

\[ \text{The } ``\text{Dimension Score"} \\ S = \frac{1}{M}\sum_{k=1}^{M} \frac{Y_k}{N_k} \]

With a sample, we have an estimate of \(S\) based on estimates of the \(Y_k\) and \(N_k\) values.

\[ \text{The estimated } ``\text{Dimension Score"} \\ \hat{S} = \frac{1}{M}\sum_{k=1}^{M} \frac{\hat{Y}_k}{\hat{N}_k} \]

To estimate the variances of \(\hat{S}\) using linearization, we instead estimate the variance of a linear approximation of \(\hat{S}\), denoted \(\tilde{S}\), calculated using the estimated totals \(\hat{Y}_k, \hat{N}_k \textit{ for k }\in \{1,2,3\}\).

\[ \begin{aligned} \tilde{S} &= a_1 \hat{Y}_1 + b_1 \hat{N}_1 + a_2 \hat{Y}_2 + b_2 \hat{N}_2 + a_3 \hat{Y}_3 + b_3 \hat{N}_3 \\ Var(\tilde{S}) &\approx Var(S)\\ \end{aligned} \]

The coefficients \(a_1,\dots,a_3\) and \(b_1,\dots b_3\) are calculated based on the partial derivatives of \(S\) with respect to \(Y_1\), \(N_1\), etc. and are treated as fixed values; we won’t get into their specific values in this blog post, as they’re tedious and generally we’ll want to use software to figure them out for us. This gives us the following variance estimate:

\[ \begin{aligned} var(\tilde{S}) &= \sum_{k=1}^{3} a_k^2 var(\hat{Y}_1) + \sum_{k=1}^{3} b_k^2 var(\hat{N}_1) \\ & + \sum_{k=1}^{3}\sum_{j=1}^{3}a_k a_j \times cov(\hat{Y}_k, \hat{Y}_j) \\ & + \sum_{k=1}^{3}\sum_{j=1}^{3}b_k b_j \times cov(\hat{N}_k, \hat{N}_j) \\ & + \sum_{k=1}^{3}\sum_{j=1}^{3}a_k b_j \times cov(\hat{Y}_k, \hat{N}_j) \\ \end{aligned} \]

The variances and covariances for \(\hat{Y}_1, \dots \hat{Y}_3\) and \(\hat{N}_1, \dots, \hat{N}_3\) are simple to estimate, but care should be taken to make sure that the estimates are appropriate to the survey sample design and reflect factors such as large sampling fractions, stratification, clustering, or weighting adjustments.

Computation in R

In practice, it’s easiest to just use statistical software such as the {survey} package in R. This package makes it easy to calculate the variance/covariance estimates for the totals \(\hat{Y}_1, \dots \hat{Y}_3\) and \(\hat{N}_1, \dots, \hat{N}_3\), and also makes it easy to estimate the variance of functions of these estimated totals.

First you declare a survey design object with your data, which tells R how to correctly estimate sampling variances for the totals.

library(survey)
library(srvyr)

survey_design <- srvyr::as_survey_design(
  .data = survey_data,
  ids = 1,
  weights = Weight
)

Next, we create the respondent-level indicator variables which are used to calculate the totals.

survey_design <- survey_design %>%
  mutate(Q1_Positive = ifelse(Q1 %in% c(4,5), 1, 0),
         Q1_Answered = ifelse(!is.na(Q1), 1, 0),
         Q2_Positive = ifelse(Q2 %in% c(4,5), 1, 0),
         Q2_Answered = ifelse(!is.na(Q2), 1, 0),
         Q3_Positive = ifelse(Q3 %in% c(4,5), 1, 0),
         Q3_Answered = ifelse(!is.na(Q3), 1, 0))

Based on those indicator variables, we use the svytotal() function to calculate the totals and their estimated variances and covariances.

necessary_totals <- svytotal(design = survey_design,
                             x = ~ Q1_Positive + Q1_Answered
                                 + Q2_Positive + Q2_Answered
                                 + Q3_Positive + Q3_Answered,
                             covmat = TRUE)
print(necessary_totals)
            total     SE
Q1_Positive     2 1.1547
Q1_Answered     4 0.0000
Q2_Positive     1 1.0000
Q2_Answered     2 1.1547
Q3_Positive     2 1.1547
Q3_Answered     2 1.1547

To implement Taylor series linearization, we can use the svycontrast() function. By supplying a formula for how to calculate the index score based on the totals, R figures out how to appropriately estimate the standard error using linearization.

index_score <- svycontrast(
  stat = necessary_totals,
  contrasts = list(
    'Index_Score' = quote( (1/3)*(  (Q1_Positive/Q1_Answered) 
                                  + (Q2_Positive/Q2_Answered)
                                  + (Q3_Positive/Q3_Answered)))
  ))

print(index_score)
              nlcon     SE
Index_Score 0.66667 0.0962

If we want a 95% confidence interval, we simply use the following:

confint(index_score, level = 0.95, df = degf(survey_design))
                2.5 %    97.5 %
Index_Score 0.3604356 0.9728977

Conclusions

There are three key takeaways from all this:

  1. Although the engagement index is quite simple to calculate, it’s surprisingly tricky to calculate an appropriate margin of error or confidence interval.
  2. Nonetheless, it’s quite possible to do this using methods from survey statistics which are easily implemented in free, open-source software.
  3. To calculate the margin of error in general purpose, non-statistical software (e.g. in a web application), the easiest solution is to use replicate methods such as the jackknife or bootstrap. These methods can be implemented using only a simple formula and a data file with replicate weights which are created once in R (or other appropriate software).

Footnotes

  1. Of course, the more important reasons are that: (a) clients often don’t care about the margin of error beyond the cursory survey-wide margin of error reported using some calculator that an analyst found with Google; (b) the arcane-sounding statistics (Cobblepot’s Gamma, or whatever) help impress clients, whereas quantification of uncertainty do not; and (c) the models are readily available in SPSS and are taught in undergraduate-level psychology methods classes.↩︎

  2. To use replicate weights in an web application, the survey data and replicate weights would be provided to the web application. A formula would be used to calculate index scores using the survey data and margins of error using the survey data together with replicate weights. Such formulas are quite simple and could easily be implemented by web programmers in JavaScript, SQL, and other common tools.↩︎