How are R and Stata (mis)handling singleton strata?

Analysts attempting to estimate variance contributions from “singleton strata” often use the “adjust”/“recentering” options implemented in R and Stata, whose meaning is unclear due to ambiguous documentation. In this post, I try to clarify what these options are actually doing in theory, and I show that in practice their software implementations have surprising bugs.

statistics
surveys
R
Stata
software
Author

Ben Schneider

Published

September 2, 2022

When analyzing stratified, multistage samples, it’s fairly common to encounter singleton strata: that is, strata with only one PSU (which Thomas Lumley dubbs “lonely PSUs”). The problem with singleton strata is that it’s generally impossible to produce exactly unbiased variance estimates from these strata. That’s why survey software usually throws a warning or error message when these strata are encountered. Here’s a sample of these warnings from R, Stata, and SAS:

R ‘survey’ package

Warning message:
Stratum (1) has only one PSU at stage 1

Stata

Note: Missing standard error because of stratum with single sampling unit.

SAS

NOTE: Only one cluster in a stratum for variable(s) y.
The estimate of variance for y will omit this stratum.

The problems with singleton strata

To understand why singleton strata are a problem, it’s useful to consider the standard formula used by software packages to estimate variances in stratified, multistage samples.

The estimated population total \(\hat{Y}\) is the sum of the estimated population totals in the \(H\) strata, with \(\hat{Y}=\sum_{h=1}^{H} \hat{Y}_h\), and the sampling variance estimate \(v(\hat{Y})\) is the sum of the strata sampling variances.

\[ \begin{aligned} &\textbf{The estimated sampling variance} \\ &\textbf{of the estimated population total, }\hat{Y} \\ &v(\hat{Y}) = \sum_{h=1}^{H} v_h(\hat{Y}_h) \\ &\text{where } h \text{ denotes one of the }H \text{ strata, and} \\ & \hat{Y_h} \text{ denotes the estimated population total in stratum } h \\ \end{aligned} \]

Each stratum’s sampling variance is estimated as follows:

\[ \begin{aligned} & v_h(\hat{Y_h}) = (1 - f_h) \times n_h \times \frac{1}{n_h-1} \sum_{i=1}^{n_h}(y_{hi.} - \bar{y}_{h..})^2 \\ &\text{where } n_h \text{ denotes the number of first-stage sampling units}, \\ & y_{hi.} \text{ is the weighted total of PSU } i, \\ & \bar{y}_{h...} = \frac{1}{n_h}\sum_{i=1}^{n_h} y_{hi.} \text{ is the mean of the PSU weighted totals in stratum }h \\ &\text{and } f_h = n_h/N_h \text{ is the sampling fraction of first-stage sampling units in stratum }h \end{aligned} \]

The term \(\frac{1}{n_h-1} \sum_{i=1}^{n_h}(y_{hi.} - \bar{y}_{h..})^2\) is essentially an estimate of the population variance of weighted PSU totals in the stratum, commonly denoted \(S^2_h = \frac{1}{N_h - 1}\sum_{i=1}^{N_h} (y_{hi.} - \bar{y}_{h..})^2\).

In a singleton stratum (let’s denote it \(h^{\star}\)), we have \(n_{h^{\star}}=1\) and the mean of the stratum PSU totals simply equals the total of the lone PSU in that stratum. That is, \(\bar{y}_{{h^{\star}}..} = y_{{h^{\star}}i.}\).

So our estimated variance contribution is thus:

\[ \begin{aligned} v_{h^{\star}}(\hat{Y_h^{\star}}) &= (1 - f_{h^{\star}}) \times n_{h^{\star}} \times \frac{1}{n_{h^{\star}}-1} \sum_{i=1}^{n_{h^{\star}}}(y_{{h^{\star}}i.} - \bar{y}_{{h^{\star}}..})^2 \\ &= (1 - f_{h^{\star}}) \times 1 \times \frac{1}{0} (y_{{h^{\star}}i.} - y_{{h^{\star}}i.})^2 \\ &= (1 - f_{h^{\star}}) \times 1 \times \frac{1}{0} (0)^2 \end{aligned} \]

This has two problems:

  • Problem 1: The term \(1/(n_{h^{\star}} - 1)\) is undefined. For the sake of estimation, we could use the factor \(1/n_{h^{\star}}\) instead of \(1/(n_{h^{\star}} - 1)\). That’s what Stata and R do.

  • Problem 2: The observed deviation from the stratum mean is \(0\). So even if we solve the first problem, we get a variance estimate of 0.

“Recentering” Solutions in R and Stata

For problem 2, Stata and R offer a “recentering” solution, where the empirical stratum mean for the singleton stratum \(h^{\star}\) is replaced by the “grand mean” or “population mean”. This is called recentering because in calculating the variance contribution from \(y_{h^{\star}i}\), the recentering methods replace the deviation of \(y_{h^{\star}i}\) from the central value of the stratum (\(\bar{y}_{{h^{\star}}..}\)) with the deviation of \(y_{h^{\star}i}\) from the central value of the entire sample (\(\hat{Y}/n\), for example).

Click for a summary of the motivation behind recentering

Intuitively, the idea is to impute the unknown stratum mean using an estimate based on the entire sample. If the singleton stratum is similar to other strata, then this gives a good estimate of the variance contribution from the singleton stratum. If the singleton stratum is very different from the other strata, then this gives an overestimate of the variance contribution from the singleton stratum, which is OK since in survey variance estimation we usually consider overestimation to be preferable to underestimation.

In R, the recentering method is used by running the command option(survey.lonely.psu = 'adjust'). In Stata, the recentering method is used by specifying the option singleunit(centered) when running the svyset command.

How does the software documentation describe these methods?

Unfortunately, the documentation for both Stata and R are not totally clear what they mean by “population mean” or “grand mean” and there are no formulas or literature references provided.

Stata

Here’s the Stata documentation on the recentering method, which is implemented with the command singleunit(centered):

singleunit(method) specifies how to handle strata with one sampling unit. … singleunit(centered) specifies that strata with one sampling unit are centered at the grand mean instead of the stratum mean.

Stata Manual for svyset: https://www.stata.com/manuals/svysvyset.pdf

When this option is actually used in Stata, the following note is displayed:

Note: Strata with single sampling unit centered at overall
      mean.

R

And here’s the documentation for the {survey} package, available by calling ?survey::survey.lonely.psu:

Handling of strata with a single PSU that are not certainty PSUs is controlled by options("survey.lonely.psu"). Use … “adjust” to center the stratum at the population mean rather than the stratum mean

Survey Package Documentation

Additionally, this {survey} package documentation from 2006 says the following:

With

options(survey.lonely.psu="adjust")

the data for the single-PSU stratum are centered at the sample grand mean rather than the stratum mean. This is conservative.

http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html

Ok, but what does that actually mean?

A couple reasonable interpretations of the recentering methods they describe are:

\[ \begin{aligned} \textbf{Option 1: } & \text{ Use the deviation from the average of PSUs across all strata} \\ & \left( y_{{h^{\star}}i} - \hat{Y}/n \right) \\ &\text{where } \hat{Y} \text{ is the estimated population total from the full sample} \\ &\text{and } n \text{ is the total number of PSUs in the sample} \\ \\ \textbf{Option 2: } & \text{ Use the deviation from the average stratum's mean} \\ & \left( y_{{h^{\star}}i} - \frac{1}{H}\sum_{h=1}^{H} \frac{\hat{Y}_h}{n_h} \right) \\ &\text{where } H \text{ is the total number of strata} \\ &\text{and } \hat{Y_h} \text{ is the estimated population total in stratum } h \\ &\text{and } n_h \text{ is the number of sampled PSUs in stratum } h \\ \end{aligned} \]

The term “population mean” seems to me to be more likely to refer to “the average of PSUs across all strata”, while the term “grand mean” seems to me more likely to refer to “the average stratum’s mean.” The use of the term “grand mean” is unfortunate since it is one of many overloaded terms in statistics: in some cases “grand mean” refers to the overall population mean (e.g., the average height in the United States), and in some cases the term “grand mean” refers to the simple mean of subgroup means (e.g., the average state’s average height).

It’s a bit confusing that both R and Stata use the terms “grand mean” in one place and “overall mean” in different pieces of documentation without clarifying the meaning or providing a formula. But I guess what I’d assume from the various pieces of documentation is that they’re referring to “the average of PSUs across all strata.”

What is the software actually doing?

Given the ambiguities in the documentation, it’s worth actually experimenting with the software to figure out what it’s doing. By setting up a toy example, we can figure out precisely what these two software options are doing and what they mean by “population mean” or “grand mean.”

Here’s an example dataset with a singleton stratum created by design, where sampling is treated as “with-replacement.”

Stratum PSU Design_Weight Sex Age
1 1 10.5 M 30.5
1 2 10.5 F 40.5
2 3 20.0 F 35.0
2 4 20.0 M 52.0
3 5 15.0 M 44.0

And here are key statistics used for estimating variance contributions:

\[ \textbf{Weighted PSU totals} \\ \begin{aligned} y_{11.} &= 320.25 \\ y_{12.} &= 425.25 \\ y_{21.} &= 700.00 \\ y_{22.} &= 1040.00 \\ y_{31.} &= 660.00 \\ \end{aligned} \]

\[ \textbf{Stratum Means}\\\textbf{(of weighted PSU totals)} \\ \begin{aligned} \bar{y}_{1..} &= 372.75 \\ \bar{y}_{2..} &= 870.00 \\ \bar{y}_{3..} &= 660.00 \\ \end{aligned} \]

\[ \begin{aligned} \textbf{Average}&\textbf{ PSU across all strata} \\ \hat{Y}/n &= 3145.5 / 5 = 629.1 \\ \textbf{Average}&\textbf{ stratum's}\textbf{ mean} \\ \frac{1}{H}\sum_{h=1}^{H} \frac{\hat{Y}_h}{n_h} &= \frac{372.75+870+660}{3} = 634.25 \end{aligned} \]

Expected results

If we ignore the singleton stratum, the estimated variance is:

\[ \begin{aligned} v_{naive}(\hat{Y}) &= 2 \times \left[(320.25 - 372.75)^2 + (425.25 - 372.75)^2\right] \\ &+ 2 \times \left[ (700 - 870)^2 + (1040 - 870)^2 \right] \\ &= 126,625 \end{aligned} \]

For the singleton stratum, the variance contribution can be estimated as either:

\[ \begin{aligned} \textbf{Option 1: } & \text{ Use the deviation from the average of PSUs across all strata} \\ & \left( y_{{h^{\star}}i} - \hat{Y}/n \right)^2 \\ &= (660 - 629.1)^2 \\ &= 954.81 \\ \textbf{Option 2: } & \text{ Use the deviation from the average stratum's mean} \\ & \left( y_{{h^{\star}}i} - \frac{1}{H}\sum_{h=1}^{H} \frac{\hat{Y}_h}{n_h} \right)^2 \\ &= (660 - 634.25)^2 \\ &= 663.0625 \end{aligned} \]

So for option 1, the estimated variance would be \(127,579.8\), while for option 2 the estimated variance would be \(127,288.1\).

What Stata actually does

Let’s load this example dataset into Stata:

* Write example dataset
input Stratum PSU Design_Weight Age
1 1 10.5 30.5
1 2 10.5 40.5
2 3 20.0 35.0
2 4 20.0 52.0
3 5 15.0 44.0
end

Now we’ll get the “naive” variance estimate by ignoring the variance contribution from the singleton stratum:

svyset PSU [pweight=Design_Weight], strata(Stratum) singleunit(certainty)

svy: total Age
. estat vce
Survey: Total estimation

Number of strata =       3        Number of obs   =          5
Number of PSUs   =       5        Population size =         76
                                  Design df       =          2

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         Age |     3145.5   355.8441      1614.427    4676.573
--------------------------------------------------------------
Note: Strata with single sampling unit treated as certainty
      units.

. . estat vce

Covariance matrix of coefficients of total model

        e(V) |        Age 
-------------+------------
         Age |     126625 

That matches what we got by hand (\(126,625\)). Now let’s use the recentering option:

svyset PSU [pweight=Design_Weight], strata(Stratum) singleunit(centered)

svy: total Age
. estat vce
Survey: Total estimation

Number of strata =       3        Number of obs   =          5
Number of PSUs   =       5        Population size =         76
                                  Design df       =          2

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         Age |     3145.5    526.837      878.7032    5412.297
--------------------------------------------------------------
Note: Strata with single sampling unit centered at overall
      mean.

. . estat vce

Covariance matrix of coefficients of total model

        e(V) |        Age 
-------------+------------
         Age |  277557.25 

What??? The variance estimate from Stata is humongous: it’s over twice as large as what we calculated by hand using either of the recentering options we described earlier. This output is saying that the estimated variance contribution from the singleton stratum is \(277,557.25-126,625\), or \(150,932.2\) What on earth is Stata doing?

After a little trial-and-error, I found that this estimated variance contribution is equal to the following:

\[ \begin{aligned} &(y_{31.} - \frac{\hat{Y}}{H})^2 \\ &= (660 - \frac{3145.5}{3})^2 \\ &= (660 - 1048.5)^2 \\ &= 150,932.2 \end{aligned} \]

In other words, here Stata isn’t recentering around the average PSU total or the average stratum mean; it’s recentering around the average stratum total. I can’t think of any reason why you would want to do that, and it’s certainly not what you would expect from the documentation.

What R actually does

Ok, well, let’s see if R does what we’d expect.

# Load example dataset
df_w_singleton <- data.frame(
  Stratum = c(1, 1, 2, 2, 3),
  PSU = c(1, 2, 3, 4, 5),
  Design_Weight = c(10.5, 10.5, 20, 20, 15),
  Sex = c("M", "F", "F", "M",  "M"),
  Age = c(30.5, 40.5, 35, 52, 44),
  Height = c(6.2, 5.0, 5.3, 5.7, 5.5)
)

library(survey) # Version 4,2

First, we’ll get the “naive” variance estimate.

options("survey.lonely.psu" = "remove")

survey::svydesign(
  data = df_w_singleton,
  ids = ~ PSU, strata = ~ Stratum,
  weights = ~ Design_Weight
) |> svytotal(x = ~ Age) |> vcov()
       Age
Age 126625

So far so good. Now let’s use the recentering option:

options("survey.lonely.psu" = "adjust")

survey::svydesign(
  data = df_w_singleton,
  ids = ~ PSU, strata = ~ Stratum,
  weights = ~ Design_Weight
) |> svytotal(x = ~ Age) |> vcov()
       Age
Age 562225

This variance estimate is somehow even more humongous than Stata’s. After some digging into the open-source code (a huge advantage of R over Stata), I found that the {survey} package isn’t recentering around any of the averages, but is instead recentering around zero. So the variance contribution from the singleton stratum is \(562,225 - 126,626=435,600\).

\[ \begin{aligned} &(y_{31.} - 0)^2 \\ &= (660 - 0)^2 \\ &= 435,600 \end{aligned} \]

I can’t see why we would want to do that, particularly when estimating the variance of an arbitrary variable’s estimated total. For influence functions and estimating equations of more complex statistics, one could argue that the estimated population total is in fact zero, so by recentering around zero we in fact are recentering around the estimated population total. But that logic doesn’t work for population totals of simple variables such as income. For an arbitrary variable’s total, the weighted sum of the influence functions is simply the estimated total, and so we do in fact need to recenter around something other than zero. So in short, there’s definitely a bug if you’re using svytotal(), but things are probably working OK if you’re using svymean() or svyglm().

Even so, I think it’s probably safest for the survey package’s general-purpose variance estimation functions (survey:::multistage() and friends) to in general actually recenter around something, even if it’s a small number such as 6.625559e-16.

Recap

In all, the R and Stata documentation describe what in theory is a reasonable method for estimating variance contributions from singleton strata, but the documentation is ambiguous about how exactly they implement that method. But we can make some reasonable guesses about how that method should be implemented. But neither R nor Stata do anything reasonable here when it comes to estimating variances of population totals, and what they do is not what you would guess from the documentation. However, R does seem to be doing the right thing for statistics other than totals.

With R, because it’s open-source we can precisely see what’s going on in the code to figure this out. With Stata, our only recourse was to do some trial-and-error. I’ve emailed Thomas Lumley (the author and maintainer of the {survey} package) to raise this issue. For Stata, I don’t know who to contact about this apparent bug. If any reader here knows who at Stata to contact, I’d appreciate it if you shared this with them.

My guess as to why these bugs have gone undiagnosed for so long is that few people understand what the recentering methods are actually supposed to do, since the documentation around them is ambiguous and these methods aren’t described in methods literature as far as I’m aware.