Systematic sampling as implicit stratification

This short post illustrates why systematic sampling is sometimes referred to as ‘implicit stratification’.

statistics
surveys
sampling
Author

Ben Schneider

Published

October 4, 2021

Almost two years ago, I started working on the Survey of Doctorate Recipients sponsored by the National Science Foundation. Like many large national surveys, the Survey of Doctorate Recipients uses systematic sampling rather than the better-known method of simple random sampling which is ubiquitous in statistics and data science training. In conversations about this sample design, statisticians on the project repeatedly referred to something called “implicit stratification”, an unfamiliar term to me, and which was apparently something induced by the systematic sampling design.

This short post illustrates why systematic sampling is sometimes described as creating “implicit stratification”.

Suppose you have a list of 20 people from which you need to draw a sample of 5. To begin with, you’d sort the list by some factor such as age. Next, we’d randomly pick a starting point between 1 and 4 (i.e. between 1 and \(k=N/n\)). Our sample would consist of this starting point and then every k-th person from there. This is illustrated in the following figure.

The figure is a sequence of 20 dots numbered 1 through 20, each dot representing a member of the population, which has been sorted by age from youngest to oldest. A series of arrows connect dots 3 and 7, 7 and 11, 11 and 15, and 15 and 19. The connected dots are an example of a sample selected using systematic sampling.

As the following figure illustrates, there are only four possible samples that can be selected under this design. The first sample is the set of elements 1,5,9,13,17; the second sample is the set of elements 2,6,10,14,18; etc.

The figure is titled as follows: 'The set of all possible samples selected using the systematic sampling design.' The figure has four rows, each consisting of 20 dots numbered 1 through 20. Each dot representing a member of the population, which has been sorted by age from youngest to oldest. The four rows have the captions: 'Sample 1', 'Sample 2', 'Sample 3', and 'Sample 4;. For Sample 1, a series of curved lines connect points 1, 5, 9, 13, and 17. For Sample 2, a series of curved lines connect points 2, 6, 10, 14, and 18. For Sample 3, a series of curved lines connects points 3, 7, 11, 15, and 19. For sample 4, a series of curved lines connect points 4, 8, 12, 16, and 20.

We can see in this illustration that in all of the possible samples, one and only element is included from the set \(\{1,2,3,4\}\), one and only one element is included from the set \(\{5,6,7,8\}\), and so on. Therein lies the implicit stratification. These \(5\) groups of \(20/4\) elements are implicit strata induced by the systematic sampling design. This is more apparent in the following figure.

The figure is titled as follows: 'The set of all possible samples under the systematic sampling design, with implicit strata shown using vertical lines.' The figure has four rows, each consisting of 20 dots numbered 1 through 20. Each dot representing a member of the population, which has been sorted by age from youngest to oldest. The four rows have the captions: 'Sample 1', 'Sample 2', 'Sample 3', and 'Sample 4;. For Sample 1, a series of curved lines connect points 1, 5, 9, 13, and 17. For Sample 2, a series of curved lines connect points 2, 6, 10, 14, and 18. For Sample 3, a series of curved lines connects points 3, 7, 11, 15, and 19. For sample 4, a series of curved lines connect points 4, 8, 12, 16, and 20. In each row, vertical lines are placed between dots 4 and 5, dots 8 and 9, dots 12 and 13, and dots 16 and 17.

This pattern is summarized nicely by William Cochran in the classic 1977 textbook “Sampling Techniques”:

In effect, [systematic sampling] stratifies the population into n strata, which consist of the first k units, the second k units, and so on. We might therefore expect the systematic sample to be about as precise as the corresponding stratified random sample with one unit per stratum.

(p. 205) Cochran’s “Sampling Techniques”, Third Edition

In contrast to explicit stratification, the implicit stratification induced by systematic sampling has a few advantages:

  1. The strata don’t have to be defined using potentially arbitrary boundaries (e.g. Pew’s definitions of generations such as Millennials or Baby Boomers). The strata boundaries are implicitly determined by the overall population size and the overall sample size.

  2. As a result, it’s straightforward to create implicit strata based on many variables in a list sampling frame: you simply sort the sampling frame by many variables.

  3. Systematic sampling can be used even when you don’t have a list frame, such as in intercept surveys.

  4. Systematic sampling can in fact yield more precise estimates than stratified simple random sampling, even when the explicit strata in the stratified SRS are the same as the implicit strata in the systematic design. I won’t get into the explanation, but Cochran’s book has a nice explanation on pages 205-207.

But implicit and explicit stratification aren’t mutually exclusive. In fact, it’s common to combine the two approaches by using systematic sampling within explicitly defined strata. That’s what the Survey of Doctorate Recipients (SDR) does. Explicit strata are formed based on PhD graduates’ field of study, gender, and race/ethnicity, and each stratum \(h\) is allocated a sample size \(n_h\) based on target margins of error for key estimates. Then within each explicitly-defined stratum \(h\), the individuals listed in the sampling frame are sorted by several other variables (age, for example) and a sample of size \(n_h\) is selected using systematic sampling.