In the survey data context, the motivating idea for estimating regression model parameters is simple: you have a sample from a specific finite population, and you want to use your sample to understand what would happen if you fit your model using all the data in the specific finite population. If you have a sample of 100 from a population of 1,000 people, you use your sample to approximate the model you would fit to the full population of 1,000. This sounds simple, but it raises subtle questions. How and why would we fit a model to a population? And if we had the data from the entire population, would there even be any need for a variance estimate–is there even any variance or uncertainty to account for at that point? The calculations used by survey software are implicitly based on particular answers to these questions.
In this post, I explain the reasoning behind why survey software estimates regression parameters and their standard errors the way it does. The only math/stats background you need for this post is some basic familiarity with regression and Calculus 101, but some familiarity with the concept of likelihood functions would also be helpful.
Fitting a model to a population rather than a sample
In traditional statistics, this isn’t even conceptually possible. Your population is some model, such as a Normal distribution characterized by a given mean and standard deviation,
However, in the real-life applications which are the focus of most surveys, your population isn’t a model but, well, an actual population: a specific list of 4,328 students teachers in a specific school, for example. It’s actually conceivable that you could survey every single one of those people. If you did, what possible use would you have for a model?
Two good uses come to mind:
You want a simple summary of your population that clearly describes aggregate patterns. For example, you could use a regression model to provide a summary such as “average wages among similarly-employed and similarly-educated Black and White workers differ by 10%.”
You want to model a process or phenomenon that applies beyond a highly specific set of individuals at a highly specific point in time. For example, you might want to model job satisfaction as a function of wages, and use this model to inform decisions that will be made over the next five years.
Oftentimes, a model developed for one use can double as a good model for the other use. For example, if there’s a strong theoretical model linking blood pressure to smoking (e.g. hypertension rates increase 5% for every year a person smokes), then that model might be a good description of aggregate patterns in a specific hospital’s patient population at a specific point in time. Similarly, the aggregate pattern in the current specific hospital’s patient population might be representative of aggregate patterns in another hospitals’ patients at a different point in time.
Survey statistics is primarily concerned with developing models for the first use, summarizing your specific population. As such, in survey analysis software, your usual tools for quantifying uncertainty (confidence intervals, standard errors, etc.) simply answer the following question: when you fit the model using sample data, how confident are you that your estimates are similar to the numbers you’d get if you fit your model to the entire finite population?
How do we determine which numeric values to use for our model parameters?
In traditional statistics, there are multiple ways of estimating an unknown parameter in a statistical model. For instance, if
Well, there’s a conceptual problem with that. All of the justification and theorems for maximum likelihood estimation are based on the idea that you’re making inference to a theoretical probability model, such as the standard probability model used in OLS regression:
Survey statisticians decades ago introduced a useful term to help think through this dilemma. Theoretical probability models are all we ever think about in traditional statistics (and so we just call them “the model”), but survey statisticians called these “superpopulation models” (or just “superpopulation” for short), to highlight the fact that they’re theoretical models of a broad, amorphous population rather than a numeric description of a specific finite population.
We can think of our specific finite population as being generated by a superpopulation model. Put differently, we can think of our finite population as being but a sample from an even-larger superpopulation which is described by a probability model. In this light, we might be interested in fitting the finite population model in such a way that it also acts as a good estimate for the superpopulation model, too.
As we know, maximum likelihood estimation is great for superpopulation models, where the beautiful mathematical theorems apply. Plus, in the important special case of ordinary linear regression, the maximum likelihood method gives results that match the intuitive least-squares method of fitting a regression model.
This suggests the following approach for fitting a descriptive model to our finite population: pick a probability model for a superpopulation that theoretically generated the values observed in our actual specific finite population, combine it with the actual specific data from our finite population to form a likelihood function, and then maximize that likelihood function.
In other words, we choose parameter values that maximize the likelihood of producing the values that exist in the entire finite population, if it had been generated by a hypothesized superpopulation model. Here’s an example for OLS regression.
The values in our finite population
This population likelihood function
Because survey analysis regression software typically focuses only on describing a finite population, I’ll stick with the term “pseudolikelihood” for the rest of this post.
Maximizing a pseudo-likelihood
How do we go about maximizing a pseudo-likelihood? One way we could go about maximizing the function is to maximize the log-transformed version of the function, which we’ll call the pseudo-loglikelihood.
Oftentimes the mathematics of this is easier (as we’ll see in a minute), and the values
From basic calculus, we know that the maximum of a function can (often) be found by calculating the first derivatives of the function with respect to the parameters and finding values of the parameters such that the first derivatives equal zero. And in the case of many common likelihood functions used for regression, this is a sure-fire way to obtain the maximum. So that’s what we’ll do here. The first derivatives of the pseudo-loglikelihood with respect to the parameters are called the “score”, or “the score function.”
So the pseudo maximum likelihood parameters
For simple linear regression, solving the score equation implies a simple formula for
Estimating the finite population model parameters with only a sample
If we only have a sample from the finite population, then we have to estimate the score function
Then our parameter estimates are given by solving the *estimated* score equation:
Importantly, all of the uncertainty in our estimation for the model comes from having to estimate the score function and score equation rather than simply calculate it using the values from the entire finite population. In typical survey software, our confidence intervals and standard errors for
Solving the estimated score equation is straightforward: we use a numeric optimization algorithm or–in the case of simple linear regression–we do some algebra to obtain a simple formula. But how do we estimate the score function? You’ll be surprised how simple this usually is.
Estimating the score function from a sample
One of the easiest things to estimate is a population total. For example, let’s say you drew a simple random sample of
It turns out that for many common models, we can write the population score as a population total and estimate it accordingly. For any particular values of
So we can simply estimate
The usual intuition of sampling weights still applies: the weight for each observation in the sample represents the number of people in the population who are represented by that observation.
I’ll show you why this works.
Why the score function is simply a population total
For many of the models we’d like to fit, the pseudolikelihood function is motivated by a superpopulation model where observations are assumed to be generated independently. For likelihood functions based on assumed independence, a beautiful thing happens: the big likelihood function can be broken down into a product of likelihood functions for each value in the population:
If the pseudolikelihood has this property, then we can also rewrite the pseudo-loglikelihood function in a much simpler way.
As a result of this and the sum rule for derivatives we learned in Calculus 101, we can thus write the score function for the population as a sum of the scores from each population element.
That’s why we can estimate the population score function with just the scores of the population elements in our sample:
Estimating the sampling variance of parameter estimates
In all of this discussion, the entire source of uncertainty in our samples is the fact that we only sample a subset of the finite population, and thus that we have to use a sample-based estimate of the finite population score function,
Now, how do we estimate this sampling variance?
Since
We start by writing out approximately how
First, we’ll introduce some notation to make our lives easier:
With this notation, we can write our Taylor series approximation as follows:
Now you might recall that the left-hand side of this equation equals zero; after all, the values of
Since
To use somewhat more traditional statistics notation, this means:
From some basic linear algebra, then, we have that:
Up until now, we’ve assumed that
So to actually estimate
Putting it all together, this gives us the standard estimator:
This is typically referred to as a “sandwich estimator”, with
More verbosely, this amounts to:
And that’s it- that’s your standard variance estimator.
(Click here for an explanation of the matrix algebra)
Recall that for two random variables
For a larger number of random variables, we generalize this using matrix notation:
When
The notation
Footnotes
At least, that’s the default. However, replication methods such as the bootstrap or balanced repeated replication can be used, too. In this approach, the regression model is refit to each replication subsample, and the variability of the estimates across the replication subsamples is used to estimate standard errors or confidence intervals.↩︎