What we estimate when we estimate regression models with survey data

I explain the reasoning behind why survey software estimates regression parameters and their standard errors the way it does.

statistics
regression
surveys
sampling
Author

Ben Schneider

Published

March 28, 2021

In the survey data context, the motivating idea for estimating regression model parameters is simple: you have a sample from a specific finite population, and you want to use your sample to understand what would happen if you fit your model using all the data in the specific finite population. If you have a sample of 100 from a population of 1,000 people, you use your sample to approximate the model you would fit to the full population of 1,000. This sounds simple, but it raises subtle questions. How and why would we fit a model to a population? And if we had the data from the entire population, would there even be any need for a variance estimate–is there even any variance or uncertainty to account for at that point? The calculations used by survey software are implicitly based on particular answers to these questions.

In this post, I explain the reasoning behind why survey software estimates regression parameters and their standard errors the way it does. The only math/stats background you need for this post is some basic familiarity with regression and Calculus 101, but some familiarity with the concept of likelihood functions would also be helpful.

Fitting a model to a population rather than a sample

In traditional statistics, this isn’t even conceptually possible. Your population is some model, such as a Normal distribution characterized by a given mean and standard deviation, \(\mathcal{N}(\mu,\sigma^2)\), which generates a set of observations that you call your sample, but which could always theoretically generate more observations.

However, in the real-life applications which are the focus of most surveys, your population isn’t a model but, well, an actual population: a specific list of 4,328 students teachers in a specific school, for example. It’s actually conceivable that you could survey every single one of those people. If you did, what possible use would you have for a model?

Two good uses come to mind:

  1. You want a simple summary of your population that clearly describes aggregate patterns. For example, you could use a regression model to provide a summary such as “average wages among similarly-employed and similarly-educated Black and White workers differ by 10%.”

  2. You want to model a process or phenomenon that applies beyond a highly specific set of individuals at a highly specific point in time. For example, you might want to model job satisfaction as a function of wages, and use this model to inform decisions that will be made over the next five years.

Oftentimes, a model developed for one use can double as a good model for the other use. For example, if there’s a strong theoretical model linking blood pressure to smoking (e.g. hypertension rates increase 5% for every year a person smokes), then that model might be a good description of aggregate patterns in a specific hospital’s patient population at a specific point in time. Similarly, the aggregate pattern in the current specific hospital’s patient population might be representative of aggregate patterns in another hospitals’ patients at a different point in time.

Survey statistics is primarily concerned with developing models for the first use, summarizing your specific population. As such, in survey analysis software, your usual tools for quantifying uncertainty (confidence intervals, standard errors, etc.) simply answer the following question: when you fit the model using sample data, how confident are you that your estimates are similar to the numbers you’d get if you fit your model to the entire finite population?

How do we determine which numeric values to use for our model parameters?

In traditional statistics, there are multiple ways of estimating an unknown parameter in a statistical model. For instance, if \(Y = \beta_0 + \beta_1X + \epsilon\) is a regression model (say, predicting individual’s wages given their age), we can estimate the unknown parameters \((\beta_0, \beta_1)\) using the method of ordinary least-squares. By far the most popular method of parameter estimation is maximum likelihood estimation, and for good reason. In addition to being relatively intuitive, statistical theory has shown that it has excellent properties under fairly-plausible assumptions. For large samples, the maximum likelihood estimate is likely to be close to the true model’s parameter value, and the probability of it being wrong by any given amount can be modeled reasonably well using a Normal distribution. So it seems like the maximum likelihood method might also be a good way to fit a model to our finite population too, right?

Well, there’s a conceptual problem with that. All of the justification and theorems for maximum likelihood estimation are based on the idea that you’re making inference to a theoretical probability model, such as the standard probability model used in OLS regression: \(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\) where \(\epsilon\) is generated by the Normal probability distribution, \(\mathcal{N}(0,\sigma^2)\). More importantly, these probability models supply the core ingredient of the maximum likelihood method: the likelihood function, which is a combination of data and a probability model. But if our primary target of inference is a numerical finite population summary rather than a theoretical model, then all we have is data and no probability model. So what are we to do?

Survey statisticians decades ago introduced a useful term to help think through this dilemma. Theoretical probability models are all we ever think about in traditional statistics (and so we just call them “the model”), but survey statisticians called these “superpopulation models” (or just “superpopulation” for short), to highlight the fact that they’re theoretical models of a broad, amorphous population rather than a numeric description of a specific finite population.

We can think of our specific finite population as being generated by a superpopulation model. Put differently, we can think of our finite population as being but a sample from an even-larger superpopulation which is described by a probability model. In this light, we might be interested in fitting the finite population model in such a way that it also acts as a good estimate for the superpopulation model, too.

As we know, maximum likelihood estimation is great for superpopulation models, where the beautiful mathematical theorems apply. Plus, in the important special case of ordinary linear regression, the maximum likelihood method gives results that match the intuitive least-squares method of fitting a regression model.

This suggests the following approach for fitting a descriptive model to our finite population: pick a probability model for a superpopulation that theoretically generated the values observed in our actual specific finite population, combine it with the actual specific data from our finite population to form a likelihood function, and then maximize that likelihood function.

In other words, we choose parameter values that maximize the likelihood of producing the values that exist in the entire finite population, if it had been generated by a hypothesized superpopulation model. Here’s an example for OLS regression.

\[ \begin{aligned} \textit{Superpopulation Probability Model:} \\ Y_i &= \beta_0 + \beta_1X_i + \epsilon_i \\ \epsilon &\sim \mathcal{N}(0,\sigma^2) \\ \\ \textit{Likelihood Function of the}&\textit{ Superpopulation:} \\ L(\beta_0,\beta_1) =L(\beta_0,\beta_1;y_1,\dots,y_N) = &\textit{Probability that }y_1,\dots,y_N\textit{ would be generated} \\ &\textit{by the superpopulation model} \\ &\textit{if the values of the unknown parameters were }\beta_0 \textit{ and } \beta_1 \\ \\ \textit{Maximum Likelihood Estimate:} \\ B_0,B_1 &\textit{ such that } L(B_0,B_1) \textit{ is greater } \\ &\textit{than for any other choice of values} \\ \\ \textit{Finite Population Descriptive Model:} \\ Y_i &= B_0 + B_1X_i + e_i\\ &\textit{where }e_i \textit{ is the error} \\ &\textit{with an average absolute value of say, }4.68 \end{aligned} \]

The values in our finite population \(y_1,\dots,y_N\) are fixed, and we maximize the likelihood by considering various values of \(\beta_0\) and \(\beta_1\), until we obtain values \(B_0\) and \(B_1\) that maximize the likelihood.

This population likelihood function \(L(\beta_0,\beta_1)\) is referred to as the “census likelihood function” since it’s calculated using a census of the entire finite population. Now, if we’re primarily using the model to describe our finite population–which is what survey analysis software assumes by default–then we call the population likelihood a “pseudolikelihood” function. Here, the adjective “pseudo” is used to emphasize the fact that the function is only a tool for producing a numeric description of the finite population. It would be a “true” likelihood if instead we genuinely thought of our finite population as a sample from some theoretical superpopulation model and were primarily trying to make inferences about that superpopulation model.

Because survey analysis regression software typically focuses only on describing a finite population, I’ll stick with the term “pseudolikelihood” for the rest of this post.

Maximizing a pseudo-likelihood

How do we go about maximizing a pseudo-likelihood? One way we could go about maximizing the function is to maximize the log-transformed version of the function, which we’ll call the pseudo-loglikelihood.

\[ \ell(\beta_0,\beta_1;y_1,\dots,y_N) = log( L(\beta_0,\beta_1;y_1,\dots,y_N)) \]

Oftentimes the mathematics of this is easier (as we’ll see in a minute), and the values \(\beta_0,\beta_1\) that maximize the pseudo-loglikelihood are guaranteed to be the same values that maximize the pseudo-likelihood. This is because of the nice property of logarithms that if \(f(x) < f(y)\), then \(log(f(x)) < log(f(y))\), and the logarithm is always defined since the likelihood function’s value is always positive.

From basic calculus, we know that the maximum of a function can (often) be found by calculating the first derivatives of the function with respect to the parameters and finding values of the parameters such that the first derivatives equal zero. And in the case of many common likelihood functions used for regression, this is a sure-fire way to obtain the maximum. So that’s what we’ll do here. The first derivatives of the pseudo-loglikelihood with respect to the parameters are called the “score”, or “the score function.”

\[ \begin{aligned} U\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) = \begin{bmatrix} \frac{\partial}{\partial \beta_0} \ell(\beta_0,\beta_1;y_1,\dots,y_N) \\ \frac{\partial}{\partial \beta_1} \ell(\beta_0,\beta_1;y_1,\dots,y_N)\end{bmatrix} \end{aligned} \]

So the pseudo maximum likelihood parameters \(B_0\) and \(B_1\) are given by the solution to the following “score equation”:

\[ \begin{aligned} B_0, B_1 \textit{ such that:}& \\ \\ U\left(\begin{bmatrix} B_0 \\ B_1\end{bmatrix}\right) &= \begin{bmatrix} 0 \\ 0\end{bmatrix} \end{aligned} \]

For simple linear regression, solving the score equation implies a simple formula for \(B_0\) and \(B_1\). However, for essentially every other common model (e.g. a logistic regression model), there is no such simple formula. Instead, a numeric optimization algorithm is used to find values \(B_0\) and \(B_1\) that solve the score equation.

Estimating the finite population model parameters with only a sample

If we only have a sample from the finite population, then we have to estimate the score function \(U\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right)\) with our sample data. That is, we need to estimate for each potential value of \(\beta_0\) and \(\beta_1\) what the value of the population score function would be for those values. This gives us an *estimated* score function: \(\hat{U}\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right)\).

Then our parameter estimates are given by solving the *estimated* score equation:

\[ \begin{aligned} &\hat{B}_0, \hat{B}_1 \textit{ such that:} \\ \\ &\hat{U}\left(\begin{bmatrix} \hat{B}_0 \\ \hat{B}_1 \end{bmatrix}\right) = \begin{bmatrix} 0 \\ 0\end{bmatrix} \\ &\textit{where }\hat{U} \textit{ is } \\ &\textit{the estimate of }\space U \\ &\textit{from the sample} \\ \end{aligned} \]

Importantly, all of the uncertainty in our estimation for the model comes from having to estimate the score function and score equation rather than simply calculate it using the values from the entire finite population. In typical survey software, our confidence intervals and standard errors for \(\hat{B}_0\) and \(\hat{B}_1\) only reflect uncertainty caused by substituting \(\hat{U}\) for \(U\). There’s no superpopulation model involved: if we had a 100% sample from our finite population, our margin of error would be zero. That said, let’s get back to focusing on how we estimate and solve the score equation.

Solving the estimated score equation is straightforward: we use a numeric optimization algorithm or–in the case of simple linear regression–we do some algebra to obtain a simple formula. But how do we estimate the score function? You’ll be surprised how simple this usually is.

Estimating the score function from a sample

One of the easiest things to estimate is a population total. For example, let’s say you drew a simple random sample of \(n\) people from a population of \(N\) people and you measured the income of each person in the sample and took the average, \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n}y_i\). Then you can estimate the total income in the population as \(\hat{Y} = N \times\bar{y} = \sum_{i=1}^{n} \frac{N}{n} y_i\). Easy as can be! If you have something other than a simple random sample, you can replace \(\frac{N}{n}\) in that formula with a weight specific to each person: \(\hat{Y} = \sum_{i=1}^{n}w_i \times y_i\). Still simple!

It turns out that for many common models, we can write the population score as a population total and estimate it accordingly. For any particular values of \(\beta_0\) and \(\beta_1\), the value of the population score function is the sum of the scores of each element in the population.

\[ \begin{aligned} U\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) = \sum_{i=1}^{N} u_i\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) \end{aligned} \]

So we can simply estimate \(U\) the same way we’d estimate any population total, using survey weights:

\[ \begin{aligned} \hat{U}\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) = \sum_{i=1}^{n} w_i \times u_i\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) \end{aligned} \]

The usual intuition of sampling weights still applies: the weight for each observation in the sample represents the number of people in the population who are represented by that observation.

I’ll show you why this works.

Why the score function is simply a population total

For many of the models we’d like to fit, the pseudolikelihood function is motivated by a superpopulation model where observations are assumed to be generated independently. For likelihood functions based on assumed independence, a beautiful thing happens: the big likelihood function can be broken down into a product of likelihood functions for each value in the population:

\[ L(\beta_0,\beta_1;y_1,\dots,y_N)=L(\beta_0,\beta_1;y_1)\times\dots\times L(\beta_0,\beta_1;y_N) \]

If the pseudolikelihood has this property, then we can also rewrite the pseudo-loglikelihood function in a much simpler way.

\[ \begin{aligned} \ell(\beta_0,\beta_1;y_1,\dots,y_N) &= log( L(\beta_0,\beta_1;y_1,\dots,y_N)) \\ &=log \left[ \space L(\beta_0,\beta_1;y_1)\times\dots\times L(\beta_0,\beta_1;y_N) \space \right] \\ \\ &\space\space\space\space\space\space\textit{And since }log(a\times b)=log(a)+log(b) \\ \\ &= log[L(\beta_0,\beta_1;y_1)]+\dots+ log(L(\beta_0,\beta_1;y_N) ) \\ &= \ell(\beta_0,\beta_1;y_1)]+\dots+ \ell(\beta_0,\beta_1;y_N) \\ &=\sum_{i=1}^{N} \ell(\beta_0,\beta_1;y_i) \end{aligned} \]

As a result of this and the sum rule for derivatives we learned in Calculus 101, we can thus write the score function for the population as a sum of the scores from each population element.

\[ \begin{aligned} U\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) = \begin{bmatrix} \frac{\partial}{\partial \beta_0} \ell(\beta_0,\beta_1;y_1,\dots,y_N) \\ \frac{\partial}{\partial \beta_1} \ell(\beta_0,\beta_1;y_1,\dots,y_N)\end{bmatrix} \\ \\ =\begin{bmatrix} \frac{\partial}{\partial \beta_0} \sum_{i=1}^{N} \ell(\beta_0,\beta_1;y_i) \\ \frac{\partial}{\partial \beta_1} \sum_{i=1}^{N} \ell(\beta_0,\beta_1;y_i)\end{bmatrix} \\ \\ =\begin{bmatrix} \sum_{i=1}^{N} \frac{\partial}{\partial \beta_0} \ell(\beta_0,\beta_1;y_i) \\ \sum_{i=1}^{N} \frac{\partial}{\partial \beta_1}\ell(\beta_0,\beta_1;y_i)\end{bmatrix} \end{aligned} \]

That’s why we can estimate the population score function with just the scores of the population elements in our sample:

\[ \begin{aligned} U\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) &= \begin{bmatrix} \sum_{i=1}^{N} \frac{\partial}{\partial \beta_0} \ell(\beta_0,\beta_1;y_i) \\ \sum_{i=1}^{N} \frac{\partial}{\partial \beta_1}\ell(\beta_0,\beta_1;y_i)\end{bmatrix} \\ \\ &\implies \\ \\ \hat{U}\left(\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix}\right) &= \begin{bmatrix} \sum_{i=1}^{n} w_i \times \frac{\partial}{\partial \beta_0} \ell(\beta_0,\beta_1;y_i) \\ \sum_{i=1}^{n} w_i \times \frac{\partial}{\partial \beta_1}\ell(\beta_0,\beta_1;y_i)\end{bmatrix} \\ \end{aligned} \]

Estimating the sampling variance of parameter estimates

In all of this discussion, the entire source of uncertainty in our samples is the fact that we only sample a subset of the finite population, and thus that we have to use a sample-based estimate of the finite population score function, \(\hat{U}\) instead of the actual finite population score function \(U\). When survey software calculates a confidence interval, variance, or standard error for a regression model parameter, that’s what it’s accounting for.

Now, how do we estimate this sampling variance?

Since \(\hat{U}\) is simply a population total, it’s very easy to estimate its sampling variance. Unfortunately, though, we’re not really interested in the standard error of \(\hat{U}\). What we really want is to quantify the uncertainty of \(\hat{B}_0\) and \(\hat{B}_1\). Intuitively, though, since the estimates \(\hat{B}_0\) and \(\hat{B}_1\) are in a sense byproducts of \(\hat{U}\), we know that if \(\hat{U}\) is estimated with a great deal of uncertainty, then the estimates \(\hat{B}_0\) and \(\hat{B}_1\) will be too. Knowing the sampling variance of \(\hat{U}\) tells us something about the sampling variance of the parameter estimates, and we can take advantage of that fact. If we know how our estimates of \(B_0\) and \(B_1\) change as a function of \(\hat{U}\), then we can plug the sampling variance of \(\hat{U}\) into a function that will approximately estimate the sampling variance of \(\hat{B}_0\) and \(\hat{B}_1\). That’s the basic idea for how survey analysis software estimates the variance.1

We start by writing out approximately how \(\hat{U}\) changes as a function of \(\hat{B}_0\) and \(\hat{B}_1\). To accomplish this, we’ll calculate a Taylor series approximation to the function \(\hat{U} \left(\begin{bmatrix}b_0 \\ b_1\end{bmatrix} \right)\) about the point \(\begin{bmatrix}B_0 \\ B_1\end{bmatrix}\), and then we’ll invert that approximation to determine how \(\hat{B}_0\) and \(\hat{B}_1\) change as a function of \(\hat{U}\).

First, we’ll introduce some notation to make our lives easier:

\[ \begin{aligned} \begin{bmatrix} \hat{U}_0(\hat{B}_0) \\ \hat{U}_1(\hat{B}_1) \end{bmatrix} &= \hat{U} \left(\begin{bmatrix}\hat{B}_0 \\ \hat{B}_1\end{bmatrix}\right) = \hat{U}(\mathbf{\hat{B}}) \\ \end{aligned} \]

\[ \begin{aligned} \textit{Denote: }\space J_{00} &=\left.{ \frac{\partial \hat{U}_0(b_0)}{\partial b_0}}\right|_{b_0=B_0} \\ \space J_{01} &=\left.{ \frac{\partial \hat{U}_0(b_1)}{\partial b_1}}\right|_{b_1=B_1} \\ \space J_{10} &=\left.{ \frac{\partial \hat{U}_1(b_0)}{\partial b_0}}\right|_{b_0=B_0} \\ J_{11} & =\left.{ \frac{\partial \hat{U}_1(b_1)}{\partial b_1}}\right|_{b_1=B_1} \\ \\ \mathbf{J_B} &= \begin{bmatrix} J_{00} \ J_{01} \\ J_{10} \ J_{11} \end{bmatrix} \end{aligned} \]

With this notation, we can write our Taylor series approximation as follows:

\[ \begin{aligned} \hat{U}(\mathbf{\hat{B}}) &\approx \hat{U}(\mathbf{B}) + \mathbf{J_B} (\mathbf{\hat{B} - B}) \\ \end{aligned} \]

Now you might recall that the left-hand side of this equation equals zero; after all, the values of \(\hat{B}_0\) and \(\hat{B}_1\) were chosen precisely for this reason. So based on this approximation, we have that:

\[ \begin{aligned} \mathbf{0} &\approx \hat{U}(\mathbf{B}) + \mathbf{J_B} (\mathbf{\hat{B} - B}) \\ \\ \hat{U}(\mathbf{B}) &\approx -\mathbf{J_B} (\mathbf{\hat{B} - B}) \\ \end{aligned} \]

Since \(\mathbf{B}\) and \(\mathbf{J_B}\) are constants, we have that:

\[ \begin{aligned} Var \left[ \hat{U} (\mathbf{B})\right] &\approx Var\left[ -\mathbf{J_B} (\mathbf{\hat{B} - B}) \right] \\ &= Var\left[ \mathbf{J_B} (\mathbf{\hat{B} - B}) \right] \\ &= Var\left[ \mathbf{J_B} \mathbf{\hat{B}} - \mathbf{J_B B} \right] \\ &= Var\left[ \mathbf{J_B} \mathbf{\hat{B}} \right] \\ &=\mathbf{J_B} Var \left[ \mathbf{\hat{B}} \right] \mathbf{J_B}^{T} \\ \end{aligned} \]

To use somewhat more traditional statistics notation, this means:

\[ \begin{aligned} \Sigma_{\hat{U}(\mathbf{B}) } \approx \mathbf{J_B} \Sigma_{\hat{B}} \mathbf{J_B}^T \end{aligned} \]

From some basic linear algebra, then, we have that:

\[ \Sigma_{\hat{B}} \approx [\mathbf{J}_{\mathbf{B}}^{-1}] \Sigma_{\hat{U}(\mathbf{B})} [\mathbf{J}_{\mathbf{B}}^{-1}]^T \]

Up until now, we’ve assumed that \(\Sigma_{\hat{U}(\mathbf{B})}\) and \(J_{\mathbf{B}}\) are known quantities. However, their values depend on the unknown true parameters:

\[\mathbf{B}=\begin{bmatrix}B_1 \\ \vdots \\B_p\end{bmatrix}\]

So to actually estimate \(\Sigma_{\hat{B}}\) using this approach, we need to plug in estimates of \(\mathbf{B}\). For this, we just plug in \(\hat{\mathbf{B}}\), obtained with the process we’ve seen above. For \(\mathbf{J_B}\), we evaluate the partial derivatives of \(\hat{U}\) at the point \(\hat{\mathbf{B}}\). For parametric models–such as OLS or the standard logistic regression model–the form of the partial derivative is easy to calculate.

Putting it all together, this gives us the standard estimator:

\[ \hat{\Sigma}_{\hat{B}} \approx [\mathbf{\hat{J}}_{\mathbf{B}}^{-1}] \hat{\Sigma}_{\hat{U}(\mathbf{B})} [\mathbf{\hat{J}}_{\mathbf{B}}^{-1}]^T \]

This is typically referred to as a “sandwich estimator”, with \(\mathbf{\hat{J}}_{\mathbf{B}}^{-1}\) acting as the ‘bread’ of the sandwich and \(\hat{\Sigma}_{\hat{U}(\mathbf{B})}\) forming the ‘meat’ of the sandwich..

More verbosely, this amounts to:

\[ \begin{aligned} \begin{bmatrix} var(\hat{B}_1) \ \dots \ \ cov(\hat{B}_1, \hat{B}_p) \\ \vdots \\ cov(\hat{B}_1, \hat{B}_p) \ \dots \ var(\hat{B}_p)\end{bmatrix} &\approx \left[\mathbf{\hat{J}}_{\mathbf{B}}^{-1} \right] \begin{bmatrix} var(\hat{U}_1(\hat{B}_1)) \ \dots \ \ cov(\hat{U}_1(\hat{B}_1), \hat{U}_p(\hat{B}_p)) \\ \vdots \\ cov(\hat{U}_1(\hat{B}_1), \hat{U}_p(\hat{B}_p)) \ \dots \ var(\hat{U}_p(\hat{B}_p))\end{bmatrix} \left[\mathbf{\hat{J}}_{\mathbf{B}}^{-1} \right]^T \\ \\ \\ \textit{Where:}&\\ \mathbf{\hat{J}}_{\mathbf{B}} &= \begin{bmatrix} \left.{ \frac{\partial \hat{U}_1(b_1)}{\partial b_1}}\right|_{b_1=\hat{B}_1} \ \dots \ \left.{ \frac{\partial \hat{U}_1(b_p)}{\partial b_p}}\right|_{b_p=\hat{B}_p} \\ \vdots \\ \left.{ \frac{\partial \hat{U}_p(b_1)}{\partial b_1}}\right|_{b_1=\hat{B}_1} \ \dots \ \left.{ \frac{\partial \hat{U}_p(b_p)}{\partial b_p}}\right|_{b_p=\hat{B}_p} \end{bmatrix} \end{aligned} \]

And that’s it- that’s your standard variance estimator.

(Click here for an explanation of the matrix algebra)

Recall that for two random variables \(X_1,X_2\) and two constants \(c_1,c_2\), with \(Y = c_1X_1 + c_2X_2\) we have the following property:

\[Var(Y) = Var(c_1X_1 + c_2X_2) = c_1^2 Var(X_1) + c_2^2Var(X_2) + 2c_1c_2Cov(X_1,X_2)\]

For a larger number of random variables, we generalize this using matrix notation:

\[ \begin{aligned} Var(Y) =Var(c_1X_1 + \dots + c_pX_p) &= \begin{bmatrix}c_1 \\ \vdots \\ c_p \end{bmatrix}^T \begin{bmatrix} Var(X_1) \ \dots \ Cov(X_1,X_p) \\ \vdots \\ Cov(X_p,X_1) \ \dots \ Var(X_p) \end{bmatrix} \begin{bmatrix}c_1 \\ \vdots \\ c_p \end{bmatrix} \end{aligned} \]

When \(Y\) is not just a single value but a vector of \(p\) values, we further generalize it using the matrix notation:

\[ \begin{aligned} Var(\mathbf{Y}) &= \begin{bmatrix} Var(Y_1) \\ \vdots \\ Var(Y_p) \end{bmatrix} = \begin{bmatrix} Var(c_{11}X_1 + \dots + c_{1p}X_p) \\ \vdots \\ Var(c_{p1}X_1 + \dots + c_{pp}X_p) \end{bmatrix} \\ \\ \\ &=\begin{bmatrix}c_{11} \ \dots \ c_{1p}\\ \vdots \\ c_{p1} \ \dots c_{pp} \end{bmatrix} \begin{bmatrix} Var(X_1) \ \dots \ Cov(X_1,X_p) \\ \vdots \\ Cov(X_p,X_1) \ \dots \ Var(X_p) \end{bmatrix} \begin{bmatrix}c_{11} \ \dots \ c_{1p}\\ \vdots \\ c_{p1} \ \dots c_{pp} \end{bmatrix}^T \\ &= \mathbf{C} \begin{bmatrix} Var(X_1) \ \dots \ Cov(X_1,X_p) \\ \vdots \\ Cov(X_p,X_1) \ \dots \ Var(X_p) \end{bmatrix}\mathbf{C}^T \\ &= \mathbf{C} \Sigma_{X}\mathbf{C}^T \end{aligned} \]

The notation \(\Sigma_X\) is commonly used to denote a matrix containing variances and covariances among a set of \(p\) random variables \(X_1, \dots, X_p\).

Footnotes

  1. At least, that’s the default. However, replication methods such as the bootstrap or balanced repeated replication can be used, too. In this approach, the regression model is refit to each replication subsample, and the variability of the estimates across the replication subsamples is used to estimate standard errors or confidence intervals.↩︎