Establishing the influence function method’s asymptotic validity

Recently, I’ve written a few posts about how survey sampling variances can be estimated using the method of linearization based on influence functions. This method is quite useful, but it is hard to find a complete proof of why it works. The first paper to contain a purported proof of the method (Deville, 1999) leaves a few key parts unexplained and I think also has an unfortunate typo or two in its proof. A couple of later papers (Goga, Deville, and Ruiz-Gazen, 2009, and Goga and Ruiz-Gazen 2013) both contain clearer proofs in their appendices.

All of the proofs above are fairly intimidating, and so the purpose of this blog post is to add more context and detail to the proof from Goga, Deville, and Ruiz-Gazen, 2009.

Notation and Background

The method of linearization using influence functions applies to a broad class of estimators used in finite population inference which are referred to as “substitution estimators”. As the name suggests, substitution estimators are estimators calculated by pretending your (weighted) sample is the population. The weighted sample mean, for example, is a substitution estimator.

This idea is formalized using the concept of functionals. Loosely speaking, a functional is a recipe that indicates how to calculate a quantity of interest (a mean, ratio, quantile, etc.) given a population and a “measure” which assigns weight to each person in the population. One example of a measure is a population measure, which equals ‘1’ for each person in the population and 0 for persons outside the population of interest. Another example of a measure is the “sample measure”, which equals the value of a sample weight for each person in the sample and zero for every person not in the sample. More formally, a functional is a function whose input is another function: the input is called a “measure”, which is a function that assigns a weight to values in a space.

Measures

Let’s give a more concrete example of what we mean by measure. Suppose we have a population of three individuals, with heights and shoe sizes as described in the following table.

ID	Height	Shoe Size
1	5.5	7
2	6.2	9
3	6.3	8

We can represent this population with a “population measure”, which is a function \(M\), that looks at a set of values and outputs ‘1’ if those values appear in the population and ‘0’ otherwise. The set of values used as the function’s input are triples \((x,y,z)\) in \(\mathbb{R}^3\), such as \((1,2.3,4.8)\) or \((9.5, 8.4, 6.1)\).

\[ \begin{aligned} &\text{The population measure }M \text{ is a function on } \mathbb{R}^3 \\ &\text{such that:} \\ &M((x,y,z)) = \begin{cases}1~&\text{if }(x,y,z) \in \{(1,5.5,7), (2,6.2,9),(3,5.5,8)\}~\\0~&{\text{otherwise}}~\end{cases} \\ \end{aligned} \]

Now, suppose we draw a simple random sample without replacement of size two, resulting in us selecting persons 2 and 3. Then we can represent our sample with a “sample measure”, which is a function denoted \(\hat{M}\) that spits out the sampling weight for values that appear in the sample and ‘0’ otherwise. For this sample design, the sampling weight is the same for everyone (i.e. equals \(3/2\)).

\[ \begin{aligned} &\text{The sample measure }\hat{M} \text{ is a function on } \mathbb{R}^3 \\ &\text{such that:} \\ &M((x,y,z)) = \begin{cases}3/2~&\text{if }(x,y,z) \in \{ (2,6.2,9),(3,5.5,8)\}~\\0~&{\text{otherwise}}~\end{cases} \\ \end{aligned} \]

In general, the sample measure is based on each sample member’s “probability of inclusion” in the sample, denoted \(\pi_k\) for person \(k\), and the weight for a given sample person is \(1/\pi_k\).

Functionals

With a clearer idea of what a measure is, we can now look at some examples of functionals. A functional \(T\) is a function whose input is a measure such as \(M\) or \(\hat{M}\); we’d write this as \(T(M)\) or \(T(\hat{M})\). It tells us how to calculate a quantity of interest (such as a mean, ratio, or quantile) using that measure and the set of values that the measure is defined over. In the example above, that was \((x,y,z) \in \mathbb{R^3}\), where \(x\) represents a person’s ID number, \(y\) represents a person’s height, and \(z\) represents a person’s shoe size.

\[ \begin{aligned} \text{Let }T \text{ be the "mean height" functional:}\\ \text{Population mean height: }T(M) &= \frac{\int y dM}{\int dM} = \frac{\sum_{i\in U}y_i}{N} \\ \text{Sample mean height: }T(\hat{M}) &= \frac{\int y d\hat{M}}{\int d\hat{M}} = \frac{\sum_{i \in S}w_iy_i}{\sum_{i=1}^n w_i} \end{aligned} \]

Influence Functions

The influence function is a function denoted \(IT(M;z)\), which is defined for a given functional \(T\) and measure \(M\), and which takes as its input a value \(z\) in the set for which \(M\) is defined. Loosely speaking, it can be interpreted as a measure of how much the value of the functional \(T(M)\) would change if the population included an additional person with values \(z\) (in other words, if \(M\) assigned twice the weight to the point \(z\)).

Mathematically, it is a derivative:

\[ \begin{aligned} IT(M;z) = &\lim_{t \rightarrow 0} \frac{T(M+t\delta_z) - T(M)}{t} \\ \\ \text{where }&\delta_z \text{ is the measure on }\mathbb{R}^3 \\ \\ &\delta_z( (x_1,x_2,x_3) ) = \begin{cases} 1~&\text{if }(x_1,x_2,x_3) = z~\\ 0~&{\text{otherwise}}~ \end{cases} \end{aligned} \]

For example, consider again our tiny population with values \(\{(1,5.5,7), (2,6.2,9),(3,5.5,8)\}\) associated with measure \(M\), and the “mean height” functional \(T\), where \(T(M)\) gives the population mean height.

Then the influence function at the point \(z=(1,5.5,7)\) is given by the following:

\[ \begin{aligned} IT(M;z) = &\lim_{t \rightarrow 0} \frac{T(M+t\delta_z) - T(M)}{t} \\ \\ \text{where }&\delta_z \text{ is the measure on }\mathbb{R}^3 \\ \\ &\delta_z( (x_1,x_2,x_3) ) = \begin{cases} 1~&\text{if }(x_1,x_2,x_3) = (1,5.5,7)~\\ 0~&{\text{otherwise}}~ \end{cases} \end{aligned} \]

And we can determine its value by noting:

\[ \begin{aligned} T(M) &= \frac{ 5.5 + 6.2 + 6.3}{3} \\ T(M+t\delta_z) &= \frac{((1+t) \times 5.5)+6.2+6.3}{3 + t} \end{aligned} \]

If we work it out, we’d find that the influence function equals the following:

\[ IT(M;z) = \frac{1}{3}(5.5 - \frac{1}{3}\sum_{i=1}^3 (5.5+6.2+6.3)) \]

And in general, the influence function for a population mean is given by \(\frac{1}{N}(y_i - \bar{Y})\).

Assumptions of the Theorem

The first set of assumptions (1 through 4) apply to the sample design and must apply to each variable of interest, \(\mathcal{Z}\).

Assumption 1. We assume that \(\lim _{N \rightarrow \infty} N^{-1} n = f\in(0,1)\).

This means that the sampling fraction of the survey design converges to some nonzero rate as the population grows larger.
Assumption 2 . We assume that \(\lim _{N \rightarrow \infty} N^{-1} \int \mathcal{Z} d M\) exists, for any variable of interest \(\mathcal{Z}\).

As Deville (1999) explains, this can be interpreted as viewing the finite population as an i.i.d. sample of a certain infinite superpopulation.
Assumption 3. As \(N \rightarrow \infty\), we have that \(N^{-1}\left(\int \mathcal{Z} d \hat{M}-\int \mathcal{Z} d M\right) \rightarrow 0\) in probability, for any variable of interest \(\mathcal{Z}\).

In other words, the Horvitz-Thompson estimator of a population total converges to the population total, and the Horvitz-Thompson estimator of a population mean converges to the population mean.
Assumption 4. As \(N \rightarrow \infty,\left\{\sqrt{n} N^{-1}\left(\int \mathcal{Z} d \hat{M}-\int \mathcal{Z} d M\right)\right\} \rightarrow N(0, \Sigma)\) in distribution, for any variable of interest \(\mathcal{Z}\).

This amounts to saying that the Horvitz-Thompson estimator of a population total follows a central limit theorem. As the population size increases, the estimator tends towards a Normal distribution with mean zero and variance proportional to \(\frac{N^2}{n}\).

The second set of assumptions, 5 through 7, apply to the functional \(T\), requiring it to be “smooth” in a certain sense and to be bounded in a certain sense.

Assumption 5. We assume that \(T\) is homogeneous, in that there exists a real number \(\beta>0\) dependent on \(T\) such that \(T(r M)=r^{\beta} T(M)\) for any real \(r>0\).

This is a weaker smoothness assumption than linearity, which would require \(T\) to be not only homogeneous but additive.
Assumption 6. We assume that \(\lim _{N \rightarrow \infty} N^{-\beta} T(M)<\infty\).

In other words, the population value of the functional converges towards some finite limit as the population increases.
Assumption 7. We assume that \(T\) is Fréchet differentiable.

This is a highly technical condition on the “smoothness” of the functional, and is used to justify a “von Mises expansion” in the proof. This assumption can be replaced by a set of weaker assumptions.
Show definition:

The following definition is adapted from Huber (1981), pages 34-35.
Definition: The functional \(T: \mathbb{M} \rightarrow \mathbb{R}\) is Fréchet differentiable at \(M \in \mathbb{M}\) if
there exists a functional \(T_{M}^{\prime}: \mathbb{M} \rightarrow \mathbb{R}\) (linear, continuous),
such that for any function \(H \in \mathbb{M}\), then \[ \left\|T(H)-T(M)-T_{M}^{\prime}(H-M)\right\| = o\left[d_{\star}(M,H)\right] \] where \(d_{\star}\) is some metric on the set \(\mathbb{M}\) of relevant measures for which the following conditions hold:
- For all \(G \in \mathbb{M}\), \(\{ F \mid d_{\star}(G,F) < \epsilon \}\) is open for all \(\epsilon > 0\)
- If \(F_t = (1-t)F_0 + tF_1\), then \(d_{\star}(F_t,F_s)=O\left(|t-s|\right)\).

Formal Statement of the Theorem

Let \(u_k\) denote the influence function for element \(k\) in the population (i.e. if person \(k\) has values \(z_k\), then \(u_k=IT(M;z_k)\)), and let \(v_k\) denote the sampling weight for person \(k\) in the population. If person \(k\) is in a selected sample, then \(v_k=1/\pi_k\), where \(\pi_k\) is their probability of inclusion in the sample. For persons in the population but not in the selected sample, \(v_k=0\).

If Assumptions 1 through 7 hold, then:

\[ \begin{aligned}\frac{\sqrt{n}}{N^{\beta}}\{T(\hat{M})-T(M)\} &=\frac{\sqrt{n}}{N^{\beta}} \int IT(M ; z) d\left(\hat{M}-M\right)(z)+o_{p}(1) \\&=\frac{\sqrt{n}}{N^{\beta}} \left\{\sum_{k=1}^{N} u_{k}\left(v_{k}-1\right)\right\}+o_{p}(1) \end{aligned} \]

\[ \textit{and so the asymptotic variance of }\\T(\hat{M})\textit{ is equal to the variance of }\\ \sum_{k=1}^{N} u_{k}\left(v_{k}-1\right) \]

In short, asymptotically, the sampling variance of the sample statistic \(T(\hat{M})\) is equal to the sampling variance of the weighted sample sum \(\sum_{i=1}^{n}u_kv_k\).

Proof

From Assumptions 5 (homogeneity of \(T\)), we have that \(N^{-\beta}T(M)=T(M/N)\). From Assumption 6, we have that \(T(M/N)=N^{-\beta}T(M) < \infty\) for sufficiently large \(N\). Taken together, this implies that for sufficiently large \(N\), the following holds:

\[ \begin{aligned}{N^{-\beta}}\{T(\hat{M})-T(M)\} &= T(\frac{\hat{M}}{N}) - T(\frac{M}{N}) \\ \end{aligned} \]

Let us provide the space \((\mathbb{R}^p, M)\) with the metric \(\tilde{d}\) satisfying \(d(Q/N, M/N) \rightarrow 0\) if and only if \(N^{-1} \{ \int Z dQ(z) - \int Z dM(z) \} \rightarrow 0\) for any variable of interest \(Z\) defined on \(\mathbb{R}^p\). This means that the distance between the sample’s measure \(\hat{M}\) and the population’s measure \(M\) goes to zero if and only if the distance between the population total (i.e. \(\int Z dM(z)= \sum_{k \in U} z_k\)) and the Horvitz-Thompson estimator for the total (i.e. \(\int Z d\hat{M}(z)= \sum_{k \in s} z_k/\pi_k\)) goes to zero for any variable of interest.

From Assumption 4, then, we have that \(\tilde{d}(\hat{M}/N, M/N) = O_p(n^{-1/2})\).

Why?

Assumption 4 states that \(X_N =\left\{n^{1 / 2} N^{-1}\left(\int \mathcal{Z} d \hat{M}-\int \mathcal{Z} d M\right)\right\}\) converges in distribution to a Normal distribution (with mean zero and finite variance \(\sigma\)), and so we must have that \(P(|X_N| > \sigma)\) can be reduced to any arbitrarily small \(\epsilon > 0\) by choosing sufficiently large \(N\). This implies that \(P(n^{-1/2}|X_N| > n^{-1/2} \sigma)\) can be reduced to any arbitrarily small number \(\epsilon > 0\) by choosing sufficiently large \(N\). Noting that \(\{N^{-1}\left(\int \mathcal{Z} d \hat{M}-\int \mathcal{Z} d M\right)\}=n^{-1/2}X_N\) and \(|n^{-1/2}X_N|=n^{-1/2}|X_N|\), we thus have that \(\{N^{-1}\left(\int \mathcal{Z} d \hat{M}-\int \mathcal{Z} d M\right)\} = O_p(n^{-1/2})\).

Because of the equivalence between the metric \(d(Q/N,M/N)\) and \(N^{-1} \{ \int Z dQ(z) - \int Z dM(z)\}\) in terms of convergence properties, this implies that \(d(\hat{M}/N, M/N)=O_p(n^{-1/2})\).

Now because \(T\) is Fréchet differentiable by Assumption 7, we can write the following first-order von-Mises expansion (see Huber 1981 pages 35-39 for details):

\[ \begin{aligned}T(\frac{\hat{M}}{N}) &= T(\frac{M}{N}) + \int IT(\frac{M}{N} ; z) d\left(\frac{\hat{M}}{N}-\frac{M}{N}\right)(z) + o\{ \tilde{d}(\hat{M}/N, M/N) \} \\ \end{aligned} \]

What is a first-order Von Mises Expansion?

The following definition is loosely taken from “Robust Statistics” Hampel et al. 1986, Cabrera and Fernholz (1999), and Huber 1981.

Von Mises expansion. For distributions \(G\) and \(F\), the first-order von Mises expansion of \(T\) at \(F\) (which is derived from a Taylor series) evaluated in \(G\) is given by \[ T\left( G \right)=T\left( F \right)+ \int \operatorname{IF}(F; x) d(G - F)(x) + o(\tilde{d}(G,F)) \] for a suitable distance metric \(\tilde{d}(G,F)\).

Next, we note that for a functional of degree \(\beta\), \(IT(\frac{M}{N};z)=N^{1-\beta} IT(M,z)\) (proof in Deville, 1999). So we thus have that:

\[ \begin{aligned}T(\frac{\hat{M}}{N}) - T(\frac{M}{N}) &= N^{-\beta} \int IT(M ; z) d\left(\hat{M}-M\right)(z) + o\{ \tilde{d}(\hat{M}/N, M/N) \} \\ \end{aligned} \]

Denote the remainder term \(R_n\). Now since we have that \(\tilde{d}(\hat{M}/N, M/N)\) is \(O_p(n^{-1/2})\), and \(R_n\) is \(o\left(\tilde{d}(\hat{M}/N, M/N)\right)\), we must have that \(R_n=o_p(n^{-1/2})\).

And so we have that:

\[ \begin{aligned} T(\frac{\hat{M}}{N}) - T(\frac{M}{N}) &= N^{-\beta} \int IT(M ; z) d\left(\hat{M}-M\right)(z) + o_p( n^{-1/2} ) \\ \end{aligned} \]

Multiplying both sides by \(\sqrt{n}\), we thus have that:

\[ \sqrt{n} \left( T(\frac{\hat{M}}{N}) - T(\frac{M}{N}) \right) = \sqrt{n}N^{-\beta} \int IT(M ; z) d\left(\hat{M}-M\right)(z) + o_p(1) \]

This gives us the desired result:

\[ \begin{aligned} \sqrt{n} N^{-\beta}\left( T(\hat{M}) - T(M) \right) &= \sqrt{n}N^{-\beta} \int IT(M ; z) d\left(\hat{M}-M\right)(z) + o_p(1) \\ &= \sqrt{n}N^{-\beta} \sum_{k=1}^{N} u_k(v_k - 1) + o_p(1) \end{aligned} \]