Establishing the influence function method’s asymptotic validity

A math-heavy post giving a background and detailed proof of the theorem used to justify the method of linearization based on influence functions for sample surveys.

Author

Ben Schneider

Published

October 28, 2021

Recently, I’ve written a few posts about how survey sampling variances can be estimated using the method of linearization based on influence functions. This method is quite useful, but it is hard to find a complete proof of why it works. The first paper to contain a purported proof of the method (Deville, 1999) leaves a few key parts unexplained and I think also has an unfortunate typo or two in its proof. A couple of later papers (Goga, Deville, and Ruiz-Gazen, 2009, and Goga and Ruiz-Gazen 2013) both contain clearer proofs in their appendices.

All of the proofs above are fairly intimidating, and so the purpose of this blog post is to add more context and detail to the proof from Goga, Deville, and Ruiz-Gazen, 2009.

Notation and Background

The method of linearization using influence functions applies to a broad class of estimators used in finite population inference which are referred to as “substitution estimators”. As the name suggests, substitution estimators are estimators calculated by pretending your (weighted) sample is the population. The weighted sample mean, for example, is a substitution estimator.

This idea is formalized using the concept of functionals. Loosely speaking, a functional is a recipe that indicates how to calculate a quantity of interest (a mean, ratio, quantile, etc.) given a population and a “measure” which assigns weight to each person in the population. One example of a measure is a population measure, which equals ‘1’ for each person in the population and 0 for persons outside the population of interest. Another example of a measure is the “sample measure”, which equals the value of a sample weight for each person in the sample and zero for every person not in the sample. More formally, a functional is a function whose input is another function: the input is called a “measure”, which is a function that assigns a weight to values in a space.

Measures

Let’s give a more concrete example of what we mean by measure. Suppose we have a population of three individuals, with heights and shoe sizes as described in the following table.

ID Height Shoe Size
1 5.5 7
2 6.2 9
3 6.3 8

We can represent this population with a “population measure”, which is a function M, that looks at a set of values and outputs ‘1’ if those values appear in the population and ‘0’ otherwise. The set of values used as the function’s input are triples (x,y,z) in R3, such as (1,2.3,4.8) or (9.5,8.4,6.1).

The population measure M is a function on R3such that:M((x,y,z))={1 if (x,y,z){(1,5.5,7),(2,6.2,9),(3,5.5,8)} 0 otherwise 

Now, suppose we draw a simple random sample without replacement of size two, resulting in us selecting persons 2 and 3. Then we can represent our sample with a “sample measure”, which is a function denoted M^ that spits out the sampling weight for values that appear in the sample and ‘0’ otherwise. For this sample design, the sampling weight is the same for everyone (i.e. equals 3/2).

The sample measure M^ is a function on R3such that:M((x,y,z))={3/2 if (x,y,z){(2,6.2,9),(3,5.5,8)} 0 otherwise 

In general, the sample measure is based on each sample member’s “probability of inclusion” in the sample, denoted πk for person k, and the weight for a given sample person is 1/πk.

Functionals

With a clearer idea of what a measure is, we can now look at some examples of functionals. A functional T is a function whose input is a measure such as M or M^; we’d write this as T(M) or T(M^). It tells us how to calculate a quantity of interest (such as a mean, ratio, or quantile) using that measure and the set of values that the measure is defined over. In the example above, that was (x,y,z)R3, where x represents a person’s ID number, y represents a person’s height, and z represents a person’s shoe size.

Let T be the "mean height" functional:Population mean height: T(M)=ydMdM=iUyiNSample mean height: T(M^)=ydM^dM^=iSwiyii=1nwi

Influence Functions

The influence function is a function denoted IT(M;z), which is defined for a given functional T and measure M, and which takes as its input a value z in the set for which M is defined. Loosely speaking, it can be interpreted as a measure of how much the value of the functional T(M) would change if the population included an additional person with values z (in other words, if M assigned twice the weight to the point z).

Mathematically, it is a derivative:

IT(M;z)=limt0T(M+tδz)T(M)twhere δz is the measure on R3δz((x1,x2,x3))={1 if (x1,x2,x3)=z 0 otherwise 

For example, consider again our tiny population with values {(1,5.5,7),(2,6.2,9),(3,5.5,8)} associated with measure M, and the “mean height” functional T, where T(M) gives the population mean height.

Then the influence function at the point z=(1,5.5,7) is given by the following:

IT(M;z)=limt0T(M+tδz)T(M)twhere δz is the measure on R3δz((x1,x2,x3))={1 if (x1,x2,x3)=(1,5.5,7) 0 otherwise 

And we can determine its value by noting:

T(M)=5.5+6.2+6.33T(M+tδz)=((1+t)×5.5)+6.2+6.33+t

If we work it out, we’d find that the influence function equals the following:

IT(M;z)=13(5.513i=13(5.5+6.2+6.3))

And in general, the influence function for a population mean is given by 1N(yiY¯).

Assumptions of the Theorem

The first set of assumptions (1 through 4) apply to the sample design and must apply to each variable of interest, Z.

  • Assumption 1. We assume that limNN1n=f(0,1).

  • Assumption 2 . We assume that limNN1ZdM exists, for any variable of interest Z.

  • Assumption 3. As N, we have that N1(ZdM^ZdM)0 in probability, for any variable of interest Z.

  • Assumption 4. As N,{nN1(ZdM^ZdM)}N(0,Σ) in distribution, for any variable of interest Z.

The second set of assumptions, 5 through 7, apply to the functional T, requiring it to be “smooth” in a certain sense and to be bounded in a certain sense.

  • Assumption 5. We assume that T is homogeneous, in that there exists a real number β>0 dependent on T such that T(rM)=rβT(M) for any real r>0.

  • Assumption 6. We assume that limNNβT(M)<.

  • Assumption 7. We assume that T is Fréchet differentiable.

    Show definition:

    The following definition is adapted from Huber (1981), pages 34-35.
    Definition: The functional T:MR is Fréchet differentiable at MM if
    there exists a functional TM:MR (linear, continuous),
    such that for any function HM, then T(H)T(M)TM(HM)=o[d(M,H)] where d is some metric on the set M of relevant measures for which the following conditions hold:

    • For all GM, {Fd(G,F)<ϵ} is open for all ϵ>0
    • If Ft=(1t)F0+tF1, then d(Ft,Fs)=O(|ts|).

Formal Statement of the Theorem

Let uk denote the influence function for element k in the population (i.e. if person k has values zk, then uk=IT(M;zk)), and let vk denote the sampling weight for person k in the population. If person k is in a selected sample, then vk=1/πk, where πk is their probability of inclusion in the sample. For persons in the population but not in the selected sample, vk=0.

If Assumptions 1 through 7 hold, then:

nNβ{T(M^)T(M)}=nNβIT(M;z)d(M^M)(z)+op(1)=nNβ{k=1Nuk(vk1)}+op(1)

and so the asymptotic variance of T(M^) is equal to the variance of k=1Nuk(vk1)

In short, asymptotically, the sampling variance of the sample statistic T(M^) is equal to the sampling variance of the weighted sample sum i=1nukvk.

Proof

From Assumptions 5 (homogeneity of T), we have that NβT(M)=T(M/N). From Assumption 6, we have that T(M/N)=NβT(M)< for sufficiently large N. Taken together, this implies that for sufficiently large N, the following holds:

Nβ{T(M^)T(M)}=T(M^N)T(MN)

Let us provide the space (Rp,M) with the metric d~ satisfying d(Q/N,M/N)0 if and only if N1{ZdQ(z)ZdM(z)}0 for any variable of interest Z defined on Rp. This means that the distance between the sample’s measure M^ and the population’s measure M goes to zero if and only if the distance between the population total (i.e. ZdM(z)=kUzk) and the Horvitz-Thompson estimator for the total (i.e. ZdM^(z)=kszk/πk) goes to zero for any variable of interest.

From Assumption 4, then, we have that d~(M^/N,M/N)=Op(n1/2).

Why?

Assumption 4 states that XN={n1/2N1(ZdM^ZdM)} converges in distribution to a Normal distribution (with mean zero and finite variance σ), and so we must have that P(|XN|>σ) can be reduced to any arbitrarily small ϵ>0 by choosing sufficiently large N. This implies that P(n1/2|XN|>n1/2σ) can be reduced to any arbitrarily small number ϵ>0 by choosing sufficiently large N. Noting that {N1(ZdM^ZdM)}=n1/2XN and |n1/2XN|=n1/2|XN|, we thus have that {N1(ZdM^ZdM)}=Op(n1/2).

Because of the equivalence between the metric d(Q/N,M/N) and N1{ZdQ(z)ZdM(z)} in terms of convergence properties, this implies that d(M^/N,M/N)=Op(n1/2).

Now because T is Fréchet differentiable by Assumption 7, we can write the following first-order von-Mises expansion (see Huber 1981 pages 35-39 for details):

T(M^N)=T(MN)+IT(MN;z)d(M^NMN)(z)+o{d~(M^/N,M/N)}

What is a first-order Von Mises Expansion?

The following definition is loosely taken from “Robust Statistics” Hampel et al. 1986, Cabrera and Fernholz (1999), and Huber 1981.

Von Mises expansion. For distributions G and F, the first-order von Mises expansion of T at F (which is derived from a Taylor series) evaluated in G is given by T(G)=T(F)+IF(F;x)d(GF)(x)+o(d~(G,F)) for a suitable distance metric d~(G,F).

Next, we note that for a functional of degree β, IT(MN;z)=N1βIT(M,z) (proof in Deville, 1999). So we thus have that:

T(M^N)T(MN)=NβIT(M;z)d(M^M)(z)+o{d~(M^/N,M/N)}

Denote the remainder term Rn. Now since we have that d~(M^/N,M/N) is Op(n1/2), and Rn is o(d~(M^/N,M/N)), we must have that Rn=op(n1/2).

And so we have that:

T(M^N)T(MN)=NβIT(M;z)d(M^M)(z)+op(n1/2)

Multiplying both sides by n, we thus have that:

n(T(M^N)T(MN))=nNβIT(M;z)d(M^M)(z)+op(1)

This gives us the desired result:

nNβ(T(M^)T(M))=nNβIT(M;z)d(M^M)(z)+op(1)=nNβk=1Nuk(vk1)+op(1)