Energy Balancing Weights for Surveys – Practical Significance

This week, Nate Cohn at the New York Times announced that the Times/Siena Poll is updating its weighting methods to use an approach called “energy balancing.”. Shira Mitchell wrote a nice, short post about this over at the Gelman and friends blog. The energy balancing method is fairly new as far as weighting methods go (the paper proposing it was published in 2024, though its preprint dates back to 2020). Yet it has already generated a great deal of interest among researchers. Part of the reason is that it immediately had nice software implementations available, most notably in the excellent R package WeightIt by Noah Greifer. Noah’s written a fair bit about the method’s pros and cons in a few places: in the R package documentation, in Stack Exchange posts, and in a recent paper.

Aside from the Times/Siena Poll, I’m not aware of any other surveys that have adopted the energy balancing method, though the New York Times announcement will certainly spur interest. In the last few years, survey statisticians have recently been exploring more and more alternatives to raking, our beloved, trusty old weighting method. For example, to deal with pandemic-related spikes in nonresponse, statisticians at the U.S. Census Bureau used “entropy balancing” weighting in the American Community Survey (ACS) and the Current Population Survey Annual Social and Economic Supplement (CPS ASEC). And recent papers like Longford (2024) have proposed weighting methods to adjust for larger numbers of variables than raking can typically support.

The energy balancing method differs from these raking and related methods in a fundamental way that’s worth considering. Raking and similar calibration methods are based on balancing means or totals for specific variables: for example, ensuring that the average age in a weight-adjusted sample matches the average age of the target population. That basic approach can be extended in many ways, for instance by calibrating on interaction variables such as age crossed with gender, or by calibrating on quantities other than the mean, or by using special transformed variables such as influence functions. The energy balancing method does something different: it calibrates based on an entire multivariate distribution, as measured by an empirical cumulative distribution function (ECDF).¹

This is actually more involved than weighting on, say, combinations of age and gender rather than just age and gender separately. Raking can do that, too, of course. What’s different is that the entire shape of the distribution matters, not just summaries of it like means. And what’s being minimized in the weighting process isn’t differences in the means between sample and population, but instead a global measure of differences in the multivariate distributions for the sample and population. When you’re calibrating on a handful of categorical variables, the distinction isn’t so noticeable. But when you have continuous variables like income with potentially hundreds or thousands of distinct values, the distinction is quite noticeable and energy balancing might not be computationally feasible. So in practice you might need to do something like collapse continuous variables into, say, deciles.

With energy balancing, the benefits of calibrating on an entire multivariate distribution come at the practical cost of having to calibrate on an entire multivariate distribution. Calibrating on a multivariate distribution requires much richer data than the other calibration methods. With raking and similar methods you have a lot of flexibility: you can easily mix and match totals from different data sources, such as a table of American Community Survey estimates and another table of estimates from the National Public Opinion Reference Survey (NPORS). With energy balancing, you need a single benchmark dataset with all your weighting variables.² Ideally, you would have a rich microdata file with all the variables already on it, for example if you have access to a nice list sampling frame for a survey. But alternatively, you could construct a synthetic benchmark dataset, such as one developed by Pew Research in its 2018 study on weighting methods for online opt-in panel surveys.

If you want to calibrate on variables from different benchmark data sources (e.g., household income from the American Community Survey, and political party affiliation from NPORS), then with energy balancing you have to make precise assumptions about the joint distribution of the variables, which you’re going to end up imposing on your survey’s data. For example, in order to create a benchmark dataset with both income and political party affiliation, you might for simplicity assume that the two variables are independent, which allows you to easily generate a synthetic benchmark dataset. Unfortunately, when you then weight your survey’s data, you’d end up forcing your survey data to eliminate or attenuate genuine correlations between income and political party affiliation. You can do better by making more careful modeling assumptions, but you still have to make precise assumptions and do more complex modeling compared to raking and the like.

If you read through the full Times/Siena methodology report describing the use of energy balancing (among other aspects of the poll), you’ll see that the benchmark data is quite a rich dataset. In fact, it started out rich (it’s microdata with several weighting variables available), and it was enriched to a weapons-grade weighting dataset through a quite involved process of modeling and data integration. You can see that in the following excerpt from the methodology report. Note that the report uses the term “target population” in a nonstandard way; in the report, it refers to the dataset used for benchmarking/calibrating the weights.

Click to read excerpt from the Times/Siena methodology report

… the sample was weighted using energy balancing, which finds weights by minimizing the energy distance — a statistical measure of the difference between two multivariate distributions — between the survey and the target population.

The target population was a stratified sample (n=40,000) of the likely electorate, drawn from the L2 voter file’s list of active registered voters. The probability that an active registered voter will turn out in the 2026 midterm election is based on a model of validated turnout in the 2022 and 2018 midterm elections.

The survey was weighted to balance the joint distribution of the following characteristics: - Party (Party registration if available in the state, else classification based on participation in partisan primaries if available in the state, else classification based on a model of vote choice in prior Times/Siena polls)

Education (four categories of self-reported education level, based on a model of self-reported education in Times/Siena polls, adjusted to match census-based targets for the registered voter population)

Age (self-reported age, or voter file age if the respondent refused)

Gender (L2 data)

Race (except in Maine) (NYT model of race)

Turnout history (NYT classifications based on L2 data)

State region (NYT classifications)

Synthetic 2024 past vote (for the respondent, synthetic 2024 past vote is self-reported recall vote among validated voters, with major party vote imputed among validated voters who do not report supporting a major party candidate. For the target population, synthetic 2024 past vote is the simulated binomial outcome of major party vote choice in the 2024 election among validated voters, based on a model of vote choice in past Times/Siena polls. Vote choice was modeled as a function of information available on the L2 voter file and self-reported education, adjusted to match precinct-level election results)

Modeled probability of presidential vote choice in the 2024 election (NYT model of vote choice in past Times/Siena polls based on information available on the L2 voter file and adjusted to precinct-level election results)

The difficulty of finding or constructing a rich benchmark dataset with all the important variables is an important obstacle to implementing this method in practice. But at least in the evaluation done by the Times/Siena folks, the juice was worth the squeeze.

One common setting where we do have fairly rich multivariate microdata for weighting is in nonresponse adjustments for probability sample surveys: that is, in the scenario where we can weight a survey’s respondents to the full selected sample of individuals who were invited to the survey. In fact, that’s the precise scenario of the Times/Siena Poll, which samples from the L2 voter file and uses that file as its benchmark dataset. However, even in this kind of scenario, we often have variables for our survey respondents that aren’t in the rich benchmark microdata, but which we want to calibrate to another external data source, like published Census Bureau estimates. In that case, we might be able to use energy balancing to conduct a nonresponse adjustment, and then afterwards we could use raking or a similar method to calibrate the nonresponse-adjusted weights to external benchmark data.

Footnotes

Santra, Chen, and Park (2026) generalize this idea and provide some useful theory for this class of weighting methods.↩︎
In principle, you actually could implement energy balancing without a benchmark dataset, but instead using a precisely specified benchmark distribution which is sort of a platonic ideal of a benchmark dataset. But that doesn’t really impact the points I’m making here in this blog post.↩︎