Practical Significance - Ben Schneider’s Blog

Cube sampling provides a useful generalization of several sampling methods, including stratified sampling. This post demonstrates how we can use cube sampling to accomplish the goals we often have when using stratified sampling.

Sep 11, 2023

Ben Schneider

Avoiding Negative Weights in the Generalized Bootstrap

sampling

statistics

The generalized bootstrap is a promising tool for survey data, but one of the key challenges with it in practice is that it can generate negative replicate weights. This post describes the challenges posed by negative weights and demonstrates an overlooked problem with the main solutions proposed to deal with it.

Aug 7, 2023

Ben Schneider

An Optimization-based Bootstrap

sampling

statistics

Creating bootstrap weights for survey data can be viewed as a constrained optimization problem: obtain low-error variance estimates with as few replicates as possible using nonnegative weights. The standard solution to this problem is to use an effective but inefficent random resampling method. But what if we tried using an optimization algorithm?

May 23, 2023

Ben Schneider

Visualizing the H-1B Boom

miscellaneous

visualization

Recent articles on the huge increase in H-1B registrations have included a lot of numbers but no visualization. This post provides some much-needed visualization.

Apr 29, 2023

Ben Schneider

Bootstrapping all the things

The new version of the ‘svrep’ package on CRAN helps you implement bootstrap methods for surveys, even those with particularly complex sample designs.

Dec 18, 2022

Ben Schneider

Nonresponse weighting class adjustments are model predictions

statistics

surveys

A short post explaining how nonresponse weighting class adjustments can be interpreted as model predictions. Includes a little bit of math and a little bit of R.

Dec 2, 2022

Ben Schneider

More on Speeding up the Survey Package: Adding the New C++ Functions to the Package

statistics

surveys

survey package

Following up on an earlier post about speeding up the {survey} package using {Rcpp} and {RcppArmadillo}, I show how the new, faster functions can be incorporated into the package along with accompanying unit tests. I also show how the speed of the original base R functions and the {Rcpp} versions scale as survey designs increase in size.

Oct 21, 2022

Ben Schneider

How are R and Stata (mis)handling singleton strata?

statistics

surveys

Stata

software

Analysts attempting to estimate variance contributions from “singleton strata” often use the “adjust”/“recentering” options implemented in R and Stata, whose meaning is unclear due to ambiguous documentation. In this post, I try to clarify what these options are actually doing in theory, and I show that in practice their software implementations have surprising bugs.

Sep 2, 2022

Ben Schneider

Calculating R Indicators in R

statistics

surveys

A short post on calculating R indicators (and their sampling variance) in R.

May 15, 2022

Ben Schneider

Fitting a basic Fay-Herriot model in Stan and JAGS

statistics

bayes

MCMC

Stan

JAGS

A short post showing how to fit a very basic, Bayesian Fay-Herriot model in Stan and JAGS. This post focuses on model specification, fitting, and visualization.

May 1, 2022

Ben Schneider

Making the {survey} package hundreds of times faster using {Rcpp}

statistics

surveys

survey package

A common objection to using the {survey} package instead of SAS or Stata is the computational time it requires. In this post, I show that we can easily obtain hundred-fold speed improvements in its core functions by using the {Rcpp} and {RcppArmadillo} packages. To illustrate, I show how we can make svytotal() run over 500 times faster. Finally, I offer thoughts about how to incorporate these {Rcpp}-based functions in either the {survey} package or a potential add-on R package.

Dec 14, 2021

Ben Schneider

Understanding the code used to implement the {survey} package’s recursive variance estimation for multistage samples

statistics

surveys

survey package

Unlike SAS and SPSS, the {survey} package in R can properly estimate variances for designs with multistage sampling and significant sampling fractions. The estimation is implemented using a recursive algorithm, whose basic idea is well-documented but whose R code is a bit intimidating to understand. This post walks through the R code used to implement this algorithm.

Dec 5, 2021

Ben Schneider

What’s the margin of error for the employee engagement index?

employee engagement

statistics

surveys

sampling

The key performance metric of many employee engagement surveys, the Engagement Index, is a statistic whose margin of error is deceptively difficult to estimate. As a result, many organizations fail to present a margin of error at all. I discuss why this estimation problem is difficult using the standard tools of engagement researchers, and I suggest two solutions based on tools from survey sampling theory. I provide example R code and discuss how to estimate the margin of error in general-purpose, non-statistical software.

Nov 1, 2021

Ben Schneider

Establishing the influence function method’s asymptotic validity

math

statistics

surveys

sampling

A math-heavy post giving a background and detailed proof of the theorem used to justify the method of linearization based on influence functions for sample surveys.

Oct 28, 2021

Ben Schneider

Systematic sampling as implicit stratification

statistics

surveys

sampling

This short post illustrates why systematic sampling is sometimes referred to as ‘implicit stratification’.

Oct 4, 2021

Ben Schneider

How correlated are survey estimates from overlapping groups?

statistics

surveys

sampling

simulation

We examine one approach for estimating the correlation of survey estimates from overlapping groups, such as 30-50 year olds and 40-60 year olds, using the method of linearization by influence functions. I provide a walk-through in R and use simulations to evaluate the method’s performance.

Sep 21, 2021

Ben Schneider