--- title: "Bootstrap Methods for Surveys" output: prettydoc::html_pretty: theme: architect highlight: github toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{Bootstrap Methods for Surveys} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message=FALSE, warning=FALSE, echo=FALSE} library(dplyr) # For data manipulation library(survey) # For complex survey analysis library(srvyr) # For complex survey analysis with dplyr syntax library(svrep) ``` This vignette provide guidance on when the bootstrap can be used for complex survey data and how to choose a bootstrap method and number of bootstrap replicates for a survey. It further demonstrates how to use functions from the "svrep" package to implement bootstrap methods and help decide the number of bootstrap replicates needed for your data. The functions covered in this vignette include: **Functions for Creating Bootstrap Weights**: - `as_bootstrap_design()`, which converts a survey design object to a replicate survey design object with bootstrap replicate weights. - `make_rwyb_bootstrap_weights()`, which creates bootstrap replicate weights for general survey designs using the method of Rao-Wu-Yue-Beaumont. The survey designs may potentially include stratification, multistage clustering, and sampling with or without replacement with unequal probabilities. **Functions for Creating Generalized Survey Bootstrap Weights**: - `as_gen_boot_design()`, which creates a replicate survey design using the generalized survey bootstrap method as described by @beaumont2012. - `make_gen_boot_factors()`, which creates a matrix of replicate weights using the generalized survey bootstrap method. This function will usually be used in conjunction with the functions `get_design_quad_form()` or `make_quad_form_matrix()`, which return the quadratic form matrix representation of common variance estimators. **Functions to Help Choose the Number of Bootstrap Replicates**: - `estimate_boot_sim_cv()`, which estimates the simulation error of a bootstrap estimate. - `estimate_boot_reps_for_target_cv()`, which estimates the number of bootstrap replicates needed to reduce a bootstrap's simulation error to a target level. # Choosing a Bootstrap Method Essentially every bootstrap method commonly used for surveys can be used for simple random sampling with replacement and can easily be applied to stratified sampling (simply repeat the method separately for each stratum). However, things become more complicated for other types of sampling and care is needed to use a bootstrap method appropriate to the survey design. For many common designs used in practice, it is possible (and easy!) to use one of the bootstrap methods described in the section of this vignette titled "Basic Bootstrap Methods." If the design isn't appropriate for one of those basic bootstrap methods, then it may be possible to use the generalized survey bootstrap described in a later section of this vignette. The generalized survey bootstrap method can be used for especially complex designs, such as systematic sampling or two-phase sampling designs. The interested reader is encouraged to read @mashreghi2016 for an overview of bootstrap methods developed for survey samples. # Basic Bootstrap Methods For most sample designs used in practice, there are three basic survey design features that must be considered when choosing a bootstrap method: - Whether there are multiple stages of sampling - Whether the design uses without-replacement sampling with large sampling fractions - Whether the design uses unequal-probability sampling (commonly referred to as "probability proportional to size (PPS)" sampling in statistics jargon) The 'svrep' and 'survey' packages implement four basic bootstrap methods, each of which can handle one or more of these survey design features. Of the four methods, the Rao-Wu-Yue-Beaumont bootstrap method [@beaumont2022] is the only one able to directly handle all three of these design features and is thus the default method used in the function `as_bootstrap_design()`.[^2] The following table summarizes these four basic bootstrap methods and their appropriateness for each of the common design features described earlier. [^2]: Preston's method can handle both multistage samples and without-replacement sampling with large sample fractions, but it is not strictly applicable to designs with unequal probability sampling. The Rao-Wu method is a special case of the Rao-Wu-Yue-Beaumont method and only applies to designs where the first stage of sampling is with replacement or where the first-stage sample fraction is small enough to be ignored. The Canty-Davison method is only applicable to single-stage designs with simple random sampling (with or without replacement) within strata. | Method | Handles Multistage Samples? | Appropriate for Without-Replacement Sampling with Large Sample Fractions? | Unequal Probability Sampling (PPS)? | |------------------|------------------|------------------|------------------| | Rao-Wu-Yue-Beaumont | Yes | Yes | Yes | | Rao-Wu | Only if first-stage sampling is without replacement or has a small sampling fraction | No | No | | Preston | Yes | Yes | No | | Canty-Davison | Only if first-stage sampling is without replacement or has a small sampling fraction | Yes | No | : Designs Covered by Each Bootstrap Method | Method | Strata IDs | Cluster IDs | Final Sampling Weights | Finite Population Corrections | Selection Probabilities by Stage | |------------|------------|------------|------------|------------|------------| | Rao-Wu-Yue-Beaumont | Yes | Yes | Yes | Yes, if sampling is without-replacement | If the design has unequal probability sampling (PPS) | | Rao-Wu | Yes (first stage only) | Yes (first stage only) | Yes | No | No | | Preston | Yes | Yes | Yes | Yes, if sampling is without replacement | No | | Canty-Davison | Yes (first-stage only) | Yes (first stage only) | Yes | Yes (first stage only), if sampling is without replacement | No | : Data Required for Each Bootstrap Method ## Implementation To implement these basic bootstrap methods, we can create a survey design object with the `svydesign()` function from the survey package, and then convert this object to a bootstrap replicate design using `as_bootstrap_design()`. This method can be used for multistage, stratified designs with one or more different kinds of sampling, provided the "Rao-Wu-Yue-Beaumont" method is used. **Example 1: Multistage Simple Random Sampling without Replacement (SRSWOR)** ```{r} library(survey) # For complex survey analysis library(svrep) set.seed(2022) # Load an example dataset from a multistage sample, with two stages of SRSWOR data("mu284", package = 'survey') multistage_srswor_design <- svydesign(data = mu284, ids = ~ id1 + id2, fpc = ~ n1 + n2) bootstrap_rep_design <- as_bootstrap_design(multistage_srswor_design, type = "Rao-Wu-Yue-Beaumont", replicates = 500) svytotal(x = ~ y1, design = multistage_srswor_design) svytotal(x = ~ y1, design = bootstrap_rep_design) ``` **Example 2: Single-stage unequal probability sampling without replacement** ```{r, eval=FALSE} # Load example dataset of U.S. counties and states with 2004 Presidential vote counts data("election", package = 'survey') pps_wor_design <- svydesign(data = election_pps, pps = HR(), fpc = ~ p, # Inclusion probabilities ids = ~ 1) bootstrap_rep_design <- as_bootstrap_design(pps_wor_design, type = "Rao-Wu-Yue-Beaumont", replicates = 100) svytotal(x = ~ Bush + Kerry, design = pps_wor_design) svytotal(x = ~ Bush + Kerry, design = bootstrap_rep_design) ``` **Example 3: Multistage Sampling with Different Sampling Methods at Each Stage** For designs that use different sampling methods at different stages, we can use the argument `samp_method_by_stage` to ensure that the correct method is used to form the bootstrap weights. In general, if a multistage design uses unequal probability sampling for only some stages, then when creating the initial design object, the stage-specific sampling probabilities should be supplied to the `fpc` argument of the `svydesign()` function, and the user should specify `pps = "brewer"`. ```{r} # Declare a multistage design # where first-stage probabilities are PPSWOR sampling # and second-stage probabilities are based on SRSWOR multistage_design <- svydesign( data = library_multistage_sample, ids = ~ PSU_ID + SSU_ID, probs = ~ PSU_SAMPLING_PROB + SSU_SAMPLING_PROB, pps = "brewer" ) # Convert to a bootstrap replicate design boot_design <- as_bootstrap_design( design = multistage_design, type = "Rao-Wu-Yue-Beaumont", samp_method_by_stage = c("PPSWOR", "SRSWOR"), replicates = 1000 ) # Compare variance estimates svytotal(x = ~ TOTCIR, na.rm = TRUE, design = multistage_design) svytotal(x = ~ TOTCIR, na.rm = TRUE, design = boot_design) ``` # Generalized Survey Bootstrap For sample designs with additional complex features beyond the three highlighted above, the generalized survey bootstrap method can be used. This is especially useful for systematic samples, two-phase samples, or complex designs where one wishes to use a general-purpose estimator such as the Horvitz-Thompson or Yates-Grundy estimator. ## Statistical Background The generalized survey bootstrap is based on a remarkable observation from @fay1984, summarized nicely by @dippo1984: > *...there is no variance estimator based on sums of squares and cross-products that cannot be represented by a resampling plan.* > > \-- @dippo1984 In other words, if a sample design has a textbook variance estimator for totals that can be represented as a quadratic form (i.e., sums of squares and cross-products), then we can make a replication estimator out of it. Fay developed a general methodology for producing replication estimators from a textbook estimator's quadratic form, encompassing the jackknife, bootstrap, and balanced repeated replication as special cases. Within this framework, the "generalized survey bootstrap" developed by @bertail1997 is one specific strategy for making bootstrap replication estimators out of textbook variance estimators. See @beaumont2012 for a clear overview of the generalized survey bootstrap. The starting point for implementing the generalized survey bootstrap method is to choose a textbook variance estimator appropriate to the sampling design which can be represented as a quadratic form. Luckily, there are many useful variance estimators which can be represented as quadratic forms. We highlight a few prominent examples below: - For stratified, multistage cluster samples: - The usual multistage variance estimator used in the 'survey' package, based on adding variance contributions from each stage. The estimator can be used for any number of sampling stages. - Highly-general variance estimators which work for any 'measurable' survey design (i.e., designs where every pair of units in the population has a nonzero probability of appearing in the sample). This covers most designs used in practice, with the primary exceptions being "one-PSU-per-stratum" designs and systematic sampling designs. - The Horvitz-Thompson estimator - The Sen-Yates-Grundy estimator - For systematic samples: - The SD1 and SD2 successive-differences estimators, which are the basis of the commonly-used "successive-differences replication" (SDR) estimator (see @ash2014 for an overview of SDR). - For two-phase samples: - The double-expansion variance estimator described in Section 9.3 of @sarndalModelAssistedSurvey1992. Once the textbook variance estimator has been selected and its quadratic form identified, the generalized survey bootstrap method consists of randomly generating each set of replicate weights from a multivariate distribution whose expectation is the $n$-vector $\mathbf{1}_n$ and whose variance-covariance matrix is the matrix of the quadratic form used for the textbook variance estimator. This ensures that, in expectation, the bootstrap variance estimator for a total equals the textbook variance estimator and thus inherits properties such as design-unbiasedness and design-consistency. ## Details and Notation for the Generalized Survey Bootstrap Method In this section, we describe the generalized survey bootstrap in greater detail, using the notation of @beaumont2012. ### Quadratic Forms Let $v( \hat{T_y})$ be the textbook variance estimator for an estimated population total $\hat{T}_y$ of some variable $y$. The base weight for case $i$ in our sample is $w_i$, and we let $\breve{y}_i$ denote the weighted value $w_iy_i$. Suppose we can represent our textbook variance estimator as a quadratic form: $v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T$, for some $n \times n$ matrix $\Sigma$. Our only constraint on $\Sigma$ is that, for our sample, it must be symmetric and positive semi-definite (in other words, it should never lead to a negative variance estimate, no matter what the value of $\breve{y}$ is). For example, the popular Horvitz-Thompson estimator based on first-order inclusion probabilities $\pi_k$ and second-order inclusion probabilities $\pi_{kl}$ can be represented as a positive semi-definite matrix with entries $(1-\pi_k)$ along the main diagonal and entries $(1 - \frac{\pi_k \pi_l}{\pi_{kl}})$ everywhere else. An illustration for a sample with $n=3$ is shown below: $$ \Sigma_{HT} = \begin{bmatrix} (1-\pi_1) & (1 - \frac{\pi_1 \pi_2}{\pi_{12}}) & (1 - \frac{\pi_1 \pi_3}{\pi_{13}}) \\ (1 - \frac{\pi_2 \pi_1}{\pi_{21}}) & (1 - \pi_2) & (1 - \frac{\pi_2 \pi_3}{\pi_{23}}) \\ (1 - \frac{\pi_3 \pi_1}{\pi_{31}}) & (1 - \frac{\pi_3 \pi_2}{\pi_{32}}) & (1 - \pi_3) \end{bmatrix} $$ As another example, the successive-difference variance estimator for a systematic sample can be represented as a positive semi-definite matrix whose diagonal entries are all $1$, whose superdiagonal and subdiagonal entries are all $-1/2$, and whose top right and bottom left entries are $-1/2$ [@ash2014]. An illustration for a sample with $n=5$ is shown below: $$ \Sigma_{SD2} = \begin{bmatrix} 1 & -1/2 & 0 & 0 & -1/2\\ -1/2 & 1 & -1/2 & 0 & 0 \\ 0 & -1/2 & 1 & -1/2 & 0 \\ 0 & 0 & -1/2 & 1& -1/2 \\ -1/2 & 0 & 0 & -1/2 & 1 \end{bmatrix} $$ To obtain a quadratic form matrix for a variance estimator, we can use the function `make_quad_form_matrix()`, which takes as inputs the name of a variance estimator and relevant survey design information. For example, the following code produces the quadratic form matrix of the "SD2" variance estimator we saw earlier. ```{r} make_quad_form_matrix( variance_estimator = "SD2", cluster_ids = c(1,2,3,4,5) |> data.frame(), strata_ids = c(1,1,1,1,1) |> data.frame(), sort_order = c(1,2,3,4,5) ) ``` In the following example, we use this method to estimate the variance for a stratified systematic sample of U.S. public libraries. First, we create a quadratic form matrix to represent the SD2 successive-difference estimator. This can be done by using the `svydesign()` function to describe the survey design and then using `get_design_quad_form()` to obtain the quadratic form for a specified variance estimator. ```{r, eval=TRUE} # Load an example dataset of a stratified systematic sample data('library_stsys_sample', package = 'svrep') # First, sort the rows in the order used in sampling library_stsys_sample <- library_stsys_sample |> dplyr::arrange(SAMPLING_SORT_ORDER) # Create a survey design object survey_design <- svydesign( data = library_stsys_sample, ids = ~ 1, strata = ~ SAMPLING_STRATUM, fpc = ~ STRATUM_POP_SIZE ) # Obtain the quadratic form for the target estimator sd2_quad_form <- get_design_quad_form( design = survey_design, variance_estimator = "SD2" ) class(sd2_quad_form) dim(sd2_quad_form) ``` Next, we estimate the sampling variance of the estimated total for the `TOTCIR` variable using the quadratic form. ```{r} # Obtain weighted values wtd_y <- as.matrix(library_stsys_sample[['LIBRARIA']] / library_stsys_sample[['SAMPLING_PROB']]) wtd_y[is.na(wtd_y)] <- 0 # Obtain point estimate for a population total point_estimate <- sum(wtd_y) # Obtain the variance estimate using the quadratic form variance_estimate <- t(wtd_y) %*% sd2_quad_form %*% wtd_y std_error <- sqrt(variance_estimate[1,1]) # Summarize results sprintf("Estimate: %s", round(point_estimate)) sprintf("Standard Error: %s", round(std_error)) ``` ### Forming Adjustment Factors Our goal is to form $B$ sets of bootstrap weights, where the $b$-th set of bootstrap weights is a vector of length $n$ denoted $\mathbf{a}^{(b)}$, whose $k$-th value is denoted $a_k^{(b)}$. This gives us $B$ replicate estimates of the population total, $\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k$, for $b=1, \ldots B$, from which we can easily calculate an estimate of the sampling variance. $$ v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B} $$ We can write this bootstrap variance estimator as a quadratic form: $$ \begin{aligned} v_B\left(\hat{T}_y\right) &=\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}} \\ \textit{where}& \\ \boldsymbol{\Sigma}_B &= \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B} \end{aligned} $$ Note that if the vector of adjustment factors $\mathbf{a}^{(b)}$ has expectation $\mathbf{1}_n$ and variance-covariance matrix $\boldsymbol{\Sigma}$, then we have the bootstrap expectation $E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}$. Since the bootstrap process takes the sample values $\breve{y}$ as fixed, the bootstrap expectation of the variance estimator is $E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}$. Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating $\mathbf{a}^{(b)}$ from a distribution with the following two conditions: [**Condition 1**]{.underline}: $\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n$ [**Condition 2**]{.underline}: $\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}$ The simplest, general way to generate the adjustment factors is to simulate from a multivariate normal distribution $\mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})$, which is the method used in this package. However, this method can lead to negative adjustment factors and hence negative bootstrap weights, which--while perfectly valid for variance estimation--may be undesirable from a practical point of view. Thus, in the following subsection, we describe one method for adjusting the replicate factors so that they are nonnegative and $E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right) =\mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}$. ### Adjusting Generalized Survey Bootstrap Replicates to Avoid Negative Weights Let $\mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]$ denote the $(n \times B)$ matrix of bootstrap adjustment factors. To eliminate negative adjustment factors, @beaumont2012 propose forming a rescaled matrix of nonnegative replicate factors $\mathbf{A}^S$ by rescaling each adjustment factor $a_k^{(b)}$ as follows: $$ \begin{aligned} a_k^{S,(b)} &= \frac{a_k^{(b)} + \tau - 1}{\tau} \\ \textit{where } \tau &\geq 1 - a_k^{(b)} \geq 1 \\ &\textit{for all }k \textit{ in } \left\{ 1,\ldots,n \right\} \\ &\textit{and all }b \textit{ in } \left\{1, \ldots, B\right\} \\ \end{aligned} $$ The value of $\tau$ can be set based on the realized adjustment factor matrix $\mathbf{A}$ or by choosing $\tau$ prior to generating the adjustment factor matrix $\mathbf{A}$ so that $\tau$ is likely to be large enough to prevent negative bootstrap weights. If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes $\frac{\tau^2}{B}$ instead of $\frac{1}{B}$. $$ \begin{aligned} \textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) &= \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2 \\ \textbf{After rescaling: } v_B\left(\hat{T}_y\right) &= \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2 \\ \end{aligned} $$ When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor $\frac{\tau^2}{B}$ rather than $\frac{1}{B}$ when estimating sampling variances. ## Implementation There are two ways to implement the generalized survey bootstrap. ### Option 1: Convert an existing design to a generalized bootstrap design The simplest method is to convert an existing survey design object to a generalized bootstrap design. With this approach, we create a survey design object using the `svydesign()` function. This allows us to represent information about stratification and clustering (potentially at multiple stages), as well as information about finite population corrections. ```{r} # Load example data from stratified systematic sample data('library_stsys_sample', package = 'svrep') # First, ensure data are sorted in same order as was used in sampling library_stsys_sample <- library_stsys_sample[ order(library_stsys_sample$SAMPLING_SORT_ORDER), ] # Create a survey design object design_obj <- svydesign( data = library_stsys_sample, strata = ~ SAMPLING_STRATUM, ids = ~ 1, fpc = ~ STRATUM_POP_SIZE ) ``` Next, we convert the survey design object to a replicate design using the function `as_gen_boot_design()`. The function argument `variance_estimator` allows us to specify the name of a variance estimator to use as the basis for creating replicate weights. ```{r} # Convert to generalized bootstrap replicate design gen_boot_design_sd2 <- as_gen_boot_design( design = design_obj, variance_estimator = "SD2", replicates = 2000 ) # Estimate sampling variances svymean(x = ~ TOTSTAFF, na.rm = TRUE, design = gen_boot_design_sd2) ``` For a PPS design that uses the Horvitz-Thompson or Yates-Grundy estimator, we can create a generalized bootstrap estimator with the same expectation. In the example below, we create a PPS design with the 'survey' package and convert it to a generalized bootstrap design. ```{r, eval=FALSE} # Load example data of a PPS survey of counties and states data('election', package = 'survey') # Create survey design object pps_design_ht <- svydesign( data = election_pps, id = ~1, fpc = ~p, pps = ppsmat(election_jointprob), variance = "HT" ) # Convert to generalized bootstrap replicate design gen_boot_design_ht <- pps_design_ht |> as_gen_boot_design(variance_estimator = "Horvitz-Thompson", replicates = 5000, tau = "auto") # Compare sampling variances from bootstrap vs. Horvitz-Thompson estimator svytotal(x = ~ Bush + Kerry, design = pps_design_ht) svytotal(x = ~ Bush + Kerry, design = gen_boot_design_ht) ``` We can also use the generalized bootstrap for designs that use multistage, stratified simple random sampling without replacement. ```{r} library(dplyr) # For data manipulation # Create a multistage survey design multistage_design <- svydesign( data = library_multistage_sample |> mutate(Weight = 1/SAMPLING_PROB), ids = ~ PSU_ID + SSU_ID, fpc = ~ PSU_POP_SIZE + SSU_POP_SIZE, weights = ~ Weight ) # Convert to a generalized bootstrap design multistage_boot_design <- as_gen_boot_design( design = multistage_design, variance_estimator = "Stratified Multistage SRS" ) # Compare variance estimates svytotal(x = ~ TOTCIR, na.rm = TRUE, design = multistage_design) svytotal(x = ~ TOTCIR, na.rm = TRUE, design = multistage_boot_design) ``` Unless specified otherwise, `as_gen_boot_design()` automatically selects a rescaling value $\tau$ to use for eliminating negative adjustment factors. The `scale` attribute of the resulting replicate survey design object is thus set to equal $\tau^2/B$. The specific value of $\tau$ can be retrieved from the replicate design object, as follows. ```{r} # View overall scale factor overall_scale_factor <- multistage_boot_design$scale print(overall_scale_factor) # Check that the scale factor was calculated correctly tau <- multistage_boot_design$tau print(tau) B <- ncol(multistage_boot_design$repweights) print(B) print( (tau^2) / B ) ``` ### Option 2: Create the quadratic form matrix and then use it to create bootstrap weights The generalized survey bootstrap can be implemented with a two-step process: - **Step 1**: Use `make_quad_form_matrix()` to represent the variance estimator as a quadratic form's matrix. ```{r} # Load an example dataset of a stratified systematic sample data('library_stsys_sample', package = 'svrep') # Represent the SD2 successive-difference estimator as a quadratic form, # and obtain the matrix of that quadratic form sd2_quad_form <- make_quad_form_matrix( variance_estimator = 'SD2', cluster_ids = library_stsys_sample |> select(FSCSKEY), strata_ids = library_stsys_sample |> select(SAMPLING_STRATUM), strata_pop_sizes = library_stsys_sample |> select(STRATUM_POP_SIZE), sort_order = library_stsys_sample |> pull("SAMPLING_SORT_ORDER") ) ``` - **Step 2**: Use `make_gen_boot_factors()` to generate replicate factors based on the target quadratic form. The function argument `tau` can be used to avoid negative adjustment factors using the previously-described method. ```{r} rep_adj_factors <- make_gen_boot_factors( Sigma = sd2_quad_form, num_replicates = 500, tau = "auto" ) ``` The actual value of `tau` used can be extracted from the function's output using the `attr()` function. ```{r} tau <- attr(rep_adj_factors, 'tau') B <- ncol(rep_adj_factors) ``` For convenience, the values to use for the `scale` and `rscales` arguments of `svrepdesign()` are included as attributes of the adjustment factors created by `make_gen_boot_factors()`. ```{r} # Retrieve value of 'scale' rep_adj_factors |> attr('scale') # Compare to manually-calculated value (tau^2) / B # Retrieve value of 'rscales' rep_adj_factors |> attr('rscales') |> head() # Only show first 5 values ``` Using the adjustment factors thus created, we can create a replicate survey design object by using the function `svrepdesign()`, with arguments `type = "other"` and specifying the `scale` argument to use the factor $\tau^2/B$. ```{r} gen_boot_design <- svrepdesign( data = library_stsys_sample |> mutate(SAMPLING_WEIGHT = 1/SAMPLING_PROB), repweights = rep_adj_factors, weights = ~ SAMPLING_WEIGHT, combined.weights = FALSE, type = "other", scale = attr(rep_adj_factors, 'scale'), rscales = attr(rep_adj_factors, 'rscales') ) ``` This allows us to estimate sampling variances, even for quite complex sampling designs. ```{r} gen_boot_design |> svymean(x = ~ TOTSTAFF, na.rm = TRUE, deff = TRUE) ``` # Choosing the Number of Bootstrap Replicates The bootstrap suffers from unavoidable "simulation error" (also referred to as "Monte Carlo" error) caused by using a finite number of replicates to simulate the ideal bootstrap estimate we would obtain if we used an infinite number of replicates. In general, the simulation error can be reduced by using a larger number of bootstrap replicates. ## General Strategy While there are many rule-of-thumb values for the number of replicates that should be used (some say 500, others say 1,000), it is advisable to instead use a principled strategy for choosing the number of replicates. One general strategy proposed by @beaumont2012 is as follows: - **Step 1**: Determine the largest acceptable level of simulation error for key survey estimates. For example, one might determine that, on average, the bootstrap standard error estimate should be no more than $\pm 5\%$ different than the ideal bootstrap estimate. - **Step 2**: Estimate key statistics of interest using a large number of bootstrap replicates (such as 5,000) and save the estimates from each bootstrap replicate. This can be conveniently done using a function from the 'survey' package such as `svymean(..., return.replicates = TRUE)` or `withReplicates(..., return.replicates = TRUE)`. - **Step 3**: Estimate the minimum number of bootstrap replicates needed to reduce the level of simulation error to the target level. This can be done using the 'svrep' function `estimate_boot_reps_for_target_cv()`. ## Measuring and Estimating Simulation Error Simulation error can be measured as a **"simulation coefficient of variation" (CV)**, which is the ratio of the standard error of a bootstrap estimator to the expectation of that bootstrap estimator, where the expectation and standard error are evaluated with respect to the bootstrapping process given the selected sample. For a statistic $\hat{\theta}$, the simulation CV of the bootstrap variance estimator $v_{B}(\hat{\theta})$ based on $B$ replicate estimates $\hat{\theta}^{\star}_1,\dots,\hat{\theta}^{\star}_B$ is defined as follows: ```{=tex} \begin{aligned} CV_{\star}(v_{B}(\hat{\theta})) &= \frac{\sqrt{var_{\star}(v_B(\hat{\theta}))}}{E_{\star}(v_B(\hat{\theta}))} = \frac{CV_{\star}(E_2)}{\sqrt{B}} \\ \textit{where:}& \\ E_2 &= (\hat{\theta}^{\star} - \hat{\theta})^2 \\ CV_{\star}(E_2) &= \frac{\sqrt{var_{\star}(E_2)}}{E_{\star}(E_2)} \\ \textit{and }& var_{\star} \text{ and } E_{\star} \textit{ are evaluated} \\ & \textit{with respect to the bootstrapping process} \\ & \textit{given the selected sample} \end{aligned} ``` The simulation CV of a statistic, denoted $CV_{\star}(v_{B}(\hat{\theta}))$, can be estimated for a given number of replicates $B$ by estimating $CV_{\star}(E_2)$ using observed values and dividing this by $\sqrt{B}$. As a result, one can thereby estimate the number of bootstrap replicates needed to obtain a target simulation CV, which is a useful strategy for determining the number of bootstrap replicates to use for a survey. With the 'svrep' package, it is possible to estimate the number of bootstrap replicates required to obtain a target simulation CV for a statistic. ```{r} library(survey) data('api', package = 'survey') # Declare a bootstrap survey design object ---- boot_design <- svydesign( data = apistrat, weights = ~pw, id = ~1, strata = ~stype, fpc = ~fpc ) |> svrep::as_bootstrap_design(replicates = 5000) # Produce estimates of interest, and save the estimate from each replicate ---- estimated_means_and_proportions <- svymean(x = ~ api00 + api99 + stype, design = boot_design, return.replicates = TRUE) # Estimate the number of replicates needed to obtain a target simulation CV ---- estimate_boot_reps_for_target_cv( svrepstat = estimated_means_and_proportions, target_cv = c(0.01, 0.05, 0.10) ) ``` To estimate the simulation CV for the current number of replicates used, it is possible to use the function `estimate_boot_sim_cv()`. ```{r} estimate_boot_sim_cv(estimated_means_and_proportions) ``` # The Bootstrap vs. Other Replication Methods > Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise. > > \-- John Tukey > > Ok, but what if the approximate answer is, like, *really* approximate or requires a whole lot of computing? > > \-- Survey sampling statisticians Survey bootstrap methods are directly applicable to a wider variety of sample designs than the jackknife or balanced repeated replication (BRR). Nonetheless, complex survey designs are often shoehorned into jackknife or BRR variance estimation by pretending that the actual survey design was something simpler. The BRR method, for instance, is only applicable to samples with two clusters sampled from each stratum, but statisticians frequently use it for designs with three or more sampled clusters by grouping the actual clusters into two pseudo-clusters. For designs with a large number of sampling units in a stratum, the exact jackknife (JK1 or JKn) requires a large number of replicates and so is often replaced by the "delete-a-group jackknife" (DAGJK) with clusters randomly grouped into larger pseudo-clusters. Why do statisticians go to all this effort to shoehorn their variance estimation problem into jackknife or BRR methods when they could just use the bootstrap? The simple answer is that bootstrap methods generally require many more replicates than other methods in order to obtain a stable variance estimate. And using a large number of replicates can be a problem if you have to do a large amount of computing or if your dataset is large and you're concerned about storage costs. Statistical agencies are particularly sensitive to these concerns when they publish microdata, since agencies often serve a large number of end-users with varying computational resources. **So why use the bootstrap?** 1. **The bootstrap tends to works well for a larger class of statistics than the jackknife.** For example, for estimating the sampling variance of an estimated median or other quantiles, the jackknife tends to perform poorly but bootstrap methods at least do an adequate job. 2. **Bootstrap methods enable different options for forming confidence intervals.** For all of the standard replication methods (BRR, Jackknife, etc.), confidence intervals are generally formed using a Wald interval ($\hat{\theta} \pm \hat{se}(\hat{\theta}) \times z_{1-\frac{\alpha}{2}}$).[^1] But for certain bootstrap methods, it is possible to also form confidence intervals using other approaches, such as the bootstrap percentile method. 3. **You can analyze your design rather than an approximation of your design, which can both reduce costs and better control errors**. To use BRR for general survey designs, you have to approximate your actual survey design with a "two PSUs per stratum" design. This works surprisingly well in many cases, but it requires careful work on the part of a specially-trained statistician. For the jackknife with a large number of sampling units, you either end up with the same number of replicates as a bootstrap method or you have to randomly group your sampling units into a smaller number so you can use the DAGJK method to essentially approximate your actual survey design with a simpler one. Again, this takes careful work on the part of a specially-trained statistician. 1. By analyzing your design for what it is, **you don't have to pay a specially-trained statistician to meticulously approximate a design so that it can be shoehorned into a jackknife or BRR variance estimation problem**, which is perhaps not the best use of a limited budget. 2. **If the variance estimation is based on a bootstrap method tailored to the actual survey design, then the replication error in the variance estimates for key statistics is unbiased and can be quantified and controlled as a function of the number of replicates**. In contrast, when variance estimation is based on approximating a design so that it can be shoehorned into a jackknife or BRR variance estimation problem, then the replication error in the variance estimates is more difficult to quantify and can consist of both noise and bias. 4. **For most statisticians, it's probably easier to learn.** The bootstrap is the most well-known replication method among general statisticians, to the point that it's often taught in first-year undergraduate statistics courses. So the basic idea is already familiar even to statisticians with only passing familiarity with complex survey sampling. BRR, in contrast, takes specialized training to learn and entails pre-requisite concepts such as Hadamard matrices, partial balancing, and so on. Outside of survey statistics, the jackknife tends to be much less used (and taught) compared to the bootstrap, due to its limitations with non-smooth statistics and the complexity required to make it work efficiently for large sample sizes. [^1]: Non-Wald methods are typically only used in practice for confidence intervals for proportions. These non-Wald methods are normally based on the point estimate, the standard error, and a measure of effective sample size. # References