--- title: "Nonresponse Adjustments" output: prettydoc::html_pretty: theme: architect highlight: github toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{Nonresponse Adjustments} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this short vignette, we'll demonstrate how the svrep package can be used to implement weighting adjustments for nonresponse in survey samples. For illustration, we'll use an example survey measuring Covid-19 vaccination status and a handful of demographic variables, based on a simple random sample of 1,000 residents of Louisville, Kentucky. ```{r, message=FALSE, warning=FALSE} library(dplyr) # For data manipulation library(survey) # For complex survey analysis library(srvyr) # For complex survey analysis with dplyr syntax library(svrep) # Load and inspect the data data("lou_vax_survey", package = 'svrep') head(lou_vax_survey) colnames(lou_vax_survey) ``` This vaccination survey has an overall response rate of 50.2%, which means that estimated vaccination rates may be significantly biased by nonresponse. We'll use nonresponse weighting adjustments to try and reduce potential nonresponse bias. ```{r} lou_vax_survey |> count(RESPONSE_STATUS) |> mutate(pct = n/sum(n)) ``` # Creating initial replicate weights To begin with, we'll create bootstrap replicate weights. In most cases, we can do this simply by describing the survey design using the `svydesign()` function and then using a function to create appropriate replicate weights. The function `as.svrepdesign()` from the 'survey' package can be used to create several types of replicate weights, using the argument `type` (with options `'JK1'`, `'JKn'`, `'bootstrap'`, `'BRR'`, `'Fay'`, etc.) In addition, the function `as_bootstrap_design()` can be used to create bootstrap weights using additional methods not supported in the 'survey' package. ```{r} # Describe the survey design lou_vax_survey <- svydesign(ids = ~ 1, weights = ~ SAMPLING_WEIGHT, data = lou_vax_survey) print(lou_vax_survey) # Create appropriate replicate weights lou_vax_survey <- lou_vax_survey |> as_bootstrap_design(replicates = 100, mse = TRUE, type = "Rao-Wu-Yue-Beaumont") print(lou_vax_survey) ``` For convenience, we'll convert this survey design object into an object with class `tbl_svy`, which allows us to use convenient tidyverse/dplyr syntax (`group_by()`, `summarize()`, etc.) as well as other helpful functions from the srvyr package. ```{r} lou_vax_survey <- lou_vax_survey |> as_survey() print(lou_vax_survey) ``` # Redistributing weight from nonrespondents to respondents A common form of nonresponse adjustment is to simply 'redistribute' weight from the nonrespondents to the respondents. In other words, the weight for each nonrespondent is set to $0$, and the weight for each respondent is increased by a factor greater than one so that the sum of adjusted weights in the sample of respondents equals the sum of unadjusted weights from the full sample. For example, if the sum of weights among respondents is $299,544.4$ and the sum of weights among nonrespondents is $297,157.6$, then a basic nonresponse adjustment would set the weights among nonrespondents to $0$ and multiply the weight for each respondent by an adjustment factor equal to $1 + (297,157.6/299,544.4)$. This type of adjustment is succinctly described in mathematical notation below. ```{=tex} \begin{aligned} w_i &= \textit{Original sampling weight for case }i \\ &= 1/\pi_i, \textit{ where } \pi_i \textit{ is the probability case i }\textit{had of being sampled}\\ f_{NR,i} &= \textit{Nonresponse adjustment factor for case i} \\ w_{NR, i} &= w_i \times f_{NR,i} = \textit{Weight for case }i \textit{ after nonresponse adjustment} \\ \\ \sum_{i \in s_{resp}} w_i &= \textit{Sum of sampling weights among respondents} \\ \sum_{i \in s_{nonresp}} w_i &= \textit{Sum of sampling weights among nonrespondents} \\ \\ f_{NR_i} &= \begin{cases} 0 & \textit{if case }i\textit{ is a nonrespondent} \\ 1 + \frac{\sum_{i \in s_{nonresp}} w_i}{\sum_{i \in s_{resp}} w_i} & \textit{if case }i\textit{ is a respondent} \end{cases} \end{aligned} ``` We'll illustrate this type of adjustment with the Louisville vaccination survey. First, we'll inspect the sum of the sampling weights for respondents, nonrespondents, and the overall sample. ```{r} # Weights before adjustment lou_vax_survey |> group_by(RESPONSE_STATUS) |> cascade( `Sum of Weights` = sum(cur_svy_wts()), .fill = "TOTAL" ) ``` Next, we'll redistribute weight from nonrespondents to respondents using the `redistribute_weights()` function, which adjusts the full-sample weights as well as each set of replicate weights. To specify which subset of data should have its weights reduced, we supply a logical expression to the argument `reduce_if`. To specify which subset of data should have its weights increased, we supply a logical expression to the argument `increase_if`. ```{r} # Conduct a basic nonresponse adjustment nr_adjusted_survey <- lou_vax_survey |> redistribute_weights( reduce_if = RESPONSE_STATUS == "Nonrespondent", increase_if = RESPONSE_STATUS == "Respondent" ) ``` After making the adjustment, we can check that all of the weight from nonrespondents has been redistributed to respondents. ```{r} # Check the sum of full-sample weights by response status nr_adjusted_survey |> group_by(RESPONSE_STATUS) |> cascade( `Sum of Weights` = sum(cur_svy_wts()), .fill = "TOTAL" ) ``` ```{r} # Check sums of replicate weights by response status nr_adjusted_survey |> summarize_rep_weights( type = "specific", by = "RESPONSE_STATUS" ) |> arrange(Rep_Column, RESPONSE_STATUS) |> head(10) ``` # Conducting weighting class adjustments Nonresponse bias is liable to occur if different subpopulations systematically differ in terms of their response rates to the survey and also differ in terms of what the survey is trying to measure (in this case, vaccination status). In our example, we can see some fairly large differences in response rates across different race/ethnicity groups. ```{r} lou_vax_survey |> group_by(RACE_ETHNICITY) |> summarize(Response_Rate = mean(RESPONSE_STATUS == "Respondent"), Sample_Size = n(), n_Respondents = sum(RESPONSE_STATUS == "Respondent")) ``` Weighting adjustments may be able to help reduce nonresponse bias caused by these differences in response rates. One standard form of adjustment known as **weighting class adjustment** is to redistribute weights from nonrespondents to respondents separately by different categories of auxiliary variables (such as race/ethnicity). The survey textbook Heeringa, West, and Berglund (2017) provides an excellent overview of weighting class adjustments. To implement a weighting class adjustment with the svrep package, we can simply use the `by` argument of `redistribute_weights()`. ```{r} nr_adjusted_survey <- lou_vax_survey |> redistribute_weights( reduce_if = RESPONSE_STATUS == "Nonrespondent", increase_if = RESPONSE_STATUS == "Respondent", by = c("RACE_ETHNICITY") ) ``` Multiple grouping variables may be supplied to the `by` argument. For example, one can specify `by = c("STRATUM", "RACE_ETHNICITY")` to redistribute weights separately by combinations of stratum and race/ethnicity category. ## Propensity cell adjustment The popular method of forming weighting classes based on estimated response propensities (known as **propensity cell adjustment**) can also be used, for example by adding a variable `PROPENSITY_CELL` to the data and using `redistribute_weights(..., by = "PROPENSITY_CELL")`. ```{r} # Fit a response propensity model response_propensity_model <- lou_vax_survey |> mutate(IS_RESPONDENT = ifelse(RESPONSE_STATUS == "Respondent", 1, 0)) |> svyglm(formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT, family = quasibinomial(link = 'logit')) # Predict response propensities for individual cases lou_vax_survey <- lou_vax_survey |> mutate( RESPONSE_PROPENSITY = predict(response_propensity_model, newdata = cur_svy(), type = "response") ) # Divide sample into propensity classes lou_vax_survey <- lou_vax_survey |> mutate(PROPENSITY_CELL = ntile(x = RESPONSE_PROPENSITY, n = 5)) lou_vax_survey |> group_by(PROPENSITY_CELL) |> summarize(n = n(), min = min(RESPONSE_PROPENSITY), mean = mean(RESPONSE_PROPENSITY), max = max(RESPONSE_PROPENSITY)) # Redistribute weights by propensity class nr_adjusted_survey <- lou_vax_survey |> redistribute_weights( reduce_if = RESPONSE_STATUS == "Nonrespondent", increase_if = RESPONSE_STATUS == "Respondent", by = "PROPENSITY_CELL" ) # Inspect weights before adjustment lou_vax_survey |> summarize_rep_weights(type = "specific", by = c("PROPENSITY_CELL")) |> arrange(Rep_Column, PROPENSITY_CELL) |> select(PROPENSITY_CELL, Rep_Column, N_NONZERO, SUM) |> head(10) # Inspect weights after adjustment nr_adjusted_survey |> summarize_rep_weights(type = "specific", by = c("PROPENSITY_CELL", "RESPONSE_STATUS")) |> arrange(Rep_Column, PROPENSITY_CELL, RESPONSE_STATUS) |> select(PROPENSITY_CELL, RESPONSE_STATUS, Rep_Column, N_NONZERO, SUM) |> head(10) ``` # Saving the final weights to a data file Once we're satisfied with the weights, we can create a data frame with the analysis variables and columns of replicate weights. This format is easy to export to data files that can be loaded into R or other software later. ```{r} data_frame_with_nr_adjusted_weights <- nr_adjusted_survey |> as_data_frame_with_weights( full_wgt_name = "NR_ADJ_WGT", rep_wgt_prefix = "NR_ADJ_REP_WGT_" ) # Preview first few column names colnames(data_frame_with_nr_adjusted_weights) |> head(12) ``` ```{r, eval=FALSE} # Write the data to a CSV file write.csv( x = data_frame_with_nr_adjusted_weights, file = "survey-data-with-nonresponse-adjusted-weights.csv" ) ``` # Statistical background The motivation for making this adjustment is that standard methods of statistical inference assume that every person in the population has a known, nonzero probability of participating in the survey (i.e. has a nonzero chance of being sampled and a nonzero chance of responding if they are sampled), denoted $p_{i,overall}$. Basic results in survey sampling theory guarantee that if this assumption is true, we can produce unbiased estimates of population means and totals by weighting data from each respondent with the weight $1/{p_{i,overall}}$. Crucially, the overall probability of participation $p_{i,overall}$ is the product of two components: the probability that a person is sampled (denoted $\pi_i$), and the probability that a person would respond to the survey if they are sampled (denoted $p_i$ and referred to as the **"response propensity"**). The sampling probability $\pi_i$ is known since we can control the method of sampling, but the response propensity $p_i$ is unknown and can only be estimated. $$ \begin{aligned} w^{*}_i &= 1/p_{i,overall} \text{ (weights needed for unbiased estimation)} \\ p_{i,overall} &= \pi_i \times p_i \\ \pi_i &= \textbf{Sampling probability} \\ &\textit{i.e. the probability that case }i\textit{ is randomly sampled } \text{ (}\textit{Known}\text{)} \\ p_i &= \textbf{Response propensity} \\ &\textit{i.e. the probability that case }i\textit{ responds, if sampled } \text{ (}\textit{Unknown}\text{)} \\ \end{aligned} $$ The component $p_i$ must be estimated using data (with estimate $\hat{p}_i$) and then nonresponse-adjusted weights for respondents can be formed as $w_{NR,i} = 1/(\pi_i \times \hat{p}_i)$ and used to obtain approximately unbiased estimates of population means and totals. To use our earlier notation, the nonresponse adjustment factor for respondents $f_{NR,i}$ is actually defined using $1/\hat{p}_i$. $$ \begin{aligned} w_i &= \textit{Original sampling weight for case }i \\ &= 1/\pi_i, \textit{ where } \pi_i \textit{ is the probability case i }\textit{had of being sampled}\\ w_{NR, i} &= w_i \times f_{NR,i} = \textit{Weight for case }i \textit{ after nonresponse adjustment} \\ \\ f_{NR,i} &= \begin{cases} 0 & \text{if case } i \text{ is a nonrespondent} \\ 1 / \hat{p}_i & \text{if case } i \text{ is a respondent} \\ \end{cases} \\ \hat{p}_i &= \textbf{Estimated response propensity} \end{aligned} $$ In essence, different methods of nonresponse weighting adjustments vary in terms of how they estimate $\hat{p}_i$. The basic weight redistribution method in effect estimates $p_i$ as constant across all $i$, equal to the overall weighted response rate, and uses that to form the weights. In other words, the basic weight redistribution essentially is a way of forming an adjustment factor $f_{NR,i}$ based on the estimated response propensity $\hat{p}_i = \frac{\sum_{i \in s_{resp}}w_i}{\sum_{i \in s}w_i}$. Weighting class adjustments and propensity cell adjustments are essentially more refined ways of forming $f_{NR,i}$ by estimating $p_i$ with a more realistic model, where $p_i$ is not constant across the entire sample but instead varies among weighting classes or propensity cells. The reason for conducting weighting adjustments not only in the full-sample weights but in the replicate weights is to account for the nonresponse adjustment process when estimating sampling variances and inferential statistics such as confidence intervals. Because of random sampling, the adjustment factors used in the nonresponse adjustment will vary from one sample to the next, and applying weighting adjustments separately for each replicate reflects this variability. As we've seen in this vignette, the `redistribute_weights()` function handles this for us: after nonresponse adjustment, all of the weight in each replicate is redistributed in the same manner that weight was redistributed for the full-sample weights. # Recommended Reading - See Chapter 2, Section 2.7.3 of "Applied Survey Data Analysis" for a statistical explanation of the weighting adjustments described in this vignette. > Heeringa, S., West, B., Berglund, P. (2017). Applied Survey Data Analysis, 2nd edition. Boca Raton, FL: CRC Press. - Chapter 13 of "Practical Tools for Designing and Weighting Survey Samples" also provides an excellent overview of nonresponse adjustment methods. > Valliant, R., Dever, J., Kreuter, F. (2018). Practical Tools for Designing and Weighting Survey Samples, 2nd edition. New York: Springer.