Add a column of cluster survey weights to a data frame.

For use in surveys where you took a sample population out of a larger source population, with a cluster survey design.

add_weights_cluster(
  x,
  cl,
  eligible,
  interviewed,
  cluster_x = NULL,
  cluster_cl = NULL,
  household_x = NULL,
  household_cl = NULL,
  ignore_cluster = TRUE,
  ignore_household = TRUE,
  surv_weight = "surv_weight",
  surv_weight_ID = "surv_weight_ID"
)

Arguments

x: a data frame of survey data
cl: a data frame containing a list of clusters and the number of households in each.
eligible: the column in x which specifies the number of people eligible for being interviewed in that household. (e.g. the total number of children)
interviewed: the column in x which specifies the number of people actually interviewed in that household.
cluster_x: the column in x that indicates which cluster rows belong to. Ignored if ignore_cluster is TRUE.
cluster_cl: the column in cl that lists all possible clusters. Ignored if ignore_cluster is TRUE.
household_x: the column in x that indicates a unique household identifier. Ignored if ignore_household is TRUE.
household_cl: the column in cl that lists the number of households per cluster. Ignored if ignore_household is TRUE.
ignore_cluster: If TRUE (default), set the weight for clusters to be 1. This assumes that your sample was taken in a way which is a close approximation of a simple random sample. Ignores inputs from cluster_cl as well as cluster_x.
ignore_household: If TRUE (default), set the weight for households to be 1. This assumes that your sample of households was takenin a way which is a close approximation of a simple random sample. Ignores inputs from household_cl and household_x.
surv_weight: the name of the new column to store the weights. Defaults to "surv_weight".
surv_weight_ID: the name of the new ID column to be created. Defaults to "surv_weight_ID"

Details

Will multiply the inverse chances of a cluster being selected, a household being selected within a cluster, and an individual being selected within a household.

As follows:

((clusters available) / (clusters surveyed)) *
((households in each cluster) / (households surveyed in each cluster)) *
((individuals eligible in each household) / (individuals interviewed))

In the case where both ignore_cluster and ignore_household are TRUE, this will simply be:

1 * 1 * (individuals eligible in each household) / (individuals interviewed)

Author

Alex Spina, Zhian N. Kamvar, Lukas Richter

Examples



# define a fake dataset of survey data
# including household and individual information
x <- data.frame(stringsAsFactors=FALSE,
         cluster = c("Village A", "Village A", "Village A", "Village A",
                     "Village A", "Village B", "Village B", "Village B"),
    household_id = c(1, 1, 1, 1, 2, 2, 2, 2),
      eligible_n = c(6, 6, 6, 6, 6, 3, 3, 3),
      surveyed_n = c(4, 4, 4, 4, 4, 3, 3, 3),
   individual_id = c(1, 2, 3, 4, 4, 1, 2, 3),
         age_grp = c("0-10", "20-30", "30-40", "50-60", "50-60", "20-30",
                     "50-60", "30-40"),
             sex = c("Male", "Female", "Male", "Female", "Female", "Male",
                     "Female", "Female"),
         outcome = c("Y", "Y", "N", "N", "N", "N", "N", "Y")
)

# define a fake dataset of cluster listings
# including cluster names and number of households
cl <- tibble::tribble(
     ~cluster, ~n_houses,
  "Village A",        23,
  "Village B",        42,
  "Village C",        56,
  "Village D",        38
)


# add weights to a cluster sample
# include weights for cluster, household and individual levels
add_weights_cluster(x, cl = cl,
                    eligible = eligible_n,
                    interviewed = surveyed_n,
                    cluster_cl = cluster, household_cl = n_houses,
                    cluster_x = cluster,  household_x = household_id,
                    ignore_cluster = FALSE, ignore_household = FALSE)
#>     cluster household_id eligible_n surveyed_n individual_id age_grp    sex
#> 1 Village A            1          6          4             1    0-10   Male
#> 2 Village A            1          6          4             2   20-30 Female
#> 3 Village A            1          6          4             3   30-40   Male
#> 4 Village A            1          6          4             4   50-60 Female
#> 5 Village A            2          6          4             4   50-60 Female
#> 6 Village B            2          3          3             1   20-30   Male
#> 7 Village B            2          3          3             2   50-60 Female
#> 8 Village B            2          3          3             3   30-40 Female
#>   outcome surv_weight surv_weight_ID
#> 1       Y        34.5    Village A_1
#> 2       Y        34.5    Village A_1
#> 3       N        34.5    Village A_1
#> 4       N        34.5    Village A_1
#> 5       N        34.5    Village A_2
#> 6       N        84.0    Village B_2
#> 7       N        84.0    Village B_2
#> 8       Y        84.0    Village B_2


# add weights to a cluster sample
# ignore weights for cluster and household level (set equal to 1)
# only include weights at individual level
add_weights_cluster(x, cl = cl,
                    eligible = eligible_n,
                    interviewed = surveyed_n,
                    cluster_cl = cluster, household_cl = n_houses,
                    cluster_x = cluster,  household_x = household_id,
                    ignore_cluster = TRUE, ignore_household = TRUE)
#>     cluster household_id eligible_n surveyed_n individual_id age_grp    sex
#> 1 Village A            1          6          4             1    0-10   Male
#> 2 Village A            1          6          4             2   20-30 Female
#> 3 Village A            1          6          4             3   30-40   Male
#> 4 Village A            1          6          4             4   50-60 Female
#> 5 Village A            2          6          4             4   50-60 Female
#> 6 Village B            2          3          3             1   20-30   Male
#> 7 Village B            2          3          3             2   50-60 Female
#> 8 Village B            2          3          3             3   30-40 Female
#>   outcome surv_weight surv_weight_ID
#> 1       Y         1.5    Village A_1
#> 2       Y         1.5    Village A_1
#> 3       N         1.5    Village A_1
#> 4       N         1.5    Village A_1
#> 5       N         1.5    Village A_2
#> 6       N         1.0    Village B_2
#> 7       N         1.0    Village B_2
#> 8       Y         1.0    Village B_2