Based on a dictionary generator like msf_dict()
or msf_dict_survey()
,
this function will generate a randomized data set based on values defined in
the dictionaries. The randomized dataset produced should mimic an excel
export from DHIS2 for outbreaks and a Kobo export for surveys.
Arguments
- dictionary
Specify which dictionary you would like to use.
- varnames
Specify name of column that contains variable names. If
dictionary
is a survey,varnames
needs to be "name"`.- numcases
Specify the number of cases you want (default is 300)
- org
the organization the dictionary belongs to. Currently, only MSF exists. In the future, dictionaries from WHO and other organizations may become available.
Value
a data frame with cases in rows and variables in columns. The number of columns will vary from dictionary to dictionary, so please use the dictionary functions to generate a corresponding dictionary.
Examples
if (require("dplyr") & require("matchmaker")) {
withAutoprint({
# You will often want to use MSF dictionaries to translate codes to human-
# readable variables. Here, we generate a data set of 20 cases:
dat <- gen_data(
dictionary = "Cholera",
varnames = "data_element_shortname",
numcases = 20,
org = "MSF"
)
print(dat)
# We want the expanded dictionary, so we will select `compact = FALSE`
dict <- msf_dict(disease = "Cholera", long = TRUE, compact = FALSE, tibble = TRUE)
print(dict)
# Now we can use matchmaker to filter the data:
dat_clean <- matchmaker::match_df(dat, dict,
from = "option_code",
to = "option_name",
by = "data_element_shortname",
order = "option_order_in_set"
)
print(dat_clean)
})
}
#> Loading required package: dplyr
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
#> Loading required package: matchmaker
#> > dat <- gen_data(dictionary = "Cholera", varnames = "data_element_shortname",
#> + numcases = 20, org = "MSF")
#> > print(dat)
#> # A tibble: 20 × 45
#> case_number date_of_c…¹ patie…² age_y…³ age_m…⁴ age_d…⁵ sex pregn…⁶ trime…⁷
#> <chr> <date> <chr> <int> <int> <int> <fct> <fct> <fct>
#> 1 A1 2018-02-09 Villag… 83 NA NA M NA NA
#> 2 A2 2018-04-07 Villag… 78 NA NA U NA NA
#> 3 A3 2018-04-01 Villag… 30 NA NA F N NA
#> 4 A4 2018-04-23 Villag… 21 NA NA M NA NA
#> 5 A5 2018-03-18 Villag… 13 NA NA F Y 1
#> 6 A6 2018-03-18 Villag… 73 NA NA M NA NA
#> 7 A7 2018-03-27 Villag… 53 NA NA M NA NA
#> 8 A8 2018-02-14 Villag… 27 NA NA U NA NA
#> 9 A9 2018-04-10 Villag… 63 NA NA U NA NA
#> 10 A10 2018-04-16 Villag… 40 NA NA U NA NA
#> 11 A11 2018-02-23 Villag… 18 NA NA M NA NA
#> 12 A12 2018-01-12 Villag… 30 NA NA U NA NA
#> 13 A13 2018-03-28 Villag… 10 NA NA F Y 3
#> 14 A14 2018-04-24 Villag… 20 NA NA F N NA
#> 15 A15 2018-01-27 Villag… NA 9 NA F NA NA
#> 16 A16 2018-02-20 Villag… 57 NA NA M NA NA
#> 17 A17 2018-02-15 Villag… NA 12 NA U NA NA
#> 18 A18 2018-01-01 Villag… 15 NA NA U NA NA
#> 19 A19 2018-04-10 Villag… 34 NA NA F W NA
#> 20 A20 2018-03-20 Villag… 8 NA NA F N NA
#> # … with 36 more variables: foetus_alive_at_admission <fct>, exit_status <fct>,
#> # date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> # previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> # readmission <fct>, msf_involvement <fct>,
#> # cholera_treatment_facility_type <fct>, residential_status_brief <fct>,
#> # date_of_last_vaccination <date>, prescribed_zinc_supplement <fct>,
#> # prescribed_antibiotics <fct>, ors_consumed_litres <int>, …
#> > dict <- msf_dict(disease = "Cholera", long = TRUE, compact = FALSE, tibble = TRUE)
#> > print(dict)
#> # A tibble: 182 × 11
#> data_elemen…¹ data_…² data_…³ data_…⁴ data_…⁵ data_…⁶ used_…⁷ optio…⁸ optio…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AafTlSwliVQ egen_0… case_n… Anonym… TEXT Case n… NA NA NA
#> 2 OTGOtWBz39J egen_0… date_o… Date p… DATE Date o… NA NA NA
#> 3 wnmMr2V3T3u egen_0… patien… Locati… ORGANI… Patien… NA NA NA
#> 4 sbgqjeVwtb8 egen_0… age_ye… Age of… INTEGE… Age in… NA NA NA
#> 5 eXYhovYyl61 egen_0… age_mo… Age of… INTEGE… Age in… NA NA NA
#> 6 UrYJSk2Wp46 egen_0… age_da… Age of… INTEGE… Age in… NA NA NA
#> 7 D1Ky5K7pFN6 egen_0… sex Sex of… TEXT Sex orgc5Y… M Male
#> 8 D1Ky5K7pFN6 egen_0… sex Sex of… TEXT Sex orgc5Y… F Female
#> 9 D1Ky5K7pFN6 egen_0… sex Sex of… TEXT Sex orgc5Y… U Unknow…
#> 10 dTm5R53YYXC egen_0… pregna… Pregna… TEXT Pregna… IEjzG2… N Not cu…
#> # … with 172 more rows, 2 more variables: option_uid <chr>,
#> # option_order_in_set <dbl>, and abbreviated variable names
#> # ¹data_element_uid, ²data_element_name, ³data_element_shortname,
#> # ⁴data_element_description, ⁵data_element_valuetype, ⁶data_element_formname,
#> # ⁷used_optionset_uid, ⁸option_code, ⁹option_name
#> > dat_clean <- matchmaker::match_df(dat, dict, from = "option_code", to = "option_name",
#> + by = "data_element_shortname", order = "option_order_in_set")
#> > print(dat_clean)
#> # A tibble: 20 × 45
#> case_number date_of_c…¹ patie…² age_y…³ age_m…⁴ age_d…⁵ sex pregn…⁶ trime…⁷
#> <chr> <date> <chr> <int> <int> <int> <fct> <fct> <fct>
#> 1 A1 2018-02-09 Villag… 83 NA NA Male Not ap… NA
#> 2 A2 2018-04-07 Villag… 78 NA NA Unkn… Not ap… NA
#> 3 A3 2018-04-01 Villag… 30 NA NA Fema… Not cu… NA
#> 4 A4 2018-04-23 Villag… 21 NA NA Male Not ap… NA
#> 5 A5 2018-03-18 Villag… 13 NA NA Fema… Yes, c… 1st tr…
#> 6 A6 2018-03-18 Villag… 73 NA NA Male Not ap… NA
#> 7 A7 2018-03-27 Villag… 53 NA NA Male Not ap… NA
#> 8 A8 2018-02-14 Villag… 27 NA NA Unkn… Not ap… NA
#> 9 A9 2018-04-10 Villag… 63 NA NA Unkn… Not ap… NA
#> 10 A10 2018-04-16 Villag… 40 NA NA Unkn… Not ap… NA
#> 11 A11 2018-02-23 Villag… 18 NA NA Male Not ap… NA
#> 12 A12 2018-01-12 Villag… 30 NA NA Unkn… Not ap… NA
#> 13 A13 2018-03-28 Villag… 10 NA NA Fema… Yes, c… 3rd tr…
#> 14 A14 2018-04-24 Villag… 20 NA NA Fema… Not cu… NA
#> 15 A15 2018-01-27 Villag… NA 9 NA Fema… Not ap… NA
#> 16 A16 2018-02-20 Villag… 57 NA NA Male Not ap… NA
#> 17 A17 2018-02-15 Villag… NA 12 NA Unkn… Not ap… NA
#> 18 A18 2018-01-01 Villag… 15 NA NA Unkn… Not ap… NA
#> 19 A19 2018-04-10 Villag… 34 NA NA Fema… Was pr… NA
#> 20 A20 2018-03-20 Villag… 8 NA NA Fema… Not cu… NA
#> # … with 36 more variables: foetus_alive_at_admission <fct>, exit_status <fct>,
#> # date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> # previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> # readmission <fct>, msf_involvement <fct>,
#> # cholera_treatment_facility_type <fct>, residential_status_brief <fct>,
#> # date_of_last_vaccination <date>, prescribed_zinc_supplement <fct>,
#> # prescribed_antibiotics <fct>, ors_consumed_litres <int>, …