Skip to contents

Based on a dictionary generator like msf_dict() or msf_dict_survey(), this function will generate a randomized data set based on values defined in the dictionaries. The randomized dataset produced should mimic an excel export from DHIS2 for outbreaks and a Kobo export for surveys.

Usage

gen_data(
  dictionary,
  varnames = "data_element_shortname",
  numcases = 300,
  org = "MSF"
)

Arguments

dictionary

Specify which dictionary you would like to use.

varnames

Specify name of column that contains variable names. If dictionary is a survey, varnames needs to be "name"`.

numcases

Specify the number of cases you want (default is 300)

org

the organization the dictionary belongs to. Currently, only MSF exists. In the future, dictionaries from WHO and other organizations may become available.

Value

a data frame with cases in rows and variables in columns. The number of columns will vary from dictionary to dictionary, so please use the dictionary functions to generate a corresponding dictionary.

Examples


if (require("dplyr") & require("matchmaker")) {
  withAutoprint({

    # You will often want to use MSF dictionaries to translate codes to human-
    # readable variables. Here, we generate a data set of 20 cases:
    dat <- gen_data(
      dictionary = "Cholera",
      varnames = "data_element_shortname",
      numcases = 20,
      org = "MSF"
    )
    print(dat)

    # We want the expanded dictionary, so we will select `compact = FALSE`
    dict <- msf_dict(disease = "Cholera", long = TRUE, compact = FALSE, tibble = TRUE)
    print(dict)

    # Now we can use matchmaker to filter the data:
    dat_clean <- matchmaker::match_df(dat, dict,
      from = "option_code",
      to = "option_name",
      by = "data_element_shortname",
      order = "option_order_in_set"
    )
    print(dat_clean)

  })
}
#> Loading required package: dplyr
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: matchmaker
#> > dat <- gen_data(dictionary = "Cholera", varnames = "data_element_shortname", 
#> +     numcases = 20, org = "MSF")
#> > print(dat)
#> # A tibble: 20 × 45
#>    case_number date_of_c…¹ patie…² age_y…³ age_m…⁴ age_d…⁵ sex   pregn…⁶ trime…⁷
#>    <chr>       <date>      <chr>     <int>   <int>   <int> <fct> <fct>   <fct>  
#>  1 A1          2018-02-09  Villag…      83      NA      NA M     NA      NA     
#>  2 A2          2018-04-07  Villag…      78      NA      NA U     NA      NA     
#>  3 A3          2018-04-01  Villag…      30      NA      NA F     N       NA     
#>  4 A4          2018-04-23  Villag…      21      NA      NA M     NA      NA     
#>  5 A5          2018-03-18  Villag…      13      NA      NA F     Y       1      
#>  6 A6          2018-03-18  Villag…      73      NA      NA M     NA      NA     
#>  7 A7          2018-03-27  Villag…      53      NA      NA M     NA      NA     
#>  8 A8          2018-02-14  Villag…      27      NA      NA U     NA      NA     
#>  9 A9          2018-04-10  Villag…      63      NA      NA U     NA      NA     
#> 10 A10         2018-04-16  Villag…      40      NA      NA U     NA      NA     
#> 11 A11         2018-02-23  Villag…      18      NA      NA M     NA      NA     
#> 12 A12         2018-01-12  Villag…      30      NA      NA U     NA      NA     
#> 13 A13         2018-03-28  Villag…      10      NA      NA F     Y       3      
#> 14 A14         2018-04-24  Villag…      20      NA      NA F     N       NA     
#> 15 A15         2018-01-27  Villag…      NA       9      NA F     NA      NA     
#> 16 A16         2018-02-20  Villag…      57      NA      NA M     NA      NA     
#> 17 A17         2018-02-15  Villag…      NA      12      NA U     NA      NA     
#> 18 A18         2018-01-01  Villag…      15      NA      NA U     NA      NA     
#> 19 A19         2018-04-10  Villag…      34      NA      NA F     W       NA     
#> 20 A20         2018-03-20  Villag…       8      NA      NA F     N       NA     
#> # … with 36 more variables: foetus_alive_at_admission <fct>, exit_status <fct>,
#> #   date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> #   previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> #   readmission <fct>, msf_involvement <fct>,
#> #   cholera_treatment_facility_type <fct>, residential_status_brief <fct>,
#> #   date_of_last_vaccination <date>, prescribed_zinc_supplement <fct>,
#> #   prescribed_antibiotics <fct>, ors_consumed_litres <int>, …
#> > dict <- msf_dict(disease = "Cholera", long = TRUE, compact = FALSE, tibble = TRUE)
#> > print(dict)
#> # A tibble: 182 × 11
#>    data_elemen…¹ data_…² data_…³ data_…⁴ data_…⁵ data_…⁶ used_…⁷ optio…⁸ optio…⁹
#>    <chr>         <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 AafTlSwliVQ   egen_0… case_n… Anonym… TEXT    Case n… NA      NA      NA     
#>  2 OTGOtWBz39J   egen_0… date_o… Date p… DATE    Date o… NA      NA      NA     
#>  3 wnmMr2V3T3u   egen_0… patien… Locati… ORGANI… Patien… NA      NA      NA     
#>  4 sbgqjeVwtb8   egen_0… age_ye… Age of… INTEGE… Age in… NA      NA      NA     
#>  5 eXYhovYyl61   egen_0… age_mo… Age of… INTEGE… Age in… NA      NA      NA     
#>  6 UrYJSk2Wp46   egen_0… age_da… Age of… INTEGE… Age in… NA      NA      NA     
#>  7 D1Ky5K7pFN6   egen_0… sex     Sex of… TEXT    Sex     orgc5Y… M       Male   
#>  8 D1Ky5K7pFN6   egen_0… sex     Sex of… TEXT    Sex     orgc5Y… F       Female 
#>  9 D1Ky5K7pFN6   egen_0… sex     Sex of… TEXT    Sex     orgc5Y… U       Unknow…
#> 10 dTm5R53YYXC   egen_0… pregna… Pregna… TEXT    Pregna… IEjzG2… N       Not cu…
#> # … with 172 more rows, 2 more variables: option_uid <chr>,
#> #   option_order_in_set <dbl>, and abbreviated variable names
#> #   ¹​data_element_uid, ²​data_element_name, ³​data_element_shortname,
#> #   ⁴​data_element_description, ⁵​data_element_valuetype, ⁶​data_element_formname,
#> #   ⁷​used_optionset_uid, ⁸​option_code, ⁹​option_name
#> > dat_clean <- matchmaker::match_df(dat, dict, from = "option_code", to = "option_name", 
#> +     by = "data_element_shortname", order = "option_order_in_set")
#> > print(dat_clean)
#> # A tibble: 20 × 45
#>    case_number date_of_c…¹ patie…² age_y…³ age_m…⁴ age_d…⁵ sex   pregn…⁶ trime…⁷
#>    <chr>       <date>      <chr>     <int>   <int>   <int> <fct> <fct>   <fct>  
#>  1 A1          2018-02-09  Villag…      83      NA      NA Male  Not ap… NA     
#>  2 A2          2018-04-07  Villag…      78      NA      NA Unkn… Not ap… NA     
#>  3 A3          2018-04-01  Villag…      30      NA      NA Fema… Not cu… NA     
#>  4 A4          2018-04-23  Villag…      21      NA      NA Male  Not ap… NA     
#>  5 A5          2018-03-18  Villag…      13      NA      NA Fema… Yes, c… 1st tr…
#>  6 A6          2018-03-18  Villag…      73      NA      NA Male  Not ap… NA     
#>  7 A7          2018-03-27  Villag…      53      NA      NA Male  Not ap… NA     
#>  8 A8          2018-02-14  Villag…      27      NA      NA Unkn… Not ap… NA     
#>  9 A9          2018-04-10  Villag…      63      NA      NA Unkn… Not ap… NA     
#> 10 A10         2018-04-16  Villag…      40      NA      NA Unkn… Not ap… NA     
#> 11 A11         2018-02-23  Villag…      18      NA      NA Male  Not ap… NA     
#> 12 A12         2018-01-12  Villag…      30      NA      NA Unkn… Not ap… NA     
#> 13 A13         2018-03-28  Villag…      10      NA      NA Fema… Yes, c… 3rd tr…
#> 14 A14         2018-04-24  Villag…      20      NA      NA Fema… Not cu… NA     
#> 15 A15         2018-01-27  Villag…      NA       9      NA Fema… Not ap… NA     
#> 16 A16         2018-02-20  Villag…      57      NA      NA Male  Not ap… NA     
#> 17 A17         2018-02-15  Villag…      NA      12      NA Unkn… Not ap… NA     
#> 18 A18         2018-01-01  Villag…      15      NA      NA Unkn… Not ap… NA     
#> 19 A19         2018-04-10  Villag…      34      NA      NA Fema… Was pr… NA     
#> 20 A20         2018-03-20  Villag…       8      NA      NA Fema… Not cu… NA     
#> # … with 36 more variables: foetus_alive_at_admission <fct>, exit_status <fct>,
#> #   date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> #   previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> #   readmission <fct>, msf_involvement <fct>,
#> #   cholera_treatment_facility_type <fct>, residential_status_brief <fct>,
#> #   date_of_last_vaccination <date>, prescribed_zinc_supplement <fct>,
#> #   prescribed_antibiotics <fct>, ors_consumed_litres <int>, …