vignettes/intersectional_outbreak_recode_guide.Rmd
intersectional_outbreak_recode_guide.RmdThis guide was written by Applied Epi. Feedback and suggestions are welcome at the GitHub issues page
This guide accompanies the Intersectional Outbreak Data Recode .Rmd file, which is intended to recode data into an intersectional data format for an outbreak report. This means that column names, classes, values, and formats will match that of the MSF intersectional linelist. While the code is generic, the examples in this guidance document refer to the measles report.
Once the data has been recoded, the disease-specific outbreak report .Rmd file can be used to create a report for a particular outbreak.
Note that the Rmd file needs significant editing to be fit for purpose. Specific cleaning code needs to be written, to suit the raw data, the disease, and therefore the relevant data requirements.
Running this code will clean and save your data. Knitting this .Rmd will also produce a word document which can be kept as a log to show how the data was changed. The sections are for:
Each section of the code in Intersectional Outbreak Data Recode Rmd template is explained in the ‘Detailed Guide’ section below.
With the help of this guide (specifically section 3), you should recode your data by:
eval = TRUE on line 24, and click “knit” at the
top. Which will produce a document as a record of your data
cleaning process.Note there are comments to help you and refer to the relevant sections in this guide. The comments look like this:
<!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Comments are shown between these specifically formatted lines (and will not appear in the sitrep output)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
setup chunk
This chunk first sets up key preferences for this R Markdown file, in
opts_chunk$set(). By default it is set to not show code,
errors, or warning messages in the output.
The read_data chunk loads linelist data into RStudio.
There are six options you can pick between to load data. Delete or
comment out the code that you do not need, and make sure the code that
you do need is recognised as code (not commented-out):
When you load real data, you need to specify the right file name and
path. The placeholders assume you are working in an RStudio project, and
that the linelist is in a subfolder called Data.
The inspect_data chunk will show you some basic
information about your linelist: the dimensions and the column
names.
There is no R chunk for this step, but to ensure you clean your data correctly, make sure you are aware of the:
You can use the function msf_dict() from the
sitrep package for this. The package contains data
dictionaries and simulated datasets for measles, meningitis, AJS,
cholera, and diphtheria. The msf_dict() function will show
you the data dictionary for your selected disease - see the example for
measles below.
In this function:
variable column.values_short and values_long columns.
values_short has the shortened value and
values_long has the full-text value. You will need to
recode your current linelist values into the “values_short”
content.
## get MSF standard dictionary for measles
recode_dict <- msf_dict("measles", compact = FALSE) |>
select("variable" = data_element_shortname,
"values_short" = option_code,
"values_long" = option_name)
## browse dictionary
View(recode_dict)clean_column_names chunk
This step fixes the column names. This is done in two steps:
clean_names() function from the package
janitor to automatically standardise column names as per
good coding practice (lower case, remove spaces and punctuation)sex that we want to rename as sex_id. The
syntax for this is rename(data, NEW_NAME = OLD_NAME).
To facilitate this step, you can also use the function
msf_dict_rename_helper() to create a template based again
on the data dictionary held for that disease in the
sitrep package. Do this with the following steps:
msf_dict_rename_helper("xxxx"), where “xxxx” refers
to disease name. For instance, you can type
msf_dict_rename_helper("Measles"). This will copy a rename
command to your clipboard.Here is an example of what the pasted code looks like, which you can then edit so that the name of the columns in your data is on the right-hand side.
## Add the appropriate column names after the equals signs
linelist_cleaned <- rename(linelist_cleaned,
acute_otitis_media = , # BOOLEAN (REQUIRED)
age_days = , # INTEGER_POSITIVE (REQUIRED)
age_months = , # INTEGER_POSITIVE (REQUIRED)
age_years = , # INTEGER_POSITIVE (REQUIRED)
candidiasis = , # BOOLEAN (REQUIRED)
case_number = , # TEXT (REQUIRED)
cough = , # BOOLEAN (REQUIRED)
croup = , # BOOLEAN (REQUIRED)
date_of_consultation_admission = , # DATE (REQUIRED)
residential_status = , # TEXT (optional)
residential_status_brief = , # TEXT (optional)
treatment_facility_name = , # TEXT (optional)
treatment_facility_site = , # TEXT (optional)
treatment_location = , # ORGANISATION_UNIT (optional)
trimester = # TEXT (optional)
)standardise_capitalisation chunk
Before browsing data, you can standardise the capitalisation of categorical values. This minimises the number of corrections that are needed later on in the code.
browse_data chunk
You’ll want to look at your data, to know what errors and typos exist in the column values. This chunk shows you a few ways to explore.
The tbl_summary() function in particular will show you all the values within categorical columns.
recode_factor_vars chunk
This chunk is for recoding factor (categorical) variables. You will
need to edit this section to recode the values in your dataset to suit
the values in the expected linelist format. You can look at the data
dictionary object (recode_dict) and the outputs from the
browse_data chunk to write the correct code.
There is an initial example in this code which shows you how to fix
mispellings in the columns sex_id and outcome.
You should put the various incorrect spellings that need correction into
the brackets. Multiple different incorrect spellings can be listed
within the brackets. For example:
linelist_processing <- linelist_processing |>
mutate(sex_id = case_match(
sex_id,
c("M", "m") ~ "Male",
c("F", "FEMALE") ~ "Female" ,
.default = sex_id )) |>
mutate(outcome = case_match(
outcome,
c("Dead in facility - short") ~ "Dead in facility (<4h)",
c("Dead in facility - long") ~ "Dead in facility (>4h)",
c("Sent home", "Home") ~ "Discharged home",
c("Death in community", "Dead in community") ~ "Dead in community",
c("DOA") ~ "Dead on arrival",
c("Left") ~ "Left against medical advice",
c("Transferred - MSF") ~ "Transferred (to an MSF facility)",
c("Transferred - External") ~ "Transferred (to External Facility)",
.default = outcome))