vignettes/cholera_intersectional_outbreak_guide.Rmd
cholera_intersectional_outbreak_guide.RmdThis guide was written by Applied Epi. Feedback and suggestions are welcome at the GitHub issues page
This guide accompanies the Cholera outbreak report .Rmd file, which can be used to create an automated outbreak report for cholera.
This guide and the sitrep code is intended for individuals who already have some familiarity with R but want ready-made code to make the report production process faster. You need to be able to edit and troubleshoot code.
The report will contain basic information on person, place, and time, specifically:
The outbreak report Rmd is split up into sections with chunks which relate to:
Only the outputs from the fourth section on analysis will be visible in the report when rendering it.
Note there are comments throughout the Rmd file which refer to the relevant sections in this guide. The comments look like this:
<!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Comments are shown in the code between these special lines They will not appear in the report output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
With the help of this guide (specifically section 3), you should produce the report via the following steps:
definitions chunk, to correctly define
the reporting date and other key objects. Make sure the date here
matches the one in the YAML.fig.height option in chunk labels to change the
height of figures.
You will need:
If you need to change your data into this format, do so with the outbreak report recode Rmd file first
At the very top of the Rmd file, make sure you specify the following by changing the “XXX” content within the quotations marks: - The title (default is “cholera outbreak report”) - The location/area that the outbreak report concerns - The MSF office - The date of the report
Make sure you do not remove the quotation marks or make edits outside of the quotation marks.
setup chunk
⚠️ You will need to edit this chunk.
This chunk first sets up key preferences for this R Markdown file, in
opts_chunk$set(). By default it is set to not show code, to
show errors, or warning messages in the output.
This chunk also installs/loads packages. Several packages are required for different aspects of analysis with R. You will need to install these before starting.
definitions chunk
⚠️ You will need to edit this chunk.
This chunk is for inputting information so that the report is suited to this dataset.
Set what the three geographical objects
adm1_residence_name, adm2_residence_name,
adm3_residence_name, mean. For example, they may represent
the province, district, and village of each case respectively, but this
will differ depending on the data.
Set the date of the report, assuming the report is updated on a weekly basis.
yearweek function if necessary, so that it
correctly defines the week that this sitrep is reporting on, e.g. “2024
W52”. The default argument for week start inside yearweek
is 1, which means it assumes reporting weeks start on Mondays. You can
change it to 2 for Tuesday, 3 for Wednesday etc.read_population_data chunk
⚠️ You will need to edit this chunk.
This report template uses population data to calculate attack rates.
This chunk creates two objects that are used later in the code:
For each object, there are three options to create these objects:
gen_population() function from the sitrep
package to combine into a table.gen_population() function to generate sub-group
specific counts.Comment out or delete the options you are not using, and edit the one you will use.
* Note this is for calculating rates for the adm2 level. If you need
rates for adm1 or adm3 geographic levels, load the relevant population
data in the appropriate way, and change the object name for clarity
(e.g. to population_data_adm3), or create an additional
appropriately named denominator data objects.
read_data chunk
⚠️ You will need to edit this chunk.
This chunk loads linelist data into RStudio. There are seven options you can pick between to load data. Delete or comment out the code that you do not need, and edit the code you do need by specifying the correct file name and location:
gen_data() function from the sitrep
package. Do this if you want to see what the clean data should look
like, or if you want to test the outbreak templatebrowse_data chunk
You’ll want to look at your data.
This chunk shows you a few ways you can explore, including printing out a table which shows all values in your columns (excluding the case ID or health facility IDs which would be too many to tabulate) so that you can identify non-standardised or incorrect values.
It is recommended you check other detail more comprehensively as well.
Note that these tables will not be printed when you run the report.
All your cleaning and variable creation should happen in these chunks:
| Step | Description |
|---|---|
standardise_dates |
Sets up and cleans dates, and creates new columns on reporting weeks. This includes creating a new data_combined column which prioritises onset date and uses notification date if onset date is not available |
create_age_group |
Creates the age group column from age (and some intermediary columns). For cholera, two age group columns are created for a choice of table outputs in the analysis section |
standardise_values |
Cleans the values of categorical variables (e.g., typo correction) and removes illogical values. |
create_vars |
Creates additional columns based on other columns, and converts categorical columns into ordered factors so that all values are presented in the correct order in table outputs |
duplicates |
Removes duplicate rows. |
remove_unused_data |
Removes rows (and columns) that are not required for analysis. |
save_cleaned_data |
Saves the clean data as a back-up. |
⚠️ You will likely need to edit some of these chunks.
standardise_dates chunk
As the data should already be in the right format, you should not need to change this section. If you have imported an RDS file, this code might not be needed, but it will be useful for excel files and csv files etc.
The code does the following:
Changes all columns starting with the word ‘date’ (i.e. the
date_notification, date_symptom_start, and
date_hospitalisation_end columns) to be recognised as dates
by R. The function ymd is used to recognise that the text
has dates written in the order of year, month, and day (e.g. 2025-02-01
or 2025 Feb 01 or 2025 February 1). ⚠️ Change this if the dates are written
in a different way in your data, e.g. to dmy() if
written in day, month, year (e.g. 01-02-2025 or 1 Feb 2025)
Fixes logical inconsistent dates: removed symptom onset dates if they are after notification dates. ⚠️ Inspect and edit data if you can, or change the rule to fit the circumstances of your data
Creates a new date column, date_combined, to
maximise date completeness, by using the coalesce()
function that prioritises symptom onset date, and substitutes with
notification date if onset date is not available. It also creates a
date_source column to label if the value in the
date_combined column is the onset date or the notification
date.
Creates six new week columns:
epiweek_symptom_start_num,
epiweek_notification_num, and
epiweek_combined_num: These are week representations of the
onset date, notification date, and date_combined columns,
and are represented with the year and the week number, e.g. “2025 W10”.
It uses the yearweek function in which you state which day
the week starts (default 1 for Monday). These are not the default
columns used in epicurves but these can be switched to.epiweek_symptom_start,
epiweek_notification, and epiweek_combined:
This will create corresponding columns with the date representing the
start of the week, e.g. “2025-03-03”. Note the selected date will depend
on the start day provided in the yearweek function.create_age_group chunk
This chunk creates a categorical variable for age groups, as typically analysed for the disease.
standardise_values chunk
Edit the code as needed to make sure values are standardised and
correct. The checks you did in the browse_data chunk will
inform this section.
Two examples are provided to start with:
This template does not include code for all possible errors as this will depend on your data, so you may need to write your own code/consult someone who can write code in R to make sure your linelist data is fully ready for analysis.
create_vars chunk
This chunk creates other columns used in analysis, and converts categorical columns into a factor class so that all categories are displayed in the correct order in tables later in the analysis. You can edit this section for more columns.
New columns are:
died: binary (TRUE/FALSE) column labeling if a case
died or not.These columns are changed into factors. Note that values not specified as valid categories in the code will be converted to NA:
facility_typeorigindehydrationlab_result_rdtlab_result_culturelab_result_pcrvacci_statusvacci_dosesexit_statusdate_sourceduplicates chunk
This chunk removes duplicate cases, presenting two options. You can edit to use whichever unique identifiers you think relevant.
remove_unused_data chunk
This step filters out data that are not appropriate to include in the analysis, for instance:
Data with onset after the reporting week: This removes cases that are not feasible for this report
Other ineligible/anomalous data: E.g. you might want to remove rows with missing essential data. Use this section to make other edits as needed (please do not change column names or formats).
⚠️ You will need to edit this section either in the code or final output.
This is a placeholder section for you to add comments in consultation with the appropriate team/expertise.
At the start of the epi description, there are some short bullet points describing the number of cases and key epi points.
epicurve chunk
This chunk starts by creating the objects all_weeks and
all_weeks_date with all weeks, spanning from the earliest
epiweek_combined value to the reporting_date.
This is used across epicurves in this report to define the range in the
x axis.
Then the epicurve is created using the coalesced column
epiweek_combined. The source of the date is indicated by
the fill colour of the bars, based on the date_source
column. Note that the title of this epicurve specifies onset date, so
this analysis and code assumes that the epiweek_combined
column is indeed mostly onset date and only sometimes approximated by
notification date. ⚠️ Change
the week column used if not appropriate, e.g. if there is high
missingness in onset date.
Change the fig.height chunk option for smaller or larger
figures, and change the breaks argument in the
scale_x_date() function to specify the time difference
between the x axis date labels. For example, you can specify
breaks = "1 week" or breaks = "2 months".
This section produces:
total_props_agegroup_sex chunk, a table showing
age distribution overall and by sexage_pyramid chunk: a age-sex pyramids using the
age_pyramid() function, which produces a ggplot object so
can be further edited with ggplot2 code if needed (e.g
themes, labels, and scales).This section caculates total case counts, cases in the last 14 days
(calculated for the 14 days prior to reporting date), deaths, and (for
two tables) the CFR. The first two tables link to the populations tables
imported or produced in the read_population_data chunk:
attack_rate_by_agegroup chunk creates a table by
broad age group using the age_group columnattack_rate_by_adm2 chunk creates a table by unique
value for adm2_residence.The following three tables exclude the CFR as there is no relevant denominator data:
attack_rate_by_facility chunk creates a table by
facility_type
attack_rate_by_facilityname chunk creates a table
by facility_name
attack_rate_by_dehydration chunk creates a table by
dehydration statusAcross four chunks, four tables are created with
tbl_summary() on vaccine history, laboratory results, and
detailed outcomes. They use the same formatting as all other
tbl_summary() outputs.
Two chunks produce the following:
describe_by_adm2 chunk: A table with geographical
breakdowns, for the adm2 level, by age group.epicurve_by_adm2 chunk: This epicurve uses the same
code as the first main epicurve, with the addition of a
facet_wrap() function to split the code into several mini
plots per adm2 value. You may want to edit the figure
height and width within the chunk so that it fits on the page. As with
the main epicurve, you may also want to make sure you use the right week
column, as this analysis and code assumes that the
epiweek_combined column is mostly onset date and only
sometimes approximated by notification date. ⚠️ Change the week column used if not
appropriate, e.g. if there is high missingness in onset
date.Note that the meaning of adm2, e.g. District vs Region, should be set
in the definitions chunk so that the titles and labels
within these tables and figures are correct.
The report produces three maps to show:
The map production is split up into several chunks:
read_shapefiles: To create maps, you need to have a
shapefile of the area (note that a shapefile typically consists of
several files, of which one ends in .shp). This chunk gives you the
option of generating a fake shape file with the gen_polygon
function. Otherwise, you can read in the shapefile. Often, the MSF GIS
unit can provide shapefiles and advice on how to use them.Your shapefile
can be a polygon or points. Polygons do not need to be contiguous. The
names of the polygons or points MUST match the names in your linelist.
Finally, your coordinate reference system needs to be WGS84.chloropleth_map_prep: This chunk builds on the detailed
attack rate table produced in the attack_rate_by_adm2
chunk, by converting counts and attack rates into categories suitable
for mapping. The find_breaks() function is used to
dynamically define the boundaries of the groupings. Edit the
breaks = X and snap = X argument to change how
many and how wide the subgroups are. For example, if there are 5000
total cases, you may want four categories (X=4) that snap to the closest
500 (snap=500, for example for categories 0, 1-1000, 1001-2000,
2001-3000, and 3001+). This chunk also links to the map
object created in the read_shapefiles chunk to bring in the
geometries for mapping. ⚠️ You
will likely need to edit this chunk to change the categories that appear
on the map/legend.chloropleth_map chunks: Three chunks create one map
each.The describe_by_adm2_adm3 chunk creates a table showing
the distribution of cases by adm2 and adm3. It has the potential to be a
long table so it is at the end of the report.
The meanings of adm2 and adm3 are set at the top in the
definitions chunk, which as default are sets to mean
district and area respectively. Go
back and change this chunk if the table titles are incorrect.