Normalize Disease Hierarchy — normalize_disease

Take a tidy data set with a potentially complex disease hierarchy and flatten this hierarchy so that, at any particular time and location (or some other context), all diseases in the `disease` column have the same `nesting_disease`.

Usage

normalize_disease_hierarchy(
  data,
  disease_lookup,
  grouping_columns = c("period_start_date", "period_end_date", "location"),
  basal_diseases_to_prune = character(),
  find_unaccounted_cases = TRUE,
  specials_pattern = "_unaccounted$"
)

Arguments

data: A tidy data set with the following minimal set of columns: `disease`, `nesting_disease`, `basal_disease`, `period_start_date`, `period_end_date`, and `location`. Note that the latter three can be modified with `grouping_columns`.
disease_lookup: A lookup table with `disease` and `nesting_disease` columns that describe a global disease hierarchy that will be applied locally to flatten disease hierarchy at each point in time and space in the tidy data set in the `data` argument.
grouping_columns: Character vector of column names to use when grouping to determine the context.
basal_diseases_to_prune: Character vector of `disease`s to remove from `data`.
find_unaccounted_cases: Make new records for instances when the sum of leaf diseases is less than the reported total for their basal disease.
specials_pattern: Optional regular expression to use to match `disease` names in `data` that should be added to the lookup table. This is useful for disease names that are not historical and produced for harmonization purposes. The most common example is `"_unaccounted$"`, which is the default. Setting this argument to `NULL` avoids adding any special disease names to the lookup table.