Severity pipeline: linelist data #158

CarmenTamayo · 2024-10-24T13:55:40Z

PR with severity pipeline

The skeleton is structured to accommodate either linelist or aggregated data, and to accept 3 different types of data structures:

a) 1 column with onset dates and 1 column with death dates
b) 1 column with onset dates and unknown dates of death (we only know the total number of deaths)
c) 1 column with onset dates, 1 column with dates of outcome, and 1 column that indicates the type of outcome

The params in the skeleton are used for the user to indicate the file path, disease name, whether data are aggregated or linelist, distinguish between the 3 types of data described above, provide an onset-death delay or use the epiparameter database for this purpose, and whether to use cfr or Epinow2 for CFR estimation.

At the moment the most useful review would be about how the template is structured, i.e., the balance between parameters and .Rmd chunks. This will help determine what the best way is to include options for aggregated data or to be able to indicate the level of data aggregation. Other components like the method for cfr estimation or the specifics for the time-varying cfr will be updated in further PRs.

…onset and total deaths

Bisaloo

Thanks for your work on this! I think this should be merged as soon as the comments are resolved as it already provides a good basis to add the other cases one by one.

I left comments mostly around the report structure, the choice of parameters, and the trade-off "automagically guessing" vs "black box process"

inst/rmarkdown/templates/transmissibility/skeleton/skeleton.Rmd

inst/rmarkdown/templates/severity/skeleton/skeleton.Rmd

Bisaloo · 2024-12-02T16:54:06Z

inst/rmarkdown/templates/severity/skeleton/skeleton.Rmd

+o_d_params <- get_parameters(onset_death_epiparameter)
+plot(onset_death_epiparameter)


These two lines are here whether we use_epiparameter_database or not, so I would put them in a separate chunk that is always evaluated

Updated in 6a54405
But could I ask why it's better to add separately? Originally I thought it was better to leave it in both to have less chunks of code (thinking of the output html/pdf) so that there'd be less bits to hide/show and make it look cleaner, also in case we end up using rmdscaffold or some approach like that, would it be better if there were less chunks that are more self-contained? or more individual bits that you can tag and rearrange?

But could I ask why it's better to add separately?

The idea is to generally avoid having duplicated code (see the DRY principle in software development).

By writing this twice when you could write it just once:

it takes slightly more effort to update the code in the future (e.g., to change the plot colours) since we have to edit two places instead of one

more importantly, we run the risk of forgetting to update one of the occurrences and now have inconsistent results (e.g., different plot colours) depending on the situation

As always, these general principles need to be approached and applied with a grain of salt. Here, the concerns mentioned above are somewhat reduced by the fact that the duplicated code is in the same location of the document. It would be more of a problem if code was duplicated, e.g., across two child documents.

I still think it's a (albeit minor) net benefit so I think it's worth doing it.

Bisaloo · 2024-12-02T16:54:45Z

inst/rmarkdown/templates/severity/skeleton/rmdchunks/id_key_data_linelist.Rmd

@@ -0,0 +1,65 @@
+## Identifying key data


For each one of these cases, can you say to which example dataset it corresponds please?

Sorry Hugo, do you mean to write it on here or to indicate it within the code chunks?
The example datasets work with different methods, it's a bit complicated but I summarised it here:

This illustration is useful and should probably be provided to the users somewhere.

My specific request here is to say in each chunk which dataset is used to fill in the defaults you provided.

From the other end of the looking glass, if I want to see a specific situation in action (e.g., the first chunk) here, which dataset do I need to set as input in my report parameters.

Okay I see, so the thing is that, as the image indicates, the template works on either of the datasets when selecting certain parameter options, and there'll be cases where both datasets are usable. I was thinking the datasets were only there to act as examples to test the templates?
I agree 100% that it should be mentioned, either in the documentation or on the chunks themselves (or both) that these are the defaults being used and where they can be used

inst/rmarkdown/templates/severity/skeleton/skeleton.Rmd

Bisaloo · 2024-12-02T17:42:37Z

inst/rmarkdown/templates/severity/skeleton/skeleton.Rmd

+  data_columns:
+    label: "Columns present in your dataset"
+    value: "Dates of onset and total deaths only"
+    choices: ["Dates of onset and total deaths only", "Dates of onset and death", "Dates of onset and whether case died"]


I would probably replace this parameter by two params:

cases

deaths

Which can take a column name or a number, and from this, we determine automatically in which situation we are. This means we can skip most of id_key_data_linelist.Rmd.

Does this make sense?

I had thought of something like that initially but the issue I'm having is that for each of the options there can be more than 2 options to indicate, hence the inclusion of variables within the code chunks later on, i.e.:

Dates of onset and total deaths:

Column with onset dates

Variable with total number of deaths

Dates of onset and deaths

Column with onset (or other 1ary event) dates

Column with death dates

Dates of onset and whether case died

Column with onset (or other 1ary event) dates

Column with 2ary event dates

Column with type of outcome

How would it be possible to explain/indicate in the params that these are the 3 options available? 🤔

This is partially related to other discussion items so maybe let's try to resolve the rest first and come back to this later.

Bisaloo · 2024-12-02T17:45:19Z

inst/rmarkdown/templates/severity/skeleton/rmdchunks/id_key_data_linelist.Rmd

+```{r define-key-variables3, eval= params$data_columns == "Dates of onset and whether case died", echo= params$data_columns == "Dates of onset and whether case died"}
+# This code identifies key variables for analysis in the input dataset.
+group_var <- "transmission"
+# Leave count_var as NULL if your data is really a linelist/patient-level data.
+# Update count_var to a character string with the name of the column that
+# contains case counts if your data is already aggregated.
+count_var <- NULL
+# Enter the geographical location where cases in your data took place
+country <- "Equatorial Guinea"
+cases <- "disease_onset"
+# Indicate which column provides the date of outcome of the case
+outcome_date <- "disease_ended"
+# Indicate the column containing the case's outcome, e.g., "died" or "recovered"
+outcome_type <- "status"
+
+dat_raw <- dat_raw %>%
+  mutate(deaths = if_else(dat_raw[[outcome_type]] == "died", dat_raw[[outcome_date]], NA))
+
+deaths <- "deaths"
+```


This seems to be almost equivalent to the 2nd case (with a one-line transformation done in L 48).

But it makes the template much more complicated. I think the convenience of doing this single step for the user is probably outweighed by the costs in terms of decreased usability and maintainability.

What do you think?

So you think it might be better for the user to do this themselves first? My thoughts initially was that the point of the templates is to be able to read in data that are most likely to be available to the user and I believe this to be quite a common one (with the column indicating the outcome of the case)
Would the only downside the sustainability?

Would the only downside the sustainability?

Usability as well since as it currently stands, users are going to have to have (minimal) interactions with the underlying code, if only to change the name of the variables of interests.

I think we need to strike a balance between number of cases & ability to address all cases without any changes from the users.

Indeed, we would realistically never be able to cover all cases. What if status if "recovery" & "death" instead of "recovered" & "died".

Here, even if presented slightly differently, these 2 cases are conceptually the same in my opinion. I would group cases on what types of information is present in the data rather than the exact format.

Update parameter name to "use_epiparameter_database" Co-authored-by: Hugo Gruson <[email protected]>

…after setting deaths variable to "NULL" in id_key_data child document

…de chunk to convert raw data to incidence for all data types

Co-authored-by: Hugo Gruson <[email protected]>

Bisaloo and others added 10 commits November 14, 2023 16:21

Create skeleton for severity pipeline

089b495

changes to tx report to update the use of epiparameter

3276a8d

update epiparameter usage pt 2

0d86d62

Updating .gitignore

7555df0

Draft of severity pipeline with 3 different data sources

14ac977

Adding data for Marburg to test script

e8b31d2

Removing unused images

72be1d9

Rmd chunk to add rolling cfr only for data types other than dates of …

3fba718

…onset and total deaths

Rmd chunk to identify relevant variables with individual-level data

49f29d0

Adding image of severity pipeline

70deaa5

CarmenTamayo requested a review from Bisaloo October 24, 2024 13:55

Bisaloo reviewed Dec 2, 2024

View reviewed changes

CarmenTamayo and others added 7 commits December 6, 2024 10:22

Update inst/rmarkdown/templates/transmissibility/skeleton/skeleton.Rmd

0bdfddb

Update parameter name to "use_epiparameter_database" Co-authored-by: Hugo Gruson <[email protected]>

Deleting unused variables

0f15de7

Adding separate chunk to plot epiparamter object

6a54405

Merging chunk to convert raw data to incidence for all types of data …

4f8a9eb

…after setting deaths variable to "NULL" in id_key_data child document

Setting deaths variable to "NULL" so it's possible to use only one co…

1bba60d

…de chunk to convert raw data to incidence for all data types

Update inst/rmarkdown/templates/severity/skeleton/skeleton.Rmd

8f03faf

Co-authored-by: Hugo Gruson <[email protected]>

Removing unused parameter for cfr estimation for this PR

a6346f0

Co-authored-by: Hugo Gruson <[email protected]>

		o_d_params <- get_parameters(onset_death_epiparameter)
		plot(onset_death_epiparameter)

Severity pipeline: linelist data #158

Are you sure you want to change the base?

Severity pipeline: linelist data #158

Uh oh!

Conversation

CarmenTamayo commented Oct 24, 2024

Uh oh!

Bisaloo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!