Skip to content

Severity pipeline: linelist data #158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

CarmenTamayo
Copy link
Contributor

PR with severity pipeline

The skeleton is structured to accommodate either linelist or aggregated data, and to accept 3 different types of data structures:

a) 1 column with onset dates and 1 column with death dates
b) 1 column with onset dates and unknown dates of death (we only know the total number of deaths)
c) 1 column with onset dates, 1 column with dates of outcome, and 1 column that indicates the type of outcome

The params in the skeleton are used for the user to indicate the file path, disease name, whether data are aggregated or linelist, distinguish between the 3 types of data described above, provide an onset-death delay or use the epiparameter database for this purpose, and whether to use cfr or Epinow2 for CFR estimation.

At the moment the most useful review would be about how the template is structured, i.e., the balance between parameters and .Rmd chunks. This will help determine what the best way is to include options for aggregated data or to be able to indicate the level of data aggregation. Other components like the method for cfr estimation or the specifics for the time-varying cfr will be updated in further PRs.

@CarmenTamayo CarmenTamayo requested a review from Bisaloo October 24, 2024 13:55
Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! I think this should be merged as soon as the comments are resolved as it already provides a good basis to add the other cases one by one.

I left comments mostly around the report structure, the choice of parameters, and the trade-off "automagically guessing" vs "black box process"

Comment on lines +226 to +227
o_d_params <- get_parameters(onset_death_epiparameter)
plot(onset_death_epiparameter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are here whether we use_epiparameter_database or not, so I would put them in a separate chunk that is always evaluated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 6a54405
But could I ask why it's better to add separately? Originally I thought it was better to leave it in both to have less chunks of code (thinking of the output html/pdf) so that there'd be less bits to hide/show and make it look cleaner, also in case we end up using rmdscaffold or some approach like that, would it be better if there were less chunks that are more self-contained? or more individual bits that you can tag and rearrange?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But could I ask why it's better to add separately?

The idea is to generally avoid having duplicated code (see the DRY principle in software development).

By writing this twice when you could write it just once:

  • it takes slightly more effort to update the code in the future (e.g., to change the plot colours) since we have to edit two places instead of one
  • more importantly, we run the risk of forgetting to update one of the occurrences and now have inconsistent results (e.g., different plot colours) depending on the situation

As always, these general principles need to be approached and applied with a grain of salt. Here, the concerns mentioned above are somewhat reduced by the fact that the duplicated code is in the same location of the document. It would be more of a problem if code was duplicated, e.g., across two child documents.

I still think it's a (albeit minor) net benefit so I think it's worth doing it.

@@ -0,0 +1,65 @@
## Identifying key data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each one of these cases, can you say to which example dataset it corresponds please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Hugo, do you mean to write it on here or to indicate it within the code chunks?
The example datasets work with different methods, it's a bit complicated but I summarised it here:
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This illustration is useful and should probably be provided to the users somewhere.

My specific request here is to say in each chunk which dataset is used to fill in the defaults you provided.

From the other end of the looking glass, if I want to see a specific situation in action (e.g., the first chunk) here, which dataset do I need to set as input in my report parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see, so the thing is that, as the image indicates, the template works on either of the datasets when selecting certain parameter options, and there'll be cases where both datasets are usable. I was thinking the datasets were only there to act as examples to test the templates?
I agree 100% that it should be mentioned, either in the documentation or on the chunks themselves (or both) that these are the defaults being used and where they can be used

Comment on lines +21 to +24
data_columns:
label: "Columns present in your dataset"
value: "Dates of onset and total deaths only"
choices: ["Dates of onset and total deaths only", "Dates of onset and death", "Dates of onset and whether case died"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably replace this parameter by two params:

  • cases
  • deaths

Which can take a column name or a number, and from this, we determine automatically in which situation we are. This means we can skip most of id_key_data_linelist.Rmd.

Does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of something like that initially but the issue I'm having is that for each of the options there can be more than 2 options to indicate, hence the inclusion of variables within the code chunks later on, i.e.:

Dates of onset and total deaths:

  • Column with onset dates
  • Variable with total number of deaths

Dates of onset and deaths

  • Column with onset (or other 1ary event) dates
  • Column with death dates

Dates of onset and whether case died

  • Column with onset (or other 1ary event) dates
  • Column with 2ary event dates
  • Column with type of outcome

How would it be possible to explain/indicate in the params that these are the 3 options available? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partially related to other discussion items so maybe let's try to resolve the rest first and come back to this later.

Comment on lines +33 to +52
```{r define-key-variables3, eval= params$data_columns == "Dates of onset and whether case died", echo= params$data_columns == "Dates of onset and whether case died"}
# This code identifies key variables for analysis in the input dataset.
group_var <- "transmission"
# Leave count_var as NULL if your data is really a linelist/patient-level data.
# Update count_var to a character string with the name of the column that
# contains case counts if your data is already aggregated.
count_var <- NULL
# Enter the geographical location where cases in your data took place
country <- "Equatorial Guinea"
cases <- "disease_onset"
# Indicate which column provides the date of outcome of the case
outcome_date <- "disease_ended"
# Indicate the column containing the case's outcome, e.g., "died" or "recovered"
outcome_type <- "status"

dat_raw <- dat_raw %>%
mutate(deaths = if_else(dat_raw[[outcome_type]] == "died", dat_raw[[outcome_date]], NA))

deaths <- "deaths"
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be almost equivalent to the 2nd case (with a one-line transformation done in L 48).

But it makes the template much more complicated. I think the convenience of doing this single step for the user is probably outweighed by the costs in terms of decreased usability and maintainability.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you think it might be better for the user to do this themselves first? My thoughts initially was that the point of the templates is to be able to read in data that are most likely to be available to the user and I believe this to be quite a common one (with the column indicating the outcome of the case)
Would the only downside the sustainability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the only downside the sustainability?

Usability as well since as it currently stands, users are going to have to have (minimal) interactions with the underlying code, if only to change the name of the variables of interests.

I think we need to strike a balance between number of cases & ability to address all cases without any changes from the users.

Indeed, we would realistically never be able to cover all cases. What if status if "recovery" & "death" instead of "recovered" & "died".

Here, even if presented slightly differently, these 2 cases are conceptually the same in my opinion. I would group cases on what types of information is present in the data rather than the exact format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants