|
| 1 | +--- |
| 2 | +title: "Google Symptoms Dataset" |
| 3 | +author: "Addison" |
| 4 | +output: |
| 5 | + github_document: |
| 6 | + toc: true |
| 7 | +--- |
| 8 | +```{r import_statements, echo = FALSE, message = FALSE} |
| 9 | +library(tidyverse) |
| 10 | +``` |
| 11 | + |
| 12 | +In this notebook, we look at California data in order to obtain a |
| 13 | +general impression of the dataset. |
| 14 | + |
| 15 | +```{r base_urls, echo = FALSE, message=FALSE} |
| 16 | +BASE_DAILY_URL = paste0( |
| 17 | + 'https://raw.githubusercontent.com/google-research/open-covid-19-data/', |
| 18 | + 'master/data/exports/search_trends_symptoms_dataset/', |
| 19 | + 'United%20States%20of%20America/subregions/{state}/', |
| 20 | + '2020_US_{state}_daily_symptoms_dataset.csv') |
| 21 | +BASE_WEEKLY_URL = paste0( |
| 22 | + 'https://raw.githubusercontent.com/google-research/open-covid-19-data/', |
| 23 | + 'master/data/exports/search_trends_symptoms_dataset/', |
| 24 | + 'United%20States%20of%20America/subregions/{state}/', |
| 25 | + '2020_US_{state}_weekly_symptoms_dataset.csv') |
| 26 | +
|
| 27 | +CA_DAILY_URL = str_replace_all(BASE_DAILY_URL, |
| 28 | + fixed('{state}'), 'California') |
| 29 | +CA_WEEKLY_URL = str_replace_all(BASE_WEEKLY_URL, |
| 30 | + fixed('{state}'), 'California') |
| 31 | +``` |
| 32 | + |
| 33 | +```{r read_data, echo = FALSE, message=FALSE} |
| 34 | +ca_daily_df = read_csv(CA_DAILY_URL) |
| 35 | +ca_weekly_df = read_csv(CA_WEEKLY_URL) |
| 36 | +``` |
| 37 | + |
| 38 | + |
| 39 | +### Temporal availability |
| 40 | +Data is available at a daily resolution, starting on January 1, 2020. |
| 41 | +Google also provides weekly rollups every Monday, starting on January 6, 2020. |
| 42 | +The weekly rollups are useful because the increase in sample size |
| 43 | +diminishes the effect of the noise added for differential privacy; or in |
| 44 | +circumstances may allow us obtain information in regions where data is |
| 45 | +too sparse to be reported at a consistent daily resolution. |
| 46 | + |
| 47 | +Daily availability: |
| 48 | +```{r daily_availability, echo = TRUE} |
| 49 | +print(head(unique(ca_daily_df$date))) |
| 50 | +print(tail(unique(ca_daily_df$date))) |
| 51 | +print(length(unique(ca_daily_df$date))) |
| 52 | +``` |
| 53 | + |
| 54 | +Weekly availability: |
| 55 | +```{r weekly_availability, echo = TRUE} |
| 56 | +print(unique(ca_weekly_df$date)) |
| 57 | +``` |
| 58 | + |
| 59 | +### Spatial availability |
| 60 | +Data is available at the county level, which is an improvement upon our |
| 61 | +original Google Health Trends signal. However, there are varying degrees |
| 62 | +of missingness in the data, in line with Google's standards for not reporting |
| 63 | +data when the counts are too small, in order to protect users' privacy. |
| 64 | + |
| 65 | +```{r spatial_availability, echo = TRUE} |
| 66 | +print(unique(ca_daily_df$open_covid_region_code)) |
| 67 | +``` |
| 68 | + |
| 69 | +### Symptom availability |
| 70 | +The dataset provides normalized search volumes for 422 distinct "symptoms". |
| 71 | +Note, however, that one search may count towards multiple symptoms |
| 72 | +[citation needed, but I read this in their documentation]. The normalization |
| 73 | +is a linear scaling such that (TODO -- but the info about this is in their |
| 74 | +PDF). |
| 75 | + |
| 76 | +```{r extract_symptom_columns, echo = TRUE} |
| 77 | +symptom_cols = colnames(ca_daily_df)[ |
| 78 | + str_detect(colnames(ca_daily_df), 'symptom:')] |
| 79 | +symptom_names = str_replace(symptom_cols, fixed('symptom:'), '') |
| 80 | +``` |
| 81 | + |
| 82 | +Although there are hundreds of topics included, note that neither |
| 83 | +`covid` nor `corona` is a substring in any term. |
| 84 | + |
| 85 | +```{r grep_covid_corona, echo = TRUE} |
| 86 | +sum(str_detect(symptom_names, fixed('covid', ignore_case=TRUE))) |
| 87 | +sum(str_detect(symptom_names, fixed('corona', ignore_case=TRUE))) |
| 88 | +``` |
| 89 | + |
| 90 | +```{r calculate_availability, echo = FALSE} |
| 91 | +data_matrix = ca_daily_df %>% filter( |
| 92 | + date >= '2020-03-15', |
| 93 | + ) %>% select ( |
| 94 | + contains('symptom:'), |
| 95 | + ) %>% as.matrix |
| 96 | +availability_vector = apply(!is.na(data_matrix), 2, mean) |
| 97 | +names(availability_vector) = symptom_names |
| 98 | +``` |
| 99 | + |
| 100 | +The large number of symptoms for which data is provided spans those that |
| 101 | +are availability almost every day to those availabile on amount 10% of days. |
| 102 | +If my understanding is correct, we can think of data availability as roughly |
| 103 | +corresponding to whether search incidence exceeds a certain minimum threshold. |
| 104 | + |
| 105 | +```{r plot_availability, echo = FALSE} |
| 106 | +plot(sort(availability_vector), |
| 107 | + main='Availability of data across symptoms', |
| 108 | + ylab='Prop. of days that a symptom reported data' |
| 109 | +) |
| 110 | +``` |
| 111 | + |
| 112 | +### Symptoms by degree of availability |
| 113 | +Because 422 topics are too many to parse simultaneously, I organize them |
| 114 | +based on their availability (1 - missingness) level, starting on March 1, 2020 |
| 115 | +(a soft start point for the coronavirus pandemic in the United States). |
| 116 | +```{r print_symptoms_by_availibility, echo = TRUE} |
| 117 | +for (idx in 9:0) { |
| 118 | + cat(sprintf('Symptoms that reported data on %d%% to %d%% of days:', |
| 119 | + idx*10, (idx+1)*10), '\n') |
| 120 | + print(names(availability_vector[ |
| 121 | + (availability_vector >= idx/10) |
| 122 | + &(availability_vector <= (idx+1)/10)])) |
| 123 | + cat('\n') |
| 124 | +} |
| 125 | +``` |
| 126 | + |
| 127 | +### All symptoms |
| 128 | +Here we print the entire symptom list, ordered by the columns in Google's |
| 129 | +dataset: |
| 130 | +```{r print_all_symptoms, echo = TRUE} |
| 131 | +print(symptom_names) |
| 132 | +``` |
| 133 | + |
| 134 | +```{r comment_block, echo = FALSE} |
| 135 | +# Further directions |
| 136 | +# Filter to counties [states] for each |
| 137 | +# Concatenate all counties [states] |
| 138 | +# Export |
| 139 | +
|
| 140 | +# * Missingness in data by column (symptom type) |
| 141 | +# * Need a better way of visualizing given that there are over 400 symptoms |
| 142 | +# * Many of these symptoms are not necessarily related to COVID; most seem like |
| 143 | +# general health terms, such as "Weight gain" and "Wart" |
| 144 | +# * Some way of imputation |
| 145 | +# * Find principal components (maybe some subset of the columns that are densely |
| 146 | +# populated?), use PC1 as the initial signal |
| 147 | +# * Run correlations analysis against initial version of signal |
| 148 | +``` |
0 commit comments