r4ds · jonthegeek · May 20, 2024 · May 18, 2024 · May 20, 2024 · May 20, 2024
diff --git a/10_disease-risk-modeling.Rmd b/10_disease-risk-modeling.Rmd
@@ -2,12 +2,269 @@
 
 **Learning objectives:**
 
-- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
+- how to model disease risk
+- how to map disease risk
+
+
+## Introduction
+
+- Bayesian spatial models have been used to understand geographic patterns and risk factors of childhood overweight and obesity prevalence in Costa Rica (Gómez et al. 2023), and mosquito-borne diseases in Brazil (Pavani, Bastos, and Moraga 2023). 
+
+- Spatial methods can also be extended to analyze areal data that are both spatially and temporally referenced. 
+
+
+## Modeling of lung cancer risk in Pennsylvania
+
+Bayesian spatial model to estimate the risk of lung cancer and assess its relationship with smoking in Pennsylvania, USA, in 2002.
+
+```{r message=FALSE, warning=FALSE}
+library(tidyverse)
+library(SpatialEpi)
+data(pennLC)
+names(pennLC)
+```
+
+
+```{r message=FALSE, warning=FALSE}
+head(pennLC$data)
+```
+
+```{r}
+library(sf)
+map <- st_as_sf(pennLC$spatial.polygon)
+countynames <- sapply(slot(pennLC$spatial.polygon, "polygons"),
+                      function(x){slot(x, "ID")})
+map$county <- countynames
+head(map)
+```
+
+```{r}
+d <- group_by(pennLC$data, county) %>% 
+  summarize(Y = sum(cases))
+head(d)
+```
+```{r}
+p1 <- map %>%
+  right_join(d, by = c("county" = "county")) %>%
+  ggplot()+
+  geom_sf(aes(fill=Y), color="navy")+
+  scale_fill_continuous(low="white", high="navy")+
+  geom_sf_text(aes(label=county,alpha=Y,color="orange"), size=2)+
+  coord_sf()+
+  labs(title = "Lung cancer cases",
+       subtitle = "Pennsylvania 1990-1994",
+       caption = "Source: PennLC data",
+       fill = "Cases")+
+  theme_minimal()
+p1
+```
+
+
+### Expected cases
+
+$$
+E_i = \sum_{j=1}^m r_j^{(s)} n_j^{(s)}
+$$
+
+where $r_j^{(s)}$ is the rate of disease in stratum $j$ and $n_j^{(s)}$ is the population in stratum $j$.
+
+
+```{r}
+pennLC$data <- pennLC$data[order(pennLC$data$county,
+                                 pennLC$data$race, 
+                                 pennLC$data$gender, 
+                                 pennLC$data$age), ]
+
+E <- expected(population = pennLC$data$population,
+              cases = pennLC$data$cases, 
+ # 2 races, 2 genders, and 4 age groups (2×2×4 = 16)
+              n.strata = 16)
+d$E <- E
+head(d)
+```
+
+Let's visualize the difference between observed and expected cases.
+```{r message=FALSE, warning=FALSE}
+map %>%
+  right_join(d, by = c("county" = "county")) %>%
+  pivot_longer(cols = c("Y", "E"), 
+               names_to = "type", 
+               values_to = "value") %>%
+  mutate(value=round(value))%>%
+  ggplot()+
+  geom_sf(aes(fill=value), color="navy")+
+  scale_fill_continuous(low="white", high="navy")+
+  geom_sf_text(aes(label=value,
+                   alpha=value,
+                   color="orange"), 
+               fontface="bold",
+               size=2)+
+  coord_sf()+
+  facet_wrap(~type)+
+  guides(fill = guide_colorbar(title = "Cases"),alpha="none",color="none")+
+  labs(title = "Lung cancer Observed and Expected cases",
+       subtitle = "Pennsylvania 1990-1994",
+       caption = "Source: PennLC data",
+       fill = "Type")+
+  theme_minimal()+
+  theme(axis.text = element_blank(),
+        axis.title = element_blank(),
+        legend.position = "bottom")
+```
+
+### Standardized Mortality Ratios
+
+$$
+SMR_i = \frac{Y_i}{E_i}
+$$
+
+if $SMR_i > 1$, then the county has more cases than expected, and if $SMR_i < 1$, then the county has fewer cases than expected.
+
+```{r}
+# add the smoking data
+d <- dplyr::left_join(d, pennLC$smoking, by = "county")
+d$SMR <- d$Y/d$E
+head(d)
+```
+
+```{r message=FALSE, warning=FALSE}
+ggplot(d, aes(x=smoking, y=SMR))+
+  geom_point(aes(color=smoking))+
+  geom_smooth(se=FALSE)+
+  labs(title = "Lung cancer SMR vs Smoking",
+       subtitle = "Pennsylvania 1990-1994",
+       caption = "Source: PennLC data",
+       x = "Smoking",
+       y = "SMR")+
+  theme_minimal()
+```
+
+```{r}
+map <- dplyr::left_join(map, d, by = "county")
+```
+
+```{r message=FALSE, warning=FALSE}
+map %>%
+  ggplot()+
+  geom_sf(aes(fill=SMR), color="navy")+
+  scale_fill_continuous(low="white", high="navy")+
+  geom_sf_text(aes(label=county,
+                   alpha=SMR,
+                   color="orange"), 
+               size=2)+
+  coord_sf()+
+  guides(fill = guide_colorbar(title = "Cases"),
+         alpha="none",
+         color="none")+
+  labs(title = "Lung cancer SMR",
+       subtitle = "Pennsylvania 1990-1994",
+       caption = "Source: PennLC data",
+       fill = "SMR")+
+  theme_minimal()
+```
+
+
+## Modeling diseased risk
+
+
+$$
+Y_i| \theta_i \sim Poisson(E_i \times \theta_i)
+$$
+
+where 
+
+- $\theta_i$ is the **relative risk** for county $i$,
+
+- $u_i$ is the **structured random effect** for county $i$ modeled with an intrinsic conditional autoregressive (ICAR) model, 
+
+$$
+u_i|u_{-i} \sim N(\bar{u}_{\delta_i} \frac{1}{\tau_u n_{\delta_i}})
+$$
+
+
+- $v_i$ is the random effect for stratum $i$
+
+$$
+v_i \sim N(0, \frac{1}{\tau_v})
+$$
+
+
+### Neighborhood structure
+
+
+```{r message=FALSE, warning=FALSE}
+library(spdep)
+library(INLA)
+nb <- poly2nb(map)
+nb2INLA("map.adj", nb)
+g <- inla.read.graph(filename = "map.adj")
+
+```
+
+### Model
+
+```{r}
+map$re_u <- 1:nrow(map)
+map$re_v <- 1:nrow(map)
+```
+
+
+```{r}
+formula <- Y ~ smoking +
+  f(re_u, 
+    model = "besag", 
+    graph = g, 
+    scale.model = TRUE) +
+  f(re_v, model = "iid")
+```
+
+
+```{r eval=FALSE}
+res <- inla(formula, 
+            family = "poisson", 
+            data = map, 
+            E = E,
+            control.predictor = list(compute = TRUE),
+            control.compute = list(return.marginals.predictor = TRUE))
+```
+
+
+     res$summary.fixed
+                    mean     sd 0.025quant 0.5quant
+     (Intercept) -0.3235 0.1498   -0.61925  -0.3233
+     smoking      1.1546 0.6226   -0.07569   1.1560
+                 0.975quant    mode       kld
+     (Intercept)   -0.02877 -0.3234 3.534e-08
+     smoking        2.37845  1.1563 3.545e-08
+
+
+### Relative Risk
+
+     res$summary.fitted.values[1:3, ]
+                           mean      sd 0.025quant 0.5quant
+     fitted.Predictor.01 0.8781 0.05808     0.7648   0.8778
+     fitted.Predictor.02 1.0597 0.02750     1.0072   1.0592
+     fitted.Predictor.03 0.9646 0.05089     0.8604   0.9657
+                         0.975quant   mode
+     fitted.Predictor.01     0.9936 0.8778
+     fitted.Predictor.02     1.1150 1.0582
+     fitted.Predictor.03     1.0622 0.9681
+
+
+     # relative risk
+     map$RR <- res$summary.fitted.values[, "mean"]
+
+
+     # lower and upper limits 95% CI
+     map$LL <- res$summary.fitted.values[, "0.025quant"]
+     map$UL <- res$summary.fitted.values[, "0.975quant"]
+
+
+See the map here:     
+
+<https://www.paulamoraga.com/book-spatial/disease-risk-modeling.html#mapping-smr>
 
-## SLIDE 1 {-}
 
-- ADD SLIDES AS SECTIONS (`##`).
-- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
 
 ## Meeting Videos {-}
 

diff --git a/11_areal-data-issues.Rmd b/11_areal-data-issues.Rmd
@@ -2,12 +2,21 @@
 
 **Learning objectives:**
 
-- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
+- Type of Areal data issues
+
+
+## Areal data issues
+
+Areal data issues are common in spatial epidemiology. The most common are:
+
+
+1. **Misaligned Data Problem (MIDP)**: which refers to a situation where the spatial data being analyzed are at a different scale than the one at which they were originally collected. This can lead to biased results because the data are not aligned with the scale of the analysis.
+
+2. **Modifiable Areal Unit Problem (MAUP)**: The MAUP is a statistical bias that occurs when data are aggregated into different areal units. The MAUP can lead to different results depending on the scale of the analysis. For example, if you aggregate data at the county level, you may get different results than if you aggregate data at the state level.
+
+3. **Special case of the MAUP**: **ecological fallacy**, which occurs when you make inferences about individuals based on group-level data. For example, if you find a correlation between income and disease at the county level, you may incorrectly assume that the same correlation exists at the individual level.
 
-## SLIDE 1 {-}
 
-- ADD SLIDES AS SECTIONS (`##`).
-- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
 
 ## Meeting Videos {-}
 

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -16,6 +16,7 @@ Imports:
     mapview,
     rmarkdown,
     sf,
+    SpatialEpi,
     spData,
     spdep,
     tidyverse