The opoid epidemic is one of today’s most pressing public health issues in the United States. The state of Ohio, including its major cities such as Cincinnati, have been hit particularly hard by the opiod epidemic and, as greater national attention and funding has been directed towards the epidemic, the state and its municipalities have been responding in a more concerted way. In the summer of 2016, the city of Cincinnati noticed a spike in heroin overdoses and responded by dedicating resources and capacity for the Heroin Overdose Responses dashboard that allows city officials to view historical overdose data across space and time. These data visualizations can be very helpful for developing a reactive response to the epidemic, facilitating better position of EMS services and reducing overdose response times.
The city, however, is lacking a proactice response to the persistent issue of heroin overdoses. In this analysis, we will develop such a proactive proposal to help the city address the ongoing heroin epidemic in Cincinnati. We will develop a predictive model that estimates counts of heroin overdoses across space in the city. Such a model will allow public health officials to understand the spatial distribution of risk across the city. We will pair this prediction with identification of potential intervention sites to suggest an example of a proactive response to the epidemic – that is, intervening in high risk areas before overdose occurs.
All of our methods will be displayed and explained to facilitate replication of this analysis. Our analysis includes many subjective decisions – such as the list of predictive features in our model and the choice of unit grid cell size for our spatial distirbution – and future iterations of this analysis could modify these inputs. This project, thus, should serve as an initial prediction and example method of intervention, with broad possibility for future iterations and improvements.
The video presentation of this analysis can be found here.
To begin, we will set up our analysis by loading all the R libraries we will utilize and define some helper functions we will be using. Most of the helper functions are for plotting and mapping, but the nn_function
is an analytic function that we will use later to measure the distance between a given point and the nearest n points around it.
library(tidyverse)
library(spatstat)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(corrplot)
library(jsonlite)
library(tidyr)
library(lubridate)
library(stargazer)
library(viridis)
library(data.table)
library(raster)
options(scipen = 999)
setwd("~/Fall 2019/CPLN 590/Final")
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
#The analytic nearest neighbor function
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
Next, we’ll start pulling the data from Cincinnati’s Open Data. Our primary datasource that contains the heroin overdoses with location information comes from the Cincinnati Fire Incidents dataset that includes EMS reposne data. The data is downloaded here as a CSV and then imported.
Once the entire dataset is loaded, we’ll start cleaning the data. We will clean some duplicate EMS reports based on the DISPOSITION_TEXT field. We’ll also create a new data frame for just heroin overdoses.
#Full load
all_fire_data <- fread("Cincinnati_Fire_Incidents.csv", header=TRUE, sep=",", fill=TRUE)
#Clean out some duplicate data
all_fire_data <-
subset(all_fire_data,DISPOSITION_TEXT!="CANCEL INCIDENT" & DISPOSITION_TEXT!=
"DUPLICATE INCIDENT" & DISPOSITION_TEXT!= "EMS FALSE" & DISPOSITION_TEXT!=
"MEDIC TR RESPONDED FALSE")
#Create another datafraame that is just heroin overdoses.
heroin_ods <- filter(all_fire_data, grepl('HEROIN', CFD_INCIDENT_TYPE_GROUP))
Now that we have a dataset of the heroin overdoses, we can start doing some exploratory analysis. From our background research, we know that heroin overdoses have received significant attention from government agencies across the United States, including Cincinnati. In the summer of 2016, the Cincinnati government noticed a large spike in heroin overdoses and organized a response. Let’s look at a time series of heroin overdoses below.
We’ll do some datetime handling at the beginning in order to make our plot. More specfically, we’re converting CREATE_TIME_INCIDENT
to a datetime.
#Convert the date field we're going to use to a datetime
all_fire_data$CREATE_TIME_INCIDENT = mdy_hms(all_fire_data$CREATE_TIME_INCIDENT)
#Aggregate the time field up to a date field
all_fire_data$CREATE_DATE_INCIDENT = lubridate::date(all_fire_data$CREATE_TIME_INCIDENT)
#Convert the date field we're going to use to a datetime
heroin_ods$CREATE_TIME_INCIDENT = mdy_hms(heroin_ods$CREATE_TIME_INCIDENT)
#Aggregate the time field up to a date field
heroin_ods$CREATE_DATE_INCIDENT = lubridate::date(heroin_ods$CREATE_TIME_INCIDENT)
#Make the timeseries
ggplot(data = heroin_ods %>%
group_by(CREATE_DATE_INCIDENT) %>%
tally(),
aes(x = CREATE_DATE_INCIDENT, y = n))+
geom_line(color = "#C20009", size = .4) +
geom_vline(aes(xintercept = as.numeric(as.Date("2017-01-01"))),
linetype="dashed", size = .75) +
labs(title="Time series of Heroin overdoses. Cincinnati, 07/2015-Present",
x="Date",
y="Number of overdoses")
We see a clear spike of heroin overdoses in the summer of 2016 – the increase that the city government noticed. Overdoses seem to be less common starting in 2017 and even less so since 2018. When thinking about developing our model, we want to have data that will help us make an accurate and generalizable model for the future. We hypothesize that overdoses seen in 2016 are higher than what we will see in the future and therefore we will only be using data since 2017 for this analysis. One could also argue that 2017 overdoses are higher than 2018 and 2019; we have elected to keep 2017 data in our analysis so we preserve a large enough dataset to conduct the necessary analyses and model tests.
Below, we filter out data from before 1/1/2017.
#We're only going to look at data from 2017 on
all_fire_data <-
subset(all_fire_data, CREATE_DATE_INCIDENT >= ymd("20170101"))
heroin_ods <-
subset(heroin_ods, CREATE_DATE_INCIDENT >= ymd("20170101"))
Now that we have a dataset for our dependent variable, we’ll start creating the datasets that are eventually going to help us predit heroin overdoses. We’ll pull a crimes dataset, a police use of force dataset, and a police calls dataset from Cincinnati Open Data. We’ll also create a dataset about attempted suicides from the base Fire/EMS data. We’ll then create specific sf
objects for each of the events we’re going to make into predictors (this is a geospatial analysis after all!). We also make sf objects for the Fire/EMS dataset and heroin overdoses here.
Note that a couple of the datasets are created using the grepl
command. This command does a search for a wordpart match in the specified field. The naming conventions in the data seem inconsistent, and we want to make sure we’re getting all relevent matches, so we’ll search for wordparts instead of exact string matches.
All data will be filtered to only include observations since 1/1/17, consistent with our overdose data.
#Convert the base Fire/EMS data to an sf object
all_fire.sf <-
all_fire_data %>%
na.omit(cols=c("LONGITUDE_X", "LATITUDE_X")) %>%
st_as_sf(coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326, agr = "constant") %>%
st_transform(3735)
#Crime data from COD
all_crimes <- fread("PDI__Police_Data_Initiative__Crime_Incidents.csv", header=TRUE, sep=",", fill=TRUE)
#Convert to datatime and filter to since 2017
all_crimes$DATE_REPORTED = mdy_hms(all_crimes$DATE_REPORTED)
all_crimes <-
subset(all_crimes, DATE_REPORTED >= ymd("20170101"))
#Project the crime data
crimes.sf <-
all_crimes %>%
na.omit(cols=c("LONGITUDE_X", "LATITUDE_X")) %>%
st_as_sf(coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326, agr = "constant") %>%
st_transform(3735)
#Create a robbery dataset
robbery.sf <- filter(crimes.sf, grepl('ROBBERY', OFFENSE)) %>%
mutate(Legend = "Robbery")
#Police calls data from COD.
police_calls <- fread("PDI__Police_Data_Initiative__Police_Calls_for_Service.csv", header=TRUE, sep=",", fill=TRUE)
#Convert to datatime and filter to since 2017.
police_calls$CREATE_TIME_INCIDENT = mdy_hms(police_calls$CREATE_TIME_INCIDENT)
police_calls$CREATE_DATE_INCIDENT = lubridate::date(police_calls$CREATE_TIME_INCIDENT)
police_calls <-
subset(police_calls, CREATE_DATE_INCIDENT >= ymd("20170101"))
#Create dataset for just drug activity
drug_activity <-
police_calls %>%
filter(str_detect(INCIDENT_TYPE_DESC, "DRUG ACTIVITY|INV DRUG COMPLAINT"))
#Project the drug activity
drug_activity.sf <-
drug_activity %>%
na.omit(cols=c("LONGITUDE_X", "LATITUDE_X")) %>%
st_as_sf(coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326, agr = "constant") %>%
st_transform(3735) %>%
mutate(Legend = "DrugActivity")
#Police use of force data from COD
police_force <- read.csv("PDI__Police_Data_Initiative__Use_of_Force.csv")
#Convert to datetime and filter to since 2017
police_force$INCIDENT_DATE = mdy_hms(police_force$INCIDENT_DATE)
police_force <-
subset(police_force, INCIDENT_DATE >= ymd("20170101"))
#Project the police use of force data
police_force.sf <-
police_force %>%
na.omit(cols=c("LONGITUDE_X", "LATITUDE_X")) %>%
st_as_sf(coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326, agr = "constant") %>%
st_transform(3735) %>%
mutate(Legend = "PoliceForce")
#Project the heroin overdose data
heroin_ods.sf <-
heroin_ods %>%
na.omit(cols=c("LONGITUDE_X", "LATITUDE_X")) %>%
st_as_sf(coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326, agr = "constant") %>%
st_transform(3735)
#Create a new projected suicide dataset from the fire data
suicide.sf <- filter(all_fire.sf, grepl('SUICIDE', INCIDENT_TYPE_DESC)) %>%
mutate(Legend = "Suicide")
Now that we’ve built our datasets, we’ll conduct some exploratory data analysis to begin understanding distributions of our variables and spatial relationships.
First, we need to establish a spatial framework for our study area. Because we are more interested in the spatial trends of heroin overdose risk (rather than the precise location of an overdose) we’ll overlay a fishnet on the city – a set of uniform sized grid cells. Below, we’ll build a fishent with grid cells of 400 sq ft. At the beginning, we’ll also read in a Cincinnati boundary file that we’ll use to snap the fishnet.
#Read in city boundary file
cin_boundary <- st_read("Cincinnati_City_Boundary.shp") %>%
st_transform(3735)
#Make our 400 sq ft fishnet
fishnet <-
st_make_grid(cin_boundary, cellsize = 400) %>%
st_sf()
#Snap the fishnet to the city boundary file
fishnet <-
fishnet[cin_boundary,] %>%
mutate(uniqueID = rownames(.)) %>%
dplyr::select(uniqueID)
Now that we have the base fishnet, we can create a fishnet distribution of our dependent variable. Below we’ll join the heroin data to the fishnet to create heroin_net
. We’ll then plot it.
#Create a new fishnet that has total counts of heroin overdoses per gridcell
heroin_net <-
heroin_ods.sf %>%
dplyr::select() %>%
mutate(countHeroin = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countHeroin = ifelse(is.na(countHeroin), 0, countHeroin),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = heroin_net, aes(fill = countHeroin)) +
scale_fill_viridis() +
labs(title = "Count of heroin overdoses for the fishnet") +
mapTheme()
We see a clear spatial autocorrelation of heroin overdoses in Cincinnati. The highest counts of overdoses are located in the center south of the city, with smaller clusters located in adjacent areas including to the west and northeast of the main cluster. There also seems to be some lower clustering happening in the far north of the city (between the two areas not incorporated by the city but surrounded by the city.) This map also visualizes how concentrated the epidemic really is in the city – the majority of the area in the city has 0 or very few heroin overdoses for the almost three full years since 1/1/17.
Next, we’ll make a separate fishnet that combines counts of the predictors that we made in the previous section. Vars_net
will include counts of the four predictors per grid cell. We’ll then do some data restructuring and create four maps showing the grid cell-level counts of each predictor.
#Create another fishnet that has total counts of each of our other variables per gridcell
vars_net <-
rbind(suicide.sf %>% dplyr::select(geometry, Legend),
robbery.sf %>% dplyr::select(geometry, Legend),
drug_activity.sf %>% dplyr::select(geometry, Legend),
police_force.sf %>% dplyr::select(geometry, Legend)) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
#Do some restructuring and manipulation to make visualizing our variable grid cell counts easier.
vars_net.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
#Take a look at all the plots
do.call(grid.arrange,c(mapList, ncol =2, top = "Risk Factors by Fishnet"))
These maps suggest that some of the predictors may be more useful than others. For example, DrugActivity
seems to have a high cluster in the central south where the highest observed counts of heroin overdoses were, with some clusters that we did not observe in the overdoses, such as in the central western part of the city. PoliceForce
has a cluster in the center south, but also northeast of that main cluster, which may align with the smaller cluster we observed with the heroin overdoses, but we can’t be sure just from these maps. Robbery
is clearly a much more prevalent event, but we still see some consistent clustering patterns in the center south and southeast, with additional clusters in the northwest. Suicide
, a rare event, only has the higher counts in the center south. This last predictor may be a particularly good candidate to conversion to a distance-based predictor given the rarity of the event.
So that’s what we’ll do now: build the nearest neighbor predictors from each of our predictor datasets. For each of the predictors we’ll use n = 3, except suicide attempt – given the rarity of the event, we’ll use n = 2. At the end we’ll do some additional data manipulation and map the nearest neighbor predictors in the fishnet.
vars_net$PoliceForce.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(police_force.sf), 3)
vars_net$Robbery.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(robbery.sf), 3)
vars_net$Suicide.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(suicide.sf), 2)
vars_net$DrugActivity.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(drug_activity.sf), 3)
vars_net.long.nn <-
vars_net %>%
dplyr::select(ends_with(".nn")) %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =2, top = "Nearest Neighbor risk Factors by Fishnet"))
The first thing to note about these maps is that the unusual shape of the city of Cincinnati leads to substantial edge effects in the maps. For example, the western peninsula is very far from most of the the city, which makes it more likely to have higher nearest neighbor values. This effect is especially evident for attempted suicides. We see a similar phenomenom in the east of the city, though to a lesser extent, for exmaple with the drug activity predictor.
For police force, robbery, and drug activity, we see strong clustering in the center south and southeast (not including the peninsula). Drug activity is more consistently distributed in the north of the city, robbery less so, but still somewhat, and police force show the most distinct clustering in the north of the city, with some areas showing much longer average distances from instances of police force. Suicide again proves to be a rare event, with clusters in the center south and the northeast. We should be wary of this suicide map though, given the rarity of the event.
With some EDA completed, we’ll start to more closely evaluate the spatial process of heroin overdoses in Cincinnati. We began this earlier by looking at the distribution of observed overdoses by grid cell, but here we’ll do some more analysis to help us understand how heroin overdoses occur across space in the city.
As a first step we’ll bring together the two fishnets (one of heroin overdoses, one for the predictors) into a single final_net
.
#Joining the nets
final_net <-
left_join(heroin_net, st_set_geometry(vars_net, NULL), by="uniqueID")
Next, we’ll analyze the spatial process of our dependent variable. To do this, we’ll first have to create a queen weights matrix which establishes a spatial relationship between grid cells and all grid cells that share an edge or point with the grid cell. After that, we create a local index of spatial autocorrelation (LISA) to determine the degree to which heroin overdoses are clustered across the city. We’ll also perform some significance tests with a p-value of 0.05 to determine if the clustering is statistically significant. We’ll present some maps at the bottom that help us evalute the spatial process of our dependent variable.
#spatial structure
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
#LISA
final_net.localMorans <-
cbind(
as.data.frame(localmoran(final_net$countHeroin, final_net.weights)),
as.data.frame(final_net, NULL)) %>%
st_sf() %>%
dplyr::select(Heroin_Count = countHeroin,
Local_Morans_I = Ii,
P_Value = `Pr(z > 0)`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans I statistics, Heroin overdoses"))
These maps confirm the preliminary visual interpretations that we had from our earlier maps: heroin overdose clustering is happening in the center south and the southwest of the city, with a smaller cluster in the northeast. The significant hotspots map in particular helps us more easily understand where clsutering is happening – for example, smaller clusters in the central north of southeast of the city are difficult to see on the map of heroin counts but the significiant hotspots map helps highlight those areas as having significantly high overdoses when compared to their respective surrounding areas.
Having established that there is a spatial process happening with our dependent variable, we want to incorporate the spatial distribution into our overdose model as a predictor. We’ll go about doing this by creating two varibles in our final_net
: 1) isSig
, a binary that states if a given grid cell has significant clustering, and 2) isSig.dist
which measures the distance from a given grid cell to the nearest grid cell that does have significant clustering. We’ll build each of these variables below.
#Distance to highly significant hotspots
final_net <-
final_net %>%
mutate(heroin.isSig = ifelse(localmoran(final_net$countHeroin,
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(heroin.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, heroin.isSig == 1))), 1 ))
Now, let’s take a look at the distribution of distances to highly significant local hotspots of heroin overdoses (isSig.dist
).
#Fishnet of distribution of distances to highly sig. local overdose hotspots
ggplot() +
geom_sf(data = final_net, aes(fill = heroin.isSig.dist)) +
scale_fill_viridis(direction=-1) +
labs(title = "Distance to highly significant local heroin hotspots") +
mapTheme()
The first thing this map suggests is that there is a lot of significant clustering of heroin overdoses across the city – something that we saw in our earlier map of significant clusters. We also see the impact of the edge effect again here, especially in the western peninsula of the city. This map again shows that clusters occurs through the south and southwest, with additional clustering in the northeast and southeast.
Now that we undestand more about our variables (dependent and predictors) and have examined the spatial process of overdoses, we’re ready to start building our model.
We have already established that we’ll be looking at heroin overdoses as a frequency event that occurs across space, set up as a uniform-sized grid cell fishnet. This framework requires a poisson regression, which is the method we’ll use for our analysis.
Before we build the model, let’s take a look at some correlation plots between each of our predictors (including the spatial process ones) and our dependent variable. This will help us determine which predictors to include in our predictive model, particularly when we are trying to choose between raw counts of the events and the nearest neighbor predictions.
correlation.long <-
st_set_geometry(final_net, NULL) %>%
dplyr::select(countHeroin, PoliceForce.nn, Robbery.nn, Suicide.nn, DrugActivity.nn, heroin.isSig, heroin.isSig.dist, PoliceForce, Robbery, Suicide, DrugActivity) %>%
gather(Variable, Value, -countHeroin)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countHeroin, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countHeroin)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Heroin count as a function of risk factors")
We see that for drug activity, police force, and robbery, the three count-based predictors are more strongly correlated with heroin overdoses, as so we will use those predictors in our model. For suicide, the nearest neighbor variable is more strongly correlated, so we will use that one instead of the counts. We will use both of the spatial process predictors given the strong correlation between each of them and heroin overdoses.
Now we’re ready to make our model. In the section between, we’ll first read in a census tracts file and join it to our final net – we’ll later use this for spatial cross validation of our model. We’ll then cross validate our model by doing a spatial LOGO using the neighborhoods, ensuring that the model performs well for all areas of the city. We’ve elected a spatial LOGO CV approach over a time-based one, because we subsetted our data at the outset. Because we are only looking at data from 2017 on, the temporal fluctuations in the data are reduced, and so we have decided to focus on the spatial distribution for cross-validated generalizability.
We’re using census tracts, because all of the neighborhood boundary files on Cincinnati Open Data resulted in part of our fishnet being truncated (i.e., cells fell outside the neighborhood boundaries). The census tracts was the best spatial boundary file available that preserved the entire fishnet.
cin_tract_10 <- st_read("Census_Tracts_2010.shp") %>%
st_transform(3735)
final_net <-
st_centroid(final_net) %>%
st_join(., dplyr::select(cin_tract_10, TRACTCE10)) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
reg.vars <- c("Suicide.nn", "DrugActivity", "Robbery", "PoliceForce",
"heroin.isSig", "heroin.isSig.dist")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countHeroin ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
#just the cv models
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countHeroin",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countHeroin, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "TRACTCE10",
dependentVariable = "countHeroin",
indVariables = reg.vars) %>%
dplyr::select(cvID = TRACTCE10, countHeroin, Prediction, geometry)
Next, we’ll compare the performance of each regression against the actual observed heroin overdoses.
#Create the regression summary
reg.summary <-
rbind(
mutate(reg.spatialCV, Error = countHeroin - Prediction,
Regression = "Spatial LOGO-CV")) %>%
st_sf()
filter(reg.summary,
Regression == "Spatial LOGO-CV") %>%
ggplot() +
geom_sf(aes(fill = Prediction)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Predicted Heroin overdoses by Regression") +
mapTheme()
Our predictions are heavily concentrated in the center south. If you look closely, you can see lightly clusters across the south, including around the main clusters. There is some additional clustering in the northeast, but it’s hard to see. Unfortunately, it seems our small grid cell size is making this map hard to interpret. The next section will apply some more robust testing against our model.
To better understand how well our model performs, let’s evaluate our predictions against a widely used practice of resource distribution in government: event hot spotting. In this next section we’ll create a kernel density map (i.e., hotspots) to see how our model performs against the current best practice of distributing resources to where the problem has occurred historically.
Here, we create a kernel density using a radius of 1,000 feet, then create a map using a sample 1,000 heroin overdoses.
heroin_ppp <- as.ppp(st_coordinates(heroin_ods.sf), W = st_bbox(final_net))
heroin_KD <- spatstat::density.ppp(heroin_ppp, 1000)
as.data.frame(heroin_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data = sample_n(heroin_ods.sf, 1000), size = .05) +
labs(title = "Kernel Density Map of Heroin overdoses") +
scale_fill_viridis() +
mapTheme()
There seem to be a set of grid cells that are not correctly generating a kernel density value (visualized by the grey). The risk model still appears to be capturing the spatial distribution of risk though.
The next logical step is to compare the kernel density map to our prediction map. In the next section, we’ll create a goodness of fit indicator that classifies the risk of a given grid cell under each method (our prediction model and kernel density). Placing each grid cell into one of 5 risk levels will help us more directly compare each model’s performance.
First we’ll do this for kernel density. Below we’ll place the KD risk of each grid cell into one of 5 risk categories and plot them on a map, along with a sample of overdoses.
# Compute kernel density
her_ppp <- as.ppp(st_coordinates(heroin_ods.sf), W = st_bbox(final_net))
her_KD <- spatstat::density.ppp(her_ppp, 1000)
# Convert kernel density to grid cells taking the mean
her_KDE_sf <- as.data.frame(her_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set heroin counts are spatially joined to the fisnnet.
bind_cols(
aggregate(
dplyr::select(heroin_ods.sf) %>% mutate(heroinCount = 1), ., length) %>%
mutate(heroinCount = replace_na(heroinCount, 0))) %>%
#Select the fields we need
dplyr::select(label, Risk_Category, heroinCount)
#Map the KD risk
rbind(her_KDE_sf) %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(heroin_ods.sf, 1500), size = .1, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Relative to test set points (in black)") +
mapTheme()
Again we see that many of cells are missing a risk category, because there is something wrong with the kernel density generation. Here those missing data are shown as NA values (white). We will proceed.
Next we’ll do the same for our spatial LOGO regression.
her_risk_sf <-
filter(reg.summary, Regression == "Spatial LOGO-CV") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
bind_cols(
aggregate(
dplyr::select(heroin_ods.sf) %>% mutate(heroinCount = 1), ., length) %>%
mutate(heroinCount = replace_na(heroinCount, 0))) %>%
dplyr::select(label,Risk_Category, heroinCount)
Now we can plot a direct comparison of the two maps to evaluate our prediction model against an kernel density model.
rbind(her_KDE_sf, her_risk_sf) %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(heroin_ods.sf, 1500), size = .1, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Relative to test set points (in black)") +
mapTheme()
Finally, we’ll create some barplots to see how our model compares to kernel density. Below we’re looking at the amount of grid cells that experience overdoses that each model placed into the risk categories. We would prefer to see higher number of cells in the higher risk categories (i.e., the model is correctly predicting an area that experienced overdose as higher risk).
rbind(her_KDE_sf, her_risk_sf) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countHeroin = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_ODs = countHeroin / sum(countHeroin)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_ODs)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE)
The substantially higher volume of overdoses in the highest risk category under our model shows that it is a better way of predicting risk than a kernel density map. These results suggest that our model is an improvement from the business as usual practice of kernel density. We observe, however, the large number of grid cells that are not successfully being categorized under the kernel density model – this could certainly be impacting our results. Because those NAs appeared to be relatively evenly distributed accross the grid, we will proceed with the above analysis interpretations and conclusions.
Finally, let’s return to our use case. At the beginning of this project, we stated that we wanted to help public health officals respond more proactively to the heroin epidemic. In order to do that, we have built a model that helps predict where overdoses will occur. The next logical question is: what do we do with that information? While we are not best positioned to answer this question – after all we are not public health officials – we do have some sample scenarios we would like to show here.
In this next part, we overlay some potential “intervention sites,” where prevention outreach could be conducted and resources distributed. As an example, we have identified places of worship in Cincinnati (using this dataset), as potential places where public health officials could contact members of at risk populations. In this next map, we overlay places of worship (105 total in the city) on top of our prediction map as a suggested tool for public health officials.
worship.sf <- st_read("All_Places_of_Worship.shp") %>%
st_transform(3735) %>%
st_intersection(.,st_union(cin_boundary))
filter(reg.summary, Regression == "Spatial LOGO-CV")%>%
ggplot() +
geom_sf(aes(fill = Prediction)) +
geom_sf(data = worship.sf, color="red", size = .75) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Heroin overdose Predictions with Places of Worship") +
mapTheme()
This is just one example of how our analysis could be leveraged. Depending on what city officials and community organizations determine as optimal sites for programming, we could modify this type of use case deployment and make it more sophisticated by, for example, intaking operational relevent to the spaces and more precisely match feasible resources with high risk areas.
The small size of our grid cells probably made some of the analysis difficult to visually interpret. We saw that especially with the prediction map, the small grid cell size made the map difficult to interpret. A larger grid cell size should be tested in future iterations of this analysis. It also would be helpful to discuss this unit analysis size with city officials when considering the optimal size of our grid cells. In future applications of this analysis, we recommend replicating this model with a larger grid cell fishnet. However, one thing that ought to be kept in mind is that using a larger grid cell size might cause losing local nuances and compromise certain spatial autocorrelation of heroin overdose. As with all data analysis and model building, there are tradeoffs with every decision that we make."
We also noted the impact of edge effect at many points in this analysis, due to the shape of Cincinnati. We would suggest exploring methods to control for this effect, perhaps by evaluating a larger area for the analytics and then clipping the results to the city boundary as a final step before presenting to public health officials. These method-determinant decisions should be made in close consultation with project stakeholders.
We also would recommend drawing on the knowledge of public health experts when developing the features for future models. Given the extensive research on opiod addiction, there are surely features that we are missing from our model that could make it more robust.