1. Introduction

Crime is major issue in Chicago and has been the focus of national attention. Law enforcement resources are extremely strained, leading police departments across the country to look to new and improved ways to conduct decision making processes concerning police resource allocation. Predictive policing tools have become popular during the past decade and, more recently, have come under intense scrutiny for the way they perpetuate bias in existing police practices and associated data.

In this project, we will come up with a method to geospatially predict crime, measure its effectiveness against existing resource allocation methods (i.e., kernel density mapping), and evaluate the ability of our method to generalize across the city of Chicago, paying particular attention to its performance with respect to racial demographics.

2. Initial set up and data pull

First, we have some basic setup, creation of some visualization themes, and creation of base maps that include: the City of Chicago (excluding O’Hare airport), police districts, and police beats. We include the police spatial boundaries, because our subject area here is crime.

We also create the fishnet set up that we’ll be using as the spatial framework for our predictions. The fishnet divides the entire city into uniform 500 ft² grid cells. We will ultimately predict the crime counts within each grid cell.

library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(raster)

setwd("~/Fall 2019/CPLN 590/Week9/riskPrediction_data/riskPrediction_data")

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform(crs=102271) %>%
  dplyr::select(District = dist_num)

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform(crs=102271) %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))
chicagoBoundary <- 
  st_read("chicagoBoundary.shp") %>%
  st_transform(crs=102271) 

fishnet <- 
  st_make_grid(chicagoBoundary, cellsize = 500) %>%
  st_sf()

fishnet <- 
  fishnet[chicagoBoundary,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)

Now we’re ready to start mapping our outcome, or our dependent variable. For this analysis, we have selected a specific crime: robberies that used a handgun. We know that crime reporting displays significant selection and reporting biases, and we theorize here that more serious, violent crimes (i.e., those involving a firearm) may suffer from these biases less than lower level crimes (i.e., nonviolent drug offenses).

We’ll use 2017 crime data for Chicago Open Data. In the code below, we download the data and map it to evaluate the spatial distribution of this crime in 2017.

armed_robbery <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
  filter(Primary.Type == "ROBBERY" & 
           Description == "ARMED: HANDGUN") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>%
  distinct()

ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = armed_robbery, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Armed Robberies, Chicago - 2017") +
  mapTheme()

The map suggests that there is definitely some clustering of armed robbery in Chicago. There seems to be the highest concentration of it in the north central west part of the city, with another large cluster in south central Chicago, and a consistent clustering along the entire eastern coast of the city. Notably, it is almost completely absent from the far southeast and far northwest of the city. Based on what we know about Chicago and crime more generally, we can speculate that there is both 1) selection bias in these data, meaning that certain areas experience the crime more perhaps because those who commit the crime understand that there are areas where they can get away with the crime, and 2) reporting bias, meaning that certain areas are more or less likely to communicate the occurance of these events through formal channels. Particularly in areas like Chicago where community-police relationships are often fraught, we can imagine situations where crime would go unreported because of a lack of trust of the police and a desire to resolve these types of conflicts through alternative methods outside formal law enforcement.

Next, we will join these crime data into our fishnet to set up our ability to understand this distribution within that geospatial framework.

crime_net <- 
  armed_robbery %>% 
  dplyr::select() %>% 
  mutate(countarmedrob = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countarmedrob = ifelse(is.na(countarmedrob), 0, countarmedrob),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
ggplot() +
  geom_sf(data = crime_net, aes(fill = countarmedrob)) +
  scale_fill_viridis() +
  labs(title = "Count of Armed Robberies for the Fishnet") +
  mapTheme()

The fishnet map is largely consistent with our original map, further underscoring the clustering of armed robberies throughout the city. This map seems to highlight the northeastern cluster as the most concentrated area of armed robbery within the city, with the southcentral cluster perhaps larger but not quite hitting the upper threshold (i.e., there don’t appear to be any cells with 20+ crimes).

3. Feature Engineering & Variable Analysis

Now we’re ready to start building some predictors – our feature engineering process. In this section, we use data about the built environment as predictors for armed robbery. The data below comes from Chicago Open Data.

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

asbestos <- 
  read.socrata("https://data.cityofchicago.org/Environment-Sustainable-Development/CDPH-Asbestos-and-Demolition-Notification/qhb4-qx8k/data") %>%
  mutate(year = substr(start_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Asbestos")

publichousing <-
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Affordable-Rental-Housing-Developments/s6ha-ppgi/data") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Public_Housing")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
  mutate(year = substr(date_service_request_was_received,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Cars")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur   rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

Now that we’ve made the features, we’ll join them into our existing fishnet. First, we’ll aggregate all the predictors into a single fishnet, vars_net. Once they are combined into a single fishnet, we’ll convert each of the predictors into a nearest neighbor predictor. The nearest neighbor predictors represent an average distance from the centroid of a given grid cell the nearest k predictors. For example, the nearest neighbor (k=3) value for graffiti incidents for a given grid cell represents the average of the three distances from the grid cell’s centroid to the nearest three incidents of graffiti.

Nearest neighbors typically display higher predictive power than raw counts of predictors.

vars_net <- 
  rbind(streetLightsOut,publichousing, liquorRetail, abandonCars, sanitation, asbestos, abandonBuildings, graffiti) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

vars_net$Abandoned_Buildings.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)
vars_net$Graffiti.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)
vars_net$Sanitation.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)
vars_net$Abandoned_Cars.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)
vars_net$Liquor_Retail.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)
vars_net$Street_Lights_Out.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 3)
vars_net$publichousing.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(publichousing), 3)
vars_net$asbestos.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(asbestos), 3)
vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =2, top = "Nearest Neighbor risk Factors by Fishnet"))

The above maps give us an idea about how several features might be predictors or deterrents of crime. We can hypothesize that street light outages, abandoned buildings, abandonded cars, sanitation complaints, graffiti, and liquor stores might be located close to crime. Given the history of public housing and the lack of associated funding, crime might also be located near there. The map of that predictor suggests that the two hotspots identified in our outcome might overlap with areas of public housing. Our last feature is asbestos notices, with the hypothesis that structures that are still contaminated with asbestos may be in disrepair and proximite to areas where crime occurs.

Next, we’ll examine the local spatial autocorrelation of armed robberies in Chicago by looking at a few Local Moran’s I maps. At the end of the code block, we’ll also create some significant metrics for measuring significance clusters and distance to nearest significant cluster.

In this section, we are primarily interested in understanding where armed robbery is clustrered in a statistically significant way as compared to the surrounding area.

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

final_net <-
  left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID") 
final_net <-
  st_centroid(final_net) %>%
    st_join(., dplyr::select(neighborhoods, name)) %>%
    st_join(., dplyr::select(policeDistricts, District)) %>%
      st_set_geometry(NULL) %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countarmedrob, final_net.weights)),
    as.data.frame(final_net, NULL)) %>% 
    st_sf() %>%
    dplyr::select(ArmedRob_Count = countarmedrob, 
                  Local_Morans_I = Ii, 
                  P_Value = `Pr(z > 0)`) %>%
    mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
    gather(Variable, Value, -geometry)
  
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans I statistics, Armed Robbery"))

These maps suggest that we have statistically significant clustering of armed robberies in the areas that looked like hotspots on the original dependent variable maps at the beginning of the analysis. This interpretation is further confirmed through higher Local Moran’s I values in the same areas, meaning that relative to the surrounding areas, there is clustering of armed robberies in the areas with higher Local Moran’s I values. The statistical significance is represented by the latter two maps i.e., dark purple in the bottom left map and yellow in the bottom right map.

Next, we’ll evaluate the correlations between armed robbery and each predictor (and the associated nearest neighbor predictors).

In this section we will also create two variables to capture the underlying spatial process in our dependent variable. To accomplish this, we’ll create two new predictors: 1) isSig, a binary that states if a given grid cell has significant clustering, and 2) isSig.dist which measures the distance from a given grid cell to the nearest grid cell that does have significant clustering. Each of these variables will be displayed in the correlation plots as well.

final_net <-
  final_net %>% 
  mutate(armedrob.isSig = ifelse(localmoran(final_net$countarmedrob, 
                                            final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(armedrob.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                           st_coordinates(st_centroid(
                                             filter(final_net, armedrob.isSig == 1))), 1 ))

correlation.long <-
  st_set_geometry(final_net, NULL) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    gather(Variable, Value, -countarmedrob)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countarmedrob, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countarmedrob)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Armed robbery count as a function of risk factors")

Unfortunately, public housing seems to be a pretty good predictor of crime. Asbestos notices seem to be a relatively decent predictor as well. For the majority of the predictors, the nearest neighbor approach demonstrates a stronger relationship with the dependent variable. We’ll use the nearest neighbor predictors.

Next, let’s look at a histogram of our dependent variable.

ggplot(final_net, aes(countarmedrob)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of Armed Robbery by grid cell")

Clearly, the distribution of our dependent variable is heavily skewed right and many of the fishnet cells have zero crime. These features of our outcome variable make Poisson regression the appropriate method to use for modeling the data.

4. Regression Modeling

Next, we’ll build a Poisson regression model using the predictor variables. In the following block, we make a regression that includes the spatial process predictors, cross validated through a spatial leave-one-group-out (LOGO) process. Spatial LOGO will utilize each neighborhood in the city as a test set once, thereby validating the model k times where k is the number of neighborhoods.

With the model established, we’ll map the errors to evaluate the model’s performance across the city.

grid.arrange(
  reg.summary %>%
    ggplot() +
      geom_sf(aes(fill = Prediction)) +
      facet_wrap(~Regression) +
      scale_fill_viridis() +
      labs(title = "Predicted armed robberies by Regression") +
      mapTheme() + theme(legend.position="bottom"),

  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
    ggplot() +
      geom_sf(aes(fill = countarmedrob)) +
      scale_fill_viridis() +
      labs(title = "Observed armed robberies\n") +
      mapTheme() + theme(legend.position="bottom"), ncol = 2)

Our model seems to be accurately detecting the cluster of armed robberies in the central north part of the city, and also detecting a less concentrated cluster in the south central part of the city. These maps suggest that we may be overpredicted in the southern cluster, but we’ll have to conduct some more evaluations to better determine our level of accuracy.

5. Accuracy and Generalizability

To begin assessing the accuracy of our model, let’s look at a maps of errors (i.e., the difference between observed and predicted). Note: the below errors are raw errors, not absolute value.

filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  ggplot() +
    geom_sf(aes(fill = Error)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Armed Robbery prediction errors") +
    mapTheme()

This error map does not suggest that there is much clustering of errors in our model. We do see some hotspots of high error in the north central area, but without conducting further tests, it’s difficult to say that this is really clustering. Frankly this error map does not seem to provide us with decisive conclusions. Let’s take a look at some summary metrics – mean absolute error (MAE) and standard deviation (SD) – to get a better understanding of how the model is performing in aggregate.

st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  summarize(MAE = round(mean(abs(Prediction - countarmedrob), na.rm = T),2),
            SD_MAE = round(sd(abs(Prediction - countarmedrob), na.rm = T),2)) %>% 
  kable(caption = "Regression accuracy metrics") %>%
    kable_styling("striped", full_width = F)

Regression accuracy metrics
Regression	MAE	SD_MAE
Spatial LOGO-CV: Spatial Structure	1.18	1.29

A MAE of 1.18 indicates that, on average, our predicted count of armed robberies for a given cell is 1.18 above or below the observed value. The SD of 1.29 suggests that the absolute errors are fairly tightly distributed around the MAE value.

With some accuracy measures established, we’ll next evaluate the generalizability of our model. We’re particularly interested in the ability of our model to generalize across racial boundaries, because of the context we’re working in. Biased policing in the United States has disproportionately harmed nonwhite communities, so we will evaluate how our model performs in majority nonwhite areas as compared to majority white areas. We use 2017 Census (ACS) data to conduct this analysis.

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
    kable_styling("striped", full_width = F)

Mean Error by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Spatial LOGO-CV: Spatial Structure	0.1758018	-0.1726721

Our model overpredicts crime in majority nonwhite areas and underpredicts in majority white areas, which is highly problematic given the history of segrationist policing practices in the United states. Although it is very difficult to remove this bias from our model given the substantial bias in the data, it is an important issue to address in future models.

The next step is to compare our model to traditional policing practices to see if our model is offering improved methods. Below, we’ll create a hotspot (kernel-density) map that represents traditional resource allocation methods in police departments nationwide. We’ll plot the observed armed robberies on that map and do the same for a map based on our risk predictors for comparison.

final_reg <- 
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  mutate(uniqueID = rownames(.))

AR_ppp <- as.ppp(st_coordinates(armed_robbery), W = st_bbox(final_net))
AR_KD <- spatstat::density.ppp(AR_ppp, 1000)
AR_KDE_sf <- as.data.frame(AR_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols(
    aggregate(
      dplyr::select(armed_robbery) %>% mutate(ARCount = 1), ., length) %>%
    mutate(ARCount = replace_na(ARCount, 0))) %>%
  dplyr::select(label, Risk_Category, ARCount)

AR_risk_sf <-
  filter(final_reg, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols(
    aggregate(
      dplyr::select(armed_robbery) %>% mutate(ARCount = 1), ., length) %>%
      mutate(ARCount = replace_na(ARCount, 0))) %>%
  dplyr::select(label,Risk_Category, ARCount)

rbind(AR_KDE_sf, AR_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(armed_robbery, 1500), size = .1, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="Relative to test set points (in black)") +
    mapTheme()

It appears that our model is predicting better just from looking at the map – i.e., it seems that more of the observations are within the highest risk areas under the risk predictions model. To better understand the differences, let’s evaluate how many of the observed crimes are falling into grid cells that we deem “high risk” (i.e., 90-100%).

rbind(AR_KDE_sf, AR_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countarmedrob = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countarmedrob / sum(countarmedrob)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE)

Our model is predicting grid cells that have observed crimed to be in a higher risk category. The larger portion of observed crimes in the 90-100% risk category under the Risk Predictions (yellow bar) as compared to Kernel Density (purple bar) is promising. These results suggest that our model is better at predicting crime than traditional models.

6. Conclusion

Despite the improvements in this model from traditional hotspot policing, we should be very cautious in the implementation of this model. Crime data is extremely bias. Even with the past decade of pushback against policing practices in the US, we know that certain neighborhoods are systematically overpoliced and others are underpoliced. The result is that our observation data is very biased. Therefore, all the methods at our disposal to evaluate the effectiveness of the model, including the ones in this report, are being judged against these biased data, which makes me reluctant to put too much faith in the goodness of fit metrics we generated in this report. Alternative approaches have been taken by other organizations that use 911 data instead of police reporting data with the theory that publicly reported data contains less bias than the data that police collect themselves.

Yet nihilism does not feel particularly useful in this situation. If the alternative resource allocation method is the hotspotting approach that has been discussed in this report, then I think the answer is clear that this model should replace that. At the very least, this model incorporates a handful of predictors that make the crime location question more nuanced then simply: where has crime happened before? Ultimately any policy recommendation must be situated within an organizational context. Is this our one chance to implement a new model for resource allocation? Is the bureaucratic inertia of this particular city government and police department so strong that we won’t get another chance to iterate on this? If so, I would say we need to improve this model further and explore other dependent variable options, such as the 911 data previously mentioned. On the other hand, if the client appears slightly more agile and is currently using the hotspotting method, this model could be a viable first step in an optimized approach to resource allocation.

Crime Prediction in Chicago