This week’s #tidytuesday data is the incarceration trends data set from the Vera Institute of Justice, which contains county-level jail data (1970-2015) and prison data (1983-2015). Vera is a non-profit research and policy organisation focusing on “the causes and consequences of mass incarceration, racial disparities, and the loss of public trust in law enforcement” (see the Wikipedia article for more).
I’ve been thinking about exploring maps in R but never gotten around to it, so this seemed like a great opportunity. I ended up checking out the following alternatives:
- ggmap: a package that leverages mapping services like Google Maps to obtain map tiles.
- tmap: a library for drawing thematic maps, with an API similar to
ggplot2
. - tigris: allows users to directly download and use TIGER/Line shapefiles from the US Census Bureau.
- urbnmapr: a package providing state and county shapefiles that are compatible with
ggplot2
.
Ultimately, I found myself spending way too much time reading documentation instead of plotting interesting stuff, and was almost about to abandon the entire project for this week. Thankfully David Robinson’s #tinytuesday screencast (once again) came to the rescue.
It turns out that ggplot2
comes with a handy function that leverages data from the maps package (which I was unaware of) and makes it suitable for plotting. Let’s load the relevant packages and show an example.
Plotting maps with ggplot2
# Loading useful packages
library(tidyverse)
library(gganimate)
library(maps)
library(knitr)
library(kableExtra)
# Loading data from GitHub
raw_data <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-22/prison_population.csv")
Let’s plot a map of the contiguous United States:
map_data("usa") %>%
tbl_df() %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon() +
coord_map()
What map_data
essentially does is that it returns a dataframe consisting of a series of points (longitudes and latitudes) which are grouped together. The grouping tells the geom_polygon
layer which dots to connect in order to get an outline of the US. To include the states, we simply pass "state"
as an argument…
map_data("state") %>%
tbl_df() %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon() +
coord_map()
… and likewise with counties:
map_data("county") %>%
tbl_df() %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon() +
coord_map()
There are of course other maps available (see the documentation for more info). Also, if you’re interested in some more details about how map_data
works, please check out this excellent guide.
Visualising missing data
While I was (unsuccessfully) experimenting with tmap
, it became clear that plotting maps wasn’t my only challenge this week. The data has a lot of missing values:
# Checking which columns have missing values
raw_data %>%
summarise_all(funs(sum(is.na(.)))) %>%
select_if(function(x) x > 0) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
population | prison_population |
---|---|
273093 | 751787 |
In fact, over half the rows have missing values for prison populations. This is clearly a problem, and may require a significant amount of time to deal with (David briefly describes techniques for dealing with missing data that I found quite useful).
Instead of dealing with the intricacies associated with addressing the missing values problem, I decided to turn the problem on its head and visualise them as a function of time. First I calculated the percentage of missing values for each state in every year:
state_missing_data <- raw_data %>%
filter(pop_category == "Total") %>%
group_by(year, state) %>%
summarise(missing_data = round(100*mean(is.na(prison_population)), 2)) %>%
ungroup()
We can take a closer look at e.g. Texas to get an idea of what the results of the previous chunk are:
state_missing_data %>%
filter(state == "TX") %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
year | state | missing_data |
---|---|---|
1970 | TX | 100.00 |
1971 | TX | 100.00 |
1972 | TX | 100.00 |
1973 | TX | 100.00 |
1974 | TX | 100.00 |
1975 | TX | 100.00 |
1976 | TX | 100.00 |
1977 | TX | 100.00 |
1978 | TX | 100.00 |
1979 | TX | 100.00 |
1980 | TX | 100.00 |
1981 | TX | 100.00 |
1982 | TX | 100.00 |
1983 | TX | 8.27 |
1984 | TX | 6.69 |
1985 | TX | 5.51 |
1986 | TX | 4.33 |
1987 | TX | 3.54 |
1988 | TX | 5.91 |
1989 | TX | 4.72 |
1990 | TX | 5.91 |
1991 | TX | 5.91 |
1992 | TX | 5.12 |
1993 | TX | 4.72 |
1994 | TX | 4.72 |
1995 | TX | 3.15 |
1996 | TX | 85.04 |
1997 | TX | 85.04 |
1998 | TX | 64.57 |
1999 | TX | 85.04 |
2000 | TX | 85.04 |
2001 | TX | 64.57 |
2002 | TX | 85.04 |
2003 | TX | 85.04 |
2004 | TX | 2.36 |
2005 | TX | 2.36 |
2006 | TX | 2.76 |
2007 | TX | 2.36 |
2008 | TX | 1.57 |
2009 | TX | 2.76 |
2010 | TX | 2.36 |
2011 | TX | 1.57 |
2012 | TX | 1.57 |
2013 | TX | 1.18 |
2014 | TX | 0.79 |
2015 | TX | 1.18 |
2016 | TX | 100.00 |
There are no values for the prison population in Texas until 1983. Following this there is a relatively stable period with much more data, until 1996. From then on there is a dramatic increase in the amount of missing data, which lasts until 2003. Then things look quite good for a while, until we reach the last year of the data set for which there are no recorded values for the prison population.
Next, I used gganimate
to visualise the percentage of missing values for each state by year:
state_missing_data %>%
mutate(region = str_to_lower(state.name[match(state, state.abb)])) %>% # Cool trick!
right_join(map_data("state"), by = "region") %>%
ggplot(aes(x = long, y = lat, group = group, fill = missing_data)) +
geom_polygon() +
ggthemes::theme_map() +
coord_map() +
scale_fill_viridis_c(guide = guide_legend(title = "Percent")) +
transition_manual(year) +
labs(title = "Percentage of counties with missing data (per state)",
subtitle = "Year: {current_frame}") +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
Note the cool little trick I picked up (once again) from David’s screencast: In order to join our prison data with the map data, David makes clever use of the function match
and the base R dataset state
. Try it out for yourself to see how it works!
That’s it for this week’s’ #tidytuesday. Thanks for reading!