This week’s #tidytuesday data is the incarceration trends data set from the Vera Institute of Justice, which contains county-level jail data (1970-2015) and prison data (1983-2015). Vera is a non-profit research and policy organisation focusing on “the causes and consequences of mass incarceration, racial disparities, and the loss of public trust in law enforcement” (see the Wikipedia article for more).

I’ve been thinking about exploring maps in R but never gotten around to it, so this seemed like a great opportunity. I ended up checking out the following alternatives:

ggmap: a package that leverages mapping services like Google Maps to obtain map tiles.
tmap: a library for drawing thematic maps, with an API similar to ggplot2.
tigris: allows users to directly download and use TIGER/Line shapefiles from the US Census Bureau.
urbnmapr: a package providing state and county shapefiles that are compatible with ggplot2.

Ultimately, I found myself spending way too much time reading documentation instead of plotting interesting stuff, and was almost about to abandon the entire project for this week. Thankfully David Robinson’s #tinytuesday screencast (once again) came to the rescue.

It turns out that ggplot2 comes with a handy function that leverages data from the maps package (which I was unaware of) and makes it suitable for plotting. Let’s load the relevant packages and show an example.

Plotting maps with `ggplot2`

# Loading useful packages
library(tidyverse)
library(gganimate)
library(maps)
library(knitr)
library(kableExtra)

# Loading data from GitHub
raw_data <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-22/prison_population.csv")

Let’s plot a map of the contiguous United States:

map_data("usa") %>% 
  tbl_df() %>% 
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon() +
  coord_map()

What map_data essentially does is that it returns a dataframe consisting of a series of points (longitudes and latitudes) which are grouped together. The grouping tells the geom_polygon layer which dots to connect in order to get an outline of the US. To include the states, we simply pass "state" as an argument…

map_data("state") %>% 
  tbl_df() %>% 
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon() +
  coord_map()

… and likewise with counties:

map_data("county") %>% 
  tbl_df() %>% 
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon() +
  coord_map()

There are of course other maps available (see the documentation for more info). Also, if you’re interested in some more details about how map_data works, please check out this excellent guide.

Visualising missing data

While I was (unsuccessfully) experimenting with tmap, it became clear that plotting maps wasn’t my only challenge this week. The data has a lot of missing values:

# Checking which columns have missing values
raw_data %>% 
  summarise_all(funs(sum(is.na(.)))) %>% 
  select_if(function(x) x > 0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"))

population	prison_population
273093	751787

In fact, over half the rows have missing values for prison populations. This is clearly a problem, and may require a significant amount of time to deal with (David briefly describes techniques for dealing with missing data that I found quite useful).

Instead of dealing with the intricacies associated with addressing the missing values problem, I decided to turn the problem on its head and visualise them as a function of time. First I calculated the percentage of missing values for each state in every year:

state_missing_data <- raw_data %>% 
  filter(pop_category == "Total") %>% 
  group_by(year, state) %>%
  summarise(missing_data = round(100*mean(is.na(prison_population)), 2)) %>% 
  ungroup()

We can take a closer look at e.g. Texas to get an idea of what the results of the previous chunk are:

state_missing_data %>% 
  filter(state == "TX") %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"))

year	state	missing_data
1970	TX	100.00
1971	TX	100.00
1972	TX	100.00
1973	TX	100.00
1974	TX	100.00
1975	TX	100.00
1976	TX	100.00
1977	TX	100.00
1978	TX	100.00
1979	TX	100.00
1980	TX	100.00
1981	TX	100.00
1982	TX	100.00
1983	TX	8.27
1984	TX	6.69
1985	TX	5.51
1986	TX	4.33
1987	TX	3.54
1988	TX	5.91
1989	TX	4.72
1990	TX	5.91
1991	TX	5.91
1992	TX	5.12
1993	TX	4.72
1994	TX	4.72
1995	TX	3.15
1996	TX	85.04
1997	TX	85.04
1998	TX	64.57
1999	TX	85.04
2000	TX	85.04
2001	TX	64.57
2002	TX	85.04
2003	TX	85.04
2004	TX	2.36
2005	TX	2.36
2006	TX	2.76
2007	TX	2.36
2008	TX	1.57
2009	TX	2.76
2010	TX	2.36
2011	TX	1.57
2012	TX	1.57
2013	TX	1.18
2014	TX	0.79
2015	TX	1.18
2016	TX	100.00

There are no values for the prison population in Texas until 1983. Following this there is a relatively stable period with much more data, until 1996. From then on there is a dramatic increase in the amount of missing data, which lasts until 2003. Then things look quite good for a while, until we reach the last year of the data set for which there are no recorded values for the prison population.

Next, I used gganimate to visualise the percentage of missing values for each state by year:

state_missing_data %>% 
  mutate(region = str_to_lower(state.name[match(state, state.abb)])) %>% # Cool trick!
  right_join(map_data("state"), by = "region") %>% 
  ggplot(aes(x = long, y = lat, group = group, fill = missing_data)) +
  geom_polygon() +
  ggthemes::theme_map() +
  coord_map() +
  scale_fill_viridis_c(guide = guide_legend(title = "Percent")) + 
  transition_manual(year) +
  labs(title = "Percentage of counties with missing data (per state)", 
       subtitle = "Year: {current_frame}") +
  theme(legend.position = "right",
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))

Note the cool little trick I picked up (once again) from David’s screencast: In order to join our prison data with the map data, David makes clever use of the function match and the base R dataset state. Try it out for yourself to see how it works!

That’s it for this week’s’ #tidytuesday. Thanks for reading!

Biased

#tidytuesday: Exploring (missing) data with maps

Plotting maps with `ggplot2`

Visualising missing data

Plotting maps with ggplot2

Visualising missing data

Plotting maps with `ggplot2`