This week’s post will mark the start of my experimentation with purrr and regular expressions (regex). I’ll also take a look at a package called skimr, which looks like a pretty efficient way to get a quick overview of our data.

In case you don’t know, purrr essentially abstracts away the need for many kinds of for loops used to iterate over lists or atomic vectors. As Hadley writes in R for Data Science, the apply family of functions solve a similar problem, but purrr provides a more consistent and easier-to-learn approach.

So, the goal for the next few weeks is to try to use a couple of different functions from the purrr package every time I write a post. There’s a nice list of purrr resources here. Whenever possible, I’ll also try to make use of regexes in the data wrangling process.

Housing Price Index data

This week we’ll take a look at the Housing Price Index data set provided by Freddie Mac, which

is a broad measure of the movement of single-family house prices. The HPI is a weighted, repeat-sales index, meaning that it measures average price changes in repeat sales or refinancings on the same properties.

We’ll also take a look at the accompanying mortgage rates and recession data. We’ll use the map function from purrr in conjunction with skim from skimr to concisely write code that lets us get an overview of many data sets simultaneously.

# Loading useful packages
library(tidyverse)
library(lubridate)
library(skimr)

PATH <- "~/Documents/GitHub/tidytuesday/data/2019/2019-02-05"
files <- list.files(path = PATH, pattern = "*.csv")
data <- map(file.path(PATH, files), read_csv)

# Data
mortgages <- data[[1]]
recessions <- data[[2]]
hpi_index <- data[[3]]

map applies a function to every element of a list or a vector and returns a list. In the preceding chunk I’ve mapped the function read_csv to every element of the vector files, which (after appended to PATH with the file.path function) contains the paths to this week’s data. The result is a list where each element is a tibble.

Skimming the data

There are many different kinds of map functions in purrr. One that seems particularly useful is map_if, which applies a function to every element of a list or a vector if some condition is met. So, for example, skim won’t work if a data set is empty. To make sure that skim is only mapped to appropriate data sets, we can use map_if to make sure that it is applied only to tibbles that contain one or more rows:

# Previewing data using purrr and skimr
map_if(data, ~ nrow(.x) > 0, skim)

The Markdown output in blogdown is kind of messy, so the following is what you would see in your console if you ran the chunk above:

I think this is really cool. As you can see, skimr groups our data by variable type and summarises them accordingly. One line of code gives us an overview of three different data sets (corresponding to the elements of files), allowing us to diagnose problems pretty quickly. For instance, it’s clear that the character variables in our first data set (mortgage.csv) probably should be numerics. mutate_at lets us apply as.numeric to several columns:

mortgages_fixed <- mortgages %>% 
  mutate_at(vars(-date, -fixed_rate_30_yr, -fees_and_pts_30_yr), as.numeric)

Let’s skim the data again:

mortgages_fixed %>% 
  skim()

Much better!

Regex and recessions

I’d like to try to make some interesting visualisations of the price index and recession data. Clearly, recessions needs some work:

recessions %>% 
  select(name, period_range) %>% 
  knitr::kable() %>% 
    kableExtra::kable_styling(bootstrap_options = c("condensed", "striped"))
name period_range
Great Depression Aug 1929-Mar 1933ct 1929-Dec 1941
Recession of 1937–1938 1937May 1937-June 1938
Recession of 1945 1945Feb 1945-Oct 1945
Recession of 1949 1948Nov 1948-Oct 1949
Recession of 1953 1953July 1953-May 1954
Recession of 1958 1957Aug 1957-April 1958
Recession of 1960–61 1960Apr 1960-Feb 1961
Recession of 1969–70 1969Dec 1969-Nov 1970
1973–75 recession 1973Nov 1973-Mar 1975
1980 recession 1980Jan 1980-July 1980
1981–1982 recession 1981July 1981-Nov 1982
Early 1990s recession in the United States 1990July 1990-Mar 1991
Early 2000s recession 2001Mar 2001-Nov 2001
Great Recession 2007Dec 2007-June 2009[69][70]

I’d like to retain the name and period columns of recessions, because I’d like to use geom_rect to visualise the periods when plotting our time series. We see that each row in period has a relevant interval of the format [Month abbreviation Year]-[Month abbreviation Year]. We can use regexes to extract these.

I didn’t really know anything about regexes prior to writing this post, so a small disclaimer is appropriate before we move on: This isn’t meant to be a regex tutorial, for that I would advise you to start with the resources listed below and google your way from there.

I particularly recommend the regex crosswords. They really force you to think about what’s happening. I used this online regex tester to check my reasoning, but was recently made aware of a similar functionality in RStudio:

This seems pretty useful, I’ll have to check it out later.

Matching strings

After some trial and error, I ended up with the following regex to extract the relevant parts of the strings in the period columns:

(?:[A-Z][a-z]+\\s[0-9]{4})-(?:[A-Z][a-z]+\\s[0-9]{4})

Please note, there are probably simpler and more clever approaches than mine. My general philosophy is to make things work first, and try to be clever later.

So, let’s break this down. The parenthesis ( starts to group an expression and captures it so that it can be referenced later. We won’t bother with that here. In fact, that is why I start with ?:. This explicitly tells the regex machine to not capture the grouped match. Next, the regex machine searches for a single upper case letter ([A-Z]) followed by at least one lower case letter ([a-z]+). Next the regex machine searches for a single whitespace (\\s) followed by exactly four numbers ([0-9]{4}). The grouped expression is closed with ), and the search for a single dash - starts. Then, we repeat the grouped pattern. This should give use what we want:

regex <- "(?:[A-Z][a-z]+\\s[0-9]{4})-(?:[A-Z][a-z]+\\s[0-9]{4})"

recessions_clean <- recessions %>% 
  mutate_at("period_range", funs(str_match(., regex))) %>% 
  select(name, period_range)

recessions_clean %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling(bootstrap_options = c("condensed", "striped"))
name period_range
Great Depression Aug 1929-Mar 1933
Recession of 1937–1938 May 1937-June 1938
Recession of 1945 Feb 1945-Oct 1945
Recession of 1949 Nov 1948-Oct 1949
Recession of 1953 July 1953-May 1954
Recession of 1958 Aug 1957-April 1958
Recession of 1960–61 Apr 1960-Feb 1961
Recession of 1969–70 Dec 1969-Nov 1970
1973–75 recession Nov 1973-Mar 1975
1980 recession Jan 1980-July 1980
1981–1982 recession July 1981-Nov 1982
Early 1990s recession in the United States July 1990-Mar 1991
Early 2000s recession Mar 2001-Nov 2001
Great Recession Dec 2007-June 2009

Cool! This looks a lot better. Now we can do some plotting.

Plotting recessions and loan rates

I’m going to create two date columns by separating period into a start date and an end date. Since I don’t know the exact date a recession starts or ends, I’ll simply set the dates to be the middle of the month. Furthermore, I’m only interested in the recessions after 1971, so I filter out those before. I’ll use ymd from lubridate to transform the strings to dates:

recessions_clean <- recessions_clean %>% 
  separate(period_range, into = c("start", "end"), sep = "-") %>% 
  mutate_at(vars(start, end), funs(dmy(paste("15", .)))) %>% 
  mutate(name = str_replace(name, "Early 1990s recession in the United States", "Early 1990s recession (US)")) %>% 
  filter(year(start) >= 1971) %>% 
  select(name, start, end)

recessions_clean %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling(bootstrap_options = c("condensed", "striped"))
name start end
1973–75 recession 1973-11-15 1975-03-15
1980 recession 1980-01-15 1980-07-15
1981–1982 recession 1981-07-15 1982-11-15
Early 1990s recession (US) 1990-07-15 1991-03-15
Early 2000s recession 2001-03-15 2001-11-15
Great Recession 2007-12-15 2009-06-15

Looking good! First, let’s plot the HPI. I really liked the simplicity and clarity of exposition of the BBC’s bbplot package, so I’ll use it this week as well.

hpi_index %>% 
  unite(date, year, month, sep = "-") %>% 
  mutate(date = ymd(paste(date, "1", sep="-"))) %>% 
  ggplot() +
  geom_rect(data = recessions_clean,
          aes(xmin = start, xmax = end,
              ymin = 0, ymax = Inf,
              group = name, fill = name), alpha = 0.2) +
  geom_line(aes(x = date, y = price_index, group = state), lwd = 2, alpha = 0.6, color = "#1380A1") +
  geom_line(aes(x = date, y = us_avg), color = "#FAAB18", lwd = 2) +
  annotate(geom = "text", x = ymd("2014-01-01"), y = 45, 
           label = "US average", size = 6) +
  geom_curve(aes(x = ymd("2010-12-01"), y = 57,
                 xend = ymd("2010-12-01"), yend = 115),
             size=0.5, curvature = -0.2,
             arrow = arrow(length = unit(0.03, "npc"))) + 
  labs(title = "Bubbles and recessions", subtitle = "Housing Price Index (HPI), 1975-2018") +
  bbplot::bbc_style()

The blue lines are the HPIs for the different US states and the orange line is the national average. The housing bubble of 2007 is clearly visible. The collapse of the housing market was followed by a major credit crisis which ultimately led to the Great Recession. A more dramatic way of visualising this is to approximate the derivative of the HPI by calculating the monthly differences:

hpi_index %>% 
  group_by(state, year) %>% 
  mutate(index_change = price_index - lag(price_index), # Approx. the derivative
         us_avg_change = us_avg - lag(us_avg)) %>% # Approx. the derivative
  ungroup() %>% 
  na.omit() %>% 
  unite(date, year, month, sep = "-") %>% 
  mutate(date = ymd(paste(date, "1", sep="-"))) %>% 
  ggplot() +
  geom_rect(data = recessions_clean,
          aes(xmin = start, xmax = end,
              ymin = -Inf, ymax = Inf,
              group = name, fill = name), alpha = 0.2) +
  geom_line(aes(x = date, y = index_change, group = state), lwd = 1, alpha = 0.4, color = "#1380A1") +
  geom_hline(yintercept = 0, lwd = 0.5) +
  geom_line(aes(x = date, y = us_avg_change), color = "#FAAB18", lwd = 2) +
  labs(title = "Rate of change of HPI", subtitle = "Monthly finite differences, 1975-2018") +
  bbplot::bbc_style() 

That’s it for this week. Thanks for reading, I hope you enjoyed it!