Neat and Tidy: Mapping COVID-19 Data with Tidyverse

Animating choropleths with ‘urbnmapr’ and ‘gganimate’

Alexander Carlton

2020-04-13, Updated 2020-11-03

This page is part of a Neat and Tidy project with multiple articles on different ways to use the Tidyverse to manage real world data.

COVID-19 has triggered many questions. One of the most basic questions with any viral outbreak is “Where?”

The good folks at the Johns Hopkins University CSSE provide one of the commonly sited views of the current COVID-19 situation, https://coronavirus.jhu.edu/map.html

But since these folks also have released their data via GitHub, https://github.com/CSSEGISandData/COVID-19, everyone can create customized views into this information and produce maps to suit their specific interests.

Set Up

Environment

A good tool improves the way you work. A great tool improves the way you think. – Jeff Duntemann

library(tidyverse)
library(gganimate)
library(ggthemes)
library(urbnmapr)
library(scales)
library(gifski)

Configuration

A few parameters for controling or adapting this script to suit the local needs.

# Define a time-period to animate
start_date <- as.Date("2020-01-22")
end_date <- as.Date("2020-11-03")

# This file will be created at the end
output_filename <- "animation.gif"

# The following are based on the local copy of the data
jhu_directory <- "COVID-19/csse_covid_19_data/"

# Minimum population to include in map (tiny populations can have difficult variance)
min_pop <- 10000

Inputs

Amongst the many files provided in this repository, our immediate focus are the time_series files, specifically the one for tracking Confirmed Cases in the US.

The COVID-19 repository is a great resource for sane, fresh data on a very difficult problem. Their Center for Systems Science and Engineering (JHU CSSE) has become famous for their dashboard, but making the raw data available on their GitHub has enabled so many others to learn. We are very grateful.

jhu_uscases_filename <- "csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
# Fetch the US cases data
jhu_uscases_cols <- cols(
    .default = col_double(),
    UID = col_character(),
    iso2 = col_character(),
    iso3 = col_character(),
    FIPS = col_character(),
    Admin2 = col_character(),
    Province_State = col_character(),
    Country_Region = col_character(),
    Combined_Key = col_character()
)

jhu_uscases_data <- read_csv(
    paste0(jhu_directory, jhu_uscases_filename),
    col_types = jhu_uscases_cols
) %>%
    select(
        -UID, -iso2, -iso3, -code3, -Lat, -Long_
    ) %>%
    rename(
        Admin1 = "Province_State",
        Admin0 = "Country_Region"
    )

The data in these time_series files has new columns added each day. For our use here we need to transform these spreadsheets into tidy data.

While we are at it, we process the “FIPS” values (which are a set of standardized labels for each county in the US) into a form that is easier to later join with our source for geographic mapping info.

# Turn the JHU data into a 'tidy' form (with a proper Date field)
tidy_cases_data <- jhu_uscases_data %>% 
    pivot_longer(
        cols = contains("/"),
        names_to = "date_str",
        values_to = "Cases"
    ) %>%
    mutate(
        # Convert date field into a native form
        Date = as.Date(date_str, "%m/%d/%y"),
        date_str = NULL
    )

Then once we have the data in tidy form, it is easy to manipulate it and learn from the results.

Now, in order to be able to normalize the case data by county population we need to join in the population information from a lookup table also provided in the Johns Hopkins dataset.

# Get lookup data for population counts
jhu_lookup_filename <- "UID_ISO_FIPS_LookUp_Table.csv"
jhu_lookup_cols <- cols(
    UID = col_double(),
    iso2 = col_character(),
    iso3 = col_character(),
    code3 = col_double(),
    FIPS = col_double(),  # override, was: col_character(),
    Admin2 = col_character(),
    Province_State = col_character(),
    Country_Region = col_character(),
    Lat = col_double(),
    Long_ = col_double(),
    Combined_Key = col_character(),
    Population = col_double()
)

jhu_lookup_data <- read_csv(
    paste0(jhu_directory, jhu_lookup_filename),
    col_types = jhu_lookup_cols
) %>%
    select(
        -UID, -iso2, -iso3, -code3,
        -Lat, -Long_, -Combined_Key,
        Admin0 = "Country_Region",
        Admin1 = "Province_State"
    ) %>%
    filter(
        Admin0 == "US",
        is.na(Admin2)
    ) %>%
    select(
        Admin1, Population
    )

jhu_workset <-
    # Join with Lookup data to get Population
    left_join(
        tidy_cases_data, jhu_lookup_data,
        by = "Admin1"
    ) %>%
    select(
        FIPS, Admin1, Admin2, Population, Date, Cases
    )

Operate

Since this exercise is focused on mapmaking, we can remove the entries that don’t fit into our maps.

Yes, there’s a degree of inaccuracy created by just dropping the data out, but at the scales being used here the “unassigned” and out-of-state cases are not significant enough to change the picture. Note: this assumption is not safe, and needs to be tested before one does any serious analysis.

dataset <- jhu_workset %>%
    filter(
        !Admin2 == "Unassigned",
        !str_detect(Admin2, "Out of "),
        Population > min_pop # Drops the cruise ships and tiny principalities
    ) %>%
    mutate(
        Value = if_else(
            Cases != 0,
            Cases / (Population / 1000000),
            NA_real_  # For display purposes, convert '0' to 'na' (white out)
        )
    )

Having made the (gross) simplification of dropping out the problematic elements, we can now focus on making maps.

First step is to generate a minimal dataset, stripping out anything not used in these maps.

The size of the tibbles multiplies quickly when we join with all the map points, so the more we can eliminate now the less stuff the plot needs to grind through when making each image. Something that can be appreciated when we need to grind out an animation with hundreds of images each built out of thousands of counties.

# prepare the data for joining with the map definitions
dataset <- dataset %>%
    filter(
        Date >= start_date,
        Date <= end_date
    ) %>%
    mutate(
        region = tolower(Admin1),
        subregion = tolower(Admin2),
        # Convert FIPS to a form that will join with the map data
        county_fips = sprintf("%05d", as.integer(FIPS))
    ) %>%
    select(
        Date, Value, region, subregion, county_fips
    )

The last operation is to join in the map points.

Again, we are working the potentially sharp edges of the boundaries between accuracy and usefulness.

For this exercise the county-level information as provided by urbnmapr (the very handy US map assistant from the Urban Institute team) provides just what we need [border points to draw at both the county and at the state level precalculated with a transform like the old school classroom maps of the USA that pulls Alaska and Hawaii across the Pacific to be overlayed where the real world has Baja California – not necessarily accurate, but effective and very recognizable for a US audience].

state_info <- get_urbn_map("states", sf = TRUE)
county_info <- get_urbn_map("counties", sf = TRUE)
plotdata <- dataset %>%
    right_join(county_info, by = "county_fips")

Display

Lots of functions with lots of arguments, but mostly this is just for style.

The one tricky item here is the coloring of the counties. This chart uses a continuous three-color palette for coloring, but here we specify a “log10” transformation (linear distance along this color line represents changes of orders of magnitude in cases per million in population) so that the few locations with very high rates don’t force everywhere else into a dull monotone.

As was touched on when dropping the “unassigned” values above, this is another area where one needs to be careful. The log10 function is an exaggeration, and while exaggeration is a powerful tool to pull attention to the selected area of interest, there comes a point where it is difficult (if not impossible) to differentiate between an overdone exaggeration and a simple lie.

In this chart, very low counts per million will be in shades of yellow, then the colors will shift through shades of orange as the case rates rise into the hundreds of cases per million in local population, and finally into deeper and deeper shades of red as cases reach up through tens of thousands cases per million inhabitants.

# Use the last date with values as a proxy for a timestamp for the dataset
last_date <- max(plotdata$Date)

# Build the plot
gg <- ggplot() +
    # County-level data
    geom_sf(
        data = plotdata,
        mapping = aes(fill = Value, geometry = geometry),
        color = NA
    ) +
    # Colored brightly
    scale_fill_distiller(
        palette = "YlOrRd",
        direction = 1,
        na.value = "#ffffdd",
        trans = "log10",
        n.breaks = 5,
        labels = label_comma(),
        name = "Cases per Million",
        guide = "colorbar"
    ) +
    # State outlines
    geom_sf(
        data = state_info,
        mapping = aes(geometry = geometry),
        fill = NA,
        color = "black",
        size = 0.4
    ) +
    # Hack the legend to be wide across the bottom of the chart
    guides(
        fill = guide_colorbar(
            barwidth = unit(0.666, "npc"),
            barheight = unit(0.01, "npc")
        )
    ) +
    # Labels (with "frame_time" to be supplied by gganimate below)
    labs(
        title = "Rate of Confirmed COVID-19 Cases By County",
        subtitle = "Cases per Million Population as of {frame_time}",
        caption = paste("Data from Johns Hopkins University CSSE,", last_date),
        x = NULL,
        y = NULL
    ) +
    # Clean theme, but do push the legend/guide to be displayed below chart
    theme_map() +
    theme(
        legend.title = element_text(vjust = 1),
        legend.position = "bottom"
    ) 
print(gg)

And we have our base chart.

Actually, this is a display of how things will look at the end of the sequence of frames. The map will look very different as we move forward in time.

Animate

Now that we have the base chart, with the gganimate library animation has been made very easy. Effectively all we need to do is to declare what is the variable we are using to step through the dataset, and then just specify a few basic parameters in the call to animate – here the parameters are basically the default with the one exception of specifying that we will end the animation loop by displaying the (unchanging) final image for the last three seconds before repeating.

The one trick here is that we’ve chosen to use “{frame_time}” in the subtitle for the chart so that during animation this will be replaced by the current date as is generated by the transition_time() function.

# Note: this can take a long time to execute
# -- many days of data, lots and lots of geometry points in every image

# At one frame for each day, count up how many frames we need
size_of_dates <- plotdata %>% select(Date) %>% unique() %>% dim()
nframes = size_of_dates[1]

# Animate by Date
anim <- gg +
    transition_time(Date, range = c(start_date, end_date))
loop <- animate(
    anim,
    fps = 10,           # 10 frames per second
    nframes = nframes,  # animate with 1 frame each day
    end_pause = 30      # hold still for last 30 frames (3 secs)
)
anim_save(output_filename)

All of which produces this animation.

“How did you go bankrupt?” Bill asked. “Two ways,” Mike said. “Gradually and then suddenly.” – Ernest Hemmingway, The Sun Also Rises

loop