Neat and Tidy: Using Tidyverse to Manage COVID-19 Data

Generating Small-Multiple Charts of Cases in the US Using the JHU CSSE COVID-19 Dataset

Alexander Carlton

2020-04-09, Updated 2020-11-01

Introduction

COVID-19 has surely wreaked havok all over, but it has also demonstrated the value of data – especially well managed, timely data.

The good folks at the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) have provided one of the most obvious early examples of an immediately useful dataset – and wonderfully they have shared it all openly so that everyone can work with the numbers until we each find the best ways to make sense of all this.

Neat and Tidy is a place where I have collected a variety of lessons I have learned as have explored the power of the Tidyverse.

This write-up comes from one of my own learning exercises using the powerful tools of the tidyverse library for the R environment. The author is lifelong student who has recently found the Tidyverse to be an effective means of working with and displaying this data. Note, this is not meant to be a work of indepth analysis, this is just a review of one attempt to use tidyverse functions to do good work with an interesting dataset.

Our goal is to look at how COVID is affecting different states across the US. The intention was to build a small-multiple graphic that allowed me to compare caseload curves and also appreciate the differences (if any) all around the country.

Setup

As always, there are some assumptions built into any working code. Perhaps the details of the setup used here can provide some idea of the assumptions that underlie this code.

Configuration

The key configuration variable is the path to the local copy of the data, but other values below can be configured to experiment with alternative views.

Local Files

The first set of variables are holding where to find a local copy of the dataset.

The method used here was to clone the COVID-19 repository provided by the JHU CSSE team. Cloning the repository results in a directory hierarchy of files underneath the COVID-19 directory.

# The following are based on the local copy of the data
jhu_directory <- "COVID-19/csse_covid_19_data/"

Script Parameters

For our visualization we are defining a couple of variables that can be used to adjust the look of the resulting chart.

A 7-day moving average does serve to smooth the lines, but the choice of seven days is meant to minimize weekday-vs-weekend differences in reporting affecting the views.

The upper-bound variable is used as part of the calculations to normalize the views of each chart in the set. Each state’s result are mapped to a value relative to that state’s latest result. Then our visualization will display a set of charts running from zero to this upper bound so that the shapes of the curves are comparable.

# Size of the moving window for averaging
DESIRED_DAYS <- 7
before_count <- DESIRED_DAYS - 1

# Upper Bound of Y-Axis
UPPER_BOUND <- 2

Environment

The tidyverse library is the main dependency. The slider library does the hard work of proper calculations of moving averages. We use lubridate just to get a trivial way to calculate “one month ago”. ggrepel helps make labels in each chart simple as well as easy to execute. Finally, `geofacet’ provides an interesting way to layout the many charts in this visualization.

“If I have seen further it is by standing on the shoulders of Giants” – Issac Newton

library(tidyverse)
library(lubridate)
library(geofacet)
library(ggrepel)
library(slider)

Inputs

This particular analysis is based on the time-series provided for the US, which has day by day values for cumulative confirmed cases categorized by state and county.

These files have several columns which define or label the locations, and then a bunch of columns for the values (one column for each date).

jhu_uscases_filename <- "csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
jhu_uscases_cols <- cols(
    .default = col_double(),
    iso2 = col_character(),
    iso3 = col_character(),
    Admin2 = col_character(),
    Province_State = col_character(),
    Country_Region = col_character(),
    Combined_Key = col_character()
)

To make this data useful for the tidyverse functions, once it is read in we transform into a “tidy” data format. We also do a relatively trivial summarization to calculate case totals for each state, and once that’s done we simplify the dataset down to just the few items we need.

Some may find the tidy approach awkward, but in an environment with many (sometimes arcane) methods to achieve any result, the tidyverse stands out as a sane system whose methods can be surprising effective. However, I will admit the new pivot functions make working in/out of tidy data a whole lot easier than the earlier “gather” and “spread” functions – and for me this makes the entire tidyverse much easier to approach.

jhu_uscases_data <- read_csv(
    paste0(jhu_directory, jhu_uscases_filename),
    col_types = jhu_uscases_cols
) %>%
    # Drop the fields we are not going to use, simplify a couple of names
    select(
        -UID, -iso2, -iso3, -code3,
        -Lat, -Long_,
        Admin0 = "Country_Region",
        Admin1 = "Province_State"
    ) %>%
    # Convert the raw data into a tidy form
    pivot_longer(
        -c(FIPS, Admin0, Admin1, Admin2, Combined_Key),
        names_to = "date_str",
        values_to = "Cases"
    ) %>%
    mutate(
        Date = as.Date(date_str, "%m/%d/%y"),
        date_str = NULL
    ) %>%
    # Calculate statewide totals
    group_by(Admin0, Admin1, Date) %>%
    summarise(StateTotal = sum(Cases), .groups = "drop_last") %>%
    rename(Cases = StateTotal) %>%
    # Reduce to just the info we need
    ungroup() %>%
    select(
        Admin1, Date, Cases
    )

To aid comparisons between different states it can be useful to normalize case counts by the population of each state, for this I use purpose the number of cases per million of population in the state.

Thankfully, the Johns Hopkins team also provides a “lookup table” for all locations they track which includes Population (as well as geographic coordiates, et al).

# Get lookup data for population counts
jhu_lookup_filename <- "UID_ISO_FIPS_LookUp_Table.csv"
jhu_lookup_cols <- cols(
    UID = col_double(),
    iso2 = col_character(),
    iso3 = col_character(),
    code3 = col_double(),
    FIPS = col_double(),  # override, was: col_character(),
    Admin2 = col_character(),
    Province_State = col_character(),
    Country_Region = col_character(),
    Lat = col_double(),
    Long_ = col_double(),
    Combined_Key = col_character(),
    Population = col_double()
)

jhu_lookup_data <- read_csv(
    paste0(jhu_directory, jhu_lookup_filename),
    col_types = jhu_lookup_cols
) %>%
    select(
        -UID, -iso2, -iso3, -code3,
        -Lat, -Long_, -Combined_Key,
        Admin0 = "Country_Region",
        Admin1 = "Province_State"
    ) %>%
    filter(
        Admin0 == "US",
        is.na(Admin2)
    ) %>%
    select(
        Admin1, Population
    )

Operate

Frankly, there is not much analysis in this exercise, so the operations we have are really just to use the many tools in the tidyverse to strip all this data down to the chart we seek.

Join Datasets And Calculate Moving Averages

Joining the datasets enables us to calculate the normalized case counts, and as part of the same operation we can then calculate a 7-day moving average (which helps smooth out the sitations where some states tend to report fewer cases during Saturday and Sunday).

jhu_workset <-
    # Join with Lookup data to get Population
    left_join(
        jhu_uscases_data, jhu_lookup_data,
        by = "Admin1"
    ) %>%
    select(
        Name = Admin1, Population, Date, Cases
    ) %>%
    # Convert running totals into daily new cases and generate running average
    group_by(Name) %>%
    arrange(Date) %>%
    mutate(
        NewCases = Cases - lag(Cases),
        AvgCases = slide_dbl(
            NewCases, mean,
            .before = before_count, .complete = TRUE
        ),
        NrmAvgCases = AvgCases / (Population / 1000000),
        RelCases = NrmAvgCases / last(NrmAvgCases)
    ) %>%
    # Reduce to simplify
    ungroup() %>%
    select(
        Date, Name, AvgCases, NrmAvgCases, RelCases
    )

Min and Max Cases

First, the lubridate tools in Tidyverse help keep data calculations clean, but it’s more out of habit than necessity that I use the library just to figure out a date for exactly one month ago.

The “neat” trick is in the second step, using the new slice functions in the updated dplyr library to neatly extract exactly one date each for when min and max values were reached.

# Filter down to just the dates that are of interest
latest_date <- max(jhu_workset$Date)
chart_start_date <- latest_date - months(1, FALSE)
viewset <- jhu_workset %>%
    filter(
        NrmAvgCases > 0,
        Date > chart_start_date
    )

# Find the min and max for each state now that we've removed unwanted dates
min_dates <- viewset %>%
    group_by(Name) %>%
    slice_min(RelCases, with_ties = FALSE) %>%
    ungroup() %>%
    select(Name, MinDate = Date, MinVal = AvgCases)
max_dates <- viewset %>%
    group_by(Name) %>%
    slice_max(RelCases, with_ties = FALSE) %>%
    ungroup() %>%
    select(Name, MaxDate = Date, MaxVal = AvgCases)

# Join the min/max to our data, and generate a label for those min/max dates
viewset <- viewset %>%
    left_join(min_dates, by = "Name") %>%
    left_join(max_dates, by = "Name") %>%
    # Generate a label for the datapoints on the specific min/max dates
    group_by(Name) %>%
    mutate(
        Label = if_else(
            Date == MaxDate,
            as.character(round(MaxVal)),
            if_else(
                Date == MinDate,
                as.character(round(MinVal)),
                ""
            ),
        )
    ) %>%
    # Reduce back down to just what we need
    ungroup() %>%
    select(
        -AvgCases,
        -MinVal, -MaxVal, -MinDate, -MaxDate
    )

Display

Using the many features of ggplot() we can build a very complex chart showing small-multiples of each counties’ case rates over time.

Simple Faceting

Each chart is displaying the same sequence of days across the X-axis, and each is displaying the same Y-axis relative to the last day’s value (the Y-axis runs from 0 to 2 with the last day’s value centered at 1.0 and all the other points displayed relative to that last point). In this way the shapes of the curves are easily compared.

The shape of the curve in each chart provides a sense of the recent directions of each state’s trends, but the color of the datapoints can provide an idea of how serious the situation might be, with the “hotter” the color the higher the number of cases per million residents of that state. The legend has been tweaked pretty seriously, using a log10 transform of the Population column with the vivid Viridis color scale – reworking the scale by orders of magnitude makes it is possible to get a reasonable sense of each state’s outbreak even in the tiny subplots within the small-multiple view. The legend itself has been hacked with calls to guides() and theme() to render a long and thin legend along the bottom of the chart which may make it easier to see how to read the colors of the counties while taking only a bit of visual room.

To supply some detailed information from each state, the min and the max values for each chart are noted and the exact values for those are printed in small type for those who may want to squint.

gg <- ggplot(
    viewset,
    aes(
        x = Date,
        y = RelCases,
        label = Label,
        color = NrmAvgCases
        )
) +
    # lay down one reference line
    geom_hline(
        yintercept = 1.0,
        color = "gray90",
        size = 0.5
    ) +
    # add our data
    geom_point(
        size = 1,
        na.rm = TRUE
    ) +
    # Style the chart
    coord_cartesian(
        ylim = c(0, UPPER_BOUND)
    ) +
    scale_x_date(
        date_breaks = "1 month",
        date_labels = "%b"
    ) +
    scale_color_viridis_c(
        trans = "log10",
        n.breaks = 7, breaks = waiver(),
        option = "A", direction = 1
    ) +
    guides(
        color = guide_colorbar(
            barwidth = unit(0.75, "npc"),
            barheight = unit(0.01, "npc"),
            title = "Cases/Million (log scale)"
        )
    ) +
    geom_text_repel(
        min.segment.length = unit(0, 'lines'), 
        size = 2,
        color = "black"
    ) +
    facet_wrap(~ Name) +
    labs(
        title = "COVID By US State: Last Month of New Cases Reported per Day",
        subtitle = paste(
            "Daily Cases Plotted Relative to Latest Counts,",
            "Colored by Daily Cases per Million Population"
        ),
        y = "New Cases Reported (7-day moving average)",
        x = "",
        caption = paste(
            "Visualization by Alexander Carlton;",
            "Data from Johns Hopkins University CSSE,",
            "including data through",
            format(latest_date, "%B %e, %Y")
        )
    ) +
    theme_minimal() +
    theme(
        plot.background = element_rect(
            fill = "gray90", color = "gray90"
        ),
        strip.background = element_rect(
            fill = "white", color = "white"
        ),
        panel.background = element_rect(
            fill = "white", color = "white"
        ),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5),
        axis.text.y = element_blank(),
        legend.position = "bottom"
    )

print(gg)

These charts are all drawn normalized to each other, all with the last date falling on the mid-point of the Y-axis. And unfortunately, almost all of these charts are rising to that last date showing that the peak values over the last month has been reached only recently. Hawaii is the one case where it’s easy to see the trend all month has been downwards.

Adapting the Layout to Provide More Information

Displaying these charts alphabetically is efficient space-wise, but a bit hard to appreciate visually. We can take advantage of peoples’ familiarity with the basic layout of the states in America to re-arrange these charts in such a way that one can get a quick sense of regional differences.

The geofacet library can be fun to work with. Obviously there are severe limits to the accuracy of any attempt to replicate geographic complexity with nothing more than a set of identically sized rectangles, but even with all the distortions a decent layout of these charts can provide a good means of delivering some insights about regional behavior.

Defining a Layout

The geofacet library comes with many pre-existing layouts, but depending upon one’s needs sometimes it is better to define your own (or perhaps just tweak one of the existing layouts to better suit your data).

# Define an arrangement of states into a grid layout for facet_geo()
unitedstates_grid <- data.frame(
    row = c(
                                   1, 1, 1,
        2, 2, 2, 2, 2, 2, 2,    2, 2, 2, 2,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
           5, 5, 5, 5, 5, 5, 5,
                 6, 6, 6, 6, 6,          6,
        7, 7,    7,          7,          7
    ),
    col = c(
                                   10, 11, 12,
        1, 2, 3, 4, 5, 6, 7,    9, 10, 11, 12, 
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
           2, 3, 4, 5, 6, 7, 8,
                 4, 5, 6, 7, 8,            12, 
        1, 2,    4,          8,            12
    ),
    code = c(
                                                              "VT", "NH", "ME",
        "WA", "ID", "MT", "ND", "MN", "WI", "MI",       "NY", "CT", "RI", "MA", 
        "OR", "NV", "WY", "SD", "IA", "IL", "IN", "OH", "PA", "NJ",
        "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE",
              "AZ", "NM", "KS", "AR", "TN", "NC", "SC",
                          "OK", "LA", "MS", "AL", "GA",                   "DC",
        "AK", "HI",       "TX",                   "FL",                   "PR"
    ),
    name = c(
        "Vermont", "New Hampshire", "Maine",
        "Washington", "Idaho", "Montana", "North Dakota", "Minnesota",
            "Wisconsin", "Michigan", "New York", "Connecticut", "Rhode Island", 
            "Massachusetts",
        "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Illinois",
            "Indiana", "Ohio", "Pennsylvania", "New Jersey",
        "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky",
            "West Virginia", "Virginia", "Maryland", "Delaware",
        "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", 
            "North Carolina", "South Carolina",
        "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia",
            "District of Columbia",
        "Alaska", "Hawaii", "Texas", "Florida", "Puerto Rico"
    ),
    stringsAsFactors = FALSE
)

Rebuilding the Chart

Below we repeat this same small-multiples chart, just substituting a call to facet_geo() instead of facet_wrap().

# Generate the plot
geo <- ggplot(
    viewset,
    aes(
        x = Date,
        y = RelCases,
        label = Label,
        color = NrmAvgCases
        )
) +
    # lay down one reference line
    geom_hline(
        yintercept = 1.0,
        color = "gray90",
        size = 0.5
    ) +
    # add our data
    geom_point(
        size = 1,
        na.rm = TRUE
    ) +
    # Style the chart
    coord_cartesian(
        ylim = c(0, UPPER_BOUND)
    ) +
    scale_x_date(
        date_breaks = "1 month",
        date_labels = "%b"
    ) +
    scale_color_viridis_c(
        trans = "log10",
        n.breaks = 7, breaks = waiver(),
        option = "A", direction = 1
    ) +
    guides(
        color = guide_colorbar(
            barwidth = unit(0.75, "npc"),
            barheight = unit(0.01, "npc"),
            title = "Cases/Million (log scale)"
        )
    ) +
    geom_text_repel(
        min.segment.length = unit(0, 'lines'), 
        size = 2,
        color = "black"
    ) +
    facet_geo(
        ~ Name,
        grid = unitedstates_grid,
        move_axes = FALSE
    ) +
    labs(
        title = "COVID By US State: Last Month of New Cases Reported per Day",
        subtitle = paste(
            "Daily Cases Plotted Relative to Latest Counts,",
            "Colored by Daily Cases per Million Population"
        ),
        y = "New Cases Reported (7-day moving average)",
        x = "",
        caption = paste(
            "Visualization by Alexander Carlton;",
            "Data from Johns Hopkins University CSSE,",
            "including data through",
            format(latest_date, "%B %e, %Y")
        )
    ) +
    theme_minimal() +
    theme(
        plot.background = element_rect(
            fill = "gray90", color = "gray90"
        ),
        strip.background = element_rect(
            fill = "white", color = "white"
        ),
        panel.background = element_rect(
            fill = "white", color = "white"
        ),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5),
        axis.text.y = element_blank(),
        legend.position = "bottom"
    )

print(geo)

Clearly this is not an accurate map of the United States, but it is enough to show that the overall upward trends are true almost coast to coast, though northern New England and the West Coast states both look different than their neighboring regions, and that the Dakotas appear to be at the center of an area currently experiencing a rapidly rising caseload.

Closing Thoughts

At least for me, working through this exercise proved to be a very good opportunity to become more familiar with several tidyverse features that hadn’t yet become a part of my usual processes.

I hope at least something here was helpful. This and some of my other exercises are part of a Neat And Tidy project with code shared in my GitHub.