Annotating and grouping

Materials for class on Tuesday, November 27, 2018

Slides
Data to download
Live code
Louisville animal bites
Iterative design + grouping and annotating
Clearest and muddiest things

Slides

Download the slides from today’s lecture.

Data to download

Download these and put them in a folder named “data” in an RStudio project:

World happinessI collected this data from the UN and the World Bank. If you’re interested, you can see the R script I used to create this dataset here.
Louisville animal bitesSee complete column descriptions. The data is released under a public domain license and hosted originally at Kaggle.

Live code

Use this link to see the code that I’m actually typing:

https://andhs.co/live-code

I’ve saved the R script to Dropbox, and that link goes to a live version of that file. Refresh or re-open the link as needed to copy/paste code I type up on the screen.

Louisville animal bites

Use some of this code to help you get started. You don’t have to do this—this gets a count of dog, cat, and other bites between 2010 and 2017. Feel free to do whatever you want. You’re iterating here!

library(tidyverse)
library(lubridate)

bites_raw <- read_csv("data/Health_AnimalBites.csv")

# Or directly from the internet if you want
# bites_raw <- read_csv("https://datavizf18.classes.andrewheiss.com/data/Health_AnimalBites.csv")

bites <- bites_raw %>%
  mutate(year = year(bite_date)) %>%
  mutate(species = case_when(
    SpeciesIDDesc == "CAT" ~ "Cat",
    SpeciesIDDesc == "DOG" ~ "Dog",
    TRUE ~ "Other"
  )) %>% 
  mutate(species = factor(species, levels = c("Dog", "Cat", "Other"), ordered = TRUE)) %>%
  filter(year < 2018, year >= 2010) 

bites_species_year <- bites %>%
  filter(!is.na(species)) %>%
  group_by(year, species) %>%
  summarize(total_bites = n())

Iterative design + grouping and annotating

Here are some fairly polished plots based on the world happiness index and other UN and World Bank data, all arranged in a nice 3-panel figure with patchwork. This is the final output—the process of getting to the point took a while and went through lots of different iterations, which is the creative process in action.

library(tidyverse)
library(ggrepel)
library(broom)  # For dealing with models as data frames
library(patchwork)
library(ggbeeswarm)  # For cool dot plots

happiness <- read_csv("data/world_happiness.csv")

happiness_clean <- happiness %>% 
  mutate(in_asia = region == "East Asia & Pacific") %>% 
  mutate(label_to_plot = ifelse(in_asia, country, NA)) %>% 
  mutate(region_big = case_when(
    region == "East Asia & Pacific" ~ "Asia",
    region == "Europe & Central Asia" ~ "Europe",
    region == "Latin America & Caribbean" ~ "North & South America",
    region == "North America" ~ "North & South America",
    region == "South Asia" ~ "Asia",
    TRUE ~ region
  )) %>% 
  mutate(region_big = factor(region_big, 
                             levels = c("North & South America", "Europe", 
                                        "Middle East & North Africa", "Asia", 
                                        "Sub-Saharan Africa"), 
                             ordered = TRUE))

Happiness explained by life expectancy

Here’s the relationship between life expectancy and national happiness, with East Asian and Oceanic countries highlighted with redundant shapes. Note how instead of using annotate(), I make a separate data frame called extra_labels and then use geom_text() to plot it twice. This might be overkill here, since I’m only plotting two things, but it allows for more flexibility later if I want to add additional labels and not worry about adding even more annotate() layers.

Happiness explained by life expectancy, colored by region

Here I collapsed some of the regions with case_when() up above, and then generated a palette of five perceptually uniform and colorblind friendly colors at iWantHue.

The other cool thing about this plot is final_predicted_points, which runs a linear regression model on each region and then determines the final predicted point for each line, which I then use with geom_text_repel() to put region names directly on the plot.

Happiness by region

Here I just plot happiness scores (i.e. no comparison with life expectancy or anything else) by region. I use geom_quasirandom() from the ggbeeswarm package, which jitters points in cool shapes.

Combined mega plot with patchwork

Finally, I put all of these together in a final combined plot using the patchwork package.

Note how I make some adjustments to plot1, plot2, and plot2, like shrinking the titles and adding tags. Also note that I use / and + and * and & to combine the plots in the right configuration. I figured this out by reading the README at patchwork’s GitHub repository.

Clearest and muddiest things

Go to this form and answer these two questions:

What was the muddiest thing from class today? What are you still wondering about?
What was the clearest thing from class today? What was the most exciting thing you learned?

I’ll compile the questions and send out answers after class.