Uncertainty

Materials for class on Tuesday, October 9, 2018

Slides
Live code
Code from today
Clearest and muddiest things

Slides

Download the slides from today’s lecture.

Live code

Use this link to see the code that I’m actually typing:

https://andhs.co/live-code

I’ve saved the R script to Dropbox, and that link goes to a live version of that file. Refresh or re-open the link as needed to copy/paste code I type up on the screen.

Code from today

In class we looked at the uncertainty in weather patterns in Provo (specifically 40.248752, -111.649216) for 2017.

First, we load the libraries we’ll be using:

library(tidyverse)
library(lubridate)
library(ggridges)
library(scales)

# Make the randomness reproducible
set.seed(1234)

Then we load the data. I used the darksky R package to download this historical data from Dark Sky, and then I saved the CSV file online.

weather_provo_raw <- read_csv("https://andhs.co/provoweather")

Then we wrangle the data, or make adjustments to it so that it’s easier to use and plot. Here we extract the month and weekday of each row as new columns:

# Wrangle data
weather_provo_2017 <- weather_provo_raw %>% 
  mutate(Month = month(date, label = TRUE, abbr = FALSE),
         Day = wday(date, label = TRUE, abbr = FALSE))

Now we can do stuff with it!

Histograms

With histograms, you don’t need to specify the y aesthetic, since ggplot will calculate the frequency for you.

ggplot(weather_provo_2017, aes(x = windSpeed)) +
  geom_histogram(binwidth = 1)

We can look at monthly histograms by using fill, but this is horrendous:

ggplot(weather_provo_2017, aes(x = windSpeed, fill = Month)) +
  geom_histogram(binwidth = 1)

You can also facet, which is probably better (here I turn off the fill legend, since there’s no reason to keep it—it’s obvious which color goes to which month)

ggplot(weather_provo_2017, aes(x = windSpeed, fill = Month)) +
  geom_histogram(binwidth = 1) +
  guides(fill = FALSE) +
  facet_wrap(~ Month)

Density plots

Density plots use the same syntax as histograms (i.e. you don’t need to specify a y aesthetic):

ggplot(weather_provo_2017, aes(x = windSpeed)) +
  geom_density(fill = "#24af53")

We can also incorporate months into density plots, with facets…

ggplot(weather_provo_2017, aes(x = windSpeed, fill = Month)) +
  geom_density() +
  guides(fill = FALSE) +
  facet_wrap(~ Month)

… or by plotting each density curve on top of the other and adding transparency (here I move the legend to the bottom of the plot):

ggplot(weather_provo_2017, aes(x = windSpeed, fill = Month)) +
  geom_density(alpha = 0.4) +
  theme(legend.position = "bottom")

Box plots

We can also make box plots. These are plotted vertically, so you need to put the numeric variable you’re interested in along the y-axis:

ggplot(weather_provo_2017, aes(y = windSpeed)) +
  geom_boxplot()

You can also fill or facet or whatever, just like before. Here I use day of the week instead of month, just for fun:

ggplot(weather_provo_2017, aes(y = windSpeed, x = Day, fill = Day)) +
  geom_boxplot() +
  guides(fill = FALSE)

Violin plots

Violin plots use the same syntax as box plots, with the numeric variable on the y-axis. You can use all the other aesthetics too, just like density plots and box plots.

ggplot(weather_provo_2017, aes(y = windSpeed, x = Day, fill = Day)) +
  geom_violin() +
  guides(fill = FALSE)

With violin plots, it’s often helpful to add additional geoms, like points:

ggplot(weather_provo_2017, aes(y = windSpeed, x = Day, fill = Day)) +
  geom_violin() +
  geom_point(size = 0.5, position = position_jitter(width = 0.1)) +
  guides(fill = FALSE)

Ridge plots

The ggridges package has a ton of options for plotting stacked density plots. The documentation in R itself is somewhat sparse, but the package’s main webpage has many examples of how to use the different geoms.

Here’s the density of windspeed by month (scale = x adjusts how much the density plots overlap; fct_rev() reverses the month variable so that January goes on top):

ggplot(weather_provo_2017, aes(x = windSpeed, y = fct_rev(Month),
                               fill = Month)) +
  geom_density_ridges(scale = 5) +
  guides(fill = FALSE) +
  labs(x = "Wind speed", y = NULL) +
  theme_minimal()

## Picking joint bandwidth of 0.52

Here’s temperature over time. You can add quantile lines with quantile_lines. Specifying 2 quantiles will split each density in half (giving you the median):

ggplot(weather_provo_2017, aes(x = temperatureHigh, y = fct_rev(Month),
                               fill = Month)) +
  geom_density_ridges(scale = 5, quantile_lines = TRUE, quantiles = 2) +
  guides(fill = FALSE) +
  labs(x = "Daily high temperature", y = NULL) +
  theme_minimal()

## Picking joint bandwidth of 3.48

We can also fill each density with a gradient to look extra cool. The ..x.. thing in the fill aesthetic is just how geom_density_ridges_gradient() works. The only way you’d know that (and the only way I knew that) was to look at the example page.

ggplot(weather_provo_2017, aes(x = temperatureHigh, y = Month, fill = ..x..)) +
  geom_density_ridges_gradient(quantile_lines = TRUE, quantiles = 2) + 
  scale_fill_viridis_c(option = "plasma", name = "Temp")

## Picking joint bandwidth of 3.48

We can plot the high and low temperatures at the same time if we manipulate the data a little. Here we use gather() to take the temperatureLow and temperatureHigh columns and rearrange them so that there’s a column for the actual temperature and a column indicating if it’s high or low

weather_provo_2017_plot <- weather_provo_2017 %>% 
  gather(temperature_type, temperature, temperatureLow, temperatureHigh) %>% 
  mutate(temperature_type = recode(temperature_type, 
                                   temperatureHigh = "High",
                                   temperatureLow = "Low"))

# Print out the first few rows
weather_provo_2017_plot %>% 
  select(date, temperature_type, temperature) %>% 
  head()

## # A tibble: 6 x 3
##   date                temperature_type temperature
##   <dttm>              <chr>                  <dbl>
## 1 2017-01-01 12:00:00 Low                    25.0 
## 2 2017-01-02 12:00:00 Low                    24.1 
## 3 2017-01-03 12:00:00 Low                    31.0 
## 4 2017-01-04 12:00:00 Low                    19.1 
## 5 2017-01-05 12:00:00 Low                     3.5 
## 6 2017-01-06 12:00:00 Low                     9.02

With the data in this format we can plot both high and low at the same time and show them differently with linetype, which will change the border around each density. There are a bunch of other adjustments here to make the plot more CRAPy.

ggplot(weather_provo_2017_plot, 
       aes(x = temperature, y = fct_rev(Month), fill = ..x.., linetype = temperature_type)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis_c(option = "plasma") +
  guides(linetype = guide_legend(title = NULL),
         fill = guide_colorbar(title = NULL)) +
  labs(x = NULL, y = NULL,
       title = "Low and high temperatures in Provo, Utah",
       subtitle = "January 1, 2017-December 31, 2017",
       caption = "Source: Dark Sky") +
  theme_minimal(base_family = "Roboto Condensed") +
  theme(legend.position = "bottom",
        legend.key.width = unit(3, "lines"),
        legend.key.height = unit(0.5, "lines"))

## Picking joint bandwidth of 3.2

Subjective probabilities

Here’s a full example of another ridgeplot, based on a survey done on Reddit about what each of these subjective probabilities mean in numbers.

probability_raw <- read_csv("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv")

# This data is wide; each column is one of the subjective probabilities. 
# Here we make this long.
probability <- probability_raw %>% 
  gather(phrase, value) %>% 
  mutate(value = value / 100)

# Here we calculate the media probability for each phrase and put them in order
probability_order <- probability %>% 
  group_by(phrase) %>% 
  summarize(median_prob = median(value)) %>% 
  arrange(median_prob)

# This is the data frame that we'll use to plot. We use the order of the phrases
# from `probability_order` and make the phrase column an ordered factor based on
# that order. Without this, the phrases will plot in alphabetic order
probability_plot <- probability %>% 
  mutate(phrase = factor(phrase, levels = probability_order$phrase, ordered = TRUE))

ggplot(probability_plot, aes(x = value, y = phrase, fill = phrase)) +
  geom_density_ridges(scale = 5, jittered_points = TRUE,
                      quantile_lines = TRUE, quantiles = 2,
                      point_size = 0.5, point_alpha = 0.3) +
  scale_fill_viridis_d(option = "viridis", guide = FALSE) +
  coord_cartesian(xlim = c(0, 1)) +
  scale_x_continuous(labels = percent) +
  theme_minimal(base_family = "Roboto Condensed")

## Picking joint bandwidth of 0.0343

\(•◡•)/

Clearest and muddiest things

Go to this form and answer these two questions:

What was the muddiest thing from class today? What are you still wondering about?
What was the clearest thing from class today? What was the most exciting thing you learned?

I’ll compile the questions and send out answers after class.