Problem set 5

Due by 11:59 PM on Tuesday, November 20, 2018

Task 0: Setting things up

Create a new RStudio project somewhere on your computer. Open that new folder in Windows File Explorer or macOS Finder (however you navigate around the files on your computer), and create subfolders there named output and data.

Download this R Markdown file and place it in the root of your newly-created projectYou’ll probably have to right click on the link and choose “Save link as…”.

It contains an basic outline/skeleton of the tasks you’ll do in this assignment. Like before, it doesn’t have a lot.

Download these files and place them in your data folder:

In the end, the structure of your new project directory should look something like this:

      (and all the other ne_110m_admin_0_countries.* files)

Task 1: RIAA music revenues

The music landscape in the United States has seen multiple tectonic shifts over the past four decades. Use data from the RIAA to plot music revenues by format from 1977 to 2017. Figure out the best way to plot this (geom_area(), geom_line(), something else, etc.) and tell a story about the music industry.

Task 2: World map

Make a map showing the proportion of individuals in each country that had access to the internet in 2015. If you want to be super cool, make a second map showing that same proportion in 2000.

Some hints:

Task 3: Personal map

Draw your own map with your own points. This could be a map of places you’ve lived, or a map of places you’ve visited, or a map of places you want to visit. Anything!

The only requirement is that you find an appropriate shapefile (states, counties, world, etc.), collect latitude and longitude data from Google Maps, and plot the points (with or without labels) on a map.

Hint: Basically follow the code from class in the section named “Making your own geoencoded data”

Task 4: Word frequencies

Download the entire corpus (or 6+ books) of some author on Project Gutenberg. Jane Austen, Victor Hugo, Emily Brontë, Lucy Maud Montgomery, Arthur Conan Doyle, Mark Twain, Henry David Thoreau, Fyodor Dostoyevsky, Leo Tolstoy. Anyone. Just make sure it’s all from the same author.

Make the following plots and describe what each tell about your corpus or author:

  1. Top 10 most frequent words in each book
  2. Top 10 most unique words in each book (see tf-idf)
  3. The most distinctive “he X” vs. “she X” bigrams in the author’s entire corpus

Hint: Pretty much all the code for this is at the class webpage. Adapt that code to fit your corpus.


When you’re done, submit a knitted PDF or Word file of your analysis on Learning Suite. As always, it’s best if the final knitted document is clean and free of warnings and messages (so if a chunk is creating messages, like wherever you run library(tidyverse), add message=FALSE, warning=FALSE to the chunk options).

Optional extra fun tasks

Try doing one or more of the following: