Introduction to R: Plotting

Kevin Reuning

Goals for Today

  • Understand how to make a variety of basic plots (scatter and bar)
  • Manipulate data (relabeling)
  • Export graphs

Data for Today

Today we are going to keep using a subset of country data from The Quality of Governance Institute.

library(tidyverse)
setwd("images")
df <- read_csv("country_data.csv")

From Last Week

Our groups could have better names than 0, 1, 2, and 3.

df %>% filter(!is.na(br_elect)) %>% 
    group_by(br_elect) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
  br_elect   mean     n
     <dbl>  <dbl> <int>
1        0 52084.    10
2        1 11348.     7
3        2 11188.    57
4        3 21459.   118

Changing variables

mutate()

If we want to change the values of a variable we use mutate()

General format:

mutate(new_var = f(old_var))

What function though? br_elect is a small set of numbers where each represent a category. This is a factor.

Applying factor()

factor() creates a factor, with a certain set of numeric levels, and labels and so needs somethings:

  • The variable you are change (the first thing).
  • The set of numeric levels in the variable. levels=c( )
  • Labels you want to apply to the levels. labels=c( )

Example of Factor

library(tidyverse)
df %>% mutate(br_elect_label = 
        factor(br_elect, levels=c(0, 1, 2, 3), 
                labels=c("None", "Single-party", 
                "Non-dem multi-party", "Democratic"))) %>%
    pull(br_elect_label)
  [1] Non-dem multi-party Democratic          Non-dem multi-party
  [4] <NA>                Non-dem multi-party Democratic         
  [7] Non-dem multi-party Democratic          Democratic         
 [10] Democratic          Democratic          Non-dem multi-party
 [13] Non-dem multi-party Democratic          Democratic         
 [16] Democratic          Democratic          Democratic         
 [19] Non-dem multi-party Democratic          Democratic         
 [22] Democratic          Democratic          None               
 [25] Democratic          Non-dem multi-party Non-dem multi-party
 [28] Non-dem multi-party Non-dem multi-party Non-dem multi-party
 [31] Democratic          Democratic          Non-dem multi-party
 [34] Democratic          Non-dem multi-party Democratic         
 [37] None                Democratic          Democratic         
 [40] Democratic          Non-dem multi-party Non-dem multi-party
 [43] Democratic          Democratic          Single-party       
 [46] Democratic          Democratic          Democratic         
 [49] Democratic          Democratic          Democratic         
 [52] Democratic          Democratic          Single-party       
 [55] Non-dem multi-party None                Democratic         
 [58] Non-dem multi-party Democratic          Democratic         
 [61] Non-dem multi-party Non-dem multi-party Democratic         
 [64] Democratic          Democratic          Democratic         
 [67] Democratic          Democratic          Democratic         
 [70] Democratic          Democratic          Democratic         
 [73] Non-dem multi-party Non-dem multi-party Democratic         
 [76] Democratic          Democratic          Democratic         
 [79] Single-party        Non-dem multi-party Democratic         
 [82] Democratic          Democratic          Non-dem multi-party
 [85] Democratic          Democratic          Non-dem multi-party
 [88] Non-dem multi-party Non-dem multi-party Single-party       
 [91] Democratic          Non-dem multi-party Democratic         
 [94] Single-party        Non-dem multi-party Democratic         
 [97] Democratic          Democratic          Non-dem multi-party
[100] Democratic          Democratic          Democratic         
[103] Non-dem multi-party Democratic          Democratic         
[106] Non-dem multi-party Democratic          Democratic         
[109] Non-dem multi-party Democratic          Democratic         
[112] Democratic          Democratic          Democratic         
[115] Non-dem multi-party Non-dem multi-party Non-dem multi-party
[118] Non-dem multi-party Non-dem multi-party Democratic         
[121] Democratic          Democratic          Democratic         
[124] Democratic          Non-dem multi-party Democratic         
[127] Democratic          Democratic          Democratic         
[130] Democratic          Democratic          Democratic         
[133] Democratic          Democratic          Democratic         
[136] Democratic          Democratic          Democratic         
[139] Democratic          Democratic          Democratic         
[142] None                Democratic          Non-dem multi-party
[145] Non-dem multi-party Democratic          Democratic         
[148] Democratic          <NA>                Democratic         
[151] None                Democratic          Democratic         
[154] Democratic          Democratic          Non-dem multi-party
[157] Democratic          Single-party        Democratic         
[160] None                Non-dem multi-party Non-dem multi-party
[163] Democratic          None                Non-dem multi-party
[166] Democratic          Non-dem multi-party Democratic         
[169] Democratic          Non-dem multi-party Non-dem multi-party
[172] None                Non-dem multi-party Democratic         
[175] Democratic          None                Democratic         
[178] Non-dem multi-party Non-dem multi-party Democratic         
[181] Non-dem multi-party Democratic          Democratic         
[184] Non-dem multi-party Democratic          Non-dem multi-party
[187] Democratic          Non-dem multi-party Democratic         
[190] Non-dem multi-party Single-party        Non-dem multi-party
[193] None                Democratic         
Levels: None Single-party Non-dem multi-party Democratic

Making better labels

df %>% filter(!is.na(br_elect)) %>% 
    mutate(br_elect_label = 
            factor(br_elect, levels=c(0, 1, 2, 3), 
                    labels=c("None", "Single-party", 
                    "Non-dem multi-party", "Democratic"))) %>% 
    group_by(br_elect_label) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
  br_elect_label        mean     n
  <fct>                <dbl> <int>
1 None                52084.    10
2 Single-party        11348.     7
3 Non-dem multi-party 11188.    57
4 Democratic          21459.   118

More Complicated Situations

Often there are variables we want to coarsen (take from interval and put them into categories), here we can use the cut() function to cut our variable up into categories.

There are a variety of ways to cut() the most common way is giving it a set of “breaks” where you tell it where you want the bins to be.

set.seed(1) ## Lets me recreate the exact random variables
vector <- rnorm(100) # creating a bunch of random variables 
cut_vector <- cut(vector, breaks=c(-4,-1, -0.5, 0, 0.5, 1, 4))
table(cut_vector)
cut_vector
  (-4,-1] (-1,-0.5]  (-0.5,0]   (0,0.5]   (0.5,1]     (1,4] 
       11        14        21        20        19        15 

Adding Labels

You can also add labels to it, but remember N breaks creates N-1 labels.

set.seed(2)
vector <- rnorm(100)
cut_vector <- cut(vector, breaks=c(-4,-1, -0.5, 0, 0.5, 1, 4), 
    labels=c("Lowest", "Low", "Mid-Low", 
            "Mid-High", "High", "Highest"))
table(cut_vector)
cut_vector
  Lowest      Low  Mid-Low Mid-High     High  Highest 
      20       18       16       15       10       21 

Using Cut in Mutate

df %>% mutate(
    corrupt = cut(ti_cpi, breaks=c(0, 33, 66, 100), 
            labels=c("Low", "Mid", "High"))
            ) %>%
    group_by(corrupt) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
  corrupt   mean     n
  <fct>    <dbl> <int>
1 Low      7008.    64
2 Mid     18948.    89
3 High    45831.    26
4 <NA>      NaN     15

Check

There is a variable that captures the fertility rate, mutate it into a variable with 3 categories one that is around replacement rate (2.1), one that is below it, and one that is above it.

Then group_by() that new variable and count up how many observations in each bin.

(it might help to use range() before identifying the bins)

My Solution

df %>% mutate(
    fert_groups = cut(wdi_fertility, breaks=c(0, 1.8, 2.4, Inf), 
            labels=c("Below", "Replacement", "Above"))
            ) %>%
    group_by(fert_groups) %>% 
    summarize(n = n())
# A tibble: 4 × 2
  fert_groups     n
  <fct>       <int>
1 Below          61
2 Replacement    38
3 Above          86
4 <NA>            9

ggplot2

ggplot2 Introduction

We are going to use another library today: ggplot2

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. 1

library(tidyverse) # will load ggplot2 as well 
#library(ggplot2) # to just load ggplot 

ggplot2 Basics

Every plot is going to start with a call to ggplot() with the arguments that identify the data you want to use and the aesthetics you want to use from that data.

We then tell it how to plot use that data by adding on geom_*() functions

Simple Plot

ggplot(df, 
    aes(x=mad_gdppc)) +
    geom_histogram() 

Setting the data frame we are accessing

Identifying that we are going to use mad_gdppc as the x variable by using the aes() within ggplot()

Displaying that data as a histogram by adding geom_histogram()

Remember things are linked together with +

Modifying the Style and Labels

We should always provide human readable labels on plots. You add labs() to set them. This has arguments for the x axis (x=), y axis (y=) and the title (title=)

ggplot(df, 
    aes(x=mad_gdppc)) +
    geom_histogram() + 
    labs(y="Frequency", x="GDP Per Capita", 
        title="Histogram of GDP per Capita")

Modifying the Style and Labels

Themes

ggplot has a large number of themes that you can use to change the style of your plot.

ggplot(df, 
    aes(x=mad_gdppc)) +
    geom_histogram() + 
    labs(y="Frequency",
        x="GDP Per Capita", 
        title="Histogram of GDP per Capita") +
    theme_minimal()

Different Theme

ggplot(df, 
    aes(x=mad_gdppc)) +
    geom_histogram() + 
    labs(y="Frequency", 
        x="GDP Per Capita", 
        title="Histogram of GDP per Capita") +
    theme_dark()

Adding more Aesthetics

We can also modify the color and transparency of the bars to specific values by adding them to the geom_histogram() call

ggplot(df, 
    aes(x=mad_gdppc)) +
    geom_histogram(fill="green", color="gray", 
            alpha=.5) + 
    labs(y="Frequency", x="GDP Per Capita", 
        title="Histogram of GDP per Capita") + 
    theme_minimal()

Some Common Aesthetics

  • color= the color of lines or points.
  • fill= the fill in color.
  • alpha= how transparent an object is (0 to 1, with 0 as invisible
  • linewidth= how thick lines are
  • linetype= the types of lines

More info on this page

Check

Identify a different variable, make a histogram of, correct the labels, and change some of the other aesthetics.

My Answer

ggplot(df, 
    aes(x=vdem_academ)) +
    geom_histogram(fill="gray", color="black") + 
    labs(y="Frequency", x="Academic Freedom (V-Dem)", 
        title="Histogram of Academic Freedom") + 
    theme_minimal()

Scatter Plots and Scales

Saving a Plot in R

Before going further it is helpful to know that you can save your plot (p <-) and then add more things to it:

p1 <- ggplot(df, aes(x=vdem_academ)) 
p1 <- p1 + geom_histogram(fill="gray", color="black") 
p1 <- p1 + labs(y="Frequency", x="Acaddemic Freedom (V-Dem)", 
title="Histogram of Academic Freedom") 
p1 + theme_minimal() ## Calling the plot with a theme attached

Modifying a Plot

We can then try out other things more easily:

p1 + theme_classic() #Switching to a new theme

Scatter Plots

geom_point() is going to plot points at the x and y value you that you give it:

ggplot(df, aes(x=vdem_academ, y=mad_gdppc)) + 
    geom_point()

Aesthetics

There are again a bunch of aesthetics, some of the new ones are:

  • shape= the shape of the point (try some different numbers)
  • size= the size of the point

More info

Aesthetics Example

ggplot(df, aes(x=vdem_academ, y=mad_gdppc)) + 
    geom_point(color="purple", shape=4)

Aesthetics with Other Variables

We can also set any of these aesthetics to reflect another variable

ggplot(df, aes(x=vdem_academ, y=mad_gdppc, 
    color=fh_polity2)) + 
    geom_point(size=3)

We have a new label to fix.

Scales and Guides

Each aesthetics has associated functions that can be used to modify the scale it uses. scale_color_gradient() switches the colors to a gradient defined by a high color and a low color.

ggplot(df, aes(x=vdem_academ, y=mad_gdppc, 
    color=fh_polity2)) + 
    geom_point(size=3) + 
    scale_color_gradient(low="pink", high="black") 

Scales and Guides Labels

The scale_*_*() variables all take a name argument as well, that is the first thing it expects so you can just put it in the first.

ggplot(df, aes(x=vdem_academ, y=mad_gdppc, 
    color=fh_polity2)) + 
    geom_point(size=3) + 
    scale_color_gradient("Democracy", 
    low="pink", high="black") 

Other Scales

There are a lot of scales and often there are similar ones that apply to different aesthetics:

  • scale_color_gradient2() and scale_fill_gradient2()
    • Allows you to set a midpoint
  • scale_x_log10() and scale_y_log()`
    • Log transformation of the scale on the x or y axis.
  • scale_color_manual() and scale_fill_manual()
    • Manually pick colors for different categories
  • scale_color_brewer() and scale_fill_brewer()
    • A selection of pallettes for categorical data.

Adding More Scales

ggplot(df, aes(x=vdem_academ, y=mad_gdppc,color=fh_polity2)) + 
    geom_point(size=3) + scale_y_log10() +
    scale_color_gradient2("Democracy", 
    low="pink", mid="gray", high="green", midpoint = 5) +theme_minimal() + 
    labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom", 
    title="GDP vs Academic Freedom")

Check

Pick two interval variables, and an additional variable (any type). Make a scatter plot of the variables using the third variable to color in the points.

Label things nicely.

Combining Mutate and ggplot

Making Our Guides Nice

As discussed before, often the values of our variables don’t make for good labels. The easiest way to relabel these is by adjusting the data prior to making a plot.

We can then pipe that data into our new plot

Example

df %>% filter(!is.na(br_elect)) %>% 
    mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3), 
        labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>% 

Example

df %>% filter(!is.na(br_elect)) %>% 
    mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3), 
        labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>% 
    ggplot(aes(x=vdem_academ, y=mad_gdppc, color=br_elect_label)) + 
    geom_point() + 
    scale_color_brewer("Election Type", type="qual", palette = 3) +theme_minimal() + 
    labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom", 
    title="GDP vs Academic Freedom")

Example

df %>% filter(!is.na(br_elect)) %>% 
    mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3), 
        labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>% 
    ggplot(aes(x=vdem_academ, y=mad_gdppc, color=br_elect_label)) + 
    geom_point() + 
    scale_color_brewer("Election Type", type="qual", palette = 3) +theme_minimal() + 
    labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom", 
    title="GDP vs Academic Freedom")

Saving Plots

Two ways to save plots

  • Rstudio has an “export” button above the plots that you can use.
  • Call ggsave() with a file name:
p <- ggplot(...
ggsave(filename="file.png", plot=p, width=6, 
    height=4, units="in")

Saves plot p as a png, with a size of 6 by 4 inches.