Today we are going to keep using a subset of country data from The Quality of Governance Institute.
read_csv()
and if you need more help check out the first day of slides.Our groups could have better names than 0, 1, 2, and 3.
mutate()
If we want to change the values of a variable we use mutate()
General format:
mutate(new_var = f(old_var))
What function though? br_elect
is a small set of numbers where each represent a category. This is a factor.
factor()
factor()
creates a factor, with a certain set of numeric levels, and labels and so needs somethings:
levels=c( )
labels=c( )
library(tidyverse)
df %>% mutate(br_elect_label =
factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party",
"Non-dem multi-party", "Democratic"))) %>%
pull(br_elect_label)
[1] Non-dem multi-party Democratic Non-dem multi-party
[4] <NA> Non-dem multi-party Democratic
[7] Non-dem multi-party Democratic Democratic
[10] Democratic Democratic Non-dem multi-party
[13] Non-dem multi-party Democratic Democratic
[16] Democratic Democratic Democratic
[19] Non-dem multi-party Democratic Democratic
[22] Democratic Democratic None
[25] Democratic Non-dem multi-party Non-dem multi-party
[28] Non-dem multi-party Non-dem multi-party Non-dem multi-party
[31] Democratic Democratic Non-dem multi-party
[34] Democratic Non-dem multi-party Democratic
[37] None Democratic Democratic
[40] Democratic Non-dem multi-party Non-dem multi-party
[43] Democratic Democratic Single-party
[46] Democratic Democratic Democratic
[49] Democratic Democratic Democratic
[52] Democratic Democratic Single-party
[55] Non-dem multi-party None Democratic
[58] Non-dem multi-party Democratic Democratic
[61] Non-dem multi-party Non-dem multi-party Democratic
[64] Democratic Democratic Democratic
[67] Democratic Democratic Democratic
[70] Democratic Democratic Democratic
[73] Non-dem multi-party Non-dem multi-party Democratic
[76] Democratic Democratic Democratic
[79] Single-party Non-dem multi-party Democratic
[82] Democratic Democratic Non-dem multi-party
[85] Democratic Democratic Non-dem multi-party
[88] Non-dem multi-party Non-dem multi-party Single-party
[91] Democratic Non-dem multi-party Democratic
[94] Single-party Non-dem multi-party Democratic
[97] Democratic Democratic Non-dem multi-party
[100] Democratic
[ reached getOption("max.print") -- omitted 94 entries ]
Levels: None Single-party Non-dem multi-party Democratic
df %>% filter(!is.na(br_elect)) %>%
mutate(br_elect_label =
factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party",
"Non-dem multi-party", "Democratic"))) %>%
group_by(br_elect_label) %>%
summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
br_elect_label mean n
<fct> <dbl> <int>
1 None 52084. 10
2 Single-party 11348. 7
3 Non-dem multi-party 11188. 57
4 Democratic 21459. 118
Often there are variables we want to coarsen (take from interval and put them into categories), here we can use the cut()
function to cut our variable up into categories.
There are a variety of ways to cut()
the most common way is giving it a set of “breaks” where you tell it where you want the bins to be.
You can also add labels to it, but remember N breaks creates N-1 labels.
There is a variable that captures the fertility rate, mutate it into a variable with 3 categories one that is around replacement rate (2.1), one that is below it, and one that is above it.
Then group_by()
that new variable and count up how many observations in each bin.
(it might help to use range()
before identifying the bins)
We are going to use another library today: ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. 1
Every plot is going to start with a call to ggplot()
with the arguments that identify the data you want to use and the aesthetics you want to use from that data.
We then tell it how to plot use that data by adding on geom_*()
functions
Setting the data frame we are accessing
Identifying that we are going to use mad_gdppc
as the x variable by using the aes()
within ggplot()
Displaying that data as a histogram by adding geom_histogram()
Remember things are linked together with +
We should always provide human readable labels on plots. You add labs()
to set them. This has arguments for the x axis (x=
), y axis (y=
) and the title (title=
)
ggplot
has a large number of themes that you can use to change the style of your plot.
We can also modify the color and transparency of the bars to specific values by adding them to the geom_histogram()
call
color=
the color of lines or points.fill=
the fill in color.alpha=
how transparent an object is (0 to 1, with 0 as invisiblelinewidth=
how thick lines arelinetype=
the types of linesIdentify a different variable, make a histogram of, correct the labels, and change some of the other aesthetics.
Before going further it is helpful to know that you can save your plot (p <-
) and then add more things to it:
We can then try out other things more easily:
geom_point()
is going to plot points at the x
and y
value you that you give it:
There are again a bunch of aesthetics, some of the new ones are:
shape=
the shape of the point (try some different numbers)size=
the size of the pointWe can also set any of these aesthetics to reflect another variable
We have a new label to fix.
Each aesthetics has associated functions that can be used to modify the scale
it uses. scale_color_gradient()
switches the colors to a gradient defined by a high
color and a low
color.
The scale_*_*()
variables all take a name
argument as well, that is the first thing it expects so you can just put it in the first.
There are a lot of scales and often there are similar ones that apply to different aesthetics:
scale_color_gradient2()
and scale_fill_gradient2()
scale_x_log10()
and scale_y_log
()`
scale_color_manual()
and scale_fill_manual()
scale_color_brewer()
and scale_fill_brewer()
Pick two interval variables, and an additional variable (any type). Make a scatter plot of the variables using the third variable to color in the points.
Label things nicely.
As discussed before, often the values of our variables don’t make for good labels. The easiest way to relabel these is by adjusting the data prior to making a plot.
We can then pipe that data into our new plot
df %>% filter(!is.na(br_elect)) %>%
mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>%
ggplot(aes(x=vdem_academ, y=mad_gdppc, color=br_elect_label)) +
geom_point() +
scale_color_brewer("Election Type", type="qual", palette = 3) +theme_minimal() +
labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom",
title="GDP vs Academic Freedom")
df %>% filter(!is.na(br_elect)) %>%
mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>%
ggplot(aes(x=vdem_academ, y=mad_gdppc, color=br_elect_label)) +
geom_point() +
scale_color_brewer("Election Type", type="qual", palette = 3) +theme_minimal() +
labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom",
title="GDP vs Academic Freedom")
ggsave()
with a file name:Saves plot p
as a png, with a size of 6 by 4 inches.