Today we are going to keep using a subset of country data from The Quality of Governance Institute.
read_csv()
and if you need more help check out the first day of slides.install.packages(c("tidyverse"))
We are going to use a better measure of electoral type: br_elect
which creates 4 categories (No elections=0, Single-party elections=1, non-democratic multi-party elections=2, democratic elections= 3)
mutate()
If we want to change the values of a variable we use mutate()
General format:
mutate(new_var = f(old_var))
What function though? br_elect
is a small set of numbers where each represent a category. This is a factor.
factor()
factor()
creates a factor, with a certain set of numeric levels, and labels and so needs somethings:
levels=c( )
labels=c( )
[1] Non-dem multi-party Democratic Non-dem multi-party
[4] <NA> Non-dem multi-party Democratic
[7] Non-dem multi-party Democratic Democratic
[10] Democratic Democratic Non-dem multi-party
[13] Non-dem multi-party Democratic Democratic
[16] Democratic Democratic Democratic
[19] Non-dem multi-party Democratic Democratic
[22] Democratic Democratic None
[25] Democratic Non-dem multi-party Non-dem multi-party
[28] Non-dem multi-party Non-dem multi-party Non-dem multi-party
[31] Democratic Democratic Non-dem multi-party
[34] Democratic Non-dem multi-party Democratic
[37] None Democratic Democratic
[40] Democratic Non-dem multi-party Non-dem multi-party
[43] Democratic Democratic Single-party
[46] Democratic Democratic Democratic
[49] Democratic Democratic Democratic
[52] Democratic Democratic Single-party
[55] Non-dem multi-party None Democratic
[58] Non-dem multi-party Democratic Democratic
[61] Non-dem multi-party Non-dem multi-party Democratic
[64] Democratic Democratic Democratic
[67] Democratic Democratic Democratic
[70] Democratic Democratic Democratic
[73] Non-dem multi-party Non-dem multi-party Democratic
[76] Democratic Democratic Democratic
[79] Single-party Non-dem multi-party Democratic
[82] Democratic Democratic Non-dem multi-party
[85] Democratic Democratic Non-dem multi-party
[88] Non-dem multi-party Non-dem multi-party Single-party
[91] Democratic Non-dem multi-party Democratic
[94] Single-party Non-dem multi-party Democratic
[97] Democratic Democratic Non-dem multi-party
[100] Democratic Democratic Democratic
[103] Non-dem multi-party Democratic Democratic
[106] Non-dem multi-party Democratic Democratic
[109] Non-dem multi-party Democratic Democratic
[112] Democratic Democratic Democratic
[115] Non-dem multi-party Non-dem multi-party Non-dem multi-party
[118] Non-dem multi-party Non-dem multi-party Democratic
[121] Democratic Democratic Democratic
[124] Democratic Non-dem multi-party Democratic
[127] Democratic Democratic Democratic
[130] Democratic Democratic Democratic
[133] Democratic Democratic Democratic
[136] Democratic Democratic Democratic
[139] Democratic Democratic Democratic
[142] None Democratic Non-dem multi-party
[145] Non-dem multi-party Democratic Democratic
[148] Democratic <NA> Democratic
[151] None Democratic Democratic
[154] Democratic Democratic Non-dem multi-party
[157] Democratic Single-party Democratic
[160] None Non-dem multi-party Non-dem multi-party
[163] Democratic None Non-dem multi-party
[166] Democratic Non-dem multi-party Democratic
[169] Democratic Non-dem multi-party Non-dem multi-party
[172] None Non-dem multi-party Democratic
[175] Democratic None Democratic
[178] Non-dem multi-party Non-dem multi-party Democratic
[181] Non-dem multi-party Democratic Democratic
[184] Non-dem multi-party Democratic Non-dem multi-party
[187] Democratic Non-dem multi-party Democratic
[190] Non-dem multi-party Single-party Non-dem multi-party
[193] None Democratic
Levels: None Single-party Non-dem multi-party Democratic
# A tibble: 4 × 3
br_elect_label mean n
<fct> <dbl> <int>
1 None 52084. 10
2 Single-party 11348. 7
3 Non-dem multi-party 11188. 57
4 Democratic 21459. 118
Often there are variables we want to coarsen (take from interval and put them into categories), here we can use the case_when()
function to find cases that match particular rules and replace values
The general format of case_when
:
First var
is checked to see if it is below VALUE
and if so then “New Label” is returned. Then if var
is greater than VALUE_2
“New Label” is returned. If nothing is found to be true the default of NA
is used.
# A tibble: 4 × 3
corrupt mean n
<chr> <dbl> <int>
1 High 45204. 27
2 Low 7043. 61
3 Mid 18351. 91
4 <NA> NaN 15
Because it goes in order, only those that are greater than 33 will get to the ti_cpi < 66
check.
There is a variable that captures the fertility rate, mutate it into a variable with 3 categories one that is around replacement rate (between 1.9 and 2.3), one that is below that, and one that is above that (it goes from 0 to almost 7).
Then group_by()
that new variable and count up how many observations in each bin.
We are going to use another library today: ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. 1
Every plot is going to start with a call to ggplot()
with the arguments that identify the data you want to use and the aesthetics you want to use from that data.
We then tell it how to plot use that data by adding on geom_*()
functions
Setting the data frame we are accessing
Identifying that we are going to use mad_gdppc
as the x variable by using the aes()
within ggplot()
Displaying that data as a histogram by adding geom_histogram()
Remember things are linked together with +
We should always provide human readable labels on plots. You add labs()
to set them. This has arguments for the x axis (x=
), y axis (y=
) and the title (title=
)
ggplot
has a large number of themes that you can use to change the style of your plot.
We can also modify the color and transparency of the bars to specific values by adding them to the geom_histogram()
call
color=
the color of lines or points.fill=
the fill in color.alpha=
how transparent an object is (0 to 1, with 0 as invisiblelinewidth=
how thick lines arelinetype=
the types of linesIdentify a different variable, make a histogram of, correct the labels, and change some of the other aesthetics.
Before going further it is helpful to know that you can save your plot (p <-
) and then add more things to it:
We can then try out other things more easily:
geom_point()
is going to plot points at the x
and y
value you that you give it:
There are again a bunch of aesthetics, some of the new ones are:
shape=
the shape of the point (try some different numbers)size=
the size of the pointWe can also set any of these aesthetics to reflect another variable
We have a new label to fix.
Each aesthetics has associated functions that can be used to modify the scale
it uses. scale_color_gradient()
switches the colors to a gradient defined by a high
color and a low
color.
The scale_*_*()
variables all take a name
argument as well, that is the first thing it expects so you can just put it in the first.
There are a lot of scales and often there are similar ones that apply to different aesthetics:
scale_color_gradient2()
and scale_fill_gradient2()
scale_x_log10()
and scale_y_log
()`
scale_color_manual()
and scale_fill_manual()
scale_color_brewer()
and scale_fill_brewer()
Pick two interval variables, and an additional variable (any type). Make a scatter plot of the variables using the third variable to color in the points.
Label things nicely.
As discussed before, often the values of our variables don’t make for good labels. The easiest way to relabel these is by adjusting the data prior to making a plot.
We can then pipe that data into our new plot
df %>% filter(!is.na(br_elect)) %>%
mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>%
ggplot(aes(x=vdem_academ, y=mad_gdppc, color=br_elect_label)) +
geom_point() +
scale_color_brewer("Election Type", type="qual", palette = 3) +theme_minimal() +
labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom",
title="GDP vs Academic Freedom")
df %>% filter(!is.na(br_elect)) %>%
mutate(br_elect_label = factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party", "Non-dem multi-party", "Democratic"))) %>%
ggplot(aes(x=vdem_academ, y=mad_gdppc, color=br_elect_label)) +
geom_point() +
scale_color_brewer("Election Type", type="qual", palette = 3) +theme_minimal() +
labs(y="GDP per Cap\n(Log scale)", x="Academic Freedom",
title="GDP vs Academic Freedom")
ggsave()
with a file name:Saves plot p
as a png, with a size of 6 by 4 inches.