install.package()
Run:
There might be a popup asking about installing things from “Source” you can hit no on it.
Today we are going to use a subset of country data from The Quality of Governance Institute.
read_csv()
There is a description of all the variables I’ve included here.
For now though we are going to use a few of them:
bmr_demdur
is how long the country has been in the same regime type categorytop_top1_income_share
is the proportion of income that goes to the top 1%.Often you want to select just specific rows of data that meet certain requirements.
We need to include some more operators to do this:
<
less than and >
greater than<=
less than or equal to and >=
greater than or equal to==
equal to and !=
not equal toWe can do the same thing but using a variable from our dataset:
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 94 entries ]
We can use logical checks to filter our data.
filter()
function is part of the plyr package in the tidyverse.Note
Within the filter()
call you do not need to use data$
before the variable name, it already knows you are using the data you put in the first argument.
library(tidyverse)
# df <- read_csv("country_data.csv") ## remember I did this already
filter(df, bmr_demdur>100)
# A tibble: 19 × 31
cname ccode ti_cpi vdem_…¹ wdi_f…² wdi_afp bl_as…³ wdi_e…⁴ wdi_e…⁵ wef_iu
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghani… 4 16 0.560 4.47 2.64 4.83 4.06 NA NA
2 Austral… 36 77 0.847 1.74 0.438 12.5 5.12 62.9 86.5
3 Belgium 56 75 0.967 1.62 0.619 11.6 6.41 6.11 88.7
4 Bhutan 64 68 0.541 1.98 NA NA 6.85 NA NA
5 Canada 124 81 0.919 1.50 0.356 12.9 NA 9.84 91
6 China 156 39 0.0960 1.69 0.343 8.71 3.51 70.3 54.3
7 Finland 246 85 0.947 1.41 0.919 11.3 6.38 8.30 88.9
8 Haiti 332 20 0.684 2.93 0.00991 5.32 2.78 0 32.5
9 Iceland 352 76 0.925 1.71 0 9.94 7.66 0 99.0
10 Iran (I… 364 28 0.122 2.14 2.34 9.36 3.96 0.165 70.0
11 Luxembo… 442 81 0.946 1.38 0.628 12.0 3.57 0 97.1
12 Oman 512 52 0.229 2.89 1.80 NA NA 0 80.2
13 Netherl… 528 82 0.930 1.59 0.449 11.8 5.18 38.7 94.7
14 New Zea… 554 87 0.897 1.71 0.341 11.0 6.28 4.25 90.8
15 Norway 578 84 0.934 1.56 0.830 12.7 7.91 0.105 96.5
16 Sweden 752 85 0.964 1.76 0.281 12.0 7.57 0.667 92.1
17 Switzer… 756 85 0.959 1.52 0.433 12.2 5.13 0 89.7
18 United … 826 80 0.926 1.68 0.432 12.9 5.44 22.8 94.9
19 United … 840 71 0.910 1.73 0.833 12.8 NA 34.2 87.3
# … with 21 more variables: wdi_foodins <dbl>, ht_colonial <dbl>,
# lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>, cai_physical <dbl>,
# ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>, wdi_homicides <dbl>,
# ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>, br_elect <dbl>,
# van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>, vdem_polyarchy <dbl>,
# mad_gdppc <dbl>, top_top1_income_share <dbl>, wef_sp <dbl>, and abbreviated
# variable names ¹vdem_academ, ²wdi_fertility, ³bl_asymf, ⁴wdi_expedu, …
What about if we want to check if our rows meet multiple condition? Then we need logical operators.
!
(e.g. !TRUE == FALSE
)&
|
(shift + backslash)&
returns TRUE
if both values are TRUE
|
returns TRUE
if at least one value is TRUE
We can then combine logical checks together.
Lets collect countries with 100 years of the same regime type where more than 10% of income goes to the top 1%
[1] "Afghanistan"
[2] "Australia"
[3] "Bhutan"
[4] "Canada"
[5] "China"
[6] "Finland"
[7] "Haiti"
[8] "Iran (Islamic Republic of)"
[9] "Luxembourg"
[10] "Oman"
[11] "New Zealand"
[12] "Norway"
[13] "Switzerland"
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"
Create two new datasets.
[1] "Angola" "Central African Republic (the)"
[3] "Chile" "Malawi"
[5] "Mexico" "Mozambique"
[7] "Oman"
[1] "Algeria"
[2] "Austria"
[3] "Barbados"
[4] "Botswana"
[5] "Myanmar"
[6] "Cambodia"
[7] "Cameroon"
[8] "Chad"
[9] "Colombia"
[10] "Congo (the)"
[11] "Congo (the Democratic Republic of the)"
[12] "Costa Rica"
[13] "Cuba"
[14] "Denmark"
[15] "Dominican Republic (the)"
[16] "Equatorial Guinea"
[17] "France"
[18] "Gabon"
[19] "Guinea"
[20] "India"
[21] "Iraq"
[22] "Ireland"
[23] "Israel"
[24] "Italy"
[25] "Côte d'Ivoire"
[26] "Jamaica"
[27] "Japan"
[28] "Jordan"
[29] "Korea (the Democratic People's Republic of)"
[30] "Kuwait"
[31] "Lao People's Democratic Republic (the)"
[32] "Libya"
[33] "Malaysia"
[34] "Malta"
[35] "Mauritania"
[36] "Mauritius"
[37] "Morocco"
[38] "Nauru"
[39] "Rwanda"
[40] "San Marino"
[41] "Saudi Arabia"
[42] "Singapore"
[43] "Zimbabwe"
[44] "Eswatini"
[45] "Syrian Arab Republic (the)"
[46] "Togo"
[47] "Trinidad and Tobago"
[48] "Egypt"
[49] "Tanzania, the United Republic of"
[50] "Samoa"
Tidyverse syntax makes use of pipes to chain multiple functions together.
%>%
) in between each step.For example (in pseudo-code):
Output <- Step 1(Input) %>% Step 2() %>% Step 3()
Translation: Take the Input, apply Step 1 to it, then take the output of Step 1 and apply Step 2 to it, then take the output of Step 2 and apply Step 3 to it, and finally store the output of Step 3 as Output.
[1] "Afghanistan"
[2] "Australia"
[3] "Bhutan"
[4] "Canada"
[5] "China"
[6] "Finland"
[7] "Haiti"
[8] "Iran (Islamic Republic of)"
[9] "Luxembourg"
[10] "Oman"
[11] "New Zealand"
[12] "Norway"
[13] "Switzerland"
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"
What does the pull()
function do? It is another way to access a certain column in your data.
[1] "Afghanistan"
[2] "Australia"
[3] "Bhutan"
[4] "Canada"
[5] "China"
[6] "Finland"
[7] "Haiti"
[8] "Iran (Islamic Republic of)"
[9] "Luxembourg"
[10] "Oman"
[11] "New Zealand"
[12] "Norway"
[13] "Switzerland"
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"
[1] "Afghanistan"
[2] "Australia"
[3] "Bhutan"
[4] "Canada"
[5] "China"
[6] "Finland"
[7] "Haiti"
[8] "Iran (Islamic Republic of)"
[9] "Luxembourg"
[10] "Oman"
[11] "New Zealand"
[12] "Norway"
[13] "Switzerland"
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"
%>%
has been around for a while in the tidyverse.|>
instead.%>%
is the same as |>
Yes this is all kind of silly and strange.
One of the most useful tidyverse functions is summarize()
.
summarize()
transforms data by applying a function(s) to columns in the data.What if we want to figure out the mean regime type length for our data?
# A tibble: 1 × 1
`mean(bmr_demdur)`
<dbl>
1 47.4
What if we want to calculate other statistics?
You generally want to use functions that only return 1 value. Why?
What if we want to figure out the average income share for the top 1% for countries that have been around for more than 100 years?
# A tibble: 1 × 1
`mean(top_top1_income_share)`
<dbl>
1 0.154
There is also a function specifically for the number of observations: n()
Find the mean and median regime type duration for countries that have more than 0.05 of their income going to the top 1%. Include the number of observations as well.
Then find the mean and median GDP per capita (mad_gdppc
) along with the number of observations for countries with a regime type duration of less than 50 years. There are missing values in this variable, what do we do to ignore them?
df %>% filter(top_top1_income_share > .05) %>%
summarize(mean=mean(bmr_demdur), median=median(bmr_demdur))
# A tibble: 1 × 2
mean median
<dbl> <dbl>
1 49.0 34
df %>% filter(bmr_demdur < 50) %>%
summarize(mean=mean(mad_gdppc, na.rm=T),
median=median(mad_gdppc, na.rm=T),
n=n())
# A tibble: 1 × 3
mean median n
<dbl> <dbl> <int>
1 15144. 10907. 124
Note
You can use multiple lines with pipes, it is common to put the pipe at the end of each line and indent the next line.
Often we want to provide summaries of groups within the data. For example: how does the GDP vary by type of political regime?
Here we’ll use the group_by()
function to create groups of our data.
group_by()
alonegroup_by()
expects variable(s) that you want to use to group your dataset:
# A tibble: 194 × 31
# Groups: br_elect [5]
cname ccode ti_cpi vdem_…¹ wdi_f…² wdi_afp bl_as…³ wdi_e…⁴ wdi_e…⁵ wef_iu
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanis… 4 16 0.560 4.47 2.64 4.83 4.06 NA NA
2 Albania 8 36 0.876 1.62 0.643 11.0 3.61 0 71.8
3 Algeria 12 35 0.338 3.02 2.52 7.71 NA 0 59.6
4 Andorra 20 NA NA NA NA NA 3.25 NA NA
5 Angola 24 19 0.440 5.52 0.921 NA NA 0 14.3
6 Antigua … 28 NA NA 1.99 NA NA NA NA NA
7 Azerbaij… 31 25 0.0770 1.73 1.61 NA 2.46 0 79.8
8 Argentina 32 40 0.935 2.26 0.512 10.2 5.46 2.03 74.3
9 Australia 36 77 0.847 1.74 0.438 12.5 5.12 62.9 86.5
10 Austria 40 76 0.973 1.47 0.497 10.8 5.36 8.23 87.7
# … with 184 more rows, 21 more variables: wdi_foodins <dbl>,
# ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
# cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
# wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
# br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
# vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>,
# wef_sp <dbl>, and abbreviated variable names ¹vdem_academ, …
Only change is the addition of # Groups: by_elect [5] (grouping variable, and number of groups).
Lets chain together group_by()
and summarize()
# A tibble: 5 × 3
br_elect mean n
<dbl> <dbl> <int>
1 0 52084. 10
2 1 11348. 7
3 2 11188. 57
4 3 21459. 118
5 NA NaN 2
What is ugly about this?
is.na()
checks if something is missing or not.
df %>% filter(!is.na(br_elect)) %>%
group_by(br_elect) %>%
summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
br_elect mean n
<dbl> <dbl> <int>
1 0 52084. 10
2 1 11348. 7
3 2 11188. 57
4 3 21459. 118
Tip
The drop_na( )
tidyverse function can replace filter(!is.na( ))
There are several variables that can be used to group countries. Pick one of them, pick an interval variable that you think might vary by the group, and then calculate the number of observations, mean, and median for each group.
br_pvote
van_part
df %>%
drop_na(br_pvote) %>%
group_by(br_pvote) %>%
summarize(n=n(), mean=mean(van_part, na.rm=T),
median=median(van_part, na.rm=T))
Data that I am using
Filtering out observations that are missing a value for br_pvote
Grouping the data frame by br_pvote
Summarizing (number of observations, mean of van_part
, median of van_part
)
Our groups could have better names than 0, 1, 2, and 3.
mutate()
If we want to change the values of a variable we use mutate()
General format:
mutate(new_var = f(old_var))
What function though? br_elect
is a small set of numbers where each represent a category. This is a factor.
factor()
factor()
creates a factor, with a certain set of numeric levels, and labels and so needs somethings:
levels=c( )
labels=c( )
library(tidyverse)
df %>% mutate(br_elect_label =
factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party",
"Non-dem multi-party", "Democratic"))) %>%
pull(br_elect_label)
[1] Non-dem multi-party Democratic Non-dem multi-party
[4] <NA> Non-dem multi-party Democratic
[7] Non-dem multi-party Democratic Democratic
[10] Democratic Democratic Non-dem multi-party
[13] Non-dem multi-party Democratic Democratic
[16] Democratic Democratic Democratic
[19] Non-dem multi-party Democratic Democratic
[22] Democratic Democratic None
[25] Democratic Non-dem multi-party Non-dem multi-party
[28] Non-dem multi-party Non-dem multi-party Non-dem multi-party
[31] Democratic Democratic Non-dem multi-party
[34] Democratic Non-dem multi-party Democratic
[37] None Democratic Democratic
[40] Democratic Non-dem multi-party Non-dem multi-party
[43] Democratic Democratic Single-party
[46] Democratic Democratic Democratic
[49] Democratic Democratic Democratic
[52] Democratic Democratic Single-party
[55] Non-dem multi-party None Democratic
[58] Non-dem multi-party Democratic Democratic
[61] Non-dem multi-party Non-dem multi-party Democratic
[64] Democratic Democratic Democratic
[67] Democratic Democratic Democratic
[70] Democratic Democratic Democratic
[73] Non-dem multi-party Non-dem multi-party Democratic
[76] Democratic Democratic Democratic
[79] Single-party Non-dem multi-party Democratic
[82] Democratic Democratic Non-dem multi-party
[85] Democratic Democratic Non-dem multi-party
[88] Non-dem multi-party Non-dem multi-party Single-party
[91] Democratic Non-dem multi-party Democratic
[94] Single-party Non-dem multi-party Democratic
[97] Democratic Democratic Non-dem multi-party
[100] Democratic
[ reached getOption("max.print") -- omitted 94 entries ]
Levels: None Single-party Non-dem multi-party Democratic
df %>% filter(!is.na(br_elect)) %>%
mutate(br_elect_label =
factor(br_elect, levels=c(0, 1, 2, 3),
labels=c("None", "Single-party",
"Non-dem multi-party", "Democratic"))) %>%
group_by(br_elect_label) %>%
summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
br_elect_label mean n
<fct> <dbl> <int>
1 None 52084. 10
2 Single-party 11348. 7
3 Non-dem multi-party 11188. 57
4 Democratic 21459. 118
Often there are variables we want to coarsen (take from interval and put them into categories), here we can use the cut()
function to cut our variable up into categories.
There are a variety of ways to cut()
the most common way is giving it a set of “breaks” where you tell it where you want the bins to be.
You can also add labels to it, but remember N breaks creates N-1 labels.
There is a variable that captures the fertility rate, mutate it into a variable with 3 categories one that is around replacement rate (2.1), one that is below it, and one that is above it.
Then group_by()
that new variable and count up how many observations in each bin.
(it might help to use range()
before identifying the bins)
df %>% mutate(
fert_groups = cut(wdi_fertility, breaks=c(0, 1.8, 2.4, Inf),
labels=c("Below", "Replacement", "Above"))
) %>%
group_by(fert_groups) %>%
summarize(n = n())
# A tibble: 4 × 2
fert_groups n
<fct> <int>
1 Below 61
2 Replacement 38
3 Above 86
4 <NA> 9