Introduction to R: Data Manipulation and Summary

Kevin Reuning

Goals for Today

  • Manipulate data (filter specific rows, select columns).
  • pipes
  • Summarizing data

Data for today

Today we are going to use a subset of country data from The Quality of Governance Institute.

library(readr)
setwd("images")
df <- read_csv("country_data.csv")

Variables

There is a description of all the variables I’ve included here.

For now though we are going to use a few of them:

  • bmr_demdur is how long the country has been in the same regime type category
  • top_top1_income_share is the proportion of income that goes to the top 1%.

Filtering Data

Often you want to select just specific rows of data that meet certain requirements.

Logical Checks

We need to include some more operators to do this:

  • < less than and > greater than
  • <= less than or equal to and >= greater than or equal to
  • == equal to and != not equal to
43 < 4
[1] FALSE
(4*pi)^2 > 5
[1] TRUE

Logical Values

  • The output from these checks is another form of variable called a logical.
  • We can have vectors of logical values
names <- c("Kevin", "Anne", "Sophie")
names == "Kevin"
[1]  TRUE FALSE FALSE

Logical Checks with Data

We can do the same thing but using a variable from our dataset:

## Returns true if dem_dur (regime type length) is more than 100.
df$bmr_demdur > 100 
  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [37]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[121] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[169]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE

Filtering Data

We can use logical checks to filter our data.

  • The filter() function is part of the plyr package in the tidyverse.
  • The first argument will be the data you want to filter.
  • The second argument will be the logical check.

Note

Within the filter() call you do not need to use data$ before the variable name, it already knows you are using the data you put in the first argument.

Filtering Data - Example

library(tidyverse)

# df <- read_csv("country_data.csv") ## remember I did this already
filter(df, bmr_demdur>100)
# A tibble: 19 × 31
   cname     ccode ti_cpi vdem_academ wdi_fertility  wdi_afp bl_asymf wdi_expedu
   <chr>     <dbl>  <dbl>       <dbl>         <dbl>    <dbl>    <dbl>      <dbl>
 1 Afghanis…     4     16      0.560           4.47  2.64        4.83       4.06
 2 Australia    36     77      0.847           1.74  0.438      12.5        5.12
 3 Belgium      56     75      0.967           1.62  0.619      11.6        6.41
 4 Bhutan       64     68      0.541           1.98 NA          NA          6.85
 5 Canada      124     81      0.919           1.50  0.356      12.9       NA   
 6 China       156     39      0.0960          1.69  0.343       8.71       3.51
 7 Finland     246     85      0.947           1.41  0.919      11.3        6.38
 8 Haiti       332     20      0.684           2.93  0.00991     5.32       2.78
 9 Iceland     352     76      0.925           1.71  0           9.94       7.66
10 Iran (Is…   364     28      0.122           2.14  2.34        9.36       3.96
11 Luxembou…   442     81      0.946           1.38  0.628      12.0        3.57
12 Oman        512     52      0.229           2.89  1.80       NA         NA   
13 Netherla…   528     82      0.930           1.59  0.449      11.8        5.18
14 New Zeal…   554     87      0.897           1.71  0.341      11.0        6.28
15 Norway      578     84      0.934           1.56  0.830      12.7        7.91
16 Sweden      752     85      0.964           1.76  0.281      12.0        7.57
17 Switzerl…   756     85      0.959           1.52  0.433      12.2        5.13
18 United K…   826     80      0.926           1.68  0.432      12.9        5.44
19 United S…   840     71      0.910           1.73  0.833      12.8       NA   
# ℹ 23 more variables: wdi_elprodcoal <dbl>, wef_iu <dbl>, wdi_foodins <dbl>,
#   ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
#   cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
#   wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
#   br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
#   vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>,
#   wef_sp <dbl>

If we want we can save the new data as well:

sub_df <- filter(df, bmr_demdur>100)

Checking Multiple Things

What about if we want to check if our rows meet multiple condition? Then we need logical operators.

Logical Operators

  • We can reverse a logical value with ! (e.g. !TRUE == FALSE)
  • We have and and or operators to check multiple logical values.
    • and is &
    • or is | (shift + backslash)
  • & returns TRUE if both values are TRUE
  • | returns TRUE if at least one value is TRUE
TRUE & TRUE 
TRUE | FALSE 
TRUE | TRUE 

All would return TRUE

Combining Logical Check

We can then combine logical checks together.

val <- pi^(2/3)
(val < 1) | (val > 4) # Is it less than 0 or greater than 4?
[1] FALSE

Multiple Checks in Filtering

Lets collect countries with 100 years of the same regime type where more than 10% of income goes to the top 1%

sub_df <- filter(df, bmr_demdur > 100 & top_top1_income_share > .1)
sub_df$cname
 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"                            

Check

Create two new datasets.

  1. Only countries that have an income share greater than 25%
  2. Countries that have had the same regime type for 50 to 100 years.

How I did it

sub_df1 <- filter(df, top_top1_income_share > .25)
sub_df1$cname
[1] "Angola"                         "Central African Republic (the)"
[3] "Chile"                          "Malawi"                        
[5] "Mexico"                         "Mozambique"                    
[7] "Oman"                          
sub_df2 <- filter(df, bmr_demdur > 50 & bmr_demdur < 100)
sub_df2$cname
 [1] "Algeria"                                    
 [2] "Austria"                                    
 [3] "Barbados"                                   
 [4] "Botswana"                                   
 [5] "Myanmar"                                    
 [6] "Cambodia"                                   
 [7] "Cameroon"                                   
 [8] "Chad"                                       
 [9] "Colombia"                                   
[10] "Congo (the)"                                
[11] "Congo (the Democratic Republic of the)"     
[12] "Costa Rica"                                 
[13] "Cuba"                                       
[14] "Denmark"                                    
[15] "Dominican Republic (the)"                   
[16] "Equatorial Guinea"                          
[17] "France"                                     
[18] "Gabon"                                      
[19] "Guinea"                                     
[20] "India"                                      
[21] "Iraq"                                       
[22] "Ireland"                                    
[23] "Israel"                                     
[24] "Italy"                                      
[25] "Côte d'Ivoire"                              
[26] "Jamaica"                                    
[27] "Japan"                                      
[28] "Jordan"                                     
[29] "Korea (the Democratic People's Republic of)"
[30] "Kuwait"                                     
[31] "Lao People's Democratic Republic (the)"     
[32] "Libya"                                      
[33] "Malaysia"                                   
[34] "Malta"                                      
[35] "Mauritania"                                 
[36] "Mauritius"                                  
[37] "Morocco"                                    
[38] "Nauru"                                      
[39] "Rwanda"                                     
[40] "San Marino"                                 
[41] "Saudi Arabia"                               
[42] "Singapore"                                  
[43] "Zimbabwe"                                   
[44] "Eswatini"                                   
[45] "Syrian Arab Republic (the)"                 
[46] "Togo"                                       
[47] "Trinidad and Tobago"                        
[48] "Egypt"                                      
[49] "Tanzania, the United Republic of"           
[50] "Samoa"                                      

Pipes %>%

Tidyverse syntax makes use of pipes to chain multiple functions together.

  • You use the pipe operator (%>%) in between each step.
  • This operator is like saying “take the output from the previous function and put it in the next function”

For example (in pseudo-code):

Output <- Step 1(Input) %>% Step 2() %>% Step 3()

Translation: Take the Input, apply Step 1 to it, then take the output of Step 1 and apply Step 2 to it, then take the output of Step 2 and apply Step 3 to it, and finally store the output of Step 3 as Output.

Example

filter(df, bmr_demdur > 100 & top_top1_income_share > .1) %>% pull(cname)
 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"                            

What does the pull() function do? It is another way to access a certain column in your data.

How else could we have done this?

pull(filter(df, bmr_demdur > 100 & top_top1_income_share > .1), cname)
 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"                            
filter(df, bmr_demdur > 100 & top_top1_income_share > .1)$cname
 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"                            
sub_df <- filter(df, bmr_demdur > 100 & top_top1_income_share > .1)
sub_df$cname
 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"                            

A Note of Caution

  • The %>% has been around for a while in the tidyverse.
  • R recently added its own version of this to base R BUT they use |> instead.
  • In most cases %>% is the same as |>

Yes this is all kind of silly and strange.

Summarizing Data

One of the most useful tidyverse functions is summarize().

  • summarize() transforms data by applying a function(s) to columns in the data.
  • The first argument will be the data, the rest of the arguments will be functions you want to apply to it.
  • The output will be a smaller data frame where the columns are the output from each function it applied.

Simple Examples

What if we want to figure out the mean regime type length for our data?

summarize(df, mean(bmr_demdur))
# A tibble: 1 × 1
  `mean(bmr_demdur)`
               <dbl>
1               47.4

What if we want to calculate other statistics?

summarize(df, mean(bmr_demdur), sd(bmr_demdur), median(bmr_demdur))
# A tibble: 1 × 3
  `mean(bmr_demdur)` `sd(bmr_demdur)` `median(bmr_demdur)`
               <dbl>            <dbl>                <dbl>
1               47.4             43.5                   34

Caution - Multiple Return Values

You generally want to use functions that only return 1 value. Why?

summarize(df, mean(bmr_demdur), sd(bmr_demdur), median(bmr_demdur), range(bmr_demdur))
# A tibble: 2 × 4
  `mean(bmr_demdur)` `sd(bmr_demdur)` `median(bmr_demdur)` `range(bmr_demdur)`
               <dbl>            <dbl>                <dbl>               <dbl>
1               47.4             43.5                   34                   1
2               47.4             43.5                   34                 219

Filtering and Summarizing

What if we want to figure out the average income share for the top 1% for countries that have been around for more than 100 years?

df %>% filter(bmr_demdur > 100) %>% summarize(mean(top_top1_income_share))
# A tibble: 1 × 1
  `mean(top_top1_income_share)`
                          <dbl>
1                         0.154

We can improve the output by changing the column name: summarize(col_name = mean(variable))

df %>% filter(bmr_demdur > 100) %>% summarize(mean = mean(top_top1_income_share))
# A tibble: 1 × 1
   mean
  <dbl>
1 0.154

Number of Observations

There is also a function specifically for the number of observations: n()

df %>% filter(bmr_demdur > 100) %>% summarize(n())
# A tibble: 1 × 1
  `n()`
  <int>
1    19

Check

Find the mean and median regime type duration for countries that have more than 0.05 of their income going to the top 1%. Include the number of observations as well.

Then find the mean and median GDP per capita (mad_gdppc) along with the number of observations for countries with a regime type duration of less than 50 years. There are missing values in this variable, what do we do to ignore them?

My Solutions

df %>% filter(top_top1_income_share > .05) %>% 
    summarize(mean=mean(bmr_demdur), median=median(bmr_demdur))
# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  49.0     34
df %>% filter(bmr_demdur < 50) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), 
              median=median(mad_gdppc, na.rm=T),
              n=n())
# A tibble: 1 × 3
    mean median     n
   <dbl>  <dbl> <int>
1 15144. 10907.   124

Note

You can use multiple lines with pipes, it is common to put the pipe at the end of each line and indent the next line.

Grouping

Often we want to provide summaries of groups within the data. For example: how does the GDP vary by type of political regime?

Here we’ll use the group_by() function to create groups of our data.

group_by() alone

group_by() expects variable(s) that you want to use to group your dataset:

df %>% group_by(br_elect) 
# A tibble: 194 × 31
# Groups:   br_elect [5]
   cname      ccode ti_cpi vdem_academ wdi_fertility wdi_afp bl_asymf wdi_expedu
   <chr>      <dbl>  <dbl>       <dbl>         <dbl>   <dbl>    <dbl>      <dbl>
 1 Afghanist…     4     16      0.560           4.47   2.64      4.83       4.06
 2 Albania        8     36      0.876           1.62   0.643    11.0        3.61
 3 Algeria       12     35      0.338           3.02   2.52      7.71      NA   
 4 Andorra       20     NA     NA              NA     NA        NA          3.25
 5 Angola        24     19      0.440           5.52   0.921    NA         NA   
 6 Antigua a…    28     NA     NA               1.99  NA        NA         NA   
 7 Azerbaijan    31     25      0.0770          1.73   1.61     NA          2.46
 8 Argentina     32     40      0.935           2.26   0.512    10.2        5.46
 9 Australia     36     77      0.847           1.74   0.438    12.5        5.12
10 Austria       40     76      0.973           1.47   0.497    10.8        5.36
# ℹ 184 more rows
# ℹ 23 more variables: wdi_elprodcoal <dbl>, wef_iu <dbl>, wdi_foodins <dbl>,
#   ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
#   cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
#   wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
#   br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
#   vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>, …

Only change is the addition of # Groups: by_elect [5] (grouping variable, and number of groups).

Group and Summarize

Lets chain together group_by() and summarize()

df %>% group_by(br_elect) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 5 × 3
  br_elect   mean     n
     <dbl>  <dbl> <int>
1        0 52084.    10
2        1 11348.     7
3        2 11188.    57
4        3 21459.   118
5       NA   NaN      2

What is ugly about this?

Adding in Filtering

is.na() checks if something is missing or not.

df %>% filter(!is.na(br_elect)) %>% 
    group_by(br_elect) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
# A tibble: 4 × 3
  br_elect   mean     n
     <dbl>  <dbl> <int>
1        0 52084.    10
2        1 11348.     7
3        2 11188.    57
4        3 21459.   118

Tip

The drop_na( ) tidyverse function can replace filter(!is.na( ))

Check

There are several variables that can be used to group countries. Pick one of them, pick an interval variable that you think might vary by the group, and then calculate the number of observations, mean, and median for each group.

My Solution

  • Grouping variable: br_pvote
  • Interval variable: van_part
  • Expectation: Countries with proportional representation (1) will have higher participation
df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Data that I am using

Filtering out observations that are missing a value for br_pvote

Grouping the data frame by br_pvote

Summarizing (number of observations, mean of van_part, median of van_part)

Saving Results

There are two ways to save our summary results. Both can be helpful depending on what you are doing:

  • write_csv(): Writes to a CSV file.
  • Creating an exportable table.

Huxtable

We are going to install one more package: huxtable (install.packages("huxtable"))

The huxtable package is a a lot so we are not going to get to it all but iet lets you do a lot of things:

  • Convert dataframe into a table.
  • Format tables (borders, colors, alignment)
  • Export that table in a lot of formats (html, docx, excel, latex…)

Get a Table to Export

tab_out <- df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Using Huxtable

The function hux() will create a table object that we can then modify. Lets see what happens when we make a table.

library(huxtable)
ht <- hux(tab_out)
ht
br_pvote n mean median
0 95 33.4 33.2
1 93 44.7 44.4

Using Huxtable

We can then modify the style by pipping it into functions like a theme_basic() function and set_align()

library(huxtable)
ht <- hux(tab_out) %>% theme_basic() %>%
    set_align(2:3, 1, "center") # 2:3 is the rows, 1 the columns to modify.
ht
br_pvote n mean median
0 95 33.4 33.2
1 93 44.7 44.4

Using Huxtable

Finally we can modify values using the set_contents() function

library(huxtable)
ht <- hux(tab_out) %>% theme_basic() %>%
    set_align(2:3, 1, "center") %>%
    set_contents(1, 1:4, 
    c("Prop Voting?", "N", "Mean", "Median")) %>%
    set_contents(2:3, 1, c("No", "Yes"))
ht
Prop Voting? N Mean Median
No 95 33.4 33.2
Yes 93 44.7 44.4

Saving the Document

And now we export it with quick_docx()

#install.packages(c("flextable", "officer"))
quick_docx(ht, file="images/My First Summary Table.docx")

You can download my file if you want

Huxtable Options and Check

There are a lot of options to modify your table here.

I want you to make a table, where at least one cell has non-black text color, one cell is italic, and you have a border on the very bottom of it.

My Solution

Using my table from before:

ht %>% set_italic(2:3, 3:4) %>%
    set_text_color(1, 1:4, "green") %>%
    set_bottom_border(3, 1:4) 
Prop Voting? N Mean Median
No 95 33.4 33.2
Yes 93 44.7 44.4