Introduction to R: Data Manipulation and Summary

Kevin Reuning

Goals for Today

  • Manipulate data (filter specific rows, select columns).
  • pipes
  • Summarizing data

Data for today

Today we are going to use a subset of country data from The Quality of Governance Institute.

library(readr)
setwd("images")
df <- read_csv("country_data.csv")

Variables

There is a description of all the variables I’ve included here.

For now though we are going to use a few of them:

  • bl_asymf average schooling years, females and males between 15 and 64 years old.
  • wdi_expedu general government expenditure on education (current, capital, and transfers) is expressed as a percentage of GDP

Filtering Data

Often you want to select just specific rows of data that meet certain requirements.

Logical Checks

We need to include some more operators to do this:

  • < less than and > greater than
  • <= less than or equal to and >= greater than or equal to
  • == equal to and != not equal to
43 < 4
[1] FALSE
(4*pi)^2 > 5
[1] TRUE

Logical Values

  • The output from these checks is another form of variable called a logical.
  • We can have vectors of logical values
names <- c("Kevin", "Anne", "Sophie")
names == "Kevin"
[1]  TRUE FALSE FALSE

Logical Checks with Data

We can do the same thing but using a variable from our dataset:

## Returns true if bl_asymf (average schooling) is more than 10.
df$bl_asymf > 10 
  [1] FALSE  TRUE FALSE    NA    NA    NA    NA  TRUE  TRUE  TRUE    NA FALSE
 [13] FALSE  TRUE FALSE  TRUE    NA FALSE    NA  TRUE FALSE  TRUE    NA FALSE
 [25]  TRUE FALSE FALSE    NA FALSE FALSE  TRUE    NA FALSE  TRUE    NA  TRUE
 [37] FALSE  TRUE FALSE    NA FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [49]  TRUE    NA FALSE FALSE FALSE    NA    NA    NA  TRUE  TRUE  TRUE  TRUE
 [61]    NA FALSE    NA FALSE  TRUE FALSE    NA  TRUE    NA FALSE    NA FALSE
 [73] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
 [85]  TRUE  TRUE  TRUE  TRUE FALSE    NA  TRUE FALSE  TRUE FALSE    NA FALSE
 [97]  TRUE FALSE FALSE    NA  TRUE  TRUE    NA FALSE  TRUE FALSE FALSE  TRUE
[109] FALSE FALSE FALSE    NA FALSE  TRUE    NA FALSE FALSE    NA FALSE    NA
[121] FALSE  TRUE    NA  TRUE FALSE FALSE    NA  TRUE    NA    NA    NA FALSE
[133]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE    NA    NA FALSE  TRUE  TRUE
[145] FALSE    NA    NA    NA    NA    NA  TRUE FALSE  TRUE    NA FALSE  TRUE
[157]  TRUE FALSE  TRUE    NA  TRUE FALSE  TRUE    NA FALSE    NA FALSE  TRUE
[169]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA    NA
[181] FALSE  TRUE    NA FALSE  TRUE FALSE  TRUE    NA FALSE    NA  TRUE    NA
[193] FALSE FALSE

Filtering Data

We can use logical checks to filter our data.

  • The filter() function is part of the plyr package in the tidyverse.
  • The first argument will be the data you want to filter.
  • The second argument will be the logical check.

Note

Within the filter() call you do not need to use data$ before the variable name, it already knows you are using the data you put in the first argument.

Filtering Data - Example

library(tidyverse)

# df <- read_csv("country_data.csv") ## remember I did this already
filter(df, bl_asymf>10)
cnameccodeti_cpivdem_academwdi_fertilitywdi_afpbl_asymfwdi_expeduwdi_elprodcoalwef_iuwdi_foodinsht_coloniallp_legorcai_foetalcai_mentalcai_physicalccp_initiatccp_marketh_jwdi_homicidesccp_strikewdi_lfprbr_pvotebr_electvan_partbmr_demdurfh_polity2vdem_polyarchymad_gdppctop_top1_income_sharewef_sp
Albania8360.8761.62 0.64311  3.610    71.810  031111102.29 168.31352.8 228.08 0.52 1.11e+040.090857.3     
Argentina32400.9352.26 0.51210.25.462.03 74.312.9220011205.32 169.21358.6 368.92 0.7791.86e+040.153 365       
Australia36770.8471.74 0.43812.55.1262.9  86.53.8011112210.892378.10360.8 11810    0.8654.98e+040.129 852       
Austria40760.9731.47 0.49710.85.368.23 87.71.1041111210.967376.61358.5 7310    0.8464.3e+04 0.0992579       
Armenia51350.8  1.75 3.92 11.82.710    64.71.1011129601.69 161.61351.7 16.75 0.4721.15e+040.178 162       
Belgium56750.9671.62 0.61911.66.416.11 88.71.1021112211.69 368.61368.2 1259.5  0.8913.98e+040.086 704       
Botswana72610.8742.87 0.85810.3   96.4  47  21.551111220    373.10332.4 538.25 0.6861.58e+040.227 93.7     
Belize84    2.31 0.86311.37.56        51112237.8  367.50341.1 38               0.197        
Bulgaria100420.9061.56 1.11 11.24.0946.2  64.81.9031112101.3  171.71340.7 298.92 0.6151.84e+040.182 221       
Canada124810.9191.5  0.35612.9   9.84 91  0.7011112211.76 378.50350.4 15210    0.8494.49e+040.149 1.03e+03
Sri Lanka144380.7332.2  3.65 11.12.1233.7  34.1  510002202.42 357.91357.1 46.92 0.6281.17e+040.206 148       
Chile152670.9591.65 1.3  10.65.4237.1  82.33.6220002214.4  269  1337.7 2910    0.8632.21e+040.265 319       
Taiwan (Province of China)158630.897        12.4       92.8  04121    3  0352.8 2310    0.84 4.47e+040.145 439       
Croatia191480.8731.47 1.01 12  3.9220.6  72.70.901112110.577166.61351.6 199.33 0.7322.2e+04 0.104 239       
Cuba192470.1171.62 1.49 11.1   0        231111105.05 364.30164.4 661.67 0.1828.33e+030.145        
Cyprus196590.9581.33 2.56 11.95.780    84.4  511112211.26 174.11331   4310    0.8562.72e+040.117 170       
Czechia203590.9421.71 0.40812.93.8553.1  80.70  01112210.62 176.81348.5 269.75 0.8123.07e+040.102 397       
Denmark208880.9411.73 0.48612.97.8224.5  97.61.10511196211.01 378.21363.9 7410    0.9134.63e+040.124 662       
Estonia233730.97 1.67 0.94912.44.975.33 89.40.901112212.12 179.11343.9 289.75 0.9012.74e+040.13  235       
Fiji242550.3572.77 1.12 10.2         2  51111221    260.41255.4 56.33 0.415                   
Finland246850.9471.41 0.91911.36.388.3  88.92  051111211.63 377.81354.4 10210    0.88 3.89e+040.105 571       
France250720.8811.88 1    10.35.452.16 82  0.7021111211.2  372  0343.7 739.58 0.88 3.85e+040.09661.03e+03
Germany276800.9711.57 0.41612.34.9144.3  89.70.701112110.948378.51357.3 2910    0.8784.62e+040.129 1.13e+03
Greece300450.8541.35 3.08 11.1   42.7  73  2.3021112210.941168.41363.5 459.58 0.8582.35e+040.109 434       
Hungary348460.4671.55 0.84112  4.6719.5  76.10.80311119612.49 171.91359.2 298.33 0.4892.56e+040.12  391       
Ireland372730.94 1.75 0.36312.43.5117.3  84.53.5010002210.872373.21344.8 9710    0.88 6.47e+040.12  451       
Israel376610.9453.09 4.33 12.76.0945.4  81.61.70111111.49 72.41350.8 717.75 0.7  3.3e+04 0.165 624       
Italy380520.9671.29 1.31 11  4.0416.1  74.41.1021111210.569265.71358.1 7310    0.8673.44e+040.0913897       
Jamaica388440.9431.98 0.40410.65.410    55.1  5101122043.9  370.40330.2 578.92 0.8127.27e+030.197 89.7     
Japan392730.7111.42 0.38212.83.1833.2  84.60.7040012210.263379.10344.9 6710    0.8273.87e+040.131 919       
Kazakhstan398310.3382.84 0.78611.42.6271.6  78.90  0111225.06 176.61252.7 281.83 0.2362.53e+040.154 83.7     
Jordan400490.3262.76 3.93 10.23.030    66.8  521112211.36 341.8028.95733.42 0.27 1.15e+040.174 143       
Korea (the Republic of)410570.8360.9772.15 12.84.3343.1  95.90  041112210.604168.90355.7 318.67 0.8683.79e+040.149 579       
Kyrgyzstan417290.6213.3  0.82711  6.0313.2  38  0.8011119612.19 162.51327.6 286.58 0.4655.18e+030.145 57.7     
Latvia428580.9651.6  0.69 11.74.4 0    83.60.601111214.36 178.11343.8 268.67 0.8332.43e+040.0969141       
Lithuania440590.9381.63 2.69 11.83.810    79.71.101111214.57 177.60344.3 2710    0.8242.74e+040.113 182       
Luxembourg442810.9461.38 0.62812  3.570    97.10.9021112210.338170.81339.5 12910    0.8745.74e+040.101 153       
Malaysia458470.5042    0.87611.44.4842.3  81.26.751011221    368.50338.9 626.75 0.3832.48e+040.149 251       
Malta470540.8951.23 0.73911.84.820    81.40.8020002211.59 373.21367.6 55    0.7563.2e+04 0.0961104       
Moldova (the Republic of)498330.8461.26 0.69811.25.440    76.14  01112114.1  143.71346   287.67 0.5266.75e+030.097497.7     
Netherlands (the)528820.93 1.59 0.44911.85.1838.7  94.71.7021112210.586380.31361.7 12210    0.8764.75e+040.0699895       
New Zealand554870.8971.71 0.34111  6.284.25 90.84.4011112210.744381.11345.3 16210    0.8923.53e+040.119 461       
Norway578840.9341.56 0.83 12.77.910.10596.51.1051112210.468377.81355.8 11910    0.8898.46e+040.109 532       
Panama591370.9012.46 1.28 10.1   6.92 57.9  2210096209.39 171.41352.3 289.33 0.7562.26e+040.197 174       
Poland616600.9431.46 1.07 11.84.5680.9  77.50.5031111210.73 170.41340.3 309.17 0.6952.75e+040.147 481       
Romania642470.91 1.76 1.4  11.43.1 27.6  70.73.4031111101.28 167.91343.8 288.92 0.6722.01e+040.137 228       
Russian Federation (the)643280.3761.58 1.97 12.14.6914.8  80.90.501112208.21 174.40243.7 203.92 0.27 2.47e+040.215 503       
Saudi Arabia682490.0742.32 1.8  10.1   0    93.3  010112201.27 357.5000   930    0.0165.03e+040.209 274       
Serbia688390.7251.49 0.99211.73.5972.4  73.42  01111101.23 167.41353.5 137.83 0.3481.41e+040.108 180       
Singapore702850.4661.14 1.69 12.8   1.2  88.21.4511112210.156377  0240.6 544.5  0.3876.84e+040.142 494       
Slovakia703500.9451.54 0.57612  3.9412.5  80.70.801111111.14 172.51344.8 269.58 0.8152.71e+040.0785242       
Slovenia705600.9531.61 0.69912.24.7829.6  79.70.501111110.481175.11339.2 2810    0.8382.92e+040.0803255       
South Africa710430.7722.4  0.39110.26.1692.7  56.219.35111122035.9  160  1234.5 258.92 0.7381.22e+040.193 392       
Spain724580.95 1.26 0.85110.94.2119    86.11.8021111110.621174.1352   4210    0.86 3.15e+040.125 776       
Sweden752850.9641.76 0.28112  7.570.66792.11.2051112211.08 9683  1364.9 10810    0.9094.55e+040.094 779       
Switzerland756850.9591.52 0.43312.25.130    89.70.70411196210.586184.21346   17110    0.8966.14e+040.11  868       
Tajikistan762250.0873.59 0.72310.85.231.53 22    0111220    342.10246.1 282.17 0.17 4.44e+030.149 38       
Tonga776    3.56     11.3         6  5100022    349.30318.3 49                           
Trinidad and Tobago780410.8161.73 0.60511.2   0    77.3  5101122030.6  368.70354   579.17 0.7522.85e+040.197 88.3     
United Arab Emirates (the)784700.1231.41 0.93311.7   0    98.5  511112210.464382.8000   480.9170.0957.64e+040.158 171       
Ukraine804320.4481.3  1.45 11.65.4134.6  58.91.601112216.18 166.61340.8 286.42 0.4059.81e+030.0978229       
United Kingdom of Great Britain and Northern Ireland (the)826800.9261.68 0.43212.95.4422.8  94.91.3011112211.2  377.70348.8 1349.5  0.8743.81e+040.13  1.29e+03
United States of America (the)840710.91 1.73 0.83312.8   34.2  87.30.8011112214.96 372.60370   2199.08 0.8315.53e+040.19  2.09e+03
Venezuela (Bolivarian Republic of)862180.2592.27 2.72 10.2   0    72    220001236.7  164.71129.4 142.17 0.2151.07e+040.197 193       

If we want we can save the new data as well:

sub_df <- filter(df, bl_asymf>10)

Checking Multiple Things

What about if we want to check if our rows meet multiple condition? Then we need logical operators.

Logical Operators

  • We can reverse a logical value with ! (e.g. !TRUE == FALSE)
  • We have and and or operators to check multiple logical values.
    • and is &
    • or is | (shift + backslash)
  • & returns TRUE if both values are TRUE
  • | returns TRUE if at least one value is TRUE
TRUE & TRUE 
TRUE | FALSE 
TRUE | TRUE 

All would return TRUE

Combining Logical Check

We can then combine logical checks together.

val <- pi^(2/3)
(val < 1) | (val > 4) # Is it less than 0 or greater than 4?
[1] FALSE

Multiple Checks in Filtering

Lets collect countries with more than 10 years of average education but spend less than 5% of their GDP on education ::: {.cell}

sub_df <- filter(df, bl_asymf > 10 & wdi_expedu < 5)
sub_df$cname
 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"                   

:::

Check

Create two new datasets.

  1. Only countries that spent more than 10% of their GDP on education
  2. Countries that have an average education between 5 and 8 years

How I did it

sub_df1 <- filter(df, wdi_expedu > 10)
sub_df1$cname
[1] "Micronesia (Federated States of)"
sub_df2 <- filter(df, bl_asymf > 5 & bl_asymf < 8)
sub_df2$cname
 [1] "Algeria"                               
 [2] "Bangladesh"                            
 [3] "Myanmar"                               
 [4] "Cameroon"                              
 [5] "Congo (the)"                           
 [6] "Congo (the Democratic Republic of the)"
 [7] "Benin"                                 
 [8] "El Salvador"                           
 [9] "Guatemala"                             
[10] "Haiti"                                 
[11] "Honduras"                              
[12] "India"                                 
[13] "Iraq"                                  
[14] "Kenya"                                 
[15] "Kuwait"                                
[16] "Lao People's Democratic Republic (the)"
[17] "Lesotho"                               
[18] "Malawi"                                
[19] "Maldives"                              
[20] "Mauritania"                            
[21] "Morocco"                               
[22] "Namibia"                               
[23] "Nepal"                                 
[24] "Nicaragua"                             
[25] "Pakistan"                              
[26] "Rwanda"                                
[27] "Viet Nam"                              
[28] "Eswatini"                              
[29] "Syrian Arab Republic (the)"            
[30] "Togo"                                  
[31] "Turkey"                                
[32] "Uganda"                                
[33] "Egypt"                                 
[34] "Tanzania, the United Republic of"      
[35] "Zambia"                                

Pipes %>%

Tidyverse syntax makes use of pipes to chain multiple functions together.

  • You use the pipe operator (%>%) in between each step.
  • This operator is like saying “take the output from the previous function and put it in the next function”

For example (in pseudo-code):

Output <- Step 1(Input) %>% Step 2() %>% Step 3()

Translation: Take the Input, apply Step 1 to it, then take the output of Step 1 and apply Step 2 to it, then take the output of Step 2 and apply Step 3 to it, and finally store the output of Step 3 as Output.

Example

filter(df, bl_asymf > 10 & wdi_expedu < 5) %>% pull(cname)
 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"                   

What does the pull() function do? It pulls out a column from your data.

How else could we have done this?

pull(filter(df, bl_asymf > 10 & wdi_expedu < 5), cname)
 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"                   
filter(df, bl_asymf > 10 & wdi_expedu < 5)$cname
 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"                   
sub_df <- filter(df, bl_asymf > 10 & wdi_expedu < 5)
sub_df$cname
 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"                   

A Note of Caution

  • The %>% has been around for a while in the tidyverse.
  • R added its own version of this to base R BUT they use |> instead.
  • In most cases %>% is the same as |>

Yes this is all kind of silly and strange.

Summarizing Data

One of the most useful tidyverse functions is summarize().

  • summarize() transforms data by applying a function(s) to columns in the data.
  • The first argument will be the data, the rest of the arguments will be functions you want to apply to it.
  • The output will be a smaller data frame where the columns are the output from each function it applied.

Simple Examples

What if we want to figure out the average average education for all countries in our data?

summarize(df, mean(bl_asymf, na.rm=TRUE))
mean(bl_asymf, na.rm = TRUE)
9.11

What if we want to calculate other statistics?

summarize(df, mean(bl_asymf, na.rm=TRUE), 
            sd(bl_asymf, na.rm=TRUE), 
            median(bl_asymf, na.rm=TRUE))
mean(bl_asymf, na.rm = TRUE)sd(bl_asymf, na.rm = TRUE)median(bl_asymf, na.rm = TRUE)
9.112.719.52

Caution - Multiple Return Values

You generally want to use functions that only return 1 value. Why?

summarize(df, mean(bl_asymf, na.rm=TRUE), 
            sd(bl_asymf, na.rm=TRUE), 
            median(bl_asymf, na.rm=TRUE), 
            range(bl_asymf, na.rm=TRUE))
mean(bl_asymf, na.rm = TRUE)sd(bl_asymf, na.rm = TRUE)median(bl_asymf, na.rm = TRUE)range(bl_asymf, na.rm = TRUE)
9.112.719.522.43
9.112.719.5212.9 

Filtering and Summarizing

What if we want to figure out the average education for countries that spend less than 5% of their GDP on education?

df %>% filter(wdi_expedu < 5) %>% summarize(mean(bl_asymf, na.rm=T))
mean(bl_asymf, na.rm = T)
8.81

We can improve the output by changing the column name: summarize(col_name = mean(variable))

df %>% filter(wdi_expedu < 5) %>% 
    summarize("Mean"=mean(bl_asymf, na.rm=T))
Mean
8.81

Note

You can use multiple lines with pipes, it is common to put the pipe at the end of each line and indent the next line.

Number of Observations

There is also a function specifically for the number of observations: n()

df %>% filter(wdi_expedu < 5) %>% summarize(n())
n()
89

Check

Find the mean and median average education and education expenditure for countries with a GDP per capita (mad_gdppc) of more than 10,000.

My Solutions

df %>% filter(mad_gdppc > 10000) %>% 
    summarize(mean_exp=mean(wdi_expedu, na.rm=TRUE), 
        median_exp=median(wdi_expedu, na.rm=TRUE), 
        mean_ed=mean(bl_asymf, na.rm=TRUE), 
        median_ed=median(bl_asymf, na.rm=TRUE))
mean_expmedian_expmean_edmedian_ed
4.664.5410.811.1

Grouping

Often we want to provide summaries of groups within the data. For example: how does the GDP vary by election type? br_pvote is an indicator for having proportional representation.

Here we’ll use the group_by() function to create groups of our data.

group_by() alone

group_by() expects variable(s) that you want to use to group your dataset:

df %>% group_by(br_pvote) 
# A tibble: 194 × 31
# Groups:   br_pvote [3]
   cname      ccode ti_cpi vdem_academ wdi_fertility wdi_afp bl_asymf wdi_expedu
   <chr>      <dbl>  <dbl>       <dbl>         <dbl>   <dbl>    <dbl>      <dbl>
 1 Afghanist…     4     16      0.560           4.47   2.64      4.83       4.06
 2 Albania        8     36      0.876           1.62   0.643    11.0        3.61
 3 Algeria       12     35      0.338           3.02   2.52      7.71      NA   
 4 Andorra       20     NA     NA              NA     NA        NA          3.25
 5 Angola        24     19      0.440           5.52   0.921    NA         NA   
 6 Antigua a…    28     NA     NA               1.99  NA        NA         NA   
 7 Azerbaijan    31     25      0.0770          1.73   1.61     NA          2.46
 8 Argentina     32     40      0.935           2.26   0.512    10.2        5.46
 9 Australia     36     77      0.847           1.74   0.438    12.5        5.12
10 Austria       40     76      0.973           1.47   0.497    10.8        5.36
# ℹ 184 more rows
# ℹ 23 more variables: wdi_elprodcoal <dbl>, wef_iu <dbl>, wdi_foodins <dbl>,
#   ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
#   cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
#   wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
#   br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
#   vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>, …

Only change is the addition of # Groups: br_pvote [3] (grouping variable, and number of groups).

Group and Summarize

Lets chain together group_by() and summarize()

df %>% group_by(br_pvote) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
br_pvotemeann
01.79e+0495
11.97e+0493
1.39e+046

What is ugly about this?

Adding in Filtering

is.na() checks if something is missing or not.

df %>% filter(!is.na(br_pvote)) %>% 
    group_by(br_pvote) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())
br_pvotemeann
01.79e+0495
11.97e+0493

Tip

The drop_na( ) tidyverse function can replace filter(!is.na( ))

Check

There are several variables that can be used to group countries. Pick one of them, pick an interval variable that you think might vary by the group, and then calculate the number of observations, mean, and median for each group.

There is a description of all the variables I’ve included here.

My Solution

  • Grouping variable: br_pvote
  • Interval variable: van_part
  • Expectation: Countries with proportional representation (1) will have higher participation
df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Data that I am using

Filtering out observations that are missing a value for br_pvote

Grouping the data frame by br_pvote

Summarizing (number of observations, mean of van_part, median of van_part)

Saving Results

There are two ways to save our summary results. Both can be helpful depending on what you are doing:

  • write_csv(): Writes to a CSV file.
  • Creating an exportable table.

gt

We are going to use: gt

The gt package is a a lot so we are not going to get to it all but iet lets you do a lot of things:

  • Convert dataframe into a table.
  • Format tables (borders, colors, alignment)
  • Export that table in a lot of formats (html, docx, excel, latex…)

Get a Table to Export

tab_out <- df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Using gt

The function gt() will create a table object that we can then modify. Lets see what happens when we make a table.

library(gt)
tab_fmted <- gt(tab_out)
tab_fmted
br_pvote n mean median
0 95 33.36337 33.235
1 93 44.73456 44.355

Using gt - Styling

We can then modify the style by pipping it into functions like a opt_stylize() function and cols_align()

library(gt)
tab_fmted <- gt(tab_out) %>% 
    opt_stylize(style=1, color="green") %>%
    cols_align(align="center", columns="br_pvote") 
tab_fmted
br_pvote n mean median
0 95 33.36337 33.235
1 93 44.73456 44.355

Using gt - Labels

We can also change the labeling of our columns easily using the actual name and what we want it to be called

library(gt)
tab_fmted <- gt(tab_out) %>% 
    opt_stylize(style=1, color="green") %>%
    cols_align(align="center", columns="br_pvote") %>%
    cols_label(
        br_pvote = "Proportional?", 
        n = "N", mean = "Mean", 
        median = "Median"
    )
tab_fmted
Proportional? N Mean Median
0 95 33.36337 33.235
1 93 44.73456 44.355

Using gt - Modifying Contents

We can also modify values using the text_case_match() function. This one will check if a cell matches what is on the left side of the tilde (~) and replace it with the right side.

tab_fmted <- tab_fmted %>% 
    text_case_match(
        "1" ~ "Yes",
        "0" ~ "No"
    )
tab_fmted
Proportional? N Mean Median
No 95 33.36337 33.235
Yes 93 44.73456 44.355

Using gt - Table Headers and Notes

Finally, tab_header() and tab_source_note() can be used to add other information about your table:

tab_fmted <- tab_fmted %>% 
    tab_header("Countries by Election Type") %>% 
    tab_source_note("Data from QoGs")
tab_fmted
Countries by Election Type
Proportional? N Mean Median
No 95 33.36337 33.235
Yes 93 44.73456 44.355
Data from QoGs

Saving the Document

And now we export it with gtsave()

gtsave(tab_fmted, file="Table.docx")

gt Options and Check

There are a lot of options to modify your table here.

Try to see if you can change the mean and median to be listed in percentages and then make the number of observations bold.

My Solution

Using my table from before:

tab_fmted %>%
    fmt_percent(columns=c(mean, median), scale_values=FALSE) %>% 
    tab_style(locations=cells_body(n), 
        style=cell_text(weight="bold"))
Countries by Election Type
Proportional? N Mean Median
No 95 33.36% 33.24%
Yes 93 44.73% 44.36%
Data from QoGs