Wrangling and Summarizing Data in R

Kevin Reuning

Goals for Today

Manipulate data (filter specific rows, select columns).
pipes
Summarizing data

Tidyverse

A lot of data science work is done using the Tidyverse suite of packages.
We can install the entire suite using: install.package()

Run:

install.packages("tidyverse")

There might be a popup asking about installing things from “Source” you can hit no on it.

Data for today

Today we are going to use a subset of country data from The Quality of Governance Institute.

Download the data we’ll be using here
You can open it with read_csv()

library(readr)
setwd("../r_bootcamp/images/")
df <- read_csv("country_data.csv")

Variables

There is a description of all the variables I’ve included here.

For now though we are going to use a few of them:

bmr_demdur is how long the country has been in the same regime type category
top_top1_income_share is the proportion of income that goes to the top 1%.

Filtering Data

Often you want to select just specific rows of data that meet certain requirements.

Logical Checks

We need to include some more operators to do this:

< less than and > greater than
<= less than or equal to and >= greater than or equal to
== equal to and != not equal to

43 < 4

[1] FALSE

(4*pi)^2 > 5

[1] TRUE

Logical Values

The output from these checks is another form of variable called a logical.
We can have vectors of logical values

names <- c("Kevin", "Anne", "Sophie")
names == "Kevin"

[1]  TRUE FALSE FALSE

Logical Checks with Data

We can do the same thing but using a variable from our dataset:

## Returns true if dem_dur (regime type length) is more than 100.
df$bmr_demdur > 100

  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [37]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[121] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[169]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE

Filtering Data

We can use logical checks to filter our data.

The filter() function is part of the plyr package in the tidyverse.
The first argument will be the data you want to filter.
The second argument will be the logical check.

Note

Within the filter() call you do not need to use data$ before the variable name, it already knows you are using the data you put in the first argument.

Filtering Data - Example

library(tidyverse)

# df <- read_csv("country_data.csv") ## remember I did this already
filter(df, bmr_demdur>100)

# A tibble: 19 × 31
   cname     ccode ti_cpi vdem_academ wdi_fertility  wdi_afp bl_asymf wdi_expedu
   <chr>     <dbl>  <dbl>       <dbl>         <dbl>    <dbl>    <dbl>      <dbl>
 1 Afghanis…     4     16      0.560           4.47  2.64        4.83       4.06
 2 Australia    36     77      0.847           1.74  0.438      12.5        5.12
 3 Belgium      56     75      0.967           1.62  0.619      11.6        6.41
 4 Bhutan       64     68      0.541           1.98 NA          NA          6.85
 5 Canada      124     81      0.919           1.50  0.356      12.9       NA   
 6 China       156     39      0.0960          1.69  0.343       8.71       3.51
 7 Finland     246     85      0.947           1.41  0.919      11.3        6.38
 8 Haiti       332     20      0.684           2.93  0.00991     5.32       2.78
 9 Iceland     352     76      0.925           1.71  0           9.94       7.66
10 Iran (Is…   364     28      0.122           2.14  2.34        9.36       3.96
11 Luxembou…   442     81      0.946           1.38  0.628      12.0        3.57
12 Oman        512     52      0.229           2.89  1.80       NA         NA   
13 Netherla…   528     82      0.930           1.59  0.449      11.8        5.18
14 New Zeal…   554     87      0.897           1.71  0.341      11.0        6.28
15 Norway      578     84      0.934           1.56  0.830      12.7        7.91
16 Sweden      752     85      0.964           1.76  0.281      12.0        7.57
17 Switzerl…   756     85      0.959           1.52  0.433      12.2        5.13
18 United K…   826     80      0.926           1.68  0.432      12.9        5.44
19 United S…   840     71      0.910           1.73  0.833      12.8       NA   
# ℹ 23 more variables: wdi_elprodcoal <dbl>, wef_iu <dbl>, wdi_foodins <dbl>,
#   ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
#   cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
#   wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
#   br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
#   vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>,
#   wef_sp <dbl>

If we want we can save the new data as well:

sub_df <- filter(df, bmr_demdur>100)

Checking Multiple Things

What about if we want to check if our rows meet multiple condition? Then we need logical operators.

Logical Operators

We can reverse a logical value with ! (e.g. !TRUE == FALSE)
We have and and or operators to check multiple logical values.
- and is &
- or is | (shift + backslash)
& returns TRUE if both values are TRUE
| returns TRUE if at least one value is TRUE

TRUE & TRUE 
TRUE | FALSE 
TRUE | TRUE

All would return TRUE

Combining Logical Check

We can then combine logical checks together.

val <- pi^(2/3)
(val < 1) | (val > 4) # Is it less than 0 or greater than 4?

[1] FALSE

Multiple Checks in Filtering

Lets collect countries with 100 years of the same regime type where more than 10% of income goes to the top 1%

sub_df <- filter(df, bmr_demdur > 100 & top_top1_income_share > .1)
sub_df$cname

 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"

Check

Create two new datasets.

Only countries that have an income share greater than 25%
Countries that have had the same regime type for 50 to 100 years.

How I did it

sub_df1 <- filter(df, top_top1_income_share > .25)
sub_df1$cname

[1] "Angola"                         "Central African Republic (the)"
[3] "Chile"                          "Malawi"                        
[5] "Mexico"                         "Mozambique"                    
[7] "Oman"

sub_df2 <- filter(df, bmr_demdur > 50 & bmr_demdur < 100)
sub_df2$cname

 [1] "Algeria"                                    
 [2] "Austria"                                    
 [3] "Barbados"                                   
 [4] "Botswana"                                   
 [5] "Myanmar"                                    
 [6] "Cambodia"                                   
 [7] "Cameroon"                                   
 [8] "Chad"                                       
 [9] "Colombia"                                   
[10] "Congo (the)"                                
[11] "Congo (the Democratic Republic of the)"     
[12] "Costa Rica"                                 
[13] "Cuba"                                       
[14] "Denmark"                                    
[15] "Dominican Republic (the)"                   
[16] "Equatorial Guinea"                          
[17] "France"                                     
[18] "Gabon"                                      
[19] "Guinea"                                     
[20] "India"                                      
[21] "Iraq"                                       
[22] "Ireland"                                    
[23] "Israel"                                     
[24] "Italy"                                      
[25] "Côte d'Ivoire"                              
[26] "Jamaica"                                    
[27] "Japan"                                      
[28] "Jordan"                                     
[29] "Korea (the Democratic People's Republic of)"
[30] "Kuwait"                                     
[31] "Lao People's Democratic Republic (the)"     
[32] "Libya"                                      
[33] "Malaysia"                                   
[34] "Malta"                                      
[35] "Mauritania"                                 
[36] "Mauritius"                                  
[37] "Morocco"                                    
[38] "Nauru"                                      
[39] "Rwanda"                                     
[40] "San Marino"                                 
[41] "Saudi Arabia"                               
[42] "Singapore"                                  
[43] "Zimbabwe"                                   
[44] "Eswatini"                                   
[45] "Syrian Arab Republic (the)"                 
[46] "Togo"                                       
[47] "Trinidad and Tobago"                        
[48] "Egypt"                                      
[49] "Tanzania, the United Republic of"           
[50] "Samoa"

Pipes %>%

Tidyverse syntax makes use of pipes to chain multiple functions together.

You use the pipe operator (%>%) in between each step.
This operator is like saying “take the output from the previous function and put it in the next function”

For example (in pseudo-code):

Output <- Step 1(Input) %>% Step 2() %>% Step 3()

Translation: Take the Input, apply Step 1 to it, then take the output of Step 1 and apply Step 2 to it, then take the output of Step 2 and apply Step 3 to it, and finally store the output of Step 3 as Output.

Example

filter(df, bmr_demdur > 100 & top_top1_income_share > .1) %>% pull(cname)

 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"

What does the pull() function do? It is another way to access a certain column in your data.

How else could we have done this?

pull(filter(df, bmr_demdur > 100 & top_top1_income_share > .1), cname)

 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"

sub_df <- filter(df, bmr_demdur > 100 & top_top1_income_share > .1)
sub_df$cname

 [1] "Afghanistan"                                               
 [2] "Australia"                                                 
 [3] "Bhutan"                                                    
 [4] "Canada"                                                    
 [5] "China"                                                     
 [6] "Finland"                                                   
 [7] "Haiti"                                                     
 [8] "Iran (Islamic Republic of)"                                
 [9] "Luxembourg"                                                
[10] "Oman"                                                      
[11] "New Zealand"                                               
[12] "Norway"                                                    
[13] "Switzerland"                                               
[14] "United Kingdom of Great Britain and Northern Ireland (the)"
[15] "United States of America (the)"

A Note of Caution

The %>% has been around for a while in the tidyverse.
R recently added its own version of this to base R BUT they use |> instead.
In most cases %>% is the same as |>

Yes this is all kind of silly and strange.

Summarizing Data

One of the most useful tidyverse functions is summarize().

summarize() transforms data by applying a function(s) to columns in the data.
The first argument will be the data, the rest of the arguments will be functions you want to apply to it.
The output will be a smaller data frame where the columns are the output from each function it applied.

Simple Examples

What if we want to figure out the mean regime type length for our data?

summarize(df, mean(bmr_demdur))

# A tibble: 1 × 1
  `mean(bmr_demdur)`
               <dbl>
1               47.4

What if we want to calculate other statistics?

summarize(df, mean(bmr_demdur), sd(bmr_demdur), median(bmr_demdur))

# A tibble: 1 × 3
  `mean(bmr_demdur)` `sd(bmr_demdur)` `median(bmr_demdur)`
               <dbl>            <dbl>                <dbl>
1               47.4             43.5                   34

Caution - Multiple Return Values

You generally want to use functions that only return 1 value. Why?

summarize(df, mean(bmr_demdur), sd(bmr_demdur), median(bmr_demdur), range(bmr_demdur))

# A tibble: 2 × 4
  `mean(bmr_demdur)` `sd(bmr_demdur)` `median(bmr_demdur)` `range(bmr_demdur)`
               <dbl>            <dbl>                <dbl>               <dbl>
1               47.4             43.5                   34                   1
2               47.4             43.5                   34                 219

Filtering and Summarizing

What if we want to figure out the average income share for the top 1% for countries that have been around for more than 100 years?

df %>% filter(bmr_demdur > 100) %>% summarize(mean(top_top1_income_share))

# A tibble: 1 × 1
  `mean(top_top1_income_share)`
                          <dbl>
1                         0.154

We can improve the output by changing the column name: summarize(col_name = mean(variable))

df %>% filter(bmr_demdur > 100) %>% summarize(mean = mean(top_top1_income_share))

# A tibble: 1 × 1
   mean
  <dbl>
1 0.154

Number of Observations

There is also a function specifically for the number of observations: n()

df %>% filter(bmr_demdur > 100) %>% summarize(n())

# A tibble: 1 × 1
  `n()`
  <int>
1    19

Check

Find the mean and median regime type duration for countries that have more than 0.05 of their income going to the top 1%. Include the number of observations as well.

Then find the mean and median GDP per capita (mad_gdppc) along with the number of observations for countries with a regime type duration of less than 50 years. There are missing values in this variable, what do we do to ignore them?

My Solutions

df %>% filter(top_top1_income_share > .05) %>% 
    summarize(mean=mean(bmr_demdur), median=median(bmr_demdur))

# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  49.0     34

df %>% filter(bmr_demdur < 50) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), 
              median=median(mad_gdppc, na.rm=T),
              n=n())

# A tibble: 1 × 3
    mean median     n
   <dbl>  <dbl> <int>
1 15144. 10907.   124

Note

You can use multiple lines with pipes, it is common to put the pipe at the end of each line and indent the next line.

Grouping

Often we want to provide summaries of groups within the data. For example: how does the GDP vary by type of political regime?

Here we’ll use the group_by() function to create groups of our data.

`group_by()` alone

group_by() expects variable(s) that you want to use to group your dataset:

df %>% group_by(br_elect)

# A tibble: 194 × 31
# Groups:   br_elect [5]
   cname      ccode ti_cpi vdem_academ wdi_fertility wdi_afp bl_asymf wdi_expedu
   <chr>      <dbl>  <dbl>       <dbl>         <dbl>   <dbl>    <dbl>      <dbl>
 1 Afghanist…     4     16      0.560           4.47   2.64      4.83       4.06
 2 Albania        8     36      0.876           1.62   0.643    11.0        3.61
 3 Algeria       12     35      0.338           3.02   2.52      7.71      NA   
 4 Andorra       20     NA     NA              NA     NA        NA          3.25
 5 Angola        24     19      0.440           5.52   0.921    NA         NA   
 6 Antigua a…    28     NA     NA               1.99  NA        NA         NA   
 7 Azerbaijan    31     25      0.0770          1.73   1.61     NA          2.46
 8 Argentina     32     40      0.935           2.26   0.512    10.2        5.46
 9 Australia     36     77      0.847           1.74   0.438    12.5        5.12
10 Austria       40     76      0.973           1.47   0.497    10.8        5.36
# ℹ 184 more rows
# ℹ 23 more variables: wdi_elprodcoal <dbl>, wef_iu <dbl>, wdi_foodins <dbl>,
#   ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
#   cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
#   wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
#   br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
#   vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>, …

Only change is the addition of # Groups: by_elect [5] (grouping variable, and number of groups).

Group and Summarize

Lets chain together group_by() and summarize()

df %>% group_by(br_elect) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

# A tibble: 5 × 3
  br_elect   mean     n
     <dbl>  <dbl> <int>
1        0 52084.    10
2        1 11348.     7
3        2 11188.    57
4        3 21459.   118
5       NA   NaN      2

What is ugly about this?

Adding in Filtering

is.na() checks if something is missing or not.

df %>% filter(!is.na(br_elect)) %>% 
    group_by(br_elect) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

# A tibble: 4 × 3
  br_elect   mean     n
     <dbl>  <dbl> <int>
1        0 52084.    10
2        1 11348.     7
3        2 11188.    57
4        3 21459.   118

Tip

The drop_na( ) tidyverse function can replace filter(!is.na( ))

Check

There are several variables that can be used to group countries. Pick one of them, pick an interval variable that you think might vary by the group, and then calculate the number of observations, mean, and median for each group.

My Solution

Grouping variable: br_pvote
Interval variable: van_part
Expectation: Countries with proportional representation (1) will have higher participation

df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Data that I am using

Filtering out observations that are missing a value for br_pvote

Grouping the data frame by br_pvote

Summarizing (number of observations, mean of van_part, median of van_part)

Another Improvement

Our groups could have better names than 0, 1, 2, and 3.

df %>% filter(!is.na(br_elect)) %>% 
    group_by(br_elect) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

# A tibble: 4 × 3
  br_elect   mean     n
     <dbl>  <dbl> <int>
1        0 52084.    10
2        1 11348.     7
3        2 11188.    57
4        3 21459.   118

Changing variables

`mutate()`

If we want to change the values of a variable we use mutate()

General format:

mutate(new_var = f(old_var))

What function though? br_elect is a small set of numbers where each represent a category. This is a factor.

Applying `factor()`

factor() creates a factor, with a certain set of numeric levels, and labels and so needs somethings:

The variable you are change (the first thing).
The set of numeric levels in the variable. levels=c( )
Labels you want to apply to the levels. labels=c( )

Example of Factor

library(tidyverse)
df %>% mutate(br_elect_label = 
        factor(br_elect, levels=c(0, 1, 2, 3), 
                labels=c("None", "Single-party", 
                "Non-dem multi-party", "Democratic"))) %>%
    pull(br_elect_label)

  [1] Non-dem multi-party Democratic          Non-dem multi-party
  [4] <NA>                Non-dem multi-party Democratic         
  [7] Non-dem multi-party Democratic          Democratic         
 [10] Democratic          Democratic          Non-dem multi-party
 [13] Non-dem multi-party Democratic          Democratic         
 [16] Democratic          Democratic          Democratic         
 [19] Non-dem multi-party Democratic          Democratic         
 [22] Democratic          Democratic          None               
 [25] Democratic          Non-dem multi-party Non-dem multi-party
 [28] Non-dem multi-party Non-dem multi-party Non-dem multi-party
 [31] Democratic          Democratic          Non-dem multi-party
 [34] Democratic          Non-dem multi-party Democratic         
 [37] None                Democratic          Democratic         
 [40] Democratic          Non-dem multi-party Non-dem multi-party
 [43] Democratic          Democratic          Single-party       
 [46] Democratic          Democratic          Democratic         
 [49] Democratic          Democratic          Democratic         
 [52] Democratic          Democratic          Single-party       
 [55] Non-dem multi-party None                Democratic         
 [58] Non-dem multi-party Democratic          Democratic         
 [61] Non-dem multi-party Non-dem multi-party Democratic         
 [64] Democratic          Democratic          Democratic         
 [67] Democratic          Democratic          Democratic         
 [70] Democratic          Democratic          Democratic         
 [73] Non-dem multi-party Non-dem multi-party Democratic         
 [76] Democratic          Democratic          Democratic         
 [79] Single-party        Non-dem multi-party Democratic         
 [82] Democratic          Democratic          Non-dem multi-party
 [85] Democratic          Democratic          Non-dem multi-party
 [88] Non-dem multi-party Non-dem multi-party Single-party       
 [91] Democratic          Non-dem multi-party Democratic         
 [94] Single-party        Non-dem multi-party Democratic         
 [97] Democratic          Democratic          Non-dem multi-party
[100] Democratic          Democratic          Democratic         
[103] Non-dem multi-party Democratic          Democratic         
[106] Non-dem multi-party Democratic          Democratic         
[109] Non-dem multi-party Democratic          Democratic         
[112] Democratic          Democratic          Democratic         
[115] Non-dem multi-party Non-dem multi-party Non-dem multi-party
[118] Non-dem multi-party Non-dem multi-party Democratic         
[121] Democratic          Democratic          Democratic         
[124] Democratic          Non-dem multi-party Democratic         
[127] Democratic          Democratic          Democratic         
[130] Democratic          Democratic          Democratic         
[133] Democratic          Democratic          Democratic         
[136] Democratic          Democratic          Democratic         
[139] Democratic          Democratic          Democratic         
[142] None                Democratic          Non-dem multi-party
[145] Non-dem multi-party Democratic          Democratic         
[148] Democratic          <NA>                Democratic         
[151] None                Democratic          Democratic         
[154] Democratic          Democratic          Non-dem multi-party
[157] Democratic          Single-party        Democratic         
[160] None                Non-dem multi-party Non-dem multi-party
[163] Democratic          None                Non-dem multi-party
[166] Democratic          Non-dem multi-party Democratic         
[169] Democratic          Non-dem multi-party Non-dem multi-party
[172] None                Non-dem multi-party Democratic         
[175] Democratic          None                Democratic         
[178] Non-dem multi-party Non-dem multi-party Democratic         
[181] Non-dem multi-party Democratic          Democratic         
[184] Non-dem multi-party Democratic          Non-dem multi-party
[187] Democratic          Non-dem multi-party Democratic         
[190] Non-dem multi-party Single-party        Non-dem multi-party
[193] None                Democratic         
Levels: None Single-party Non-dem multi-party Democratic

Making better labels

df %>% filter(!is.na(br_elect)) %>% 
    mutate(br_elect_label = 
            factor(br_elect, levels=c(0, 1, 2, 3), 
                    labels=c("None", "Single-party", 
                    "Non-dem multi-party", "Democratic"))) %>% 
    group_by(br_elect_label) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

# A tibble: 4 × 3
  br_elect_label        mean     n
  <fct>                <dbl> <int>
1 None                52084.    10
2 Single-party        11348.     7
3 Non-dem multi-party 11188.    57
4 Democratic          21459.   118

More Complicated Situations

Often there are variables we want to coarsen (take from interval and put them into categories), here we can use the cut() function to cut our variable up into categories.

There are a variety of ways to cut() the most common way is giving it a set of “breaks” where you tell it where you want the bins to be.

set.seed(1) ## Lets me recreate the exact random variables
vector <- rnorm(100) # creating a bunch of random variables 
cut_vector <- cut(vector, breaks=c(-4,-1, -0.5, 0, 0.5, 1, 4))
table(cut_vector)

cut_vector
  (-4,-1] (-1,-0.5]  (-0.5,0]   (0,0.5]   (0.5,1]     (1,4] 
       11        14        21        20        19        15

Adding Labels

You can also add labels to it, but remember N breaks creates N-1 labels.

set.seed(2)
vector <- rnorm(100)
cut_vector <- cut(vector, breaks=c(-4,-1, -0.5, 0, 0.5, 1, 4), 
    labels=c("Lowest", "Low", "Mid-Low", 
            "Mid-High", "High", "Highest"))
table(cut_vector)

cut_vector
  Lowest      Low  Mid-Low Mid-High     High  Highest 
      20       18       16       15       10       21

Using Cut in Mutate

df %>% mutate(
    corrupt = cut(ti_cpi, breaks=c(0, 33, 66, 100), 
            labels=c("Low", "Mid", "High"))
            ) %>%
    group_by(corrupt) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

# A tibble: 4 × 3
  corrupt   mean     n
  <fct>    <dbl> <int>
1 Low      7008.    64
2 Mid     18948.    89
3 High    45831.    26
4 <NA>      NaN     15

Check

There is a variable that captures the fertility rate, mutate it into a variable with 3 categories one that is around replacement rate (2.1), one that is below it, and one that is above it.

Then group_by() that new variable and count up how many observations in each bin.

(it might help to use range() before identifying the bins)

My Solution

df %>% mutate(
    fert_groups = cut(wdi_fertility, breaks=c(0, 1.8, 2.4, Inf), 
            labels=c("Below", "Replacement", "Above"))
            ) %>%
    group_by(fert_groups) %>% 
    summarize(n = n())

# A tibble: 4 × 2
  fert_groups     n
  <fct>       <int>
1 Below          61
2 Replacement    38
3 Above          86
4 <NA>            9

Wrangling and Summarizing Data in R

Goals for Today

Tidyverse

Data for today

Variables

Filtering Data

Logical Checks

Logical Values

Logical Checks with Data

Filtering Data

Filtering Data - Example

Checking Multiple Things

Logical Operators

Combining Logical Check

Multiple Checks in Filtering

Check

How I did it

Pipes %>%

Example

How else could we have done this?

A Note of Caution

Summarizing Data

Simple Examples

Caution - Multiple Return Values

Filtering and Summarizing

Number of Observations

Check

My Solutions

Grouping

group_by() alone

Group and Summarize

Adding in Filtering

Check

My Solution

Another Improvement

Changing variables

mutate()

Applying factor()

Example of Factor

Making better labels

More Complicated Situations

Adding Labels

Using Cut in Mutate

Check

My Solution

`group_by()` alone

`mutate()`

Applying `factor()`