Introduction to R: Data Manipulation and Summary

Kevin Reuning

Goals for Today

Manipulate data (filter specific rows, select columns).
pipes
Summarizing data

Data for today

Today we are going to use a subset of country data from The Quality of Governance Institute.

Download the data we’ll be using here
You can open it with read_csv() and if you need more help check out the first day of slides.
There is a description of all the variables I’ve included here.
We need the following packages installed: install.packages(c("tidyverse", "gt", "rmarkdown"))

library(readr)
setwd("images")
df <- read_csv("country_data.csv")

Variables

There is a description of all the variables I’ve included here.

For now though we are going to use a few of them:

bl_asymf average schooling years, females and males between 15 and 64 years old.
wdi_expedu general government expenditure on education (current, capital, and transfers) is expressed as a percentage of GDP

Filtering Data

Often you want to select just specific rows of data that meet certain requirements.

Logical Checks

We need to include some more operators to do this:

< less than and > greater than
<= less than or equal to and >= greater than or equal to
== equal to and != not equal to

43 < 4

[1] FALSE

(4*pi)^2 > 5

[1] TRUE

Logical Values

The output from these checks is another form of variable called a logical.
We can have vectors of logical values

names <- c("Kevin", "Anne", "Sophie")
names == "Kevin"

[1]  TRUE FALSE FALSE

Logical Checks with Data

We can do the same thing but using a variable from our dataset:

## Returns true if bl_asymf (average schooling) is more than 10.
df$bl_asymf > 10

  [1] FALSE  TRUE FALSE    NA    NA    NA    NA  TRUE  TRUE  TRUE    NA FALSE
 [13] FALSE  TRUE FALSE  TRUE    NA FALSE    NA  TRUE FALSE  TRUE    NA FALSE
 [25]  TRUE FALSE FALSE    NA FALSE FALSE  TRUE    NA FALSE  TRUE    NA  TRUE
 [37] FALSE  TRUE FALSE    NA FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [49]  TRUE    NA FALSE FALSE FALSE    NA    NA    NA  TRUE  TRUE  TRUE  TRUE
 [61]    NA FALSE    NA FALSE  TRUE FALSE    NA  TRUE    NA FALSE    NA FALSE
 [73] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
 [85]  TRUE  TRUE  TRUE  TRUE FALSE    NA  TRUE FALSE  TRUE FALSE    NA FALSE
 [97]  TRUE FALSE FALSE    NA  TRUE  TRUE    NA FALSE  TRUE FALSE FALSE  TRUE
[109] FALSE FALSE FALSE    NA FALSE  TRUE    NA FALSE FALSE    NA FALSE    NA
[121] FALSE  TRUE    NA  TRUE FALSE FALSE    NA  TRUE    NA    NA    NA FALSE
[133]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE    NA    NA FALSE  TRUE  TRUE
[145] FALSE    NA    NA    NA    NA    NA  TRUE FALSE  TRUE    NA FALSE  TRUE
[157]  TRUE FALSE  TRUE    NA  TRUE FALSE  TRUE    NA FALSE    NA FALSE  TRUE
[169]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA    NA
[181] FALSE  TRUE    NA FALSE  TRUE FALSE  TRUE    NA FALSE    NA  TRUE    NA
[193] FALSE FALSE

Filtering Data

We can use logical checks to filter our data.

The filter() function is part of the plyr package in the tidyverse.
The first argument will be the data you want to filter.
The second argument will be the logical check.

Note

Within the filter() call you do not need to use data$ before the variable name, it already knows you are using the data you put in the first argument.

Filtering Data - Example

library(tidyverse)

# df <- read_csv("country_data.csv") ## remember I did this already
filter(df, bl_asymf>10)

cname	ccode	ti_cpi	vdem_academ	wdi_fertility	wdi_afp	bl_asymf	wdi_expedu	wdi_elprodcoal	wef_iu	wdi_foodins	ht_colonial	lp_legor	cai_foetal	cai_mental	cai_physical	ccp_initiat	ccp_market	h_j	wdi_homicides	ccp_strike	wdi_lfpr	br_pvote	br_elect	van_part	bmr_demdur	fh_polity2	vdem_polyarchy	mad_gdppc	top_top1_income_share	wef_sp
Albania	8	36	0.876	1.62	0.643	11	3.61	0	71.8	10	0	3	1	1	1	1	1	0	2.29	1	68.3	1	3	52.8	22	8.08	0.52	1.11e+04	0.0908	57.3
Argentina	32	40	0.935	2.26	0.512	10.2	5.46	2.03	74.3	12.9	2	2	0	0	1	1	2	0	5.32	1	69.2	1	3	58.6	36	8.92	0.779	1.86e+04	0.153	365
Australia	36	77	0.847	1.74	0.438	12.5	5.12	62.9	86.5	3.8	0	1	1	1	1	2	2	1	0.892	3	78.1	0	3	60.8	118	10	0.865	4.98e+04	0.129	852
Austria	40	76	0.973	1.47	0.497	10.8	5.36	8.23	87.7	1.1	0	4	1	1	1	1	2	1	0.967	3	76.6	1	3	58.5	73	10	0.846	4.3e+04	0.0992	579
Armenia	51	35	0.8	1.75	3.92	11.8	2.71	0	64.7	1.1	0		1	1	1	2	96	0	1.69	1	61.6	1	3	51.7	1	6.75	0.472	1.15e+04	0.178	162
Belgium	56	75	0.967	1.62	0.619	11.6	6.41	6.11	88.7	1.1	0	2	1	1	1	2	2	1	1.69	3	68.6	1	3	68.2	125	9.5	0.891	3.98e+04	0.086	704
Botswana	72	61	0.874	2.87	0.858	10.3		96.4	47	21.5	5	1	1	1	1	2	2	0		3	73.1	0	3	32.4	53	8.25	0.686	1.58e+04	0.227	93.7
Belize	84			2.31	0.863	11.3	7.56				5		1	1	1	2	2		37.8	3	67.5	0	3	41.1	38				0.197
Bulgaria	100	42	0.906	1.56	1.11	11.2	4.09	46.2	64.8	1.9	0	3	1	1	1	2	1	0	1.3	1	71.7	1	3	40.7	29	8.92	0.615	1.84e+04	0.182	221
Canada	124	81	0.919	1.5	0.356	12.9		9.84	91	0.7	0	1	1	1	1	2	2	1	1.76	3	78.5	0	3	50.4	152	10	0.849	4.49e+04	0.149	1.03e+03
Sri Lanka	144	38	0.733	2.2	3.65	11.1	2.12	33.7	34.1		5	1	0	0	0	2	2	0	2.42	3	57.9	1	3	57.1	4	6.92	0.628	1.17e+04	0.206	148
Chile	152	67	0.959	1.65	1.3	10.6	5.42	37.1	82.3	3.6	2	2	0	0	0	2	2	1	4.4	2	69	1	3	37.7	29	10	0.863	2.21e+04	0.265	319
Taiwan (Province of China)	158	63	0.897			12.4			92.8		0	4				1	2	1		3		0	3	52.8	23	10	0.84	4.47e+04	0.145	439
Croatia	191	48	0.873	1.47	1.01	12	3.92	20.6	72.7	0.9	0		1	1	1	2	1	1	0.577	1	66.6	1	3	51.6	19	9.33	0.732	2.2e+04	0.104	239
Cuba	192	47	0.117	1.62	1.49	11.1		0			2	3	1	1	1	1	1	0	5.05	3	64.3	0	1	64.4	66	1.67	0.182	8.33e+03	0.145
Cyprus	196	59	0.958	1.33	2.56	11.9	5.78	0	84.4		5	1	1	1	1	2	2	1	1.26	1	74.1	1	3	31	43	10	0.856	2.72e+04	0.117	170
Czechia	203	59	0.942	1.71	0.408	12.9	3.85	53.1	80.7	0	0		1	1	1	2	2	1	0.62	1	76.8	1	3	48.5	26	9.75	0.812	3.07e+04	0.102	397
Denmark	208	88	0.941	1.73	0.486	12.9	7.82	24.5	97.6	1.1	0	5	1	1	1	96	2	1	1.01	3	78.2	1	3	63.9	74	10	0.913	4.63e+04	0.124	662
Estonia	233	73	0.97	1.67	0.949	12.4	4.97	5.33	89.4	0.9	0		1	1	1	2	2	1	2.12	1	79.1	1	3	43.9	28	9.75	0.901	2.74e+04	0.13	235
Fiji	242	55	0.357	2.77	1.12	10.2				2	5	1	1	1	1	2	2	1		2	60.4	1	2	55.4	5	6.33	0.415
Finland	246	85	0.947	1.41	0.919	11.3	6.38	8.3	88.9	2	0	5	1	1	1	1	2	1	1.63	3	77.8	1	3	54.4	102	10	0.88	3.89e+04	0.105	571
France	250	72	0.881	1.88	1	10.3	5.45	2.16	82	0.7	0	2	1	1	1	1	2	1	1.2	3	72	0	3	43.7	73	9.58	0.88	3.85e+04	0.0966	1.03e+03
Germany	276	80	0.971	1.57	0.416	12.3	4.91	44.3	89.7	0.7	0		1	1	1	2	1	1	0.948	3	78.5	1	3	57.3	29	10	0.878	4.62e+04	0.129	1.13e+03
Greece	300	45	0.854	1.35	3.08	11.1		42.7	73	2.3	0	2	1	1	1	2	2	1	0.941	1	68.4	1	3	63.5	45	9.58	0.858	2.35e+04	0.109	434
Hungary	348	46	0.467	1.55	0.841	12	4.67	19.5	76.1	0.8	0	3	1	1	1	1	96	1	2.49	1	71.9	1	3	59.2	29	8.33	0.489	2.56e+04	0.12	391
Ireland	372	73	0.94	1.75	0.363	12.4	3.51	17.3	84.5	3.5	0	1	0	0	0	2	2	1	0.872	3	73.2	1	3	44.8	97	10	0.88	6.47e+04	0.12	451
Israel	376	61	0.945	3.09	4.33	12.7	6.09	45.4	81.6	1.7	0	1	1	1	1			1	1.49		72.4	1	3	50.8	71	7.75	0.7	3.3e+04	0.165	624
Italy	380	52	0.967	1.29	1.31	11	4.04	16.1	74.4	1.1	0	2	1	1	1	1	2	1	0.569	2	65.7	1	3	58.1	73	10	0.867	3.44e+04	0.0913	897
Jamaica	388	44	0.943	1.98	0.404	10.6	5.41	0	55.1		5	1	0	1	1	2	2	0	43.9	3	70.4	0	3	30.2	57	8.92	0.812	7.27e+03	0.197	89.7
Japan	392	73	0.711	1.42	0.382	12.8	3.18	33.2	84.6	0.7	0	4	0	0	1	2	2	1	0.263	3	79.1	0	3	44.9	67	10	0.827	3.87e+04	0.131	919
Kazakhstan	398	31	0.338	2.84	0.786	11.4	2.62	71.6	78.9	0	0		1	1	1	2	2		5.06	1	76.6	1	2	52.7	28	1.83	0.236	2.53e+04	0.154	83.7
Jordan	400	49	0.326	2.76	3.93	10.2	3.03	0	66.8		5	2	1	1	1	2	2	1	1.36	3	41.8	0	2	8.95	73	3.42	0.27	1.15e+04	0.174	143
Korea (the Republic of)	410	57	0.836	0.977	2.15	12.8	4.33	43.1	95.9	0	0	4	1	1	1	2	2	1	0.604	1	68.9	0	3	55.7	31	8.67	0.868	3.79e+04	0.149	579
Kyrgyzstan	417	29	0.621	3.3	0.827	11	6.03	13.2	38	0.8	0		1	1	1	1	96	1	2.19	1	62.5	1	3	27.6	28	6.58	0.465	5.18e+03	0.145	57.7
Latvia	428	58	0.965	1.6	0.69	11.7	4.4	0	83.6	0.6	0		1	1	1	1	2	1	4.36	1	78.1	1	3	43.8	26	8.67	0.833	2.43e+04	0.0969	141
Lithuania	440	59	0.938	1.63	2.69	11.8	3.81	0	79.7	1.1	0		1	1	1	1	2	1	4.57	1	77.6	0	3	44.3	27	10	0.824	2.74e+04	0.113	182
Luxembourg	442	81	0.946	1.38	0.628	12	3.57	0	97.1	0.9	0	2	1	1	1	2	2	1	0.338	1	70.8	1	3	39.5	129	10	0.874	5.74e+04	0.101	153
Malaysia	458	47	0.504	2	0.876	11.4	4.48	42.3	81.2	6.7	5	1	0	1	1	2	2	1		3	68.5	0	3	38.9	62	6.75	0.383	2.48e+04	0.149	251
Malta	470	54	0.895	1.23	0.739	11.8	4.82	0	81.4	0.8	0	2	0	0	0	2	2	1	1.59	3	73.2	1	3	67.6	55		0.756	3.2e+04	0.0961	104
Moldova (the Republic of)	498	33	0.846	1.26	0.698	11.2	5.44	0	76.1	4	0		1	1	1	2	1	1	4.1	1	43.7	1	3	46	28	7.67	0.526	6.75e+03	0.0974	97.7
Netherlands (the)	528	82	0.93	1.59	0.449	11.8	5.18	38.7	94.7	1.7	0	2	1	1	1	2	2	1	0.586	3	80.3	1	3	61.7	122	10	0.876	4.75e+04	0.0699	895
New Zealand	554	87	0.897	1.71	0.341	11	6.28	4.25	90.8	4.4	0	1	1	1	1	2	2	1	0.744	3	81.1	1	3	45.3	162	10	0.892	3.53e+04	0.119	461
Norway	578	84	0.934	1.56	0.83	12.7	7.91	0.105	96.5	1.1	0	5	1	1	1	2	2	1	0.468	3	77.8	1	3	55.8	119	10	0.889	8.46e+04	0.109	532
Panama	591	37	0.901	2.46	1.28	10.1		6.92	57.9		2	2	1	0	0	96	2	0	9.39	1	71.4	1	3	52.3	28	9.33	0.756	2.26e+04	0.197	174
Poland	616	60	0.943	1.46	1.07	11.8	4.56	80.9	77.5	0.5	0	3	1	1	1	1	2	1	0.73	1	70.4	1	3	40.3	30	9.17	0.695	2.75e+04	0.147	481
Romania	642	47	0.91	1.76	1.4	11.4	3.1	27.6	70.7	3.4	0	3	1	1	1	1	1	0	1.28	1	67.9	1	3	43.8	28	8.92	0.672	2.01e+04	0.137	228
Russian Federation (the)	643	28	0.376	1.58	1.97	12.1	4.69	14.8	80.9	0.5	0		1	1	1	2	2	0	8.21	1	74.4	0	2	43.7	20	3.92	0.27	2.47e+04	0.215	503
Saudi Arabia	682	49	0.074	2.32	1.8	10.1		0	93.3		0	1	0	1	1	2	2	0	1.27	3	57.5	0	0	0	93	0	0.016	5.03e+04	0.209	274
Serbia	688	39	0.725	1.49	0.992	11.7	3.59	72.4	73.4	2	0		1	1	1	1	1	0	1.23	1	67.4	1	3	53.5	13	7.83	0.348	1.41e+04	0.108	180
Singapore	702	85	0.466	1.14	1.69	12.8		1.2	88.2	1.4	5	1	1	1	1	2	2	1	0.156	3	77	0	2	40.6	54	4.5	0.387	6.84e+04	0.142	494
Slovakia	703	50	0.945	1.54	0.576	12	3.94	12.5	80.7	0.8	0		1	1	1	1	1	1	1.14	1	72.5	1	3	44.8	26	9.58	0.815	2.71e+04	0.0785	242
Slovenia	705	60	0.953	1.61	0.699	12.2	4.78	29.6	79.7	0.5	0		1	1	1	1	1	1	0.481	1	75.1	1	3	39.2	28	10	0.838	2.92e+04	0.0803	255
South Africa	710	43	0.772	2.4	0.391	10.2	6.16	92.7	56.2	19.3	5	1	1	1	1	2	2	0	35.9	1	60	1	2	34.5	25	8.92	0.738	1.22e+04	0.193	392
Spain	724	58	0.95	1.26	0.851	10.9	4.21	19	86.1	1.8	0	2	1	1	1	1	1	1	0.621	1	74.1		3	52	42	10	0.86	3.15e+04	0.125	776
Sweden	752	85	0.964	1.76	0.281	12	7.57	0.667	92.1	1.2	0	5	1	1	1	2	2	1	1.08	96	83	1	3	64.9	108	10	0.909	4.55e+04	0.094	779
Switzerland	756	85	0.959	1.52	0.433	12.2	5.13	0	89.7	0.7	0	4	1	1	1	96	2	1	0.586	1	84.2	1	3	46	171	10	0.896	6.14e+04	0.11	868
Tajikistan	762	25	0.087	3.59	0.723	10.8	5.23	1.53	22		0		1	1	1	2	2	0		3	42.1	0	2	46.1	28	2.17	0.17	4.44e+03	0.149	38
Tonga	776			3.56		11.3				6	5	1	0	0	0	2	2			3	49.3	0	3	18.3	49
Trinidad and Tobago	780	41	0.816	1.73	0.605	11.2		0	77.3		5	1	0	1	1	2	2	0	30.6	3	68.7	0	3	54	57	9.17	0.752	2.85e+04	0.197	88.3
United Arab Emirates (the)	784	70	0.123	1.41	0.933	11.7		0	98.5		5	1	1	1	1	2	2	1	0.464	3	82.8	0	0	0	48	0.917	0.095	7.64e+04	0.158	171
Ukraine	804	32	0.448	1.3	1.45	11.6	5.41	34.6	58.9	1.6	0		1	1	1	2	2	1	6.18	1	66.6	1	3	40.8	28	6.42	0.405	9.81e+03	0.0978	229
United Kingdom of Great Britain and Northern Ireland (the)	826	80	0.926	1.68	0.432	12.9	5.44	22.8	94.9	1.3	0	1	1	1	1	2	2	1	1.2	3	77.7	0	3	48.8	134	9.5	0.874	3.81e+04	0.13	1.29e+03
United States of America (the)	840	71	0.91	1.73	0.833	12.8		34.2	87.3	0.8	0	1	1	1	1	2	2	1	4.96	3	72.6	0	3	70	219	9.08	0.831	5.53e+04	0.19	2.09e+03
Venezuela (Bolivarian Republic of)	862	18	0.259	2.27	2.72	10.2		0	72		2	2	0	0	0	1	2		36.7	1	64.7	1	1	29.4	14	2.17	0.215	1.07e+04	0.197	193

If we want we can save the new data as well:

sub_df <- filter(df, bl_asymf>10)

Checking Multiple Things

What about if we want to check if our rows meet multiple condition? Then we need logical operators.

Logical Operators

We can reverse a logical value with ! (e.g. !TRUE == FALSE)
We have and and or operators to check multiple logical values.
- and is &
- or is | (shift + backslash)
& returns TRUE if both values are TRUE
| returns TRUE if at least one value is TRUE

TRUE & TRUE 
TRUE | FALSE 
TRUE | TRUE

All would return TRUE

Combining Logical Check

We can then combine logical checks together.

val <- pi^(2/3)
(val < 1) | (val > 4) # Is it less than 0 or greater than 4?

[1] FALSE

Multiple Checks in Filtering

Lets collect countries with more than 10 years of average education but spend less than 5% of their GDP on education ::: {.cell}

sub_df <- filter(df, bl_asymf > 10 & wdi_expedu < 5)
sub_df$cname

 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"

:::

Check

Create two new datasets.

Only countries that spent more than 10% of their GDP on education
Countries that have an average education between 5 and 8 years

How I did it

sub_df1 <- filter(df, wdi_expedu > 10)
sub_df1$cname

[1] "Micronesia (Federated States of)"

sub_df2 <- filter(df, bl_asymf > 5 & bl_asymf < 8)
sub_df2$cname

 [1] "Algeria"                               
 [2] "Bangladesh"                            
 [3] "Myanmar"                               
 [4] "Cameroon"                              
 [5] "Congo (the)"                           
 [6] "Congo (the Democratic Republic of the)"
 [7] "Benin"                                 
 [8] "El Salvador"                           
 [9] "Guatemala"                             
[10] "Haiti"                                 
[11] "Honduras"                              
[12] "India"                                 
[13] "Iraq"                                  
[14] "Kenya"                                 
[15] "Kuwait"                                
[16] "Lao People's Democratic Republic (the)"
[17] "Lesotho"                               
[18] "Malawi"                                
[19] "Maldives"                              
[20] "Mauritania"                            
[21] "Morocco"                               
[22] "Namibia"                               
[23] "Nepal"                                 
[24] "Nicaragua"                             
[25] "Pakistan"                              
[26] "Rwanda"                                
[27] "Viet Nam"                              
[28] "Eswatini"                              
[29] "Syrian Arab Republic (the)"            
[30] "Togo"                                  
[31] "Turkey"                                
[32] "Uganda"                                
[33] "Egypt"                                 
[34] "Tanzania, the United Republic of"      
[35] "Zambia"

Pipes %>%

Tidyverse syntax makes use of pipes to chain multiple functions together.

You use the pipe operator (%>%) in between each step.
This operator is like saying “take the output from the previous function and put it in the next function”

For example (in pseudo-code):

Output <- Step 1(Input) %>% Step 2() %>% Step 3()

Translation: Take the Input, apply Step 1 to it, then take the output of Step 1 and apply Step 2 to it, then take the output of Step 2 and apply Step 3 to it, and finally store the output of Step 3 as Output.

Example

filter(df, bl_asymf > 10 & wdi_expedu < 5) %>% pull(cname)

 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"

What does the pull() function do? It pulls out a column from your data.

How else could we have done this?

pull(filter(df, bl_asymf > 10 & wdi_expedu < 5), cname)

 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"

filter(df, bl_asymf > 10 & wdi_expedu < 5)$cname

 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"

sub_df <- filter(df, bl_asymf > 10 & wdi_expedu < 5)
sub_df$cname

 [1] "Albania"                  "Armenia"                 
 [3] "Bulgaria"                 "Sri Lanka"               
 [5] "Croatia"                  "Czechia"                 
 [7] "Estonia"                  "Germany"                 
 [9] "Hungary"                  "Ireland"                 
[11] "Italy"                    "Japan"                   
[13] "Kazakhstan"               "Jordan"                  
[15] "Korea (the Republic of)"  "Latvia"                  
[17] "Lithuania"                "Luxembourg"              
[19] "Malaysia"                 "Malta"                   
[21] "Poland"                   "Romania"                 
[23] "Russian Federation (the)" "Serbia"                  
[25] "Slovakia"                 "Slovenia"                
[27] "Spain"

A Note of Caution

The %>% has been around for a while in the tidyverse.
R added its own version of this to base R BUT they use |> instead.
In most cases %>% is the same as |>

Yes this is all kind of silly and strange.

Summarizing Data

One of the most useful tidyverse functions is summarize().

summarize() transforms data by applying a function(s) to columns in the data.
The first argument will be the data, the rest of the arguments will be functions you want to apply to it.
The output will be a smaller data frame where the columns are the output from each function it applied.

Simple Examples

What if we want to figure out the average average education for all countries in our data?

summarize(df, mean(bl_asymf, na.rm=TRUE))

mean(bl_asymf, na.rm = TRUE)
9.11

What if we want to calculate other statistics?

summarize(df, mean(bl_asymf, na.rm=TRUE), 
            sd(bl_asymf, na.rm=TRUE), 
            median(bl_asymf, na.rm=TRUE))

mean(bl_asymf, na.rm = TRUE)	sd(bl_asymf, na.rm = TRUE)	median(bl_asymf, na.rm = TRUE)
9.11	2.71	9.52

Caution - Multiple Return Values

You generally want to use functions that only return 1 value. Why?

summarize(df, mean(bl_asymf, na.rm=TRUE), 
            sd(bl_asymf, na.rm=TRUE), 
            median(bl_asymf, na.rm=TRUE), 
            range(bl_asymf, na.rm=TRUE))

mean(bl_asymf, na.rm = TRUE)	sd(bl_asymf, na.rm = TRUE)	median(bl_asymf, na.rm = TRUE)	range(bl_asymf, na.rm = TRUE)
9.11	2.71	9.52	2.43
9.11	2.71	9.52	12.9

Filtering and Summarizing

What if we want to figure out the average education for countries that spend less than 5% of their GDP on education?

df %>% filter(wdi_expedu < 5) %>% summarize(mean(bl_asymf, na.rm=T))

mean(bl_asymf, na.rm = T)
8.81

We can improve the output by changing the column name: summarize(col_name = mean(variable))

df %>% filter(wdi_expedu < 5) %>% 
    summarize("Mean"=mean(bl_asymf, na.rm=T))

Mean
8.81

Note

You can use multiple lines with pipes, it is common to put the pipe at the end of each line and indent the next line.

Number of Observations

There is also a function specifically for the number of observations: n()

df %>% filter(wdi_expedu < 5) %>% summarize(n())

n()
89

Check

Find the mean and median average education and education expenditure for countries with a GDP per capita (mad_gdppc) of more than 10,000.

My Solutions

df %>% filter(mad_gdppc > 10000) %>% 
    summarize(mean_exp=mean(wdi_expedu, na.rm=TRUE), 
        median_exp=median(wdi_expedu, na.rm=TRUE), 
        mean_ed=mean(bl_asymf, na.rm=TRUE), 
        median_ed=median(bl_asymf, na.rm=TRUE))

mean_exp	median_exp	mean_ed	median_ed
4.66	4.54	10.8	11.1

Grouping

Often we want to provide summaries of groups within the data. For example: how does the GDP vary by election type? br_pvote is an indicator for having proportional representation.

Here we’ll use the group_by() function to create groups of our data.

`group_by()` alone

group_by() expects variable(s) that you want to use to group your dataset:

df %>% group_by(br_pvote)

# A tibble: 194 × 31
# Groups:   br_pvote [3]
   cname      ccode ti_cpi vdem_academ wdi_fertility wdi_afp bl_asymf wdi_expedu
   <chr>      <dbl>  <dbl>       <dbl>         <dbl>   <dbl>    <dbl>      <dbl>
 1 Afghanist…     4     16      0.560           4.47   2.64      4.83       4.06
 2 Albania        8     36      0.876           1.62   0.643    11.0        3.61
 3 Algeria       12     35      0.338           3.02   2.52      7.71      NA   
 4 Andorra       20     NA     NA              NA     NA        NA          3.25
 5 Angola        24     19      0.440           5.52   0.921    NA         NA   
 6 Antigua a…    28     NA     NA               1.99  NA        NA         NA   
 7 Azerbaijan    31     25      0.0770          1.73   1.61     NA          2.46
 8 Argentina     32     40      0.935           2.26   0.512    10.2        5.46
 9 Australia     36     77      0.847           1.74   0.438    12.5        5.12
10 Austria       40     76      0.973           1.47   0.497    10.8        5.36
# ℹ 184 more rows
# ℹ 23 more variables: wdi_elprodcoal <dbl>, wef_iu <dbl>, wdi_foodins <dbl>,
#   ht_colonial <dbl>, lp_legor <dbl>, cai_foetal <dbl>, cai_mental <dbl>,
#   cai_physical <dbl>, ccp_initiat <dbl>, ccp_market <dbl>, h_j <dbl>,
#   wdi_homicides <dbl>, ccp_strike <dbl>, wdi_lfpr <dbl>, br_pvote <dbl>,
#   br_elect <dbl>, van_part <dbl>, bmr_demdur <dbl>, fh_polity2 <dbl>,
#   vdem_polyarchy <dbl>, mad_gdppc <dbl>, top_top1_income_share <dbl>, …

Only change is the addition of # Groups: br_pvote [3] (grouping variable, and number of groups).

Group and Summarize

Lets chain together group_by() and summarize()

df %>% group_by(br_pvote) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

br_pvote	mean	n
0	1.79e+04	95
1	1.97e+04	93
	1.39e+04	6

What is ugly about this?

Adding in Filtering

is.na() checks if something is missing or not.

df %>% filter(!is.na(br_pvote)) %>% 
    group_by(br_pvote) %>% 
    summarize(mean=mean(mad_gdppc, na.rm=T), n = n())

br_pvote	mean	n
0	1.79e+04	95
1	1.97e+04	93

Tip

The drop_na( ) tidyverse function can replace filter(!is.na( ))

Check

There are several variables that can be used to group countries. Pick one of them, pick an interval variable that you think might vary by the group, and then calculate the number of observations, mean, and median for each group.

There is a description of all the variables I’ve included here.

My Solution

Grouping variable: br_pvote
Interval variable: van_part
Expectation: Countries with proportional representation (1) will have higher participation

df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Data that I am using

Filtering out observations that are missing a value for br_pvote

Grouping the data frame by br_pvote

Summarizing (number of observations, mean of van_part, median of van_part)

Saving Results

There are two ways to save our summary results. Both can be helpful depending on what you are doing:

write_csv(): Writes to a CSV file.
Creating an exportable table.

gt

We are going to use: gt

The gt package is a a lot so we are not going to get to it all but iet lets you do a lot of things:

Convert dataframe into a table.
Format tables (borders, colors, alignment)
Export that table in a lot of formats (html, docx, excel, latex…)

Get a Table to Export

tab_out <- df %>% 
    drop_na(br_pvote) %>% 
    group_by(br_pvote) %>%
    summarize(n=n(), mean=mean(van_part, na.rm=T),
            median=median(van_part, na.rm=T))

Using gt

The function gt() will create a table object that we can then modify. Lets see what happens when we make a table.

library(gt)
tab_fmted <- gt(tab_out)
tab_fmted

br_pvote	n	mean	median
0	95	33.36337	33.235
1	93	44.73456	44.355

Using gt - Styling

We can then modify the style by pipping it into functions like a opt_stylize() function and cols_align()

library(gt)
tab_fmted <- gt(tab_out) %>% 
    opt_stylize(style=1, color="green") %>%
    cols_align(align="center", columns="br_pvote") 
tab_fmted

br_pvote	n	mean	median
0	95	33.36337	33.235
1	93	44.73456	44.355

Using gt - Labels

We can also change the labeling of our columns easily using the actual name and what we want it to be called

library(gt)
tab_fmted <- gt(tab_out) %>% 
    opt_stylize(style=1, color="green") %>%
    cols_align(align="center", columns="br_pvote") %>%
    cols_label(
        br_pvote = "Proportional?", 
        n = "N", mean = "Mean", 
        median = "Median"
    )
tab_fmted

Proportional?	N	Mean	Median
0	95	33.36337	33.235
1	93	44.73456	44.355

Using gt - Modifying Contents

We can also modify values using the text_case_match() function. This one will check if a cell matches what is on the left side of the tilde (~) and replace it with the right side.

tab_fmted <- tab_fmted %>% 
    text_case_match(
        "1" ~ "Yes",
        "0" ~ "No"
    )
tab_fmted

Proportional?	N	Mean	Median
No	95	33.36337	33.235
Yes	93	44.73456	44.355

Using gt - Table Headers and Notes

Finally, tab_header() and tab_source_note() can be used to add other information about your table:

tab_fmted <- tab_fmted %>% 
    tab_header("Countries by Election Type") %>% 
    tab_source_note("Data from QoGs")
tab_fmted

Proportional?	N	Mean	Median
Countries by Election Type
No	95	33.36337	33.235
Yes	93	44.73456	44.355
Data from QoGs

Saving the Document

And now we export it with gtsave()

gtsave(tab_fmted, file="Table.docx")

gt Options and Check

There are a lot of options to modify your table here.

Try to see if you can change the mean and median to be listed in percentages and then make the number of observations bold.

My Solution

Using my table from before:

tab_fmted %>%
    fmt_percent(columns=c(mean, median), scale_values=FALSE) %>% 
    tab_style(locations=cells_body(n), 
        style=cell_text(weight="bold"))

Proportional?	N	Mean	Median
Countries by Election Type
No	95	33.36%	33.24%
Yes	93	44.73%	44.36%
Data from QoGs

Introduction to R: Data Manipulation and Summary

Goals for Today

Data for today

Variables

Filtering Data

Logical Checks

Logical Values

Logical Checks with Data

Filtering Data

Filtering Data - Example

Checking Multiple Things

Logical Operators

Combining Logical Check

Multiple Checks in Filtering

Check

How I did it

Pipes %>%

Example

How else could we have done this?

A Note of Caution

Summarizing Data

Simple Examples

Caution - Multiple Return Values

Filtering and Summarizing

Number of Observations

Check

My Solutions

Grouping

group_by() alone

Group and Summarize

Adding in Filtering

Check

My Solution

Saving Results

gt

Get a Table to Export

Using gt

Using gt - Styling

Using gt - Labels

Using gt - Modifying Contents

Using gt - Table Headers and Notes

Saving the Document

gt Options and Check

My Solution

`group_by()` alone