Kevin Reuning

Who am I

• I’m Kevin Reuning (ROY-ning).
• I’m an Assistant Professor in Political Science.
• Prior to grad school I had very little experience in coding.

Goals For this Bootcamp

• Not be afraid of R/Rstudio
• Able to load data in and calculate useful statistics with it.
• Make a variety of beautiful plots.

Where We Are Going

library(tidyverse)
setwd("images")
df %>% mutate(type = cut(fh_polity2,  breaks=c(0, 3, 7, 10),
labels=c("Autocracy", "Anocracy", "Democracy"))) %>%
geom_smooth(method="lm", color='black') +
geom_point(color='orangered3') + facet_wrap(~type) +
scale_x_log10(labels=scales::label_dollar()) +
theme_minimal() + theme(strip.text=element_text(size=20)) +
labs(y="Percent of Labor\nForce in Military",
x="GDP per Capita\n(Log scale)") 

Goals for Today

• Start using R and RStudio, realize you cannot break it.
• How to use R as a calculator.
• Understand the basics of variables and functions in R.
• Load data into R and calculate the average of different variables.

R and RStudio

• R is a statistical language used to do analysis.
• R is free.
• R makes it easy to create reproducible analysis.
• RStudio is an interface that sits on top of R and makes life easier.

Following along

You need to learn by doing. If you haven’t opened RStudio yet, do so now. You should have something like:

Rscripts

• R Scripts allow you to save and re-run everything you did to your data. This is incredibly helpful.
• To start a new R Script: File $\rightarrow$ New File $\rightarrow$ R Script.
• When you are done you can save the R Script.

Rscripts in RStudio

You should now have 4 panes.

Running things in R/RStudio

To run parts of your script in R you have to two do things: 1) indicate what you want to run, and 2) tell RStudio to run it.

• Indicating: In the Script part of the Window you will highlight large blocks of code or leave your cursor on the specific line you want to run.
• Running: Click the button that says “Run” to the top right of the script window. Or hit Ctrl + Enter (Windows) or Command + Return (Mac)

Throughout this I will show you code and the output in R (Note: You can leave comments to yourself using “#” and R won’t run that line).

## The code is here
1
[1] 1

The output will be immediately below it. This should be similar to how you’ll have code in the RScript pane and the results in the pane below.

R as a calculator

R can be used to add, subtract, multiply... Type something similar in the RScript pane, highlight it and then click “Run”

1 + 2
[1] 3
3 - 4
[1] -1
5 * 6
[1] 30
7/3
[1] 2.333333

R as a calculator

You can also exponentiate things and even access special numbers such as $\pi$

5^2
[1] 25
pi
[1] 3.141593
2*pi
[1] 6.283185

• The hardest part of learning to code is that there are lot of rules but also a lot of flexibility. Sometimes you have to be very precise and sometimes you don’t.
• You’ll learn these rules as you try different things. Don’t be afraid to try to break R.
• You can (and should) save scripts so you can re-run and change things. Every R script should be a self contained world.

Spaces

Spaces (or not) between things often don’t matter

3 * 2 
[1] 6
3*2 
[1] 6
3          *      2
[1] 6

Check

At this point you should have RStudio open, and be able to type things in the RScript pane and run it in the console pane below. Your screen will look something like this.

Variables

In R you can store information in order to retrieve it later. These are called variables.

You use an arrow <- (less than and dash) to save a value as a variable.

a_variable <- 1 # Running this alone will return nothing
a_variable # By calling the vector alone it returns the result
[1] 1

Variable Names

You can name a variable almost anything with numbers, character

therearenoreallimitsonhowlongavariablenameis <- 1
Variable_1 <- 4 

There are some limits, they cannot start with numbers and cannot use some symbols

1_variable <- 2 # variables cannot start with a number though 
Error: <text>:1:2: unexpected input
1: 1_
^

Rstudio and Variables

One of the benefits of RStudio is that it will show the stored variables in the top right pane.

Here I’ve saved the number 341.24 to the variable bank_account_balance

Overwriting Variables

There is nothing stopping you from saving on top of the variable with a new value.

bank_account_balance <- 341.24
bank_account_balance <- 341.24 + 100
bank_account_balance
[1] 441.24

You can do math to a variable and save it back to itself

bank_account_balance <- bank_account_balance - 1000
bank_account_balance
[1] -558.76

Warning

Modifying and then saving a variable as itself can lead to mistakes so be careful.

Types of Variables

There are two base types of variables you can use:

• Numeric/Doubles: These are just numbers. You don’t include a commas just numbers.
• Strings: This is anything as long as it is surrounded by " " (quotation marks).
• Factors: This is a combination of the two and we’ll discuss it more later.

String Examples

hello <- "I'm learning how to do R"
hello 
[1] "I'm learning how to do R"

Anything with quotes around it will be treated as a string.

a_number <- "1234"
a_number 
[1] "1234"
a_number * 12
Error in a_number * 12: non-numeric argument to binary operator

Vectors

You can also store a series of numbers or characters. This is called a vector

• To store a vector you surround everything with c( ) and put commas between each item in the vector.
• Everything in a vector has to be the same type (all numbers or all strings)

Vectors Examples

ages <- c(34,23,41,4,6)
ages
[1] 34 23 41  4  6
names <- c("Kevin", "Claire", "Mike", "Dominick", "Leona")
names
[1] "Kevin"    "Claire"   "Mike"     "Dominick" "Leona"   
names_ages <- c("Kevin", 34, "Claire", 23)
names_ages # What happens here? 
[1] "Kevin"  "34"     "Claire" "23"    

Warning

If any item in a vector is a string then R will make everything a string.

Vector Math

You can do math on vectors

ages <- c(34,23,41,4,6)
ages + 5
[1] 39 28 46  9 11
assets <- c(534, 1694)
debts <- c(100, 50)
assets - debts
[1]  434 1644

Check

Using R as a calculator, calculate the volume of a sphere with a radius of 2 and store that value as the variable vol

• You can do this all in a single line.
• The formula is $\frac{4}{3} \pi \cdot \text{r}^3$
• Then use a vector to calculate the volume of 3 different spheres with radii 3, 6, and 8.

How I did it

vol <- pi * (2^3) * (4/3)
vol 
[1] 33.51032
rad <- c(3, 6, 8)
vol <- pi * (rad^3) * (4/3)
vol 
[1]  113.0973  904.7787 2144.6606

Functions

Functions in R take the form of function(X, Y, Z) where function is the function, and X, Y, X are a bunch of arguments that give the function an input and/or tell it what specifically to do.

precinct_voters <- c(123,44,32,67)
sum(precinct_voters)
[1] 266

sum() is a pretty simple function, it takes a single vector and adds together all of its components.

Danger

Never have a space between the function name, and the parentheses.

sum() has an additional argument that you can use to tell R what to do with missing values. First, how does R no a value is missing? These are recorded as NA.

precinct_voters <- c(123,44,32,67, NA) # missing the data for the last precinct
precinct_voters * 1.2 # 20% population growth 
[1] 147.6  52.8  38.4  80.4    NA

What if we sum together this new vector?

sum(precinct_voters)
[1] NA

sum() has a second argument na.rm that tells it what to do with missing values. If we want to ignore them we need to set this argument to TRUE:

sum(precinct_voters, na.rm=TRUE)
[1] 266

na.rm is a logical argument. It can either be TRUE or FALSE and so acts as switch. If TRUE then missing values are ignored, if FALSE (the default) they are not ignore and so a missing value is returned.

Accessing the Manual

R has a manual for each function. These are a good place to look if you don’t know what arguments a function has, but they can take practice to read.

You can access the manual by typing ? followed by the function name in the console: ?sum

The manual itself will appear on the bottom right pane.

The manual can be hard to read at first. A few tips:

• The Description is often very general (to a point of sometimes not being useful).
• The Usage shows all the arguments and their defaults (if they have any). There is more info about the arguments in the Arguments section
• At the very bottom there is usually an Examples section. You can often copy these into the script pane, run them, and see what happens.

Some Other Functions:

• mean()
• median()
• sd()
• range()

Take a moment now, and look at the manual of one of these functions.

Libraries

• R is powerful/useful because anyone can extend it (add more functions).
• Bundles of functions are called libraries/packages.
• You can install a library with install.package() and then tell R you want to use it with library.

Tidyverse

• A lot of data science work is done using the Tidyverse suite of packages.
• We can install the entire suite using: install.package()

Run:

install.packages("tidyverse")

There might be a popup asking about installing things from “Source” you can hit no on it.

Using a Package

• To use a package you use the function library().
• It is a norm to load all the packages you use in a script at the top of a script.

• readr is a library used to load datasets.

Download this data and we are going to open it in R. It has data on the number veterans in each county receiving disability benefits.

Importing Data with RStudio

• In the bottom right you can look through files, it shows the working directory.
• You can change the working directory by going to Session $\rightarrow$ Set Working Directory $\rightarrow$ Choose Directory…
• Find ‘disability_comp.csv’, click on it and select ‘Import Dataset…’
• The first time you do this there might again be a popup asking you to install something, click “Yes” on this one.

Importing Data with just R

We can do the same thing but just using R:

library(readr) # load readr package
setwd("images") # Set working directory
df <- read_csv("disability_comp.csv")

Note

Mac computers file paths start with ~/ and Windows start with C:/.

If you write setwd("C:/") you can then hit tab and walk through the folders.

Once you have the data loaded start by just running the data by itself. It will show you the first 10 rows of data.

df
# A tibble: 3,142 × 9
FIPS State   County   Total Age_under_44 Age_45_65 Age_over_65  Male Female
<dbl> <chr>   <chr>    <dbl>        <dbl>     <dbl>       <dbl> <dbl>  <dbl>
1  1001 Alabama Autauga   2000          466       957         576  1687    313
2  1003 Alabama Baldwin   5073          936      1553        2584  4648    425
3  1005 Alabama Barbour    605           97       242         266   537     68
4  1007 Alabama Bibb       278           56        95         127   252     26
5  1009 Alabama Blount     771          159       217         395   724     47
6  1011 Alabama Bullock    152           22        67          63   133     19
7  1013 Alabama Butler     414           82       168         164   362     52
8  1015 Alabama Calhoun   3228          540      1177        1511  2847    381
9  1017 Alabama Chambers   663          127       201         335   601     62
10  1019 Alabama Cherokee   419           59       124         236   395     24
# … with 3,132 more rows

Accessing individual columns

To access a specific column of data you’ll use the $: data$column.

df$Total  [1] 2000 5073 605 278 771 152 414 3228 663 419 736 241 [13] 485 269 190 2853 1071 274 239 862 290 1345 2629 879 [25] 831 2659 735 1857 314 293 720 159 277 413 2696 724 [37] 10724 222 1574 382 3999 2257 230 532 11686 428 398 1444 [49] 8234 440 7674 1857 147 10 372 706 420 3191 1385 2970 [61] 210 1482 1028 3658 1078 241 157 379 NA 35 10388 91 [73] 18 65 43 4954 49 38 502 1426 255 383 28 16 [85] 3760 66 34 33 57 111 138 14 259 194 47 NA [97] 44 753 7481 1871 [ reached getOption("max.print") -- omitted 3042 entries ] You can put this directly in a function like: mean(df$Total)
[1] NA

Check

Pick two numeric variables and calculate the mean, median and standard deviation for them.

The functions you’ll need are: mean(), median() and sd().

Tip

There is missing data so you’ll have to use the na.rm=TRUE argument.

How I did it

Total number of recipients:

mean(df$Total, na.rm=TRUE) [1] 1610.01 median(df$Total, na.rm=TRUE)
[1] 438
sd(df$Total, na.rm=TRUE) [1] 4284.202 Total number of male recipients: mean(df$Male, na.rm=TRUE)
[1] 1745.597
median(df$Male, na.rm=TRUE) [1] 580 sd(df$Male, na.rm=TRUE)
[1] 4076.84