Jarrett Lovelett | April 21 2021 | PSYC60
Here I outline some benefits of thinking of data processing as a pipeline, at least for certain chunks of common processing tasks. I describe how to use the pipe operator %>%
to send data frames between functions.
Students: This is intended as supplementary material to help frame and clarify what you have already learned in class. It introduces some new functions and ideas, which you are not responsible for knowing until/unless they come up in class outside this document.
A note on notation. Below (and in general when I write markdown text), thing
(without parentheses) denotes an R variable called 'thing' (e.g. a data frame, vector, or character string). thing()
(with parens) denotes an R function called 'thing'(something that takes an input and produces an output).
suppressMessages(library(tidyverse)) # load necessary packages, but don't display annoying output messages
suppressMessages(library(mosaic)) # Don't do this in your own R code unless we do it for you, since sometimes those messages are important!
psych_60_data_file <- 'class_data.csv' # this is a file in the same directory as this notebook
psyc60 <- read_csv(psych_60_data_file, # read in our class survey results
col_types = cols()) # this tells read_csv() to guess column types
# (which I use here to suppress its normal output of its guesses)
sample_n(psyc60, 3) # randomly print 3 rows
subID | years_in_college_range | years_in_college_numeric | num_siblings | birth_month | birth_year | num_times_left_home_country | target_age | num_classes | go_with_aliens | ⋯ | favorite_country | favorite_color | favorite_musical_genre | num_hours_studying_stats | living_on_campus | age | lark_owl | num_languages | the_dress_colors | home_region |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | ⋯ | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <chr> | <dbl> | <chr> | <chr> |
sub6510167 | 3 - 4 | 3 | 2 | May | 1999 | 2 | 85 | 3 | Yes | ⋯ | Spain | Green | funk | 0 - 4 | No | 22 | neither | 2 | black & blue | North America |
sub8912365 | 0 - 1 | 0 | 0 | December | 2001 | 4 | 119 | 5 | Yes | ⋯ | Japan | Blue | Piano music | 8 - 12 | No | 19 | night owl | 2 | white & gold | East Asia |
sub4794784 | 0 - 1 | 0 | 1 | December | 2001 | 0 | 92 | 3 | No | ⋯ | I have not visited another country yet | Purple | r&b | 8 - 12 | No | 19 | night owl | 1 | black & blue | North America |
One of the most common uses for R is data processing (a.k.a. data cleaning, data wrangling). That is, taking some raw data and doing stuff to it until it is organized in the way you want it to be. That often involves things like dealing with missing data, correcting multiple spellings of the same response (e.g. "New York, NY" = "NYC" = "New York City"), calculating new variables based on those you have, splitting one column into many, grouping many columns into one, filtering out unneeded rows, merging with other data frames, and summarizing data (e.g. averaging).
We often ask you to calculate new variables, or summarize information over existing variables. In the wild world of data wrangling, Functions take the data from its raw form to its final form (which could be be, say, another now-cleaned data frame, or a plot, or a model). Sometimes one function is enough to do that. Say we want to know the distribtution of birth months of students in psych 60. We can use the purpose-built tally()
for that:
tally(~birth_month, data = psyc60)
birth_month April August December February January July June March 10 10 6 4 9 10 6 7 May November October September 8 7 6 5
But let's say we want to know what proportion of students have the same favorite season (info which we also collected in the survey) as the season in which they were born. How would we go about calculating that? Below is one sequence of steps that will do the job. Can you come up with another?
birth_month
and favorite_season
.birth_season
for the (estimated) season someone was born, based on birth_month
.birth_season_fave
for whether birth_season
is the same as favorite_season
.Note that these steps sketch out an approach, but leave some decisions to be made as to how to implement these operations.
Below, I'll code up several ways of executing these operations. First though, here are some useful tidbits that will come up below:
month.name
. I'll use this below instead of manually creating that vector. case_when()
is an R function that takes in a set of condition ~ result
pairs, where condition
is usually used to set multiple alternatives for a particular variable, and result
is what the function returns if that condition is true. I use it below to map birth months to birth seasons for convenience (you could also use recode()
but it'd be more tedious). Here's one implementation of this process:
# get rid of unneeded columns:
seasons_dat <- select(psyc60, subID, birth_month, favorite_season)
# make birth_season column:
seasons_dat$birth_season <- case_when(seasons_dat$birth_month %in% month.name[c(12,1,2)] ~ 'Winter',
# ^ Call winter any birth month Dec - Feb
# (I like this better than Winter = Jan - March, I think)
seasons_dat$birth_month %in% month.name[3:5] ~ 'Spring',
# ^ Spring = Mar - May
seasons_dat$birth_month %in% month.name[6:8] ~ 'Summer',
# ^ June - Aug
seasons_dat$birth_month %in% month.name[9:11] ~ 'Fall') # Sep - Nov
# make birth_season_fave column:
seasons_dat$birth_season_fave <- seasons_dat$birth_season == seasons_dat$favorite_season
sample_n(seasons_dat,5) # display a few rows
subID | birth_month | favorite_season | birth_season | birth_season_fave |
---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <lgl> |
sub5679722 | August | Summer | Summer | TRUE |
sub9352389 | October | Summer | Fall | FALSE |
sub8685385 | May | Spring | Spring | TRUE |
sub7385393 | October | Winter | Fall | FALSE |
sub7423388 | May | Fall | Spring | FALSE |
sum(seasons_dat$birth_season_fave == TRUE) / nrow(seasons_dat)
# Looks like about .36 of the class's favorite season is the season they were born
# Quick aside - a nifty trick:
mean(seasons_dat$birth_season_fave)
# why does this work??? We'll explain it in an upcoming lab, but try to figure it out!
But we can do all this in one fell swoop, without creating any new variables or permanently adding columns along the way. (It's worth noting that sometimes it is a good idea to create intermediary variables -- for example after you've done all your cleaning, and then want to use that cleaned data frame for multiple plots or analyses. But often it's just redundant and clutters your workspace. A good heuristic is that if you'll only use a data frame once, there's no need to permanently save it as a variable. Instead, use pipes to modify it on-the-fly and send it to its final destination.)
We've seen the pipe operator %>%
before, when combining together different aspects of plots (main plot, labels, theme, etc.).
But we can also use it to chain together data processing functions. The below schematic illustrate how it works.
data_frame %>%
step_1() %>%
step_2() %>%
# ... %>%
step_N()
Useful tidbits:
One way to think of what %>%
does is that it inserts the thing on its left as the first argument to the thing on its right. So this will only work with functions whose first argument is the data frame or vector they act on.
data %>% function()
is the same as saying function(data)
function(somethingOtherThanData, data)
select()
filter()
mutate()
summarize()
gf_*()
plotting functionsmutate()
is a function for modifying existing columns and creating new ones (if you name an existing column, it will modify that one, otherwise create a new one)
data %>% mutate(newVar = stuff)
or mutate(data, newVar = stuff)
are the pipe-compatible equivalent of data$newVar = stuff
summarize()
takes a data frame as input, and creates a (usually much) smaller data frame, containing only those columns you manually create (see below)Below is the same process we coded up above, achieved with a pipeline of functions, each taking as input the result of the step before it. Try "commenting out" lower sections of this pipeline and examining the intermediary results -- this let's you easily check each step of the pipeline to make sure things are working as you expect. Note that when you do this, you'll have to make sure to also comment out any now-unecessary bits of code at the end of the preceding line (e.g. the preceding %>%
). You can highlight multiple lines and hit Ctrl
+ /
(Windows) or Cmd
+ /
(Mac) to (un-)comment them out.
psyc60 %>% # send psych60 to select(), and...
# select these three columns (subID just for posterity, not really needed here)
select(subID, birth_month, favorite_season) %>% # send just those three columns to...
# add birth season column:
mutate(birth_season = case_when(birth_month %in% month.name[c(12,1,2)] ~ 'Winter',
birth_month %in% month.name[3:5] ~ 'Spring',
birth_month %in% month.name[6:8] ~ 'Summer',
birth_month %in% month.name[9:11] ~ 'Fall')) %>%
# add favorite == birth column
mutate(fav_birth_season_same = birth_season == favorite_season) %>%
# summarize as proportion same:
summarize(prop_fave_birth_month = mean(fav_birth_season_same))
prop_fave_birth_month |
---|
<dbl> |
0.3636364 |
... and we get the same* result!
* ok technically the first result was just a number, this one is a data frame with one variable and one observation. The pipe style of processing data does encourage working with entire data frames. If for some reason we really needed to pull out just the number within that data frame, we could add %>% pull(prop_fave_birth_month)
to the end of the pipeline above to extract that column (which only contains one number; in R, a vector containing one number is equivalent to that number itself).
Note about (not) saving The above pipeline just does the calculations and produces and displays the result, but never saves anything, including the end result. If you want to save the result, add the typical newVariable <-
to the beginning of your pipeline.
Great, we've calculated the proportion of psyc60 students whose favorite season is the season in which they were born. But let's say we want to visualize the favorite season of students, separately by whether their favorite season is the one in which they were born. (For some reason.)
Now we have to add to the pipeline by splicing in a step in the middle, and by tacking more steps on at the end; and we have to delete the last step calling summarize()
since we want to send all the data to the plot. New lines below are marked off with comments. Notice how easy it is to insert a new intermediate processing step! (I've also combined the previous mutate()
calls into one (yes, this is a thing!)).
psyc60 %>%
# first line of mutate is new! reorder birth_month.
mutate(birth_month = factor(birth_month, levels = month.name),
birth_season = case_when(birth_month %in% month.name[1:3] ~ 'Winter',
birth_month %in% month.name[4:6] ~ 'Spring',
birth_month %in% month.name[7:9] ~ 'Summer',
birth_month %in% month.name[10:12] ~ 'Fall'),
fav_birth_season_same = birth_season == favorite_season) %>%
# mutate_at() is a special version of mutate() that does the same thing to multiple columns...
# Here I want to reorder the seasons for both season columns (birth and favorite)
mutate_at(c('favorite_season', 'birth_season'), factor, levels = c('Fall','Winter','Spring','Summer')) %>%
# And all the rest of this is new:
# make the basic plot
gf_bar(~favorite_season, fill = ~birth_season) %>%
# add labels
gf_labs(x = 'Favorite Season',
y = '# Students',
title = 'Favorite Seasons of Psyc60 Students',
# I used double quotes " instead of single ' because needed an apostrophe within text!
subtitle = "And whether it's the same as birth season",
fill = 'Birth Season') %>%
# facet on same birth & fave season
gf_facet_grid(~fav_birth_season_same,
# custom labels: need `` around TRUE / FALSE b/c they are special R values otherwise
labeller = labeller(fav_birth_season_same = c(`TRUE` = 'Favorite == Birth',
`FALSE` = 'Favorite != Birth'))) %>%
# tweaking a couple theme options
gf_theme(axis.text.x = element_text(angle=45, hjust = 1), # axis at diagonal
legend.position = c(.66,.67)) # relative coordinates c(0-1 x , 0-1 y)
Welp, now we have... that plot. Does it tell us anything interesting? You tell me!