Chaining Operations Demo

Jarrett Lovelett | April 21 2021 | PSYC60

Overview

Here I outline some benefits of thinking of data processing as a pipeline, at least for certain chunks of common processing tasks. I describe how to use the pipe operator %>% to send data frames between functions.

Students: This is intended as supplementary material to help frame and clarify what you have already learned in class. It introduces some new functions and ideas, which you are not responsible for knowing until/unless they come up in class outside this document.

A note on notation. Below (and in general when I write markdown text), thing (without parentheses) denotes an R variable called 'thing' (e.g. a data frame, vector, or character string). thing() (with parens) denotes an R function called 'thing'(something that takes an input and produces an output).

Load in data & display sample

Why Chain Operations?

One of the most common uses for R is data processing (a.k.a. data cleaning, data wrangling). That is, taking some raw data and doing stuff to it until it is organized in the way you want it to be. That often involves things like dealing with missing data, correcting multiple spellings of the same response (e.g. "New York, NY" = "NYC" = "New York City"), calculating new variables based on those you have, splitting one column into many, grouping many columns into one, filtering out unneeded rows, merging with other data frames, and summarizing data (e.g. averaging).

We often ask you to calculate new variables, or summarize information over existing variables. In the wild world of data wrangling, Functions take the data from its raw form to its final form (which could be be, say, another now-cleaned data frame, or a plot, or a model). Sometimes one function is enough to do that. Say we want to know the distribtution of birth months of students in psych 60. We can use the purpose-built tally() for that:

A more complex process...

But let's say we want to know what proportion of students have the same favorite season (info which we also collected in the survey) as the season in which they were born. How would we go about calculating that? Below is one sequence of steps that will do the job. Can you come up with another?

  1. Get rid of unneeded columns (all but birth_month and favorite_season.
  2. Add a column birth_season for the (estimated) season someone was born, based on birth_month.
  3. Add a column birth_season_fave for whether birth_season is the same as favorite_season.
  4. Divide the number of matches by the total number of observations (rows) to get the proportion.

Note that these steps sketch out an approach, but leave some decisions to be made as to how to implement these operations.

Below, I'll code up several ways of executing these operations. First though, here are some useful tidbits that will come up below:

Here's one implementation of this process:

Step-by-step: create/modify variables

Pipes:

But we can do all this in one fell swoop, without creating any new variables or permanently adding columns along the way. (It's worth noting that sometimes it is a good idea to create intermediary variables -- for example after you've done all your cleaning, and then want to use that cleaned data frame for multiple plots or analyses. But often it's just redundant and clutters your workspace. A good heuristic is that if you'll only use a data frame once, there's no need to permanently save it as a variable. Instead, use pipes to modify it on-the-fly and send it to its final destination.)

We've seen the pipe operator %>% before, when combining together different aspects of plots (main plot, labels, theme, etc.). But we can also use it to chain together data processing functions. The below schematic illustrate how it works.

data_frame %>% 
    step_1() %>%
    step_2() %>%
    # ... %>%
    step_N()

Useful tidbits:

Below is the same process we coded up above, achieved with a pipeline of functions, each taking as input the result of the step before it. Try "commenting out" lower sections of this pipeline and examining the intermediary results -- this let's you easily check each step of the pipeline to make sure things are working as you expect. Note that when you do this, you'll have to make sure to also comment out any now-unecessary bits of code at the end of the preceding line (e.g. the preceding %>%). You can highlight multiple lines and hit Ctrl + / (Windows) or Cmd + / (Mac) to (un-)comment them out.

... and we get the same* result!

* ok technically the first result was just a number, this one is a data frame with one variable and one observation. The pipe style of processing data does encourage working with entire data frames. If for some reason we really needed to pull out just the number within that data frame, we could add %>% pull(prop_fave_birth_month) to the end of the pipeline above to extract that column (which only contains one number; in R, a vector containing one number is equivalent to that number itself).

Note about (not) saving The above pipeline just does the calculations and produces and displays the result, but never saves anything, including the end result. If you want to save the result, add the typical newVariable <- to the beginning of your pipeline.

Returning to the pipeline

Great, we've calculated the proportion of psyc60 students whose favorite season is the season in which they were born. But let's say we want to visualize the favorite season of students, separately by whether their favorite season is the one in which they were born. (For some reason.)

Now we have to add to the pipeline by splicing in a step in the middle, and by tacking more steps on at the end; and we have to delete the last step calling summarize() since we want to send all the data to the plot. New lines below are marked off with comments. Notice how easy it is to insert a new intermediate processing step! (I've also combined the previous mutate() calls into one (yes, this is a thing!)).

Welp, now we have... that plot. Does it tell us anything interesting? You tell me!

The End