Set up

We will use the palmer penguin data so open your saved Rproj in Rstudio by clicking File menu then Open project and look for your .Rproj file. Click the Files tab in the bottom right pane, click the data folder and check the penguins.csv is in there. Then read it in.

penguins <- read.csv(file = "data/penguins.csv")

1 Data wrangling

Cartoon of a fuzzy monster with a cowboy hat and lasso, riding another fuzzy monster labeled “dplyr”, lassoing a group of angry / unruly looking creatures labeled “data”. Artwork by @allison_horst

Cleaning or wrangling data means many things to many researchers: we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R functions:

mean(penguins[penguins$species == "Adelie", "body_mass_g"], na.rm = TRUE)
mean(penguins[penguins$species == "Gentoo", "body_mass_g"], na.rm = TRUE)
mean(penguins[penguins$species == "Chinstrap", "body_mass_g"])

But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

1.1 The `dplyr` package

A hexagon with dplyr written in it and three colourful plyers on a space background

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to understand.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.

select()
filter()
group_by()
summarize()
mutate()

If you have not installed this package previously, use the packages tab, bottom right.

Then load the package:

library("dplyr")

1.2 select()

Text is select(data.frame, a, c). On the right are four columns of data named a, b, c and d. On the left are two columns a and c. One arrow curves from a to a and the other arrow curves c to c.

If we wanted to analyse only a few of the variables (columns) in our dataframe we could use the select() function. This will keep only the variables you select.

species_body <- select(penguins, species,body_mass_g,sex)

If we open up species_body (under the environment tab) we’ll see that it only contains the species, body mass and sex.

1.3 filter()

If we now wanted to use the columns selected above, but only with Adelie penguins, we can use filter

species_body_adelie <- filter(species_body, species =="Adelie")

Open species_body_adelie to check it only contains Adelie penguin data.

Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes.

To demonstrate a pipe (%>%) let’s repeat what we’ve done above by combining select and filter in one command.

species_body_adelie2 <- penguins %>%
    filter(species == "Adelie") %>%
    select(species, body_mass_g, sex)

First we assign the result to species_body_adelie2. Then, we call the penguins dataframe and pass it on, using the pipe symbol %>% to the filter() function. We don’t need to name which data object to use in the filter() function since we’ve piped penguins in. Then the resulting data frame is piped into the select() part of the code.

Pipes allow you to do many actions in one command, make code easier to read and avoid creating multiple data objects.

Tip: select() and filter()

Remember, select() chooses columns and filter() chooses rows

Challenge 1

Write a single command (which can span multiple lines and include pipes) that will produce a dataframe that has the Gentoo values for island, flipper_length_mm and year, but not for other penguins. How many rows does your dataframe have?

Solution to Challenge 1

gentoo_island_flipper_yr <- penguins %>%
    filter(species == "Gentoo") %>%
    select(island, flipper_length_mm, year)

You should have 124 rows.

Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable species since we would have removed it in the previous step

1.4 group_by() and summarize()

Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each species. Using filter(), will only pass observations that meet your criteria (in the above: species=="Adelie"). Instead, we can use group_by(), which will analyse all three species.

Text is data.frame, pipe operator, group_by(a). On the right are four columns of data named a, b, c and d. Arrows show these rows split into three so 1 and 1 rows are in one dataframe, 2 and 2 rows in another and 3 and 3 rows together.

1.4.1 Using group_by() with summarize()

group_by() is more easily understood when it’s used with something like summarize(). By using the group_by() function, we split our original data frame into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

penguins %>%
    group_by(species) %>%
    summarize(mean_body = mean(body_mass_g, na.rm=TRUE))

## # A tibble: 3 × 2
##   species   mean_body
##   <chr>         <dbl>
## 1 Adelie        3701.
## 2 Chinstrap     3733.
## 3 Gentoo        5076.

Tip: To assign or not

Starting the code with species_body_means <- will assign the results as a new data frame called species_body_means. This species_body_means will then be listed under the environment tab so you can view it. Good for large results data frames.

As was done in the last example, you can choose not to assign as an object if the resulting data frame is small. The results are not saved but given in your console window.

The function group_by() allows us to group by multiple variables. Let’s group by species and sex.

penguins %>%
    group_by(species, sex) %>%
    summarize(mean_body = mean(body_mass_g, na.rm=TRUE))

And you’re not limited to analysing only one variable with only one function in summarize().

penguins %>%
    group_by(species, sex) %>%
    summarize(mean_body = mean(body_mass_g),
              sd_body = sd(body_mass_g),
              mean_flipper = mean(flipper_length_mm),
              sd_flipper = sd(flipper_length_mm))

The filter() function can remove the penguins who have NA under sex. (Read the exclamation mark as “not”)

penguins %>%
    group_by(species, sex) %>%
    filter(!is.na(sex)) %>%
    summarize(mean_body = mean(body_mass_g),
              sd_body = sd(body_mass_g),
              mean_flipper = mean(flipper_length_mm),
              sd_flipper = sd(flipper_length_mm))

Challenge 2

For each species and sex, calculate the mean and sd of bill length and bill depth.

Solution to Challenge 2

penguins %>%
    group_by(species, sex) %>%
    filter(!is.na(sex)) %>%
    summarize(mean_billlength = mean(bill_length_mm),
              sd_billlength = sd(bill_length_mm),
              mean_billdepth = mean(bill_depth_mm),
              sd_billdepth = sd(bill_depth_mm))

1.5 count() and n()

If we wanted to check the number of penguins included in the dataset for the year 2007, we can use filter() then the count() function:

penguins %>%
    filter(year == 2007) %>%
    count

If we need to use the total number of observations in calculations, the n() function is useful. For instance, if we wanted the standard error of body mass per species:

penguins %>%
    group_by(species) %>%
    summarize(se_body = sd(body_mass_g)/sqrt(n()))

This works better if we filter out the NAs.

penguins %>%
    group_by(species) %>%
    filter(!is.na(body_mass_g)) %>%
    summarize(se_body = sd(body_mass_g)/sqrt(n()))

Tip: Standard error

To calculate standard error (se), the standard deviation (sd) is divided by the square root of the sample size (n). \(se = sd/\sqrt n\)

1.6 mutate()

Cartoon of cute fuzzy monsters dressed up as different X-men characters, working together to add a new column to an existing data frame. Stylized title text reads “dplyr::mutate - add columns, keep existing. Artwork by @allison_horst

mutate() creates new variables (columns) from existing variables.

penguins_body_kg <- penguins %>%
    mutate(body_weight_kg = body_mass_g / 1000)

1.6.1 Connect mutate with case_when

We can combine mutate() with case_when() to filter data in the moment of creating a new variable. (Note: case_when is an alternative to the older ifelse() function.)

We can make a new column of data that categorises each Gentoo penguin as either large or small and all other penguins as “not Gentoo”. To do this, we use information in the species and body mass columns.

penguin_gentoo_bodysize <- penguins %>%
  mutate(
    size = case_when(
      species == "Gentoo" & body_mass_g >= 5000 ~ "large",
      species == "Gentoo" & body_mass_g < 5000 ~ "small",
      TRUE ~ "not Gentoo"
                    )
        )

Size is our name for the new column. You can read ~ as “then write”. The last line of code can be read as “if none of the conditions are TRUE then write is not Gentoo”.

Advanced Challenge

Write code to select only the species, island, flipper length and sex variables. Filter to only the Adelie penguins. Filter out the penguins with NA under flipper length. Group the data by island. Use mutate to create a new flipper length column in cm by dividing the flipper length by 100. Summarise the mean and sd of flipper lengths for each of the islands.

Solution to Advanced Challenge

penguins %>%
    select(species, island, flipper_length_mm, sex) %>%
    filter(species == "Adelie") %>%
    filter(!is.na(flipper_length_mm)) %>%
    group_by(island) %>%
    mutate(flipper_cm = flipper_length_mm/100) %>%
    summarise(flipper_mean_cm = mean(flipper_cm),
              flipper_sd = sd(flipper_cm))

## # A tibble: 3 × 3
##   island    flipper_mean_cm flipper_sd
##   <chr>               <dbl>      <dbl>
## 1 Biscoe               1.89     0.0673
## 2 Dream                1.90     0.0659
## 3 Torgersen            1.91     0.0623

1.7 Other great resources

Source

Adapted from R for Reproducible Scientific Analysis licensed CC_BY 4.0 by The Carpentries

dplyr