Data Management using R

Leykun (MSc)¹, Tesfamichael (MSc)² & Yebelay (MSc)³

¹NDMC, EPHI; ²SPH, AAU; ³DMU & C4ED

October 14 - 17, 2025

Data Manipulation and Cleaning using `dplyr` package

What is Tidyverse?

The tidyverse is a collection of R packages designed for data science.
- All packages share an underlying design philosophy, grammar, and data structures.
All packages included in tidyverse are automatically installed when installing the tidyverse package:
Install the complete tidyverse with:

Code

install.packages("tidyverse")

To load the core tidyverse and make it available in your current R session.

Code

library(tidyverse)

Cont.

To see the packages included in the tidyverse

Code

tidyverse_packages()

Some packages under tidyverse are considered core packages and others called friend packages.

Core tidyverse

tibble, for tibbles, a modern re-imagining of data frames
readr, for data import
tidyr, for data tidying
ggplot2, for data visualization
dplyr, for data manipulation
stringr, for strings
forcats, for factors
purrr, for functional programming

More Data Import/Export Tools

readxl, for xls and xlsx files
haven, for SPSS, SAS, and Stata files
jsonlite, for JSON
xml2, for XML
httr, for web APIs
rvest, for web scraping
DBI, for databases

Tools for date wrangling

lubridate and hms, for date/times

Tools for modeling

modelr, broom for model/tidy data

Intro to `dplyr` package

dplyr is part of tidyverse and provides a grammar (the verbs) for data manipulation.

The key operator and the essential verbs are:

Function	Description	Operates on
`filter()`	pick rows matching criteria	rows
`slice()`	pick rows using indices	rows
`arrange()`	reorder rows	rows
`select()`	pick columns by name	columns
`mutate()`	add new variables	columns
`summarise()`	reduce variables to values	groups of rows
`relocate()`	to change column positions	columns

… many more.

|> (the native pipe) or %>% (the magrittr pipe) : the “pipe” operator used to connect multiple verb actions together into a pipeline.
Note: Base R pipe |> is available in R 4.1+; magrittr pipe %>% requires the magrittr/dplyr packages.

Tools → Global Options → Code → Editing → Use Native Pipe Operator (|>)

Select Columns from a Dataset `select()`:

select(): To extract variables

select() $\sim$ columns
select columns (variables)
no quotes needed around variable names
can be used to rearrange columns
uses special syntax that is flexible and has many options

Note that the column names are not quoted; you access the column name as if you are calling the name of an object or variable

Ways to Use `select()` in dplyr

Method	Description	Example
using column name	Select specific columns by their names.	`select(col1, col2)`
By position	Select columns by their positions.	`select(1, 3)`
Using a range	Use `:` to select columns	`select(col1:col5); select(2:4)`
Exclude columns	Use `-` or `!` to exclude specific columns	`select(-col3); select(-(2:4)); select(!starts_with("A"))`
Use pattern	Use helper functions based on patterns.	`select(starts_with("prefix"))` `select(ends_with("suffix"))` `select(contains("text"))`
Select by type	Use `where()` to select based on type or condition.	`select(where(is.numeric))`
Select all columns except some	Use `everything()` to re-order or select all columns except specific ones.	`select(col3, everything())` `select(-starts_with("temp"))`
Rearrange columns	Move specific columns to the front while retaining all others.	`select(col1, col3, everything())`

About the data

Data from the CDC’s Youth Risk Behavior Surveillance System (YRBSS)

complex survey data
national school-based survey conducted by CDC
monitors six categories of health-related behaviors
- that contribute to the leading causes of death and disability among youth and adults
- including alcohol & drug use, unhealthy & dangerous behaviors, sexuality, and physical activity

Code

library(readr)
yrb_data <- read_csv("data/yrbss.csv")

We can have a look at the data and its structure by using the glimpse() function from the dplyr package.

Pipe perator (`|>`)

Pipes in R look like |> and strings together commands to be performed sequentially
The pipe passes the data frame output that results from the function right before the pipe to input it as the first argument of the function right after the pipe.

Code

third(second(first(x)))

This nesting is not a natural way to think about a sequence of operations.
The |> operator allows you to string operations in a left-to-right fashion.

Code

first(x) |>
second |> third

Advantages of Pipe oprator

Pipes used to reduce multiple steps, that can be hard to keep track of.
less redundant code
Easy to read and write because functions are executed in order
- Difficult to read if too many functions are nested
Look at the three syntax

Code

data1<-filter(sampledata, Age > 15) #<<
data2<-select(data1, Sex, Weight1, Age) #<<

Code

non_piped <-select(filter(mydata, Age>15), Sex, Weight1, Age) #<<

Code

pipeddata<-mydata |> filter(Age > 15) |> select(Sex, Weight1, Height1, Age)#<<

select a column by name: `select(col1, col2, col3, ...)`

Code

library(dplyr)
yrb_data1 <- yrb_data |> 
  select(age, sex, grade)
yrb_data1

# A tibble: 20,000 × 3
   age                   sex    grade
   <chr>                 <chr>  <chr>
 1 15 years old          Female 10th 
 2 17 years old          Female 12th 
 3 18 years old or older Male   11th 
 4 15 years old          Male   10th 
 5 14 years old          Male   9th  
 6 17 years old          Male   9th  
 7 16 years old          Male   11th 
 8 17 years old          Male   12th 
 9 18 years old or older Male   12th 
10 14 years old          Male   10th 
# ℹ 19,990 more rows

Selecting column ranges with `:`

The : operator selects a range of consecutive variables:

Code

yrb_data |>  select(age:race4) |>  head(3)

# A tibble: 3 × 4
  age                   sex    grade race4          
  <chr>                 <chr>  <chr> <chr>          
1 15 years old          Female 10th  White          
2 17 years old          Female 12th  White          
3 18 years old or older Male   11th  Hispanic/Latino

We can also specify a range with column numbers:

Code

yrb_data |> select(1:4) |> head(3)

# A tibble: 3 × 4
  record age                   sex    grade
   <dbl> <chr>                 <chr>  <chr>
1 931897 15 years old          Female 10th 
2 333862 17 years old          Female 12th 
3  36253 18 years old or older Male   11th

Excluding columns with `!` or `-`

The exclamation point negates a selection:

Code

yrb_data |> select(!record) |> head(2)

# A tibble: 2 × 7
  age          sex    grade race4 race7   bmi stweight
  <chr>        <chr>  <chr> <chr> <chr> <dbl>    <dbl>
1 15 years old Female 10th  White White  17.2     54.4
2 17 years old Female 12th  White White  20.2     57.2

To drop a range of consecutive columns, we use, for example,!age:grade:

Code

yrb_data |> select(!age:grade) |> head(2)

# A tibble: 2 × 5
  record race4 race7   bmi stweight
   <dbl> <chr> <chr> <dbl>    <dbl>
1 931897 White White  17.2     54.4
2 333862 White White  20.2     57.2

To drop several non-consecutive columns, place them inside !c():

Code

yrb_data |> select(!c(race4, race7)) |> head(3)

Helper functions: `starts_with()`, `ends_with()` and `contains()`

These two helpers work exactly as their names suggest!

`starts_with()`

Code

yrb_data |> select(starts_with("r")) |> head(2)

# A tibble: 2 × 3
  record race4 race7
   <dbl> <chr> <chr>
1 931897 White White
2 333862 White White

`ends_with()`

Code

yrb_data |> select(ends_with("e")) |> head(3)

# A tibble: 3 × 2
  age                   grade
  <chr>                 <chr>
1 15 years old          10th 
2 17 years old          12th 
3 18 years old or older 11th

`contains()`

contains() helps select columns that contain a certain string:

Code

yrb_data |> select(sex, contains("r")) |> head()

# A tibble: 6 × 5
  sex     record grade race4                     race7                    
  <chr>    <dbl> <chr> <chr>                     <chr>                    
1 Female  931897 10th  White                     White                    
2 Female  333862 12th  White                     White                    
3 Male     36253 11th  Hispanic/Latino           Hispanic/Latino          
4 Male   1095530 10th  Black or African American Black or African American
5 Male   1303997 9th   All other races           Multiple - Non-Hispanic  
6 Male    261619 9th   All other races           <NA>

Another helper function, `everything()`

matches all variables that have not yet been selected.

Code

## First, `bmi`, then every other column.
yrb_data |> select(bmi, everything()) |> head(3)

# A tibble: 3 × 8
    bmi record age                   sex    grade race4           race7 stweight
  <dbl>  <dbl> <chr>                 <chr>  <chr> <chr>           <chr>    <dbl>
1  17.2 931897 15 years old          Female 10th  White           White     54.4
2  20.2 333862 17 years old          Female 12th  White           White     57.2
3  NA    36253 18 years old or older Male   11th  Hispanic/Latino Hisp…     NA

It is often useful for establishing the order of columns.

But this would be painful for larger data frames, data frame. In such a case, we can use everything().

This helper can be combined with many others.

Code

## Bring columns that starts with "r" to the front of the data frame
yrb_data |> select(starts_with("r"), everything()) %>% head(3)

# A tibble: 3 × 8
  record race4           race7           age          sex   grade   bmi stweight
   <dbl> <chr>           <chr>           <chr>        <chr> <chr> <dbl>    <dbl>
1 931897 White           White           15 years old Fema… 10th   17.2     54.4
2 333862 White           White           17 years old Fema… 12th   20.2     57.2
3  36253 Hispanic/Latino Hispanic/Latino 18 years ol… Male  11th   NA       NA

You can also select columns based on their data type using select(where(...)).
The common data types to be called are: is.character, is.double, is.factor, is.integer, is.logical, is.numeric.

Code

yrb_data |>  select(where(is.numeric)) |>
  glimpse()  # numeric data types only selected (here: integer or double)

Rows: 20,000
Columns: 3
$ record   <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1309…
$ bmi      <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.2555,…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, NA,…

Summary for Select() function

There are five ways to select variables in select(data, ...):

By position: yrb_data |> select(1, 2, 4) oryrb_data |> select(1:2).
By name: yrb_data |> select(age, sex), or yrb_data |> select(age:race4).
By function of name: yrb_data |> select(starts_with("r")), or yrb_data |> select(ends_with("e")).
By type: yrb_data |> select(where(is.numeric)),or yrb_data |> select(where(is.character)).
By any combination of the above using the Boolean operators !, &, and |:
- yrb_data |> select(!where(is.numeric)): selects all non-numeric variables.
- yrb_data |> select(where(is.numeric) & contains("i")): selects all numeric variables that contains ‘i’.

Filter cases from the dataset `filter()`

`filter()`: To extract cases

The function filter() is used to filter the dataset to return a subset of all rows that meet one or more specific conditions.

filter(dataframe, logical statement 1, logical statement 2, ...)

Ways to Use `filter()` in dplyr

Method	Description	Example
By specific value	Filter rows where a column equals a specific value.	`filter(col1 == "value")` `filter(col1 != "value")`
By inequality	Filter rows based on inequality conditions.	`filter(col1 > 10)`
Using multiple conditions	Filter rows that satisfy multiple conditions.	`filter(col1 > 10, col2 == "A")`
With logical operators	Using AND (`&) and using OR (`\|)
By range	Filter rows within a range of values using `between()`.	`filter(between(col1, 10, 20))`
By missing values	Filter rows with or without missing values.	`filter(is.na(col1))` `filter(!is.na(col1))`

Filtering based on exact character variable matches

Note the use of the double equal sign == rather than the single equal sign =.

Code

yrb_data |> select(sex, grade ) |>  filter(grade=="9th")|>  head(3)

# A tibble: 3 × 2
  sex   grade
  <chr> <chr>
1 Male  9th  
2 Male  9th  
3 Male  9th

Code

yrb_data |> select(sex, grade)|> filter(sex == "Male") |> head(3)

# A tibble: 3 × 2
  sex   grade
  <chr> <chr>
1 Male  11th 
2 Male  10th 
3 Male  9th

Similarly you can use the other operators:
- filter(grade != "9th") will select everything except the grade 9 rows.

If you want to select more than one category value you can use the %in% operator.

Code

yrb_data |> 
  select(sex, age, grade ) |> 
  filter(grade %in% c("9th", "11th")) |> head(3)

The %in% operator used to deselect certain groups as well, using !%in%.
To select all individuals with a bmi between 22 and 30, use:

Code

yrb_data |> 
  select(sex, age, bmi) |> 
  filter(between(bmi, 22, 30))

Code

yrb_data |> 
  select(sex, age, bmi) |> 
  filter(bmi >= 22, bmi <= 30)

Filtering based on multiple conditions

The filter option also allows AND and OR style filters:
filter(condition1, condition2) will return rows where both conditions are met.
filter(condition1 & condition2) will also return rows where both conditions are met.
filter(condition1, !condition2) will return all rows where condition one is true but condition 2 is not.
filter(condition1 | condition2) will return rows where condition 1 and/or condition 2 is met.

Code

yrb_data |> select(sex, age, bmi, stweight, grade) |> 
  filter(bmi > 20, (stweight > 50 | grade != "12th")) |> 
  head(3)

# A tibble: 3 × 5
  sex    age            bmi stweight grade
  <chr>  <chr>        <dbl>    <dbl> <chr>
1 Female 17 years old  20.2     57.2 12th 
2 Male   15 years old  28.0     85.7 10th 
3 Male   14 years old  24.5     66.7 9th

selects the bmi and stweight columns from yrb_data and filters out rows with missing bmi values

Code

yrb_data |>  
  select(bmi, stweight) |> 
  filter(!is.na(bmi)) |> 
  head(4)

# A tibble: 4 × 2
    bmi stweight
  <dbl>    <dbl>
1  17.2     54.4
2  20.2     57.2
3  28.0     85.7
4  24.5     66.7

Adding or Modifying columns using `mutate()`

Another common task is creating a new column based on values in existing columns.
The dplyr library has the following functions that can be used to add additional variables to a data frame.
mutate() – adds new variables while retaining old variables to a data frame.
Example: add the new column called height_m

Code

yrb_data %>% 
  mutate(height_m = sqrt(stweight / bmi)) |>    # use = (not <- or ==) to define new variable
  head(3)

# A tibble: 3 × 9
  record age                   sex    grade race4  race7   bmi stweight height_m
   <dbl> <chr>                 <chr>  <chr> <chr>  <chr> <dbl>    <dbl>    <dbl>
1 931897 15 years old          Female 10th  White  White  17.2     54.4     1.78
2 333862 17 years old          Female 12th  White  White  20.2     57.2     1.68
3  36253 18 years old or older Male   11th  Hispa… Hisp…  NA       NA      NA

We can use the relocate() function to put it before our bmi column:

Code

yrb_data |> 
  mutate(height_m = sqrt(stweight / bmi)) |> 
  relocate(height_m, .before = bmi) |> 
  head(3)

# A tibble: 3 × 9
  record age                   sex    grade race4  race7 height_m   bmi stweight
   <dbl> <chr>                 <chr>  <chr> <chr>  <chr>    <dbl> <dbl>    <dbl>
1 931897 15 years old          Female 10th  White  White     1.78  17.2     54.4
2 333862 17 years old          Female 12th  White  White     1.68  20.2     57.2
3  36253 18 years old or older Male   11th  Hispa… Hisp…    NA     NA       NA

Sort rows with `arrange`

Re-order rows by a particular column, by default in ascending order

Use desc() for descending order.

arrange(data, variable1, desc(variable2), ...)

Example: Arrange by BMI in descending order

Code

# Example: Arrange by BMI in descending order
yrb_data %>%
  arrange(desc(bmi))

# A tibble: 20,000 × 8
    record age                   sex    grade race4         race7   bmi stweight
     <dbl> <chr>                 <chr>  <chr> <chr>         <chr> <dbl>    <dbl>
 1  324452 16 years old          Male   11th  Black or Afr… Blac…  53.9     91.2
 2 1310082 18 years old or older Male   11th  Black or Afr… Blac…  53.5    160. 
 3  328160 18 years old or older Male   <NA>  Black or Afr… Blac…  53.4    128. 
 4 1315913 17 years old          Female 12th  Black or Afr… Blac…  53.3    142. 
 5 1094597 13 years old          Male   9th   All other ra… Asian  52.9    181. 
 6 1305503 15 years old          Male   9th   All other ra… Am I…  52.4    134. 
 7  770391 16 years old          Female 11th  All other ra… Mult…  52.4    161. 
 8  634138 17 years old          Male   12th  All other ra… Nati…  52.3    160. 
 9 1312697 15 years old          Female 10th  Black or Afr… Blac…  52.3     95.3
10 1099468 17 years old          Male   9th   Black or Afr… Blac…  52.0    174. 
# ℹ 19,990 more rows

`group_by()` and `summarise()`

The dplyr verbs become especially powerful when they are are combined using the pipe operator |>.
The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually
group_by(): group data frame by a factor for downstream operations (usually summarise)
summarise(): summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

`dplyr` - Split-Apply-Combine

The group_by function is key to the Split-Apply-Combine strategy

The `summarize()` function

The summarize() function is used in the R program to summarize the data frame into just one value or vector.
This summarization is done through grouping observations by using categorical values at first, using the group_by() function.
The summarize() function offers the summary that is based on the action done on grouped or ungrouped data.

`dplyr::summarize()` Function

To calculate the mean bmi in base R vs with summarize()::

Code

mean(yrb_data$bmi, na.rm = T)

[1] 23.49541

summarize(new_column = summary_function(column))

Code

yrb_data %>% filter(!is.na(bmi)) %>% 
  summarize(mean_bmi = mean(bmi))

# A tibble: 1 × 1
  mean_bmi
     <dbl>
1     23.5

Multiple Summary Statistics

You can calculate multiple statistics in one summarize():

Code

yrb_data %>% filter(!is.na(bmi)) %>% 
  summarize(mean_age = mean(bmi), 
            median_bmi = median(bmi))

# A tibble: 1 × 2
  mean_age median_bmi
     <dbl>      <dbl>
1     23.5       22.3

Grouped summaries with `dplyr::group_by()`

group_by() groups data by one or more variables.

Example 1: Mean weight by Sex

Code

yrb_data %>% 
  filter(!is.na(stweight)) |> 
  group_by(sex) |> 
  summarize(mean_weight = mean(stweight))

Example 2: Maximum and Minimum Weights

Calculate the min, max and mean weights for each sex. The function n() will count the number of rows in each group.:

Code

yrb_data %>% 
  filter(!is.na(stweight)) %>%
  group_by(sex) %>%  
  summarize(max_weight = max(stweight), 
            min_weight = min(stweight),
            mean_weight = mean(stweight),
            n = n())

# A tibble: 2 × 5
  sex    max_weight min_weight mean_weight     n
  <chr>       <dbl>      <dbl>       <dbl> <int>
1 Female       181.       27.7        61.7  6542
2 Male         181.       35.4        73.1  6901

Why `summarize()` Matters

The combination of group_by() and summarize() allows highly informative grouped summaries of datasets with minimal code.
- Producing such summaries is an essential data analysis skill.

Grouping by Multiple Variables (Nested Grouping)

To group by more than one variable, list both in group_by():

Code

yrb_data %>% filter(!is.na(bmi)) %>% group_by(sex, grade) %>%    
  summarize(mean_bmi = mean(bmi)) %>% head(4)

# A tibble: 4 × 3
# Groups:   sex [1]
  sex    grade mean_bmi
  <chr>  <chr>    <dbl>
1 Female 10th      23.0
2 Female 11th      23.4
3 Female 12th      23.9
4 Female 9th       22.8

You can swap the column order in group_by(). We can use the arrange() function:

Code

yrb_data |>  filter(!is.na(bmi))  |>  group_by(grade, sex) |>      
  summarize(mean_bmi = mean(bmi))  |>  arrange(mean_bmi) |> head()

# A tibble: 6 × 3
# Groups:   grade [4]
  grade sex    mean_bmi
  <chr> <chr>     <dbl>
1 9th   Male       22.8
2 9th   Female     22.8
3 10th  Female     23.0
4 <NA>  Female     23.1
5 11th  Female     23.4
6 10th  Male       23.5

Ungrouping Data

After group_by() and summarize(), the resulting data frame may still be grouped.
To avoid unintended behaviors, use ungroup():

Code

yrb_data %>% filter(!is.na(bmi)) |> group_by(sex, grade) %>%   
  summarize(mean_bmi = mean(bmi)) %>% ungroup() %>% head()

# A tibble: 6 × 3
  sex    grade mean_bmi
  <chr>  <chr>    <dbl>
1 Female 10th      23.0
2 Female 11th      23.4
3 Female 12th      23.9
4 Female 9th       22.8
5 Female <NA>      23.1
6 Male   10th      23.5

Why is `ungroup()` Needed?

Grouped data frames behave uniquely with other dplyr functions like select(), filter(), or mutate():

Code

# Unexpected behavior when grouped 
yrb_data %>% group_by(sex, grade) %>% filter(!is.na(bmi)) %>%    
  summarize(mean_bmi = mean(bmi)) %>% select(mean_bmi) %>% head(4)

# A tibble: 4 × 2
# Groups:   sex [1]
  sex    mean_bmi
  <chr>     <dbl>
1 Female     23.0
2 Female     23.4
3 Female     23.9
4 Female     22.8

By ungrouping, we get the expected output:

Code

yrb_data %>% group_by(sex, grade) %>% filter(!is.na(bmi)) %>% 
  summarize(mean_bmi = mean(bmi)) %>%   
  ungroup() %>% select(mean_bmi) %>% head(3)

# A tibble: 3 × 1
  mean_bmi
     <dbl>
1     23.0
2     23.4
3     23.9

Counting Rows

Use n() inside summarize() to count rows:

Code

yrb_data %>%    
  group_by(sex) %>%    
  summarize(count = n())

# A tibble: 3 × 2
  sex    count
  <chr>  <int>
1 Female  9592
2 Male   10177
3 <NA>     231

You can combine counts with other summary statistics:

Code

yrb_data %>%    
  group_by(sex) %>%    
  summarize(count = n(), 
            mean_bmi = mean(bmi, na.rm=T))

# A tibble: 3 × 3
  sex    count mean_bmi
  <chr>  <int>    <dbl>
1 Female  9592     23.3
2 Male   10177     23.7
3 <NA>     231    NaN

Counting Rows with Conditions

To count rows that meet specific conditions, wrap the condition in sum():

Code

yrb_data %>% group_by(race7) %>% filter(!is.na(bmi)) %>%  
  summarize(count_above50 = sum(bmi > 50))

# A tibble: 8 × 2
  race7                     count_above50
  <chr>                             <int>
1 Am Indian / Alaska Native             1
2 Asian                                 1
3 Black or African American             7
4 Hispanic/Latino                       2
5 Multiple - Non-Hispanic               3
6 Native Hawaiian/other PI              1
7 White                                 2
8 <NA>                                  0

For binary variables, TRUE equals 1, and FALSE equals 0, making sum() work seamlessly.

Counting Missing Values

To count NAs:

Code

yrb_data %>% group_by(sex) %>% 
  summarize(unknown_bmi = sum(is.na(bmi)))

# A tibble: 3 × 2
  sex    unknown_bmi
  <chr>        <int>
1 Female        2970
2 Male          3257
3 <NA>           231

To count known (non-missing) values:

Code

yrb_data %>%    group_by(sex) %>%    
  summarize(known_bmi = sum(!is.na(bmi)))

# A tibble: 3 × 2
  sex    known_bmi
  <chr>      <int>
1 Female      6622
2 Male        6920
3 <NA>           0

Using `dplyr::count()`

count() simplifies counting observations by group:

Code

yrb_data %>% count(race4)

# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710

This is equivalent to:

Code

yrb_data %>% group_by(race4) %>%    
  summarize(n = n())

# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710

You can count by multiple variables:

Code

yrb_data %>% count(sex, grade)

# A tibble: 15 × 3
   sex    grade     n
   <chr>  <chr> <int>
 1 Female 10th   2332
 2 Female 11th   2365
 3 Female 12th   2277
 4 Female 9th    2492
 5 Female <NA>    126
 6 Male   10th   2539
 7 Male   11th   2496
 8 Male   12th   2263
 9 Male   9th    2684
10 Male   <NA>    195
11 <NA>   10th     36
12 <NA>   11th     30
13 <NA>   12th     37
14 <NA>   9th      43
15 <NA>   <NA>     85

Summarizing Multiple Columns with `across()`

A common task is to apply the same summary function (e.g., mean()) to multiple columns. Instead of writing the code for each column, we can use the powerful across() helper function inside summarize().

The basic syntax is summarize(across(columns, function)).

Example 1: Summarize specific columns

Let’s get the mean for the Wind and Temp columns in the airquality dataset.

Code

airquality |>
   summarize(across(c(Wind, Temp), ~mean(.x, na.rm = TRUE)))

      Wind     Temp
1 9.957516 77.88235

How it works:

c(Wind, Temp) tells across() which columns to use.
~mean(.x, na.rm = TRUE) is a shorthand way to write the function to apply. The . or .x is a placeholder for each column selected by across().

Example 2: Summarize columns that match a pattern

We can use select() helpers like where() or starts_with() inside across()! Let’s calculate the mean for all numeric columns.

Code

# Calculating the mean for every numeric column
airquality |>
  summarize(across(where(is.numeric), ~mean(.x, na.rm = TRUE)))

     Ozone  Solar.R     Wind     Temp    Month      Day
1 42.12931 185.9315 9.957516 77.88235 6.993464 15.80392

This across() pattern is a fundamental tool in modern data manipulation with dplyr.

Recoding Variables

use recode() inside a mutate() statement.

Example of Recoding

Code

library(tibble)
data_diet <- tibble(diet = rep(c("A", "B", "B"), times = 4), 
                    gender = c("Male","m","f","F","Female","M",
                               "f","M","Man","f","F","female"), 
                    weight_start = sample(100:250, size = 12),
                    weight_change = sample(-10:20, size = 12))
head(data_diet)

# A tibble: 6 × 4
  diet  gender weight_start weight_change
  <chr> <chr>         <int>         <int>
1 A     Male            227             8
2 B     m               226            19
3 B     f               219            15
4 A     F               177            -7
5 B     Female          198            14
6 B     M               136            17

Say we have some data about samples in a diet study but this needs lots of recoding.

Example Cont.

Code

library(dplyr)
data_diet |>
  count(gender)

# A tibble: 8 × 2
  gender     n
  <chr>  <int>
1 F          2
2 Female     1
3 M          2
4 Male       1
5 Man        1
6 f          3
7 female     1
8 m          1

`dplyr` can help!

Using Excel to find all of the different ways gender has been coded, could be hectic!

The recode() function inside mutate() is perfect for this. The syntax is recode(variable_to_fix, "old_value" = "new_value", "another_old_value" = "new_value").

Code

data_diet |> 
  mutate(gender_recoded = recode(gender, 
                                 "M" = "Male", "m" = "Male", "Man" = "Male",
                                 "F" = "Female", "f" = "Female", "female" = "Female")) |>
  count(gender_recoded)

# A tibble: 2 × 2
  gender_recoded     n
  <chr>          <int>
1 Female             7
2 Male               5

Or you can use `case_when()`

The case_when() function of dplyr can help us to do this as well.

Note that automatically values not reassigned explicitly by case_when() will be NA unless otherwise specified.

Code

data_diet |> 
  mutate(gender = case_when(gender == "M" ~ "Male"))

# A tibble: 12 × 4
   diet  gender weight_start weight_change
   <chr> <chr>         <int>         <int>
 1 A     <NA>            227             8
 2 B     <NA>            226            19
 3 B     <NA>            219            15
 4 A     <NA>            177            -7
 5 B     <NA>            198            14
 6 B     Male            136            17
 7 A     <NA>            123             3
 8 B     Male            174            -3
 9 B     <NA>            182            16
10 A     <NA>            232            -9
11 B     <NA>            130             6
12 B     <NA>            195             0

Use of `case_when()` without automatic `NA`

Here we use the original values of gender to replace all values of gender that do not meet the condition == "M".

Code

data_diet |> 
  mutate(gender = case_when(gender == "M" ~ "Male", TRUE ~ gender))

# A tibble: 12 × 4
   diet  gender weight_start weight_change
   <chr> <chr>         <int>         <int>
 1 A     Male            227             8
 2 B     m               226            19
 3 B     f               219            15
 4 A     F               177            -7
 5 B     Female          198            14
 6 B     Male            136            17
 7 A     f               123             3
 8 B     Male            174            -3
 9 B     Man             182            16
10 A     f               232            -9
11 B     F               130             6
12 B     female          195             0

More complicated case_when()

Code

data_diet |> 
  mutate(gender = case_when(
    gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
    gender %in% c("F", "Female", "f", "female") ~ "Female")) |> head()

# A tibble: 6 × 4
  diet  gender weight_start weight_change
  <chr> <chr>         <int>         <int>
1 A     Male            227             8
2 B     Male            226            19
3 B     Female          219            15
4 A     Female          177            -7
5 B     Female          198            14
6 B     Male            136            17

Another reason for `case_when()`

case_when can do very sophisticated comparisons

Code

data_diet1 <-data_diet |> 
      mutate(effect = case_when(weight_change > 0 ~ "increase",
                                weight_change == 0 ~ "same",
                                weight_change < 0 ~ "decrease"))
head(data_diet1)

# A tibble: 6 × 5
  diet  gender weight_start weight_change effect  
  <chr> <chr>         <int>         <int> <chr>   
1 A     Male            227             8 increase
2 B     m               226            19 increase
3 B     f               219            15 increase
4 A     F               177            -7 decrease
5 B     Female          198            14 increase
6 B     M               136            17 increase

Code

data_diet1 |> 
  count(diet, effect)

# A tibble: 5 × 3
  diet  effect       n
  <chr> <chr>    <int>
1 A     decrease     2
2 A     increase     2
3 B     decrease     1
4 B     increase     6
5 B     same         1

Creating new discrete column with `two levels`

The ifelse() statement can be used to turn a numeric column into a discrete one.

Code

data_diet |>
  mutate(temp_cat = ifelse(weight_change > 0, "increased", "decreased")) |>
  head()

# A tibble: 6 × 5
  diet  gender weight_start weight_change temp_cat 
  <chr> <chr>         <int>         <int> <chr>    
1 A     Male            227             8 increased
2 B     m               226            19 increased
3 B     f               219            15 increased
4 A     F               177            -7 decreased
5 B     Female          198            14 increased
6 B     M               136            17 increased

`case_when()` improved with `stringr`

Code

data_diet |> 
  mutate(gender = case_when(
    gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
    gender %in% c("F", "Female", "f", "female")~ "Female")) |> count(gender)

# A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 Female     7
2 Male       5

`case_when()` improved with `stringr`

^ indicates the beginning of a character string
$ indicates the end

Code

library(stringr)
data_diet |> 
  mutate(gender = case_when(
    str_detect(string = gender, pattern = "^m|^M") ~ "Male",
    str_detect(string = gender, pattern = "^f|^F") ~ "Female")) |>
  count(gender)

# A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 Female     7
2 Male       5

Data merging

Why Merging Matters in Public Health

Combine demographic & clinical data
Link longitudinal health records
Merge survey responses with medical data
Integrate multiple data sources for cohort studies

The 4 mutating join verbs:
left_join()
right_join()
inner_join()
full_join()

The 2 binding join verbs:
bind_rows()
bind_cols()

The 2 filtering join verbs:
semi_join()
anti_join()

The 3 set operations:
intersect()
union() and setdiff()

All the joins have this basic syntax: *_join(x, y, by = NULL, suffix = c(".x", ".y")
x = the first (left) table
y = the second (right) table
by = what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
suffix = if columns have the same name in the two tables, but you aren’t joining by them, they get a suffix to make them unambiguous.
This defaults to “.x” and “.y”, but you can change it to something more meaningful.

Sample Health Datasets

Patient Demographics (Synthetic)

Code

library(tibble)
clinic_data <- tibble(
  patient_id = c("P001", "P002", "P003", "P004"),
  age = c(35, 28, 42, 31),
  bmi = c(22.1, 26.5, 29.8, 24.3),
  smoking_status = c("former", "never", "current", "never")
)

Clinical Measurements

Code

lab_data <- tibble(
  patient_id = c("P002", "P003", "P003", "P005"),
  visit_date = as.Date(c("2023-01-15", "2023-02-01", 
                         "2023-03-01", "2023-01-20")),
  sbp = c(120, 135, 140, 128),
  dbp = c(80, 85, 90, 82))

left_join(): Preserve Clinic Records

What it does:

Retains all rows from the left (first) table
Adds matching columns from the right (second) table
Fills NA where no match exists
left_join()

Code

library(dplyr)
left_join(clinic_data, lab_data, by = "patient_id") %>% 
  arrange(patient_id)

# A tibble: 5 × 7
  patient_id   age   bmi smoking_status visit_date   sbp   dbp
  <chr>      <dbl> <dbl> <chr>          <date>     <dbl> <dbl>
1 P001          35  22.1 former         NA            NA    NA
2 P002          28  26.5 never          2023-01-15   120    80
3 P003          42  29.8 current        2023-02-01   135    85
4 P003          42  29.8 current        2023-03-01   140    90
5 P004          31  24.3 never          NA            NA    NA

Merging patient registries with lab results
Preserving all patients from primary clinic records

The order of the clinic_data and lab_data tables is different.

Code

left_join(lab_data, clinic_data, by = "patient_id") %>% 
  arrange(patient_id)

# A tibble: 4 × 7
  patient_id visit_date   sbp   dbp   age   bmi smoking_status
  <chr>      <date>     <dbl> <dbl> <dbl> <dbl> <chr>         
1 P002       2023-01-15   120    80    28  26.5 never         
2 P003       2023-02-01   135    85    42  29.8 current       
3 P003       2023-03-01   140    90    42  29.8 current       
4 P005       2023-01-20   128    82    NA  NA   <NA>

right_join()

A right_join keeps all the data from the second (right) table and joins anything that matches from the first (left) table.

inner_join(): Complete Cases Only

What it does:

Returns only rows with matches in both tables
Filters out non-matching records
An inner_join returns all the rows that have a match in the other table.

Code

inner_join(clinic_data, lab_data, by = "patient_id") %>% 
  knitr::kable()

patient_id	age	bmi	smoking_status	visit_date	sbp	dbp
P002	28	26.5	never	2023-01-15	120	80
P003	42	29.8	current	2023-02-01	135	85
P003	42	29.8	current	2023-03-01	140	90

inner_join() - cont…

Creating analysis datasets with complete information
Identifying patients with both survey and clinical data

full_join()

What it does:

A full_join lets you join up rows in two tables while keeping all of the information from both tables.
If a row doesn’t have a match in the other table, the other table’s column values are set to NA.

Code

full_join(clinic_data, lab_data, by = "patient_id")

# A tibble: 6 × 7
  patient_id   age   bmi smoking_status visit_date   sbp   dbp
  <chr>      <dbl> <dbl> <chr>          <date>     <dbl> <dbl>
1 P001          35  22.1 former         NA            NA    NA
2 P002          28  26.5 never          2023-01-15   120    80
3 P003          42  29.8 current        2023-02-01   135    85
4 P003          42  29.8 current        2023-03-01   140    90
5 P004          31  24.3 never          NA            NA    NA
6 P005          NA  NA   <NA>           2023-01-20   128    82

bind_rows()

You can combine the rows of two tables with bind_rows.

The columns just have to have the same names, they don’t have to be in the same order.
Any columns that differ between the two tables will just have NA values for entries from the other table.
If a row is duplicated between the two tables, the row will also be duplicated in the resulting table.

Data Management using R

Data Manipulation and Cleaning using dplyr package

What is Tidyverse?

Cont.

Intro to dplyr package

Select Columns from a Dataset select():

Ways to Use select() in dplyr

About the data

Pipe perator (|>)

Advantages of Pipe oprator

select a column by name: select(col1, col2, col3, ...)

Selecting column ranges with :

Excluding columns with ! or -

Helper functions: starts_with(), ends_with() and contains()

starts_with()

ends_with()

contains()

Another helper function, everything()

Filter cases from the dataset filter()

filter(): To extract cases

Ways to Use filter() in dplyr

Filtering based on exact character variable matches

Filtering based on multiple conditions

Adding or Modifying columns using mutate()

Sort rows with arrange

group_by() and summarise()

dplyr - Split-Apply-Combine

The summarize() function

dplyr::summarize() Function

Multiple Summary Statistics

Grouped summaries with dplyr::group_by()

Why summarize() Matters

Grouping by Multiple Variables (Nested Grouping)

Ungrouping Data

Why is ungroup() Needed?

Counting Rows

Counting Rows with Conditions

Counting Missing Values

Using dplyr::count()

Summarizing Multiple Columns with across()

Example 1: Summarize specific columns

Example 2: Summarize columns that match a pattern

Recoding Variables

dplyr can help!

Or you can use case_when()

Use of case_when() without automatic NA

More complicated case_when()

Another reason for case_when()

Creating new discrete column with two levels

case_when() improved with stringr

case_when() improved with stringr

Data merging

Why Merging Matters in Public Health

Sample Health Datasets

left_join(): Preserve Clinic Records

right_join()

inner_join(): Complete Cases Only

inner_join() - cont…

full_join()

bind_rows()

Data Manipulation and Cleaning using `dplyr` package

Intro to `dplyr` package

Select Columns from a Dataset `select()`:

Ways to Use `select()` in dplyr

Pipe perator (`|>`)

select a column by name: `select(col1, col2, col3, ...)`

Selecting column ranges with `:`

Excluding columns with `!` or `-`

Helper functions: `starts_with()`, `ends_with()` and `contains()`

`starts_with()`

`ends_with()`

`contains()`

Another helper function, `everything()`

Filter cases from the dataset `filter()`

`filter()`: To extract cases

Ways to Use `filter()` in dplyr

Adding or Modifying columns using `mutate()`

Sort rows with `arrange`

`group_by()` and `summarise()`

`dplyr` - Split-Apply-Combine

The `summarize()` function

`dplyr::summarize()` Function

Grouped summaries with `dplyr::group_by()`

Why `summarize()` Matters

Why is `ungroup()` Needed?

Using `dplyr::count()`

Summarizing Multiple Columns with `across()`

`dplyr` can help!

Or you can use `case_when()`

Use of `case_when()` without automatic `NA`

Another reason for `case_when()`

Creating new discrete column with `two levels`

`case_when()` improved with `stringr`

`case_when()` improved with `stringr`