1NDMC, EPHI; 2SPH, AAU; 3DMU & C4ED
October 14 - 17, 2025
dplyr
packageThe tidyverse
is a collection of R packages designed for data science.
All packages included in tidyverse
are automatically installed when installing the tidyverse
package:
Install the complete tidyverse with:
core packages
and others called friend packages
.Core tidyverse
tibble
, for tibbles, a modern re-imagining of data framesreadr
, for data importtidyr
, for data tidyingggplot2
, for data visualizationdplyr
, for data manipulationstringr
, for stringsforcats
, for factorspurrr
, for functional programmingMore Data Import/Export Tools
readxl
, for xls and xlsx fileshaven
, for SPSS, SAS, and Stata filesjsonlite
, for JSONxml2
, for XMLhttr
, for web APIsrvest
, for web scrapingDBI
, for databasesTools for date wrangling
lubridate
and hms
, for date/timesTools for modeling
modelr
, broom
for model/tidy datadplyr
packagedplyr
is part of tidyverse
and provides a grammar (the verbs) for data manipulation.
The key operator and the essential verbs are:
Function | Description | Operates on |
---|---|---|
filter() |
pick rows matching criteria | rows |
slice() |
pick rows using indices | rows |
arrange() |
reorder rows | rows |
select() |
pick columns by name | columns |
mutate() |
add new variables | columns |
summarise() |
reduce variables to values | groups of rows |
relocate() |
to change column positions | columns |
… many more.
|>
(the native pipe) or %>%
(the magrittr pipe) : the “pipe” operator used to connect multiple verb actions together into a pipeline.
Note: Base R pipe |>
is available in R 4.1+; magrittr pipe %>%
requires the magrittr/dplyr packages.
Tools → Global Options → Code → Editing → Use Native Pipe Operator (|>
)
select()
:select()
: To extract variables
select()
\(\sim\) columns
select columns (variables)
no quotes needed around variable names
can be used to rearrange columns
uses special syntax that is flexible and has many options
Note that the column names are not quoted; you access the column name as if you are calling the name of an object or variable
select()
in dplyrMethod | Description | Example | |
---|---|---|---|
using column name | Select specific columns by their names. | select(col1, col2) |
|
By position | Select columns by their positions. | select(1, 3) |
|
Using a range | Use : to select columns |
select(col1:col5); select(2:4) |
|
Exclude columns | Use - or ! to exclude specific columns |
select(-col3); select(-(2:4)); select(!starts_with("A")) |
|
Use pattern | Use helper functions based on patterns. |
select(starts_with("prefix")) select(ends_with("suffix")) select(contains("text"))
|
|
Select by type | Use where() to select based on type or condition. |
select(where(is.numeric)) |
|
Select all columns except some | Use everything() to re-order or select all columns except specific ones. |
select(col3, everything()) select(-starts_with("temp"))
|
|
Rearrange columns | Move specific columns to the front while retaining all others. | select(col1, col3, everything()) |
Data from the CDC’s Youth Risk Behavior Surveillance System (YRBSS)
glimpse()
function from the dplyr
package.|>
)Pipes in R look like |>
and strings together commands to be performed sequentially
The pipe passes the data frame output that results from the function right before the pipe to input it as the first argument of the function right after the pipe.
This nesting is not a natural way to think about a sequence of operations.
The |>
operator allows you to string operations in a left-to-right fashion.
select(col1, col2, col3, ...)
# A tibble: 20,000 × 3
age sex grade
<chr> <chr> <chr>
1 15 years old Female 10th
2 17 years old Female 12th
3 18 years old or older Male 11th
4 15 years old Male 10th
5 14 years old Male 9th
6 17 years old Male 9th
7 16 years old Male 11th
8 17 years old Male 12th
9 18 years old or older Male 12th
10 14 years old Male 10th
# ℹ 19,990 more rows
:
:
operator selects a range of consecutive variables:# A tibble: 3 × 4
age sex grade race4
<chr> <chr> <chr> <chr>
1 15 years old Female 10th White
2 17 years old Female 12th White
3 18 years old or older Male 11th Hispanic/Latino
!
or -
The exclamation point negates a selection:
# A tibble: 2 × 7
age sex grade race4 race7 bmi stweight
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 15 years old Female 10th White White 17.2 54.4
2 17 years old Female 12th White White 20.2 57.2
To drop a range of consecutive columns, we use, for example,!age:grade
:
# A tibble: 2 × 5
record race4 race7 bmi stweight
<dbl> <chr> <chr> <dbl> <dbl>
1 931897 White White 17.2 54.4
2 333862 White White 20.2 57.2
To drop several non-consecutive columns, place them inside !c()
:
starts_with()
, ends_with()
and contains()
starts_with()
# A tibble: 2 × 3
record race4 race7
<dbl> <chr> <chr>
1 931897 White White
2 333862 White White
ends_with()
contains()
contains()
helps select columns that contain a certain string:# A tibble: 6 × 5
sex record grade race4 race7
<chr> <dbl> <chr> <chr> <chr>
1 Female 931897 10th White White
2 Female 333862 12th White White
3 Male 36253 11th Hispanic/Latino Hispanic/Latino
4 Male 1095530 10th Black or African American Black or African American
5 Male 1303997 9th All other races Multiple - Non-Hispanic
6 Male 261619 9th All other races <NA>
everything()
# A tibble: 3 × 8
bmi record age sex grade race4 race7 stweight
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 17.2 931897 15 years old Female 10th White White 54.4
2 20.2 333862 17 years old Female 12th White White 57.2
3 NA 36253 18 years old or older Male 11th Hispanic/Latino Hisp… NA
It is often useful for establishing the order of columns.
everything()
.# A tibble: 3 × 8
record race4 race7 age sex grade bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 White White 15 years old Fema… 10th 17.2 54.4
2 333862 White White 17 years old Fema… 12th 20.2 57.2
3 36253 Hispanic/Latino Hispanic/Latino 18 years ol… Male 11th NA NA
select(where(...))
.is.character
, is.double
, is.factor
, is.integer
, is.logical
, is.numeric
.Rows: 20,000
Columns: 3
$ record <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1309…
$ bmi <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.2555,…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, NA,…
Summary for Select()
function
select(data, ...)
:yrb_data |> select(1, 2, 4)
oryrb_data |> select(1:2)
.yrb_data |> select(age, sex)
, or yrb_data |> select(age:race4)
.yrb_data |> select(starts_with("r"))
, or yrb_data |> select(ends_with("e"))
.yrb_data |> select(where(is.numeric))
,or yrb_data |> select(where(is.character))
.yrb_data |> select(!where(is.numeric)):
selects all non-numeric variables.yrb_data |> select(where(is.numeric) & contains("i")):
selects all numeric variables that contains ‘i’.filter()
filter()
: To extract casesThe function filter()
is used to filter the dataset to return a subset of all rows that meet one or more specific conditions.
filter(dataframe, logical statement 1, logical statement 2, ...)
filter()
in dplyrMethod | Description | Example |
---|---|---|
By specific value | Filter rows where a column equals a specific value. |
filter(col1 == "value") filter(col1 != "value")
|
By inequality | Filter rows based on inequality conditions. | filter(col1 > 10) |
Using multiple conditions | Filter rows that satisfy multiple conditions. | filter(col1 > 10, col2 == "A") |
With logical operators | Using AND (&) and using OR ( |) |
|
By range | Filter rows within a range of values using between() . |
filter(between(col1, 10, 20)) |
By missing values | Filter rows with or without missing values. |
filter(is.na(col1)) filter(!is.na(col1))
|
==
rather than the single equal sign =
.# A tibble: 3 × 2
sex grade
<chr> <chr>
1 Male 9th
2 Male 9th
3 Male 9th
# A tibble: 3 × 2
sex grade
<chr> <chr>
1 Male 11th
2 Male 10th
3 Male 9th
filter(grade != "9th")
will select everything except the grade 9 rows.%in%
operator.The %in%
operator used to deselect certain groups as well, using !%in%
.
To select all individuals with a bmi between 22 and 30, use:
The filter option also allows AND and OR style filters:
filter(condition1, condition2)
will return rows where both conditions are met.
filter(condition1 & condition2)
will also return rows where both conditions are met.
filter(condition1, !condition2)
will return all rows where condition one is true but condition 2 is not.
filter(condition1 | condition2)
will return rows where condition 1 and/or condition 2 is met.
# A tibble: 3 × 5
sex age bmi stweight grade
<chr> <chr> <dbl> <dbl> <chr>
1 Female 17 years old 20.2 57.2 12th
2 Male 15 years old 28.0 85.7 10th
3 Male 14 years old 24.5 66.7 9th
bmi
and stweight
columns from yrb_data
and filters out rows with missing bmi
valuesmutate()
Another common task is creating a new column based on values in existing columns.
The dplyr library has the following functions that can be used to add additional variables to a data frame.
mutate()
– adds new variables while retaining old variables to a data frame.
Example: add the new column called height_m
# A tibble: 3 × 9
record age sex grade race4 race7 bmi stweight height_m
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4 1.78
2 333862 17 years old Female 12th White White 20.2 57.2 1.68
3 36253 18 years old or older Male 11th Hispa… Hisp… NA NA NA
relocate()
function to put it before our bmi
column:# A tibble: 3 × 9
record age sex grade race4 race7 height_m bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 931897 15 years old Female 10th White White 1.78 17.2 54.4
2 333862 17 years old Female 12th White White 1.68 20.2 57.2
3 36253 18 years old or older Male 11th Hispa… Hisp… NA NA NA
arrange
Re-order rows by a particular column, by default in ascending order
Use desc()
for descending order.
arrange(data, variable1, desc(variable2), ...)
Example: Arrange by BMI in descending order
# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 324452 16 years old Male 11th Black or Afr… Blac… 53.9 91.2
2 1310082 18 years old or older Male 11th Black or Afr… Blac… 53.5 160.
3 328160 18 years old or older Male <NA> Black or Afr… Blac… 53.4 128.
4 1315913 17 years old Female 12th Black or Afr… Blac… 53.3 142.
5 1094597 13 years old Male 9th All other ra… Asian 52.9 181.
6 1305503 15 years old Male 9th All other ra… Am I… 52.4 134.
7 770391 16 years old Female 11th All other ra… Mult… 52.4 161.
8 634138 17 years old Male 12th All other ra… Nati… 52.3 160.
9 1312697 15 years old Female 10th Black or Afr… Blac… 52.3 95.3
10 1099468 17 years old Male 9th Black or Afr… Blac… 52.0 174.
# ℹ 19,990 more rows
group_by()
and summarise()
The dplyr
verbs become especially powerful when they are are combined using the pipe operator |>
.
The following dplyr
functions allow us to split our data frame into groups on which we can perform operations individually
group_by()
: group data frame by a factor for downstream operations (usually summarise)
summarise()
: summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min()
, max()
, mean()
, etc…)
dplyr
- Split-Apply-CombineThe group_by
function is key to the Split-Apply-Combine strategy
summarize()
functionsummarize()
function is used in the R program to summarize the data frame into just one value or vector.group_by()
function.summarize()
function offers the summary that is based on the action done on grouped or ungrouped data.dplyr::summarize()
Functionsummarize()
::You can calculate multiple statistics in one summarize()
:
# A tibble: 1 × 2
mean_age median_bmi
<dbl> <dbl>
1 23.5 22.3
dplyr::group_by()
group_by()
groups data by one or more variables.
Calculate the min
, max
and mean
weights for each sex. The function n()
will count the number of rows in each group.:
# A tibble: 2 × 5
sex max_weight min_weight mean_weight n
<chr> <dbl> <dbl> <dbl> <int>
1 Female 181. 27.7 61.7 6542
2 Male 181. 35.4 73.1 6901
summarize()
Mattersgroup_by()
and summarize()
allows highly informative grouped summaries of datasets with minimal code.
group_by()
:# A tibble: 4 × 3
# Groups: sex [1]
sex grade mean_bmi
<chr> <chr> <dbl>
1 Female 10th 23.0
2 Female 11th 23.4
3 Female 12th 23.9
4 Female 9th 22.8
group_by()
. We can use the arrange()
function:# A tibble: 6 × 3
# Groups: grade [4]
grade sex mean_bmi
<chr> <chr> <dbl>
1 9th Male 22.8
2 9th Female 22.8
3 10th Female 23.0
4 <NA> Female 23.1
5 11th Female 23.4
6 10th Male 23.5
group_by()
and summarize()
, the resulting data frame may still be grouped.ungroup()
:ungroup()
Needed?dplyr
functions like select()
, filter()
, or mutate()
:# A tibble: 4 × 2
# Groups: sex [1]
sex mean_bmi
<chr> <dbl>
1 Female 23.0
2 Female 23.4
3 Female 23.9
4 Female 22.8
n()
inside summarize()
to count rows:sum()
:# A tibble: 8 × 2
race7 count_above50
<chr> <int>
1 Am Indian / Alaska Native 1
2 Asian 1
3 Black or African American 7
4 Hispanic/Latino 2
5 Multiple - Non-Hispanic 3
6 Native Hawaiian/other PI 1
7 White 2
8 <NA> 0
TRUE
equals 1, and FALSE
equals 0, making sum()
work seamlessly.To count NA
s:
# A tibble: 3 × 2
sex unknown_bmi
<chr> <int>
1 Female 2970
2 Male 3257
3 <NA> 231
To count known (non-missing) values:
dplyr::count()
count()
simplifies counting observations by group:# A tibble: 15 × 3
sex grade n
<chr> <chr> <int>
1 Female 10th 2332
2 Female 11th 2365
3 Female 12th 2277
4 Female 9th 2492
5 Female <NA> 126
6 Male 10th 2539
7 Male 11th 2496
8 Male 12th 2263
9 Male 9th 2684
10 Male <NA> 195
11 <NA> 10th 36
12 <NA> 11th 30
13 <NA> 12th 37
14 <NA> 9th 43
15 <NA> <NA> 85
across()
mean()
) to multiple columns. Instead of writing the code for each column, we can use the powerful across()
helper function inside summarize()
.The basic syntax is summarize(across(columns, function))
.
Let’s get the mean for the Wind
and Temp
columns in the airquality
dataset.
Wind Temp
1 9.957516 77.88235
How it works:
c(Wind, Temp)
tells across()
which columns to use.~mean(.x, na.rm = TRUE)
is a shorthand way to write the function to apply. The .
or .x
is a placeholder for each column selected by across()
.We can use
select()
helpers likewhere()
orstarts_with()
insideacross()
! Let’s calculate the mean for all numeric columns.
This
across()
pattern is a fundamental tool in modern data manipulation withdplyr
.
recode()
inside a mutate()
statement.Example of Recoding
# A tibble: 6 × 4
diet gender weight_start weight_change
<chr> <chr> <int> <int>
1 A Male 227 8
2 B m 226 19
3 B f 219 15
4 A F 177 -7
5 B Female 198 14
6 B M 136 17
Example Cont.
dplyr
can help!Using Excel to find all of the different ways gender
has been coded, could be hectic!
The
recode()
function insidemutate()
is perfect for this. The syntax isrecode(variable_to_fix, "old_value" = "new_value", "another_old_value" = "new_value")
.
case_when()
The case_when()
function of dplyr
can help us to do this as well.
case_when()
will be NA
unless otherwise specified.# A tibble: 12 × 4
diet gender weight_start weight_change
<chr> <chr> <int> <int>
1 A <NA> 227 8
2 B <NA> 226 19
3 B <NA> 219 15
4 A <NA> 177 -7
5 B <NA> 198 14
6 B Male 136 17
7 A <NA> 123 3
8 B Male 174 -3
9 B <NA> 182 16
10 A <NA> 232 -9
11 B <NA> 130 6
12 B <NA> 195 0
case_when()
without automatic NA
gender
to replace all values of gender
that do not meet the condition == "M"
.# A tibble: 12 × 4
diet gender weight_start weight_change
<chr> <chr> <int> <int>
1 A Male 227 8
2 B m 226 19
3 B f 219 15
4 A F 177 -7
5 B Female 198 14
6 B Male 136 17
7 A f 123 3
8 B Male 174 -3
9 B Man 182 16
10 A f 232 -9
11 B F 130 6
12 B female 195 0
# A tibble: 6 × 4
diet gender weight_start weight_change
<chr> <chr> <int> <int>
1 A Male 227 8
2 B Male 226 19
3 B Female 219 15
4 A Female 177 -7
5 B Female 198 14
6 B Male 136 17
case_when()
case_when
can do very sophisticated comparisons
# A tibble: 6 × 5
diet gender weight_start weight_change effect
<chr> <chr> <int> <int> <chr>
1 A Male 227 8 increase
2 B m 226 19 increase
3 B f 219 15 increase
4 A F 177 -7 decrease
5 B Female 198 14 increase
6 B M 136 17 increase
two levels
ifelse()
statement can be used to turn a numeric column into a discrete one.# A tibble: 6 × 5
diet gender weight_start weight_change temp_cat
<chr> <chr> <int> <int> <chr>
1 A Male 227 8 increased
2 B m 226 19 increased
3 B f 219 15 increased
4 A F 177 -7 decreased
5 B Female 198 14 increased
6 B M 136 17 increased
case_when()
improved with stringr
case_when()
improved with stringr
^
indicates the beginning of a character string
$
indicates the end
left_join()
right_join()
inner_join()
full_join()
The 2 binding join verbs:
bind_rows()
bind_cols()
semi_join()
anti_join()
intersect()
union()
and setdiff()
All the joins have this basic syntax: *_join(x, y, by = NULL, suffix = c(".x", ".y")
x =
the first (left) table
y =
the second (right) table
by =
what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
suffix =
if columns have the same name in the two tables, but you aren’t joining by them, they get a suffix to make them unambiguous.
This defaults to “.x” and “.y”, but you can change it to something more meaningful.
What it does:
Retains all rows from the left (first) table
Adds matching columns from the right (second) table
Fills NA where no match exists
left_join()
# A tibble: 5 × 7
patient_id age bmi smoking_status visit_date sbp dbp
<chr> <dbl> <dbl> <chr> <date> <dbl> <dbl>
1 P001 35 22.1 former NA NA NA
2 P002 28 26.5 never 2023-01-15 120 80
3 P003 42 29.8 current 2023-02-01 135 85
4 P003 42 29.8 current 2023-03-01 140 90
5 P004 31 24.3 never NA NA NA
Merging patient registries with lab results
Preserving all patients from primary clinic records
clinic_data
and lab_data
tables is different.# A tibble: 4 × 7
patient_id visit_date sbp dbp age bmi smoking_status
<chr> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 P002 2023-01-15 120 80 28 26.5 never
2 P003 2023-02-01 135 85 42 29.8 current
3 P003 2023-03-01 140 90 42 29.8 current
4 P005 2023-01-20 128 82 NA NA <NA>
right_join
keeps all the data from the second (right) table and joins anything that matches from the first (left) table.What it does:
inner_join
returns all the rows that have a match in the other table.What it does:
full_join
lets you join up rows in two tables while keeping all of the information from both tables.NA
.# A tibble: 6 × 7
patient_id age bmi smoking_status visit_date sbp dbp
<chr> <dbl> <dbl> <chr> <date> <dbl> <dbl>
1 P001 35 22.1 former NA NA NA
2 P002 28 26.5 never 2023-01-15 120 80
3 P003 42 29.8 current 2023-02-01 135 85
4 P003 42 29.8 current 2023-03-01 140 90
5 P004 31 24.3 never NA NA NA
6 P005 NA NA <NA> 2023-01-20 128 82
bind_rows
.The columns just have to have the same names, they don’t have to be in the same order.
Any columns that differ between the two tables will just have NA
values for entries from the other table.
If a row is duplicated between the two tables, the row will also be duplicated in the resulting table.