Data Manipulation with R
NDMC, EPHI
July 21 - 25, 2025
dplyr()
packageThe tidyverse
is a collection of R packages designed for data science.
All packages included in tidyverse
are automatically installed when installing the tidyverse
package:
Install the complete tidyverse with:
core packages
and others called friend packages
.Core tidyverse
tibble
, for tibbles, a modern re-imagining of data framesreadr
, for data importtidyr
, for data tidyingggplot2
, for data visualizationdplyr
, for data manipulationstringr
, for stringsforcats
, for factorspurrr
, for functional programmingFriends for data import or export (beyond readr)
readxl
, for xls and xlsx fileshaven
, for SPSS, SAS, and Stata filesjsonlite
, for JSONxml2
, for XMLhttr
, for web APIsrvest
, for web scrapingDBI
, for databasesFriends for date wrangling
lubridate
and hms
, for date/timesFriends for modeling
modelr
and broom
for model/tidy datadplyr
packagedplyr
is part of tidyverse
provides a grammar (the verbs) for data manipulation.
The key operator and the essential verbs are:
Function | Description | Operates on |
---|---|---|
filter() |
pick rows matching criteria | rows |
slice() |
pick rows using indices | rows |
arrange() |
reorder rows | rows |
select() |
pick columns by name | columns |
mutate() |
add new variables | columns |
summarise() |
reduce variables to values | groups of rows |
relocate() |
to change column positions | columns |
… many more.
%>%
or |>
: the “pipe” operator used to connect multiple verb actions together into a pipeline.
Tools → Global Options → Code → Editing → Use Native Pipe Operator (|>
)
select()
:select()
: To extract variables
select()
\(\sim\) columns
select columns (variables)
no quotes needed around variable names
can be used to rearrange columns
uses special syntax that is flexible and has many options
Note that the column names are not quoted; you access the column name as if you are calling the name of an object or variable
select()
in dplyrMethod | Description | Example | |
---|---|---|---|
using column name | Select specific columns by their names. | select(col1, col2) |
|
By position | Select columns by their positions. | select(1, 3) |
|
Using a range | Use : to select columns |
select(col1:col5); select(2:4) |
|
Exclude columns | Use - to exclude specific columns |
select(-col3); select(-(2:4)); select(!starts_with("A")) |
|
Use pattern | Use helper functions based on patterns. |
select(starts_with("prefix")) select(ends_with("suffix")) select(contains("text"))
|
|
Select by type | Use where() to select based on type or condition. |
select(where(is.numeric)) |
|
Select all columns except some | Use everything() to re-order or select all columns except specific ones. |
select(col1, everything()) select(-starts_with("temp"))
|
|
Rearrange columns | Move specific columns to the front while retaining all others. | select(col1, col3, everything()) |
Data from the CDC’s Youth Risk Behavior Surveillance System (YRBSS)
yrbss_demo.csv
are a subset of data in the R package yrbss
glimpse()
function from the dplyr
package.|>
)Pipes in R look like |>
and strings together commands to be performed sequentially
The pipe passes the data frame output that results from the function right before the pipe to input it as the first argument of the function right after the pipe.
This nesting is not a natural way to think about a sequence of operations.
The |>
operator allows you to string operations in a left-to-right fashion.
select(col1, col2, col3, ...)
# A tibble: 20,000 × 3
age sex grade
<chr> <chr> <chr>
1 15 years old Female 10th
2 17 years old Female 12th
3 18 years old or older Male 11th
4 15 years old Male 10th
5 14 years old Male 9th
6 17 years old Male 9th
7 16 years old Male 11th
8 17 years old Male 12th
9 18 years old or older Male 12th
10 14 years old Male 10th
# ℹ 19,990 more rows
:
:
operator selects a range of consecutive variables:# A tibble: 3 × 4
age sex grade race4
<chr> <chr> <chr> <chr>
1 15 years old Female 10th White
2 17 years old Female 12th White
3 18 years old or older Male 11th Hispanic/Latino
!
or -
The exclamation point negates a selection:
# A tibble: 2 × 7
age sex grade race4 race7 bmi stweight
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 15 years old Female 10th White White 17.2 54.4
2 17 years old Female 12th White White 20.2 57.2
To drop a range of consecutive columns, we use, for example,!age:grade
:
# A tibble: 2 × 5
record race4 race7 bmi stweight
<dbl> <chr> <chr> <dbl> <dbl>
1 931897 White White 17.2 54.4
2 333862 White White 20.2 57.2
To drop several non-consecutive columns, place them inside !c()
:
starts_with()
, ends_with()
and contains()
starts_with()
# A tibble: 2 × 3
record race4 race7
<dbl> <chr> <chr>
1 931897 White White
2 333862 White White
ends_with()
contains()
contains()
helps select columns that contain a certain string:# A tibble: 6 × 5
sex record grade race4 race7
<chr> <dbl> <chr> <chr> <chr>
1 Female 931897 10th White White
2 Female 333862 12th White White
3 Male 36253 11th Hispanic/Latino Hispanic/Latino
4 Male 1095530 10th Black or African American Black or African American
5 Male 1303997 9th All other races Multiple - Non-Hispanic
6 Male 261619 9th All other races <NA>
everything()
# A tibble: 3 × 8
bmi record age sex grade race4 race7 stweight
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 17.2 931897 15 years old Female 10th White White 54.4
2 20.2 333862 17 years old Female 12th White White 57.2
3 NA 36253 18 years old or older Male 11th Hispanic/Latino Hisp… NA
It is often useful for establishing the order of columns.
everything()
.# A tibble: 3 × 8
record race4 race7 age sex grade bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 White White 15 years old Fema… 10th 17.2 54.4
2 333862 White White 17 years old Fema… 12th 20.2 57.2
3 36253 Hispanic/Latino Hispanic/Latino 18 years ol… Male 11th NA NA
select_if()
.is.character
, is.double
, is.factor
, is.integer
, is.logical
, is.numeric
.Rows: 20,000
Columns: 3
$ record <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1309…
$ bmi <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.2555,…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, NA,…
Summary for Select()
function
select(data, ...)
:yrb_data |> select(1, 2, 4)
oryrb_data |> select(1:2)
.yrb_data |> select(age, sex)
, or yrb_data |> select(age:race4)
.yrb_data |> select(starts_with("r"))
, or yrb_data |> select(ends_with("e"))
.yrb_data |> select(where(is.numeric))
,or yrb_data |> select(where(is.character))
.yrb_data |> select(!where(is.numeric)):
selects all non-numeric variables.yrb_data |> select(where(is.numeric) & contains("i")):
selects all numeric variables that contains ‘i’.filter()
filter()
: To extract casesThe function filter()
is used to filter the dataset to return a subset of all rows that meet one or more specific conditions.
filter(dataframe, logical statement 1, logical statement 2, ...)
filter()
in dplyrMethod | Description | Example |
---|---|---|
By specific value | Filter rows where a column equals a specific value. |
filter(col1 == "value") filter(col1 != "value")
|
By inequality | Filter rows based on inequality conditions. | filter(col1 > 10) |
Using multiple conditions | Filter rows that satisfy multiple conditions. | filter(col1 > 10, col2 == "A") |
With logical operators | Using AND (&) and using OR ( |) |
|
By range | Filter rows within a range of values using between() . |
filter(between(col1, 10, 20)) |
By missing values | Filter rows with or without missing values. |
filter(is.na(col1)) filter(!is.na(col1))
|
==
rather than the single equal sign =
.# A tibble: 3 × 2
sex grade
<chr> <chr>
1 Male 9th
2 Male 9th
3 Male 9th
# A tibble: 3 × 2
sex grade
<chr> <chr>
1 Male 11th
2 Male 10th
3 Male 9th
filter(grade != "9th")
will select everything except the grade 9 rows.%in%
operator.The %in%
operator used to deselect certain groups as well, using !%in%
.
To select all individuals with a bmi between 22 and 30, use:
The filter option also allows AND and OR style filters:
filter(condition1, condition2)
will return rows where both conditions are met.
filter(condition1 & condition2)
will also return rows where both conditions are met.
filter(condition1, !condition2)
will return all rows where condition one is true but condition 2 is not.
filter(condition1 | condition2)
will return rows where condition 1 and/or condition 2 is met.
# A tibble: 3 × 5
sex age bmi stweight grade
<chr> <chr> <dbl> <dbl> <chr>
1 Female 17 years old 20.2 57.2 12th
2 Male 15 years old 28.0 85.7 10th
3 Male 14 years old 24.5 66.7 9th
bmi
and stweight
columns from yrb_data
and filters out rows with missing bmi
valuesmutate()
Another common task is creating a new column based on values in existing columns.
The dplyr library has the following functions that can be used to add additional variables to a data frame.
mutate()
– adds new variables while retaining old variables to a data frame.
Example: add the new column called height_m
# A tibble: 3 × 9
record age sex grade race4 race7 bmi stweight height_m
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4 1.78
2 333862 17 years old Female 12th White White 20.2 57.2 1.68
3 36253 18 years old or older Male 11th Hispa… Hisp… NA NA NA
relocate()
function to put it before our bmi
column:# A tibble: 3 × 9
record age sex grade race4 race7 height_m bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 931897 15 years old Female 10th White White 1.78 17.2 54.4
2 333862 17 years old Female 12th White White 1.68 20.2 57.2
3 36253 18 years old or older Male 11th Hispa… Hisp… NA NA NA
arrange
Re-order rows by a particular column, by default in ascending order
Use desc()
for descending order.
arrange(data, variable1, desc(variable2), ...)
Example: Arrange by BMI in descending order
# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 324452 16 years old Male 11th Black or Afr… Blac… 53.9 91.2
2 1310082 18 years old or older Male 11th Black or Afr… Blac… 53.5 160.
3 328160 18 years old or older Male <NA> Black or Afr… Blac… 53.4 128.
4 1315913 17 years old Female 12th Black or Afr… Blac… 53.3 142.
5 1094597 13 years old Male 9th All other ra… Asian 52.9 181.
6 1305503 15 years old Male 9th All other ra… Am I… 52.4 134.
7 770391 16 years old Female 11th All other ra… Mult… 52.4 161.
8 634138 17 years old Male 12th All other ra… Nati… 52.3 160.
9 1312697 15 years old Female 10th Black or Afr… Blac… 52.3 95.3
10 1099468 17 years old Male 9th Black or Afr… Blac… 52.0 174.
# ℹ 19,990 more rows
group_by()
and summarise()
The dplyr
verbs become especially powerful when they are are combined using the pipe operator |>
.
The following dplyr
functions allow us to split our data frame into groups on which we can perform operations individually
group_by()
: group data frame by a factor for downstream operations (usually summarise)
summarise()
: summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min()
, max()
, mean()
, etc…)
dplyr
- Split-Apply-CombineThe group_by
function is key to the Split-Apply-Combine strategy
summarize()
functionsummarize()
function is used in the R program to summarize the data frame into just one value or vector.group_by()
function.summarize()
function offers the summary that is based on the action done on grouped or ungrouped data.dplyr::summarize()
Functionsummarize()
::You can calculate multiple statistics in one summarize()
:
# A tibble: 1 × 2
mean_age median_bmi
<dbl> <dbl>
1 23.5 22.3
dplyr::group_by()
group_by()
groups data by one or more variables.
Calculate the min
, max
and mean
weights for each sex. The function n()
will count the number of rows in each group.:
# A tibble: 2 × 5
sex max_weight min_weight mean_weight n
<chr> <dbl> <dbl> <dbl> <int>
1 Female 181. 27.7 61.7 6542
2 Male 181. 35.4 73.1 6901
summarize()
Mattersgroup_by()
and summarize()
allows highly informative grouped summaries of datasets with minimal code.
group_by()
:# A tibble: 4 × 3
# Groups: sex [1]
sex grade mean_bmi
<chr> <chr> <dbl>
1 Female 10th 23.0
2 Female 11th 23.4
3 Female 12th 23.9
4 Female 9th 22.8
group_by()
. We can use the arrange()
function:# A tibble: 6 × 3
# Groups: grade [4]
grade sex mean_bmi
<chr> <chr> <dbl>
1 9th Male 22.8
2 9th Female 22.8
3 10th Female 23.0
4 <NA> Female 23.1
5 11th Female 23.4
6 10th Male 23.5
group_by()
and summarize()
, the resulting data frame may still be grouped.ungroup()
:ungroup()
Needed?dplyr
functions like select()
, filter()
, or mutate()
:# A tibble: 4 × 2
# Groups: sex [1]
sex mean_bmi
<chr> <dbl>
1 Female 23.0
2 Female 23.4
3 Female 23.9
4 Female 22.8
n()
inside summarize()
to count rows:sum()
:# A tibble: 8 × 2
race7 count_above50
<chr> <int>
1 Am Indian / Alaska Native 1
2 Asian 1
3 Black or African American 7
4 Hispanic/Latino 2
5 Multiple - Non-Hispanic 3
6 Native Hawaiian/other PI 1
7 White 2
8 <NA> 0
TRUE
equals 1, and FALSE
equals 0, making sum()
work seamlessly.To count NA
s:
# A tibble: 3 × 2
sex unknown_bmi
<chr> <int>
1 Female 2970
2 Male 3257
3 <NA> 231
To count known (non-missing) values:
dplyr::count()
count()
simplifies counting observations by group:# A tibble: 15 × 3
sex grade n
<chr> <chr> <int>
1 Female 10th 2332
2 Female 11th 2365
3 Female 12th 2277
4 Female 9th 2492
5 Female <NA> 126
6 Male 10th 2539
7 Male 11th 2496
8 Male 12th 2263
9 Male 9th 2684
10 Male <NA> 195
11 <NA> 10th 36
12 <NA> 11th 30
13 <NA> 12th 37
14 <NA> 9th 43
15 <NA> <NA> 85
summarize_all()
summarize_at()
summarize_all(action)
airquality |> summarize_all(mean)
will show the mean of all columns.It performs the action on the specific column and generates the summary based on that action.
summarize_at(vector_of_columns, action)
vector_of_columns
: The list of column names or character vector of column names.