Data Manipulation with R

Leykun Getaneh (MSc)

NDMC, EPHI

July 21 - 25, 2025

Data Manipulation and Cleaning using `dplyr()` package

What is Tidyverse?

The tidyverse is a collection of R packages designed for data science.
- All packages share an underlying design philosophy, grammar, and data structures.
All packages included in tidyverse are automatically installed when installing the tidyverse package:
Install the complete tidyverse with:

Code

install.packages("tidyverse")

To load the core tidyverse and make it available in your current R session.

Code

library(tidyverse)

To see the packages included in the tidyverse

Code

tidyverse_packages()

Some packages under tidyverse are considered core packages and others called friend packages.

Core tidyverse

tibble, for tibbles, a modern re-imagining of data frames
readr, for data import
tidyr, for data tidying
ggplot2, for data visualization
dplyr, for data manipulation
stringr, for strings
forcats, for factors
purrr, for functional programming

Friends for data import or export (beyond readr)

readxl, for xls and xlsx files
haven, for SPSS, SAS, and Stata files
jsonlite, for JSON
xml2, for XML
httr, for web APIs
rvest, for web scraping
DBI, for databases

Friends for date wrangling

lubridate and hms, for date/times

Friends for modeling

modelr and broom for model/tidy data

Intro to `dplyr` package

dplyr is part of tidyverse provides a grammar (the verbs) for data manipulation.

The key operator and the essential verbs are:

Function	Description	Operates on
`filter()`	pick rows matching criteria	rows
`slice()`	pick rows using indices	rows
`arrange()`	reorder rows	rows
`select()`	pick columns by name	columns
`mutate()`	add new variables	columns
`summarise()`	reduce variables to values	groups of rows
`relocate()`	to change column positions	columns

… many more.

%>% or |> : the “pipe” operator used to connect multiple verb actions together into a pipeline.

Tools → Global Options → Code → Editing → Use Native Pipe Operator (|>)

Select Columns from a Dataset `select()`:

select(): To extract variables

select() \(\sim\) columns
select columns (variables)
no quotes needed around variable names
can be used to rearrange columns
uses special syntax that is flexible and has many options

Note that the column names are not quoted; you access the column name as if you are calling the name of an object or variable

Ways to Use `select()` in dplyr

Method	Description	Example
using column name	Select specific columns by their names.	`select(col1, col2)`
By position	Select columns by their positions.	`select(1, 3)`
Using a range	Use `:` to select columns	`select(col1:col5); select(2:4)`
Exclude columns	Use `-` to exclude specific columns	`select(-col3); select(-(2:4)); select(!starts_with("A"))`
Use pattern	Use helper functions based on patterns.	`select(starts_with("prefix"))` `select(ends_with("suffix"))` `select(contains("text"))`
Select by type	Use `where()` to select based on type or condition.	`select(where(is.numeric))`
Select all columns except some	Use `everything()` to re-order or select all columns except specific ones.	`select(col1, everything())` `select(-starts_with("temp"))`
Rearrange columns	Move specific columns to the front while retaining all others.	`select(col1, col3, everything())`

About the data

Data from the CDC’s Youth Risk Behavior Surveillance System (YRBSS)

complex survey data
national school-based survey conducted by CDC
monitors six categories of health-related behaviors
- that contribute to the leading causes of death and disability among youth and adults
- including alcohol & drug use, unhealthy & dangerous behaviors, sexuality, and physical activity
the data in yrbss_demo.csv are a subset of data in the R package yrbss

Code

library(readr)
yrb_data <- read_csv("data/yrbss.csv")

We can have a look at the data and its structure by using the glimpse() function from the dplyr package.

Pipe perator (`|>`)

Pipes in R look like |> and strings together commands to be performed sequentially
The pipe passes the data frame output that results from the function right before the pipe to input it as the first argument of the function right after the pipe.

Code

third(second(first(x)))

This nesting is not a natural way to think about a sequence of operations.
The |> operator allows you to string operations in a left-to-right fashion.

Code

first(x) |>
second |> third

Advantages of Pipe oprator

Pipes used to reduce multiple steps, that can be hard to keep track of.
less redundant code
Easy to read and write because functions are executed in order
- Difficult to read if too many functions are nested
Look at the three syntax

Code

data1<-filter(sampledata, Age > 15) #<<
data2<-select(data1, Sex, Weight1, Age) #<<

Code

non_piped <-select(filter(mydata, Age>15), Sex, Weight1, Age) #<<

Code

pipeddata<-mydata |> filter(Age > 15) |> select(Sex, Weight1, Height1, Age)#<<

select a column by name: `select(col1, col2, col3, ...)`

Code

library(dplyr)
yrb_data1 <- yrb_data |> 
  select(age, sex, grade)
yrb_data1

# A tibble: 20,000 × 3
   age                   sex    grade
   <chr>                 <chr>  <chr>
 1 15 years old          Female 10th 
 2 17 years old          Female 12th 
 3 18 years old or older Male   11th 
 4 15 years old          Male   10th 
 5 14 years old          Male   9th  
 6 17 years old          Male   9th  
 7 16 years old          Male   11th 
 8 17 years old          Male   12th 
 9 18 years old or older Male   12th 
10 14 years old          Male   10th 
# ℹ 19,990 more rows

Selecting column ranges with `:`

The : operator selects a range of consecutive variables:

Code

yrb_data |>  select(age:race4) |>  head(3)

# A tibble: 3 × 4
  age                   sex    grade race4          
  <chr>                 <chr>  <chr> <chr>          
1 15 years old          Female 10th  White          
2 17 years old          Female 12th  White          
3 18 years old or older Male   11th  Hispanic/Latino

We can also specify a range with column numbers:

Code

yrb_data |> select(1:4) |> head(3)

# A tibble: 3 × 4
  record age                   sex    grade
   <dbl> <chr>                 <chr>  <chr>
1 931897 15 years old          Female 10th 
2 333862 17 years old          Female 12th 
3  36253 18 years old or older Male   11th

Excluding columns with `!` or `-`

The exclamation point negates a selection:

Code

yrb_data |> select(!record) |> head(2)

# A tibble: 2 × 7
  age          sex    grade race4 race7   bmi stweight
  <chr>        <chr>  <chr> <chr> <chr> <dbl>    <dbl>
1 15 years old Female 10th  White White  17.2     54.4
2 17 years old Female 12th  White White  20.2     57.2

To drop a range of consecutive columns, we use, for example,!age:grade:

Code

yrb_data |> select(!age:grade) |> head(2)

# A tibble: 2 × 5
  record race4 race7   bmi stweight
   <dbl> <chr> <chr> <dbl>    <dbl>
1 931897 White White  17.2     54.4
2 333862 White White  20.2     57.2

To drop several non-consecutive columns, place them inside !c():

Code

yrb_data |> select(!c(race4, race7)) |> head(3)

Helper functions: `starts_with()`, `ends_with()` and `contains()`

These two helpers work exactly as their names suggest!

`starts_with()`

Code

yrb_data |> select(starts_with("r")) |> head(2)

# A tibble: 2 × 3
  record race4 race7
   <dbl> <chr> <chr>
1 931897 White White
2 333862 White White

`ends_with()`

Code

yrb_data |> select(ends_with("e")) |> head(3)

# A tibble: 3 × 2
  age                   grade
  <chr>                 <chr>
1 15 years old          10th 
2 17 years old          12th 
3 18 years old or older 11th

`contains()`

contains() helps select columns that contain a certain string:

Code

yrb_data |> select(sex, contains("r")) |> head()

# A tibble: 6 × 5
  sex     record grade race4                     race7                    
  <chr>    <dbl> <chr> <chr>                     <chr>                    
1 Female  931897 10th  White                     White                    
2 Female  333862 12th  White                     White                    
3 Male     36253 11th  Hispanic/Latino           Hispanic/Latino          
4 Male   1095530 10th  Black or African American Black or African American
5 Male   1303997 9th   All other races           Multiple - Non-Hispanic  
6 Male    261619 9th   All other races           <NA>

Another helper function, `everything()`

matches all variables that have not yet been selected.

Code

## First, `bmi`, then every other column.
yrb_data |> select(bmi, everything()) |> head(3)

# A tibble: 3 × 8
    bmi record age                   sex    grade race4           race7 stweight
  <dbl>  <dbl> <chr>                 <chr>  <chr> <chr>           <chr>    <dbl>
1  17.2 931897 15 years old          Female 10th  White           White     54.4
2  20.2 333862 17 years old          Female 12th  White           White     57.2
3  NA    36253 18 years old or older Male   11th  Hispanic/Latino Hisp…     NA

It is often useful for establishing the order of columns.

But this would be painful for larger data frames, data frame. In such a case, we can use everything().

This helper can be combined with many others.

Code

## Bring columns that starts with "r" to the front of the data frame
yrb_data |> select(starts_with("r"), everything()) %>% head(3)

# A tibble: 3 × 8
  record race4           race7           age          sex   grade   bmi stweight
   <dbl> <chr>           <chr>           <chr>        <chr> <chr> <dbl>    <dbl>
1 931897 White           White           15 years old Fema… 10th   17.2     54.4
2 333862 White           White           17 years old Fema… 12th   20.2     57.2
3  36253 Hispanic/Latino Hispanic/Latino 18 years ol… Male  11th   NA       NA

You can also select columns based on their data type using select_if().
The common data types to be called are: is.character, is.double, is.factor, is.integer, is.logical, is.numeric.

Code

yrb_data |>  select_if(is.numeric) |>
  glimpse()  # numeric data types only selected (here: integer or double)

Rows: 20,000
Columns: 3
$ record   <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1309…
$ bmi      <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.2555,…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, NA,…

Summary for Select() function

There are five ways to select variables in select(data, ...):

By position: yrb_data |> select(1, 2, 4) oryrb_data |> select(1:2).
By name: yrb_data |> select(age, sex), or yrb_data |> select(age:race4).
By function of name: yrb_data |> select(starts_with("r")), or yrb_data |> select(ends_with("e")).
By type: yrb_data |> select(where(is.numeric)),or yrb_data |> select(where(is.character)).
By any combination of the above using the Boolean operators !, &, and |:
- yrb_data |> select(!where(is.numeric)): selects all non-numeric variables.
- yrb_data |> select(where(is.numeric) & contains("i")): selects all numeric variables that contains ‘i’.

Filter cases from the dataset `filter()`

`filter()`: To extract cases

The function filter() is used to filter the dataset to return a subset of all rows that meet one or more specific conditions.

filter(dataframe, logical statement 1, logical statement 2, ...)

Ways to Use `filter()` in dplyr

Method	Description	Example
By specific value	Filter rows where a column equals a specific value.	`filter(col1 == "value")` `filter(col1 != "value")`
By inequality	Filter rows based on inequality conditions.	`filter(col1 > 10)`
Using multiple conditions	Filter rows that satisfy multiple conditions.	`filter(col1 > 10, col2 == "A")`
With logical operators	Using AND (`&) and using OR (`\|)
By range	Filter rows within a range of values using `between()`.	`filter(between(col1, 10, 20))`
By missing values	Filter rows with or without missing values.	`filter(is.na(col1))` `filter(!is.na(col1))`

Filtering based on exact character variable matches

Note the use of the double equal sign == rather than the single equal sign =.

Code

yrb_data |> select(sex, grade ) |>  filter(grade=="9th")|>  head(3)

# A tibble: 3 × 2
  sex   grade
  <chr> <chr>
1 Male  9th  
2 Male  9th  
3 Male  9th

Code

yrb_data |> select(sex, grade)|> filter(sex == "Male") |> head(3)

# A tibble: 3 × 2
  sex   grade
  <chr> <chr>
1 Male  11th 
2 Male  10th 
3 Male  9th

Similarly you can use the other operators:
- filter(grade != "9th") will select everything except the grade 9 rows.

If you want to select more than one category value you can use the %in% operator.

Code

yrb_data |> 
  select(sex, age, grade ) |> 
  filter(grade %in% c("9th", "11th")) |> head(3)

The %in% operator used to deselect certain groups as well, using !%in%.
To select all individuals with a bmi between 22 and 30, use:

Code

yrb_data |> 
  select(sex, age, bmi) |> 
  filter(between(bmi, 22, 30))

Code

yrb_data |> 
  select(sex, age, bmi) |> 
  filter(bmi >= 22, bmi <= 30)

Filtering based on multiple conditions

The filter option also allows AND and OR style filters:
filter(condition1, condition2) will return rows where both conditions are met.
filter(condition1 & condition2) will also return rows where both conditions are met.
filter(condition1, !condition2) will return all rows where condition one is true but condition 2 is not.
filter(condition1 | condition2) will return rows where condition 1 and/or condition 2 is met.

Code

yrb_data |> select(sex, age, bmi, stweight, grade) |> 
  filter(bmi > 20, (stweight > 50 | grade != "12th")) |> 
  head(3)

# A tibble: 3 × 5
  sex    age            bmi stweight grade
  <chr>  <chr>        <dbl>    <dbl> <chr>
1 Female 17 years old  20.2     57.2 12th 
2 Male   15 years old  28.0     85.7 10th 
3 Male   14 years old  24.5     66.7 9th

selects the bmi and stweight columns from yrb_data and filters out rows with missing bmi values

Code

yrb_data |>  
  select(bmi, stweight) |> 
  filter(!is.na(bmi)) |> 
  head(4)

# A tibble: 4 × 2
    bmi stweight
  <dbl>    <dbl>
1  17.2     54.4
2  20.2     57.2
3  28.0     85.7
4  24.5     66.7

Adding or Modifying columns using `mutate()`

Another common task is creating a new column based on values in existing columns.
The dplyr library has the following functions that can be used to add additional variables to a data frame.
mutate() – adds new variables while retaining old variables to a data frame.
Example: add the new column called height_m

Code

yrb_data %>% 
  mutate(height_m = sqrt(stweight / bmi)) |>    # use = (not <- or ==) to define new variable
  head(3)

# A tibble: 3 × 9
  record age                   sex    grade race4  race7   bmi stweight height_m
   <dbl> <chr>                 <chr>  <chr> <chr>  <chr> <dbl>    <dbl>    <dbl>
1 931897 15 years old          Female 10th  White  White  17.2     54.4     1.78
2 333862 17 years old          Female 12th  White  White  20.2     57.2     1.68
3  36253 18 years old or older Male   11th  Hispa… Hisp…  NA       NA      NA

We can use the relocate() function to put it before our bmi column:

Code

yrb_data %>% 
  mutate(height_m = sqrt(stweight / bmi)) |> 
  relocate(height_m, .before = bmi) |> 
  head(3)

# A tibble: 3 × 9
  record age                   sex    grade race4  race7 height_m   bmi stweight
   <dbl> <chr>                 <chr>  <chr> <chr>  <chr>    <dbl> <dbl>    <dbl>
1 931897 15 years old          Female 10th  White  White     1.78  17.2     54.4
2 333862 17 years old          Female 12th  White  White     1.68  20.2     57.2
3  36253 18 years old or older Male   11th  Hispa… Hisp…    NA     NA       NA

Sort rows with `arrange`

Re-order rows by a particular column, by default in ascending order

Use desc() for descending order.

arrange(data, variable1, desc(variable2), ...)

Example: Arrange by BMI in descending order

Code

# Example: Arrange by BMI in descending order
yrb_data %>%
  arrange(desc(bmi))

# A tibble: 20,000 × 8
    record age                   sex    grade race4         race7   bmi stweight
     <dbl> <chr>                 <chr>  <chr> <chr>         <chr> <dbl>    <dbl>
 1  324452 16 years old          Male   11th  Black or Afr… Blac…  53.9     91.2
 2 1310082 18 years old or older Male   11th  Black or Afr… Blac…  53.5    160. 
 3  328160 18 years old or older Male   <NA>  Black or Afr… Blac…  53.4    128. 
 4 1315913 17 years old          Female 12th  Black or Afr… Blac…  53.3    142. 
 5 1094597 13 years old          Male   9th   All other ra… Asian  52.9    181. 
 6 1305503 15 years old          Male   9th   All other ra… Am I…  52.4    134. 
 7  770391 16 years old          Female 11th  All other ra… Mult…  52.4    161. 
 8  634138 17 years old          Male   12th  All other ra… Nati…  52.3    160. 
 9 1312697 15 years old          Female 10th  Black or Afr… Blac…  52.3     95.3
10 1099468 17 years old          Male   9th   Black or Afr… Blac…  52.0    174. 
# ℹ 19,990 more rows

`group_by()` and `summarise()`

The dplyr verbs become especially powerful when they are are combined using the pipe operator |>.
The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually
group_by(): group data frame by a factor for downstream operations (usually summarise)
summarise(): summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

`dplyr` - Split-Apply-Combine

The group_by function is key to the Split-Apply-Combine strategy

The `summarize()` function

The summarize() function is used in the R program to summarize the data frame into just one value or vector.
This summarization is done through grouping observations by using categorical values at first, using the group_by() function.
The summarize() function offers the summary that is based on the action done on grouped or ungrouped data.

`dplyr::summarize()` Function

To calculate the mean bmi in base R vs with summarize()::

Code

mean(yrb_data$bmi, na.rm = T)

[1] 23.49541

summarize(new_column = summary_function(column))

Code

yrb_data %>% filter(!is.na(bmi)) %>% 
  summarize(mean_bmi = mean(bmi))

# A tibble: 1 × 1
  mean_bmi
     <dbl>
1     23.5

Multiple Summary Statistics

You can calculate multiple statistics in one summarize():

Code

yrb_data %>% filter(!is.na(bmi)) %>% 
  summarize(mean_age = mean(bmi), 
            median_bmi = median(bmi))

# A tibble: 1 × 2
  mean_age median_bmi
     <dbl>      <dbl>
1     23.5       22.3

Grouped summaries with `dplyr::group_by()`

group_by() groups data by one or more variables.

Example 1: Mean weight by Sex

Code

yrb_data %>% 
  filter(!is.na(stweight)) |> 
  group_by(sex) |> 
  summarize(mean_weight = mean(stweight))

Example 2: Maximum and Minimum Weights

Calculate the min, max and mean weights for each sex. The function n() will count the number of rows in each group.:

Code

yrb_data %>% 
  filter(!is.na(stweight)) %>%
  group_by(sex) %>%  
  summarize(max_weight = max(stweight), 
            min_weight = min(stweight),
            mean_weight = mean(stweight),
            n = n())

# A tibble: 2 × 5
  sex    max_weight min_weight mean_weight     n
  <chr>       <dbl>      <dbl>       <dbl> <int>
1 Female       181.       27.7        61.7  6542
2 Male         181.       35.4        73.1  6901

Why `summarize()` Matters

The combination of group_by() and summarize() allows highly informative grouped summaries of datasets with minimal code.
- Producing such summaries is an essential data analysis skill.

Grouping by Multiple Variables (Nested Grouping)

To group by more than one variable, list both in group_by():

Code

yrb_data %>% filter(!is.na(bmi)) %>% group_by(sex, grade) %>%    
  summarize(mean_bmi = mean(bmi)) %>% head(4)

# A tibble: 4 × 3
# Groups:   sex [1]
  sex    grade mean_bmi
  <chr>  <chr>    <dbl>
1 Female 10th      23.0
2 Female 11th      23.4
3 Female 12th      23.9
4 Female 9th       22.8

You can swap the column order in group_by(). We can use the arrange() function:

Code

yrb_data |>  filter(!is.na(bmi))  |>  group_by(grade, sex) |>      
  summarize(mean_bmi = mean(bmi))  |>  arrange(mean_bmi) |> head()

# A tibble: 6 × 3
# Groups:   grade [4]
  grade sex    mean_bmi
  <chr> <chr>     <dbl>
1 9th   Male       22.8
2 9th   Female     22.8
3 10th  Female     23.0
4 <NA>  Female     23.1
5 11th  Female     23.4
6 10th  Male       23.5

Ungrouping Data

After group_by() and summarize(), the resulting data frame may still be grouped.
To avoid unintended behaviors, use ungroup():

Code

yrb_data %>% filter(!is.na(bmi)) |> group_by(sex, grade) %>%   
  summarize(mean_bmi = mean(bmi)) %>% ungroup() %>% head()

# A tibble: 6 × 3
  sex    grade mean_bmi
  <chr>  <chr>    <dbl>
1 Female 10th      23.0
2 Female 11th      23.4
3 Female 12th      23.9
4 Female 9th       22.8
5 Female <NA>      23.1
6 Male   10th      23.5

Why is `ungroup()` Needed?

Grouped data frames behave uniquely with other dplyr functions like select(), filter(), or mutate():

Code

# Unexpected behavior when grouped 
yrb_data %>% group_by(sex, grade) %>% filter(!is.na(bmi)) %>%    
  summarize(mean_bmi = mean(bmi)) %>% select(mean_bmi) %>% head(4)

# A tibble: 4 × 2
# Groups:   sex [1]
  sex    mean_bmi
  <chr>     <dbl>
1 Female     23.0
2 Female     23.4
3 Female     23.9
4 Female     22.8

By ungrouping, we get the expected output:

Code

yrb_data %>% group_by(sex, grade) %>% filter(!is.na(bmi)) %>% 
  summarize(mean_bmi = mean(bmi)) %>%   
  ungroup() %>% select(mean_bmi) %>% head(3)

# A tibble: 3 × 1
  mean_bmi
     <dbl>
1     23.0
2     23.4
3     23.9

Counting Rows

Use n() inside summarize() to count rows:

Code

yrb_data %>%    
  group_by(sex) %>%    
  summarize(count = n())

# A tibble: 3 × 2
  sex    count
  <chr>  <int>
1 Female  9592
2 Male   10177
3 <NA>     231

You can combine counts with other summary statistics:

Code

yrb_data %>%    
  group_by(sex) %>%    
  summarize(count = n(), 
            mean_bmi = mean(bmi, na.rm=T))

# A tibble: 3 × 3
  sex    count mean_bmi
  <chr>  <int>    <dbl>
1 Female  9592     23.3
2 Male   10177     23.7
3 <NA>     231    NaN

Counting Rows with Conditions

To count rows that meet specific conditions, wrap the condition in sum():

Code

yrb_data %>% group_by(race7) %>% filter(!is.na(bmi)) %>%  
  summarize(count_above50 = sum(bmi > 50))

# A tibble: 8 × 2
  race7                     count_above50
  <chr>                             <int>
1 Am Indian / Alaska Native             1
2 Asian                                 1
3 Black or African American             7
4 Hispanic/Latino                       2
5 Multiple - Non-Hispanic               3
6 Native Hawaiian/other PI              1
7 White                                 2
8 <NA>                                  0

For binary variables, TRUE equals 1, and FALSE equals 0, making sum() work seamlessly.

Counting Missing Values

To count NAs:

Code

yrb_data %>% group_by(sex) %>% 
  summarize(unknown_bmi = sum(is.na(bmi)))

# A tibble: 3 × 2
  sex    unknown_bmi
  <chr>        <int>
1 Female        2970
2 Male          3257
3 <NA>           231

To count known (non-missing) values:

Code

yrb_data %>%    group_by(sex) %>%    
  summarize(known_bmi = sum(!is.na(bmi)))

# A tibble: 3 × 2
  sex    known_bmi
  <chr>      <int>
1 Female      6622
2 Male        6920
3 <NA>           0

Using `dplyr::count()`

count() simplifies counting observations by group:

Code

yrb_data %>% count(race4)

# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710

This is equivalent to:

Code

yrb_data %>% group_by(race4) %>%    
  summarize(n = n())

# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710

You can count by multiple variables:

Code

yrb_data %>% count(sex, grade)

# A tibble: 15 × 3
   sex    grade     n
   <chr>  <chr> <int>
 1 Female 10th   2332
 2 Female 11th   2365
 3 Female 12th   2277
 4 Female 9th    2492
 5 Female <NA>    126
 6 Male   10th   2539
 7 Male   11th   2496
 8 Male   12th   2263
 9 Male   9th    2684
10 Male   <NA>    195
11 <NA>   10th     36
12 <NA>   11th     30
13 <NA>   12th     37
14 <NA>   9th      43
15 <NA>   <NA>     85

Summarize ungrouped data

We can also summarize ungrouped data. This can be done by using three functions.
- summarize_all()
- summarize_at()

1. summarize_all()

This function summarizes all the columns of data based on the action which is to be performed. summarize_all(action)
example The code airquality |> summarize_all(mean) will show the mean of all columns.

Code

# Caculating mean value.
airquality |> summarize_all(mean, na.rm=T)

     Ozone  Solar.R     Wind     Temp    Month      Day
1 42.12931 185.9315 9.957516 77.88235 6.993464 15.80392

2. summarize_at()

It performs the action on the specific column and generates the summary based on that action.
summarize_at(vector_of_columns, action)
vector_of_columns: The list of column names or character vector of column names.

Code

airquality |> group_by(Month) |>
summarize_at(c("Wind","Temp"),mean)

# A tibble: 5 × 3
  Month  Wind  Temp
  <int> <dbl> <dbl>
1     5 11.6   65.5
2     6 10.3   79.1
3     7  8.94  83.9
4     8  8.79  84.0
5     9 10.2   76.9

Data Manipulation and Cleaning using dplyr() package

What is Tidyverse?

Intro to dplyr package

Select Columns from a Dataset select():

Ways to Use select() in dplyr

About the data

Pipe perator (|>)

Advantages of Pipe oprator

select a column by name: select(col1, col2, col3, ...)

Selecting column ranges with :

Excluding columns with ! or -

Helper functions: starts_with(), ends_with() and contains()

starts_with()

ends_with()

contains()

Another helper function, everything()

Filter cases from the dataset filter()

filter(): To extract cases

Ways to Use filter() in dplyr

Filtering based on exact character variable matches

Filtering based on multiple conditions

Adding or Modifying columns using mutate()

Sort rows with arrange

group_by() and summarise()

dplyr - Split-Apply-Combine

The summarize() function

dplyr::summarize() Function

Multiple Summary Statistics

Grouped summaries with dplyr::group_by()

Why summarize() Matters

Grouping by Multiple Variables (Nested Grouping)

Ungrouping Data

Why is ungroup() Needed?

Counting Rows

Counting Rows with Conditions

Counting Missing Values

Using dplyr::count()

Summarize ungrouped data

1. summarize_all()

2. summarize_at()

Data Manipulation and Cleaning using `dplyr()` package

Intro to `dplyr` package

Select Columns from a Dataset `select()`:

Ways to Use `select()` in dplyr

Pipe perator (`|>`)

select a column by name: `select(col1, col2, col3, ...)`

Selecting column ranges with `:`

Excluding columns with `!` or `-`

Helper functions: `starts_with()`, `ends_with()` and `contains()`

`starts_with()`

`ends_with()`

`contains()`

Another helper function, `everything()`

Filter cases from the dataset `filter()`

`filter()`: To extract cases

Ways to Use `filter()` in dplyr

Adding or Modifying columns using `mutate()`

Sort rows with `arrange`

`group_by()` and `summarise()`

`dplyr` - Split-Apply-Combine

The `summarize()` function

`dplyr::summarize()` Function

Grouped summaries with `dplyr::group_by()`

Why `summarize()` Matters

Why is `ungroup()` Needed?

Using `dplyr::count()`