Data Manipulation with R

Leykun Getaneh (MSc)

NDMC, EPHI


July 21 - 25, 2025

Data Manipulation and Cleaning using dplyr() package

What is Tidyverse?

  • The tidyverse is a collection of R packages designed for data science.

    • All packages share an underlying design philosophy, grammar, and data structures.
  • All packages included in tidyverse are automatically installed when installing the tidyverse package:

  • Install the complete tidyverse with:

Code
install.packages("tidyverse")
  • To load the core tidyverse and make it available in your current R session.
Code
library(tidyverse) 
  • To see the packages included in the tidyverse
Code
tidyverse_packages() 
  • Some packages under tidyverse are considered core packages and others called friend packages.

Core tidyverse

  • tibble, for tibbles, a modern re-imagining of data frames
  • readr, for data import
  • tidyr, for data tidying
  • ggplot2, for data visualization
  • dplyr, for data manipulation
  • stringr, for strings
  • forcats, for factors
  • purrr, for functional programming

Friends for data import or export (beyond readr)

  • readxl, for xls and xlsx files
  • haven, for SPSS, SAS, and Stata files
  • jsonlite, for JSON
  • xml2, for XML
  • httr, for web APIs
  • rvest, for web scraping
  • DBI, for databases

Friends for date wrangling

  • lubridate and hms, for date/times

Friends for modeling

  • modelr and broom for model/tidy data

Intro to dplyr package

  • dplyr is part of tidyverse provides a grammar (the verbs) for data manipulation.

  • The key operator and the essential verbs are:

    Function Description Operates on
    filter() pick rows matching criteria rows
    slice() pick rows using indices rows
    arrange() reorder rows rows
    select() pick columns by name columns
    mutate() add new variables columns
    summarise() reduce variables to values groups of rows
    relocate() to change column positions columns

    … many more.

  • %>% or |> : the “pipe” operator used to connect multiple verb actions together into a pipeline.

Tools → Global Options → Code → Editing → Use Native Pipe Operator (|>)

Select Columns from a Dataset select():

select(): To extract variables

  • select() \(\sim\) columns

  • select columns (variables)

  • no quotes needed around variable names

  • can be used to rearrange columns

  • uses special syntax that is flexible and has many options

Note that the column names are not quoted; you access the column name as if you are calling the name of an object or variable

Ways to Use select() in dplyr

Method Description Example
using column name Select specific columns by their names. select(col1, col2)
By position Select columns by their positions. select(1, 3)
Using a range Use : to select columns select(col1:col5); select(2:4)
Exclude columns Use - to exclude specific columns select(-col3); select(-(2:4)); select(!starts_with("A"))
Use pattern Use helper functions based on patterns. select(starts_with("prefix"))
select(ends_with("suffix"))
select(contains("text"))
Select by type Use where() to select based on type or condition. select(where(is.numeric))
Select all columns except some Use everything() to re-order or select all columns except specific ones. select(col1, everything())
select(-starts_with("temp"))
Rearrange columns Move specific columns to the front while retaining all others. select(col1, col3, everything())

About the data

Data from the CDC’s Youth Risk Behavior Surveillance System (YRBSS)

  • complex survey data
  • national school-based survey conducted by CDC
  • monitors six categories of health-related behaviors
    • that contribute to the leading causes of death and disability among youth and adults
    • including alcohol & drug use, unhealthy & dangerous behaviors, sexuality, and physical activity
  • the data in yrbss_demo.csv are a subset of data in the R package yrbss
Code
library(readr)
yrb_data <- read_csv("data/yrbss.csv")
  • We can have a look at the data and its structure by using the glimpse() function from the dplyr package.

Pipe perator (|>)

  • Pipes in R look like |> and strings together commands to be performed sequentially

  • The pipe passes the data frame output that results from the function right before the pipe to input it as the first argument of the function right after the pipe.

Code
third(second(first(x)))
  • This nesting is not a natural way to think about a sequence of operations.

  • The |> operator allows you to string operations in a left-to-right fashion.

Code
first(x) |>
second |> third

Advantages of Pipe oprator

  • Pipes used to reduce multiple steps, that can be hard to keep track of.
  • less redundant code
  • Easy to read and write because functions are executed in order
    • Difficult to read if too many functions are nested
  • Look at the three syntax
Code
data1<-filter(sampledata, Age > 15) #<<
data2<-select(data1, Sex, Weight1, Age) #<<
Code
non_piped <-select(filter(mydata, Age>15), Sex, Weight1, Age) #<<
Code
pipeddata<-mydata |> filter(Age > 15) |> select(Sex, Weight1, Height1, Age)#<<

select a column by name: select(col1, col2, col3, ...)

Code
library(dplyr)
yrb_data1 <- yrb_data |> 
  select(age, sex, grade)
yrb_data1
# A tibble: 20,000 × 3
   age                   sex    grade
   <chr>                 <chr>  <chr>
 1 15 years old          Female 10th 
 2 17 years old          Female 12th 
 3 18 years old or older Male   11th 
 4 15 years old          Male   10th 
 5 14 years old          Male   9th  
 6 17 years old          Male   9th  
 7 16 years old          Male   11th 
 8 17 years old          Male   12th 
 9 18 years old or older Male   12th 
10 14 years old          Male   10th 
# ℹ 19,990 more rows

Selecting column ranges with :

  • The : operator selects a range of consecutive variables:
Code
yrb_data |>  select(age:race4) |>  head(3)
# A tibble: 3 × 4
  age                   sex    grade race4          
  <chr>                 <chr>  <chr> <chr>          
1 15 years old          Female 10th  White          
2 17 years old          Female 12th  White          
3 18 years old or older Male   11th  Hispanic/Latino
  • We can also specify a range with column numbers:
Code
yrb_data |> select(1:4) |> head(3)
# A tibble: 3 × 4
  record age                   sex    grade
   <dbl> <chr>                 <chr>  <chr>
1 931897 15 years old          Female 10th 
2 333862 17 years old          Female 12th 
3  36253 18 years old or older Male   11th 

Excluding columns with ! or -

The exclamation point negates a selection:

Code
yrb_data |> select(!record) |> head(2)
# A tibble: 2 × 7
  age          sex    grade race4 race7   bmi stweight
  <chr>        <chr>  <chr> <chr> <chr> <dbl>    <dbl>
1 15 years old Female 10th  White White  17.2     54.4
2 17 years old Female 12th  White White  20.2     57.2

To drop a range of consecutive columns, we use, for example,!age:grade:

Code
yrb_data |> select(!age:grade) |> head(2)
# A tibble: 2 × 5
  record race4 race7   bmi stweight
   <dbl> <chr> <chr> <dbl>    <dbl>
1 931897 White White  17.2     54.4
2 333862 White White  20.2     57.2

To drop several non-consecutive columns, place them inside !c():

Code
yrb_data |> select(!c(race4, race7)) |> head(3)

Helper functions: starts_with(), ends_with() and contains()

  • These two helpers work exactly as their names suggest!

starts_with()

Code
yrb_data |> select(starts_with("r")) |> head(2)
# A tibble: 2 × 3
  record race4 race7
   <dbl> <chr> <chr>
1 931897 White White
2 333862 White White

ends_with()

Code
yrb_data |> select(ends_with("e")) |> head(3)
# A tibble: 3 × 2
  age                   grade
  <chr>                 <chr>
1 15 years old          10th 
2 17 years old          12th 
3 18 years old or older 11th 

contains()

  • contains() helps select columns that contain a certain string:
Code
yrb_data |> select(sex, contains("r")) |> head()
# A tibble: 6 × 5
  sex     record grade race4                     race7                    
  <chr>    <dbl> <chr> <chr>                     <chr>                    
1 Female  931897 10th  White                     White                    
2 Female  333862 12th  White                     White                    
3 Male     36253 11th  Hispanic/Latino           Hispanic/Latino          
4 Male   1095530 10th  Black or African American Black or African American
5 Male   1303997 9th   All other races           Multiple - Non-Hispanic  
6 Male    261619 9th   All other races           <NA>                     

Another helper function, everything()

  • matches all variables that have not yet been selected.
Code
## First, `bmi`, then every other column.
yrb_data |> select(bmi, everything()) |> head(3)
# A tibble: 3 × 8
    bmi record age                   sex    grade race4           race7 stweight
  <dbl>  <dbl> <chr>                 <chr>  <chr> <chr>           <chr>    <dbl>
1  17.2 931897 15 years old          Female 10th  White           White     54.4
2  20.2 333862 17 years old          Female 12th  White           White     57.2
3  NA    36253 18 years old or older Male   11th  Hispanic/Latino Hisp…     NA  

It is often useful for establishing the order of columns.

  • But this would be painful for larger data frames, data frame. In such a case, we can use everything().
  • This helper can be combined with many others.
Code
## Bring columns that starts with "r" to the front of the data frame
yrb_data |> select(starts_with("r"), everything()) %>% head(3)
# A tibble: 3 × 8
  record race4           race7           age          sex   grade   bmi stweight
   <dbl> <chr>           <chr>           <chr>        <chr> <chr> <dbl>    <dbl>
1 931897 White           White           15 years old Fema… 10th   17.2     54.4
2 333862 White           White           17 years old Fema… 12th   20.2     57.2
3  36253 Hispanic/Latino Hispanic/Latino 18 years ol… Male  11th   NA       NA  
  • You can also select columns based on their data type using select_if().
  • The common data types to be called are: is.character, is.double, is.factor, is.integer, is.logical, is.numeric.
Code
yrb_data |>  select_if(is.numeric) |>
  glimpse()  # numeric data types only selected (here: integer or double)
Rows: 20,000
Columns: 3
$ record   <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1309…
$ bmi      <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.2555,…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, NA,…

Summary for Select() function

  • There are five ways to select variables in select(data, ...):
  1. By position: yrb_data |> select(1, 2, 4) oryrb_data |> select(1:2).
  2. By name: yrb_data |> select(age, sex), or yrb_data |> select(age:race4).
  3. By function of name: yrb_data |> select(starts_with("r")), or yrb_data |> select(ends_with("e")).
  4. By type: yrb_data |> select(where(is.numeric)),or yrb_data |> select(where(is.character)).
  5. By any combination of the above using the Boolean operators !, &, and |:
    • yrb_data |> select(!where(is.numeric)): selects all non-numeric variables.
    • yrb_data |> select(where(is.numeric) & contains("i")): selects all numeric variables that contains ‘i’.

Filter cases from the dataset filter()

filter(): To extract cases

The function filter() is used to filter the dataset to return a subset of all rows that meet one or more specific conditions.

  • filter(dataframe, logical statement 1, logical statement 2, ...)

Ways to Use filter() in dplyr

Method Description Example
By specific value Filter rows where a column equals a specific value. filter(col1 == "value") filter(col1 != "value")
By inequality Filter rows based on inequality conditions. filter(col1 > 10)
Using multiple conditions Filter rows that satisfy multiple conditions. filter(col1 > 10, col2 == "A")
With logical operators Using AND (&) and using OR (|)
By range Filter rows within a range of values using between(). filter(between(col1, 10, 20))
By missing values Filter rows with or without missing values. filter(is.na(col1))
filter(!is.na(col1))

Filtering based on exact character variable matches

  • Note the use of the double equal sign == rather than the single equal sign =.
Code
yrb_data |> select(sex, grade ) |>  filter(grade=="9th")|>  head(3)
# A tibble: 3 × 2
  sex   grade
  <chr> <chr>
1 Male  9th  
2 Male  9th  
3 Male  9th  
Code
yrb_data |> select(sex, grade)|> filter(sex == "Male") |> head(3)
# A tibble: 3 × 2
  sex   grade
  <chr> <chr>
1 Male  11th 
2 Male  10th 
3 Male  9th  
  • Similarly you can use the other operators:
    • filter(grade != "9th") will select everything except the grade 9 rows.
  • If you want to select more than one category value you can use the %in% operator.
Code
yrb_data |> 
  select(sex, age, grade ) |> 
  filter(grade %in% c("9th", "11th")) |> head(3)
  • The %in% operator used to deselect certain groups as well, using !%in%.

  • To select all individuals with a bmi between 22 and 30, use:

Code
yrb_data |> 
  select(sex, age, bmi) |> 
  filter(between(bmi, 22, 30)) 
Code
yrb_data |> 
  select(sex, age, bmi) |> 
  filter(bmi >= 22, bmi <= 30)

Filtering based on multiple conditions

  • The filter option also allows AND and OR style filters:

  • filter(condition1, condition2) will return rows where both conditions are met.

  • filter(condition1 & condition2) will also return rows where both conditions are met.

  • filter(condition1, !condition2) will return all rows where condition one is true but condition 2 is not.

  • filter(condition1 | condition2) will return rows where condition 1 and/or condition 2 is met.

Code
yrb_data |> select(sex, age, bmi, stweight, grade) |> 
  filter(bmi > 20, (stweight > 50 | grade != "12th")) |> 
  head(3)
# A tibble: 3 × 5
  sex    age            bmi stweight grade
  <chr>  <chr>        <dbl>    <dbl> <chr>
1 Female 17 years old  20.2     57.2 12th 
2 Male   15 years old  28.0     85.7 10th 
3 Male   14 years old  24.5     66.7 9th  
  • selects the bmi and stweight columns from yrb_data and filters out rows with missing bmi values
Code
yrb_data |>  
  select(bmi, stweight) |> 
  filter(!is.na(bmi)) |> 
  head(4)
# A tibble: 4 × 2
    bmi stweight
  <dbl>    <dbl>
1  17.2     54.4
2  20.2     57.2
3  28.0     85.7
4  24.5     66.7

Adding or Modifying columns using mutate()

  • Another common task is creating a new column based on values in existing columns.

  • The dplyr library has the following functions that can be used to add additional variables to a data frame.

  • mutate() – adds new variables while retaining old variables to a data frame.

  • Example: add the new column called height_m

Code
yrb_data %>% 
  mutate(height_m = sqrt(stweight / bmi)) |>    # use = (not <- or ==) to define new variable
  head(3)
# A tibble: 3 × 9
  record age                   sex    grade race4  race7   bmi stweight height_m
   <dbl> <chr>                 <chr>  <chr> <chr>  <chr> <dbl>    <dbl>    <dbl>
1 931897 15 years old          Female 10th  White  White  17.2     54.4     1.78
2 333862 17 years old          Female 12th  White  White  20.2     57.2     1.68
3  36253 18 years old or older Male   11th  Hispa… Hisp…  NA       NA      NA   
  • We can use the relocate() function to put it before our bmi column:
Code
yrb_data %>% 
  mutate(height_m = sqrt(stweight / bmi)) |> 
  relocate(height_m, .before = bmi) |> 
  head(3)
# A tibble: 3 × 9
  record age                   sex    grade race4  race7 height_m   bmi stweight
   <dbl> <chr>                 <chr>  <chr> <chr>  <chr>    <dbl> <dbl>    <dbl>
1 931897 15 years old          Female 10th  White  White     1.78  17.2     54.4
2 333862 17 years old          Female 12th  White  White     1.68  20.2     57.2
3  36253 18 years old or older Male   11th  Hispa… Hisp…    NA     NA       NA  

Sort rows with arrange

Re-order rows by a particular column, by default in ascending order

Use desc() for descending order.

arrange(data, variable1, desc(variable2), ...)

Example: Arrange by BMI in descending order

Code
# Example: Arrange by BMI in descending order
yrb_data %>%
  arrange(desc(bmi)) 
# A tibble: 20,000 × 8
    record age                   sex    grade race4         race7   bmi stweight
     <dbl> <chr>                 <chr>  <chr> <chr>         <chr> <dbl>    <dbl>
 1  324452 16 years old          Male   11th  Black or Afr… Blac…  53.9     91.2
 2 1310082 18 years old or older Male   11th  Black or Afr… Blac…  53.5    160. 
 3  328160 18 years old or older Male   <NA>  Black or Afr… Blac…  53.4    128. 
 4 1315913 17 years old          Female 12th  Black or Afr… Blac…  53.3    142. 
 5 1094597 13 years old          Male   9th   All other ra… Asian  52.9    181. 
 6 1305503 15 years old          Male   9th   All other ra… Am I…  52.4    134. 
 7  770391 16 years old          Female 11th  All other ra… Mult…  52.4    161. 
 8  634138 17 years old          Male   12th  All other ra… Nati…  52.3    160. 
 9 1312697 15 years old          Female 10th  Black or Afr… Blac…  52.3     95.3
10 1099468 17 years old          Male   9th   Black or Afr… Blac…  52.0    174. 
# ℹ 19,990 more rows

group_by() and summarise()

  • The dplyr verbs become especially powerful when they are are combined using the pipe operator |>.

  • The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually

  • group_by(): group data frame by a factor for downstream operations (usually summarise)

  • summarise(): summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

dplyr - Split-Apply-Combine

The group_by function is key to the Split-Apply-Combine strategy

The summarize() function

  • The summarize() function is used in the R program to summarize the data frame into just one value or vector.
  • This summarization is done through grouping observations by using categorical values at first, using the group_by() function.
  • The summarize() function offers the summary that is based on the action done on grouped or ungrouped data.

dplyr::summarize() Function

  • To calculate the mean bmi in base R vs with summarize()::
Code
mean(yrb_data$bmi, na.rm = T)
[1] 23.49541
  • summarize(new_column = summary_function(column))
Code
yrb_data %>% filter(!is.na(bmi)) %>% 
  summarize(mean_bmi = mean(bmi))
# A tibble: 1 × 1
  mean_bmi
     <dbl>
1     23.5

Multiple Summary Statistics

You can calculate multiple statistics in one summarize():

Code
yrb_data %>% filter(!is.na(bmi)) %>% 
  summarize(mean_age = mean(bmi), 
            median_bmi = median(bmi))
# A tibble: 1 × 2
  mean_age median_bmi
     <dbl>      <dbl>
1     23.5       22.3

Grouped summaries with dplyr::group_by()

group_by() groups data by one or more variables.

  • Example 1: Mean weight by Sex
Code
yrb_data %>% 
  filter(!is.na(stweight)) |> 
  group_by(sex) |> 
  summarize(mean_weight = mean(stweight))
  • Example 2: Maximum and Minimum Weights

Calculate the min, max and mean weights for each sex. The function n() will count the number of rows in each group.:

Code
yrb_data %>% 
  filter(!is.na(stweight)) %>%
  group_by(sex) %>%  
  summarize(max_weight = max(stweight), 
            min_weight = min(stweight),
            mean_weight = mean(stweight),
            n = n())
# A tibble: 2 × 5
  sex    max_weight min_weight mean_weight     n
  <chr>       <dbl>      <dbl>       <dbl> <int>
1 Female       181.       27.7        61.7  6542
2 Male         181.       35.4        73.1  6901

Why summarize() Matters

  • The combination of group_by() and summarize() allows highly informative grouped summaries of datasets with minimal code.
    • Producing such summaries is an essential data analysis skill.

Grouping by Multiple Variables (Nested Grouping)

  • To group by more than one variable, list both in group_by():
Code
yrb_data %>% filter(!is.na(bmi)) %>% group_by(sex, grade) %>%    
  summarize(mean_bmi = mean(bmi)) %>% head(4)
# A tibble: 4 × 3
# Groups:   sex [1]
  sex    grade mean_bmi
  <chr>  <chr>    <dbl>
1 Female 10th      23.0
2 Female 11th      23.4
3 Female 12th      23.9
4 Female 9th       22.8
  • You can swap the column order in group_by(). We can use the arrange() function:
Code
yrb_data |>  filter(!is.na(bmi))  |>  group_by(grade, sex) |>      
  summarize(mean_bmi = mean(bmi))  |>  arrange(mean_bmi) |> head()
# A tibble: 6 × 3
# Groups:   grade [4]
  grade sex    mean_bmi
  <chr> <chr>     <dbl>
1 9th   Male       22.8
2 9th   Female     22.8
3 10th  Female     23.0
4 <NA>  Female     23.1
5 11th  Female     23.4
6 10th  Male       23.5

Ungrouping Data

  • After group_by() and summarize(), the resulting data frame may still be grouped.
  • To avoid unintended behaviors, use ungroup():
Code
yrb_data %>% filter(!is.na(bmi)) |> group_by(sex, grade) %>%   
  summarize(mean_bmi = mean(bmi)) %>% ungroup() %>% head()
# A tibble: 6 × 3
  sex    grade mean_bmi
  <chr>  <chr>    <dbl>
1 Female 10th      23.0
2 Female 11th      23.4
3 Female 12th      23.9
4 Female 9th       22.8
5 Female <NA>      23.1
6 Male   10th      23.5

Why is ungroup() Needed?

  • Grouped data frames behave uniquely with other dplyr functions like select(), filter(), or mutate():
Code
# Unexpected behavior when grouped 
yrb_data %>% group_by(sex, grade) %>% filter(!is.na(bmi)) %>%    
  summarize(mean_bmi = mean(bmi)) %>% select(mean_bmi) %>% head(4)
# A tibble: 4 × 2
# Groups:   sex [1]
  sex    mean_bmi
  <chr>     <dbl>
1 Female     23.0
2 Female     23.4
3 Female     23.9
4 Female     22.8
  • By ungrouping, we get the expected output:
Code
yrb_data %>% group_by(sex, grade) %>% filter(!is.na(bmi)) %>% 
  summarize(mean_bmi = mean(bmi)) %>%   
  ungroup() %>% select(mean_bmi) %>% head(3)
# A tibble: 3 × 1
  mean_bmi
     <dbl>
1     23.0
2     23.4
3     23.9

Counting Rows

  • Use n() inside summarize() to count rows:
Code
yrb_data %>%    
  group_by(sex) %>%    
  summarize(count = n())
# A tibble: 3 × 2
  sex    count
  <chr>  <int>
1 Female  9592
2 Male   10177
3 <NA>     231
  • You can combine counts with other summary statistics:
Code
yrb_data %>%    
  group_by(sex) %>%    
  summarize(count = n(), 
            mean_bmi = mean(bmi, na.rm=T))
# A tibble: 3 × 3
  sex    count mean_bmi
  <chr>  <int>    <dbl>
1 Female  9592     23.3
2 Male   10177     23.7
3 <NA>     231    NaN  

Counting Rows with Conditions

  • To count rows that meet specific conditions, wrap the condition in sum():
Code
yrb_data %>% group_by(race7) %>% filter(!is.na(bmi)) %>%  
  summarize(count_above50 = sum(bmi > 50))
# A tibble: 8 × 2
  race7                     count_above50
  <chr>                             <int>
1 Am Indian / Alaska Native             1
2 Asian                                 1
3 Black or African American             7
4 Hispanic/Latino                       2
5 Multiple - Non-Hispanic               3
6 Native Hawaiian/other PI              1
7 White                                 2
8 <NA>                                  0
  • For binary variables, TRUE equals 1, and FALSE equals 0, making sum() work seamlessly.

Counting Missing Values

To count NAs:

Code
yrb_data %>% group_by(sex) %>% 
  summarize(unknown_bmi = sum(is.na(bmi)))
# A tibble: 3 × 2
  sex    unknown_bmi
  <chr>        <int>
1 Female        2970
2 Male          3257
3 <NA>           231

To count known (non-missing) values:

Code
yrb_data %>%    group_by(sex) %>%    
  summarize(known_bmi = sum(!is.na(bmi)))
# A tibble: 3 × 2
  sex    known_bmi
  <chr>      <int>
1 Female      6622
2 Male        6920
3 <NA>           0

Using dplyr::count()

  • count() simplifies counting observations by group:
Code
yrb_data %>% count(race4)
# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710
  • This is equivalent to:
Code
yrb_data %>% group_by(race4) %>%    
  summarize(n = n())
# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710
  • You can count by multiple variables:
Code
yrb_data %>% count(sex, grade)
# A tibble: 15 × 3
   sex    grade     n
   <chr>  <chr> <int>
 1 Female 10th   2332
 2 Female 11th   2365
 3 Female 12th   2277
 4 Female 9th    2492
 5 Female <NA>    126
 6 Male   10th   2539
 7 Male   11th   2496
 8 Male   12th   2263
 9 Male   9th    2684
10 Male   <NA>    195
11 <NA>   10th     36
12 <NA>   11th     30
13 <NA>   12th     37
14 <NA>   9th      43
15 <NA>   <NA>     85

Summarize ungrouped data

  • We can also summarize ungrouped data. This can be done by using three functions.
    • summarize_all()
    • summarize_at()

1. summarize_all()

  • This function summarizes all the columns of data based on the action which is to be performed. summarize_all(action)
  • example The code airquality |> summarize_all(mean) will show the mean of all columns.
Code
# Caculating mean value.
airquality |> summarize_all(mean, na.rm=T)
     Ozone  Solar.R     Wind     Temp    Month      Day
1 42.12931 185.9315 9.957516 77.88235 6.993464 15.80392

2. summarize_at()

  • It performs the action on the specific column and generates the summary based on that action.

  • summarize_at(vector_of_columns, action)

  • vector_of_columns: The list of column names or character vector of column names.

Code
airquality |> group_by(Month) |>
summarize_at(c("Wind","Temp"),mean)
# A tibble: 5 × 3
  Month  Wind  Temp
  <int> <dbl> <dbl>
1     5 11.6   65.5
2     6 10.3   79.1
3     7  8.94  83.9
4     8  8.79  84.0
5     9 10.2   76.9