Variable recoding

Leykun Getaneh (MSc)

NDMC, EPHI


July 21 - 25, 2025

Recoding Variables

  • use recode() inside a mutate() statement.

Example of Recoding

Code
library(tibble)
data_diet <- tibble(diet = rep(c("A", "B", "B"), times = 4), 
                    gender = c("Male","m","Other","F","Female","M",
                               "f","O","Man","f","F","O"), 
                    weight_start = sample(100:250, size = 12),
                    weight_change = sample(-10:20, size = 12))
head(data_diet)
# A tibble: 6 × 4
  diet  gender weight_start weight_change
  <chr> <chr>         <int>         <int>
1 A     Male            236             4
2 B     m               138            -1
3 B     Other           209             8
4 A     F               124            19
5 B     Female          231            14
6 B     M               113             3
  • Say we have some data about samples in a diet study but this needs lots of recoding.

Example Cont.

Code
library(dplyr)
data_diet |>
  count(gender)
# A tibble: 9 × 2
  gender     n
  <chr>  <int>
1 F          2
2 Female     1
3 M          1
4 Male       1
5 Man        1
6 O          2
7 Other      1
8 f          2
9 m          1

dplyr can help!

Using Excel to find all of the different ways gender has been coded, could be hectic!

In dplyr you can use the recode function (need mutate here too!):

Code
# General Format - this is not code!
data_input |>
  mutate(variable_to_fix = recode(Variable_fixing, old_value = new_value,
                                    another_old_value = new_value))
Code
data_diet |> 
  mutate(gender = recode(gender, M = "Male", m = "Male", Man = "Male",
                                 O = "Other",f = "Female",F = "Female")) |>
  count(gender, diet)
# A tibble: 5 × 3
  gender diet      n
  <chr>  <chr> <int>
1 Female A         3
2 Female B         2
3 Male   A         1
4 Male   B         3
5 Other  B         3

Or you can use case_when()

The case_when() function of dplyr can help us to do this as well.

  • Note that automatically values not reassigned explicitly by case_when() will be NA unless otherwise specified.
Code
data_diet |> 
  mutate(gender = case_when(gender == "M" ~ "Male"))
# A tibble: 12 × 4
   diet  gender weight_start weight_change
   <chr> <chr>         <int>         <int>
 1 A     <NA>            236             4
 2 B     <NA>            138            -1
 3 B     <NA>            209             8
 4 A     <NA>            124            19
 5 B     <NA>            231            14
 6 B     Male            113             3
 7 A     <NA>            241             7
 8 B     <NA>            123            -7
 9 B     <NA>            145            -8
10 A     <NA>            148             9
11 B     <NA>            176            -3
12 B     <NA>            126            11

Use of case_when() without automatic NA

  • Here we use the original values of gender to replace all values of gender that do not meet the condition == "M".
Code
data_diet |> 
  mutate(gender = case_when(gender == "M" ~ "Male", TRUE ~ gender))
# A tibble: 12 × 4
   diet  gender weight_start weight_change
   <chr> <chr>         <int>         <int>
 1 A     Male            236             4
 2 B     m               138            -1
 3 B     Other           209             8
 4 A     F               124            19
 5 B     Female          231            14
 6 B     Male            113             3
 7 A     f               241             7
 8 B     O               123            -7
 9 B     Man             145            -8
10 A     f               148             9
11 B     F               176            -3
12 B     O               126            11

More complicated case_when()

Code
data_diet |> 
  mutate(gender = case_when(
    gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
    gender %in% c("F", "Female", "f", "female") ~ "Female",
    gender %in% c("O", "Other") ~ "Other")) |> head()
# A tibble: 6 × 4
  diet  gender weight_start weight_change
  <chr> <chr>         <int>         <int>
1 A     Male            236             4
2 B     Male            138            -1
3 B     Other           209             8
4 A     Female          124            19
5 B     Female          231            14
6 B     Male            113             3

Another reason for case_when()

case_when can do very sophisticated comparisons

Code
data_diet1 <-data_diet |> 
      mutate(effect = case_when(weight_change > 0 ~ "increase",
                                weight_change == 0 ~ "same",
                                weight_change < 0 ~ "decrease"))
head(data_diet1)
# A tibble: 6 × 5
  diet  gender weight_start weight_change effect  
  <chr> <chr>         <int>         <int> <chr>   
1 A     Male            236             4 increase
2 B     m               138            -1 decrease
3 B     Other           209             8 increase
4 A     F               124            19 increase
5 B     Female          231            14 increase
6 B     M               113             3 increase
Code
data_diet1 |> 
  count(diet, effect)
# A tibble: 3 × 3
  diet  effect       n
  <chr> <chr>    <int>
1 A     increase     4
2 B     decrease     4
3 B     increase     4

Creating new discrete column with two levels

  • The ifelse() statement can be used to turn a numeric column into a discrete one.
Code
data_diet |>
  mutate(temp_cat = ifelse(weight_change > 0, "increased", "decreased")) |>
  head()
# A tibble: 6 × 5
  diet  gender weight_start weight_change temp_cat 
  <chr> <chr>         <int>         <int> <chr>    
1 A     Male            236             4 increased
2 B     m               138            -1 decreased
3 B     Other           209             8 increased
4 A     F               124            19 increased
5 B     Female          231            14 increased
6 B     M               113             3 increased

case_when() improved with stringr

Code
data_diet |> 
  mutate(gender = case_when(
    gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
    gender %in% c("F", "Female", "f", "female")~ "Female",
    gender %in% c("O", "Other") ~ "Other")) |> count(gender)
# A tibble: 3 × 2
  gender     n
  <chr>  <int>
1 Female     5
2 Male       4
3 Other      3

case_when() improved with stringr

  • ^ indicates the beginning of a character string

  • $ indicates the end

Code
library(stringr)
data_diet |> 
  mutate(gender = case_when(
    str_detect(string = gender, pattern = "^m|^M") ~ "Male",
    str_detect(string = gender, pattern = "^f|^F") ~ "Female",
    str_detect(string = gender, pattern = "^o|^O") ~ "Other")) |>
  count(gender)
# A tibble: 3 × 2
  gender     n
  <chr>  <int>
1 Female     5
2 Male       4
3 Other      3