Merging data
NDMC, EPHI
July 21 - 25, 2025
left_join()
right_join()
inner_join()
full_join()
The 2 binding join verbs:
bind_rows()
bind_cols()
semi_join()
anti_join()
intersect()
union()
and setdiff()
All the joins have this basic syntax: *_join(x, y, by = NULL, suffix = c(".x", ".y")
x =
the first (left) table
y =
the second (right) table
by =
what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
suffix =
if columns have the same name in the two tables, but you aren’t joining by them, they get a suffix to make them unambiguous.
This defaults to “.x” and “.y”, but you can change it to something more meaningful.
What it does:
Retains all rows from the left (first) table
Adds matching columns from the right (second) table
Fills NA where no match exists
left_join()
# A tibble: 5 × 7
patient_id age bmi smoking_status visit_date sbp dbp
<chr> <dbl> <dbl> <chr> <date> <dbl> <dbl>
1 P001 35 22.1 former NA NA NA
2 P002 28 26.5 never 2023-01-15 120 80
3 P003 42 29.8 current 2023-02-01 135 85
4 P003 42 29.8 current 2023-03-01 140 90
5 P004 31 24.3 never NA NA NA
Merging patient registries with lab results
Preserving all patients from primary clinic records
clinic_data
and lab_data
tables is different.# A tibble: 4 × 7
patient_id visit_date sbp dbp age bmi smoking_status
<chr> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 P002 2023-01-15 120 80 28 26.5 never
2 P003 2023-02-01 135 85 42 29.8 current
3 P003 2023-03-01 140 90 42 29.8 current
4 P005 2023-01-20 128 82 NA NA <NA>
right_join
keeps all the data from the second (right) table and joins anything that matches from the first (left) table.What it does:
inner_join
returns all the rows that have a match in the other table.What it does:
full_join
lets you join up rows in two tables while keeping all of the information from both tables.NA
.# A tibble: 6 × 7
patient_id age bmi smoking_status visit_date sbp dbp
<chr> <dbl> <dbl> <chr> <date> <dbl> <dbl>
1 P001 35 22.1 former NA NA NA
2 P002 28 26.5 never 2023-01-15 120 80
3 P003 42 29.8 current 2023-02-01 135 85
4 P003 42 29.8 current 2023-03-01 140 90
5 P004 31 24.3 never NA NA NA
6 P005 NA NA <NA> 2023-01-20 128 82
bind_rows
.The columns just have to have the same names, they don’t have to be in the same order.
Any columns that differ between the two tables will just have NA
values for entries from the other table.
If a row is duplicated between the two tables, the row will also be duplicated in the resulting table.