Basic Data Analysis with R

April 28 - May 1, 2026

Yebelay & Leykun

Learning Objectives

By the end of this module you will be able to:

  1. Understanding our data: load, inspect, audit missingness
  2. Summarising variables: the gtsummary workflow
  3. Graphs & plots: distributions, comparisons, what to look for
  4. Choosing what to report: mean vs median · table vs graph
  5. Statistical tests: t-test, chi-square, regression

Part 1. Know Your Data: What is YRBSS?

Before you analyse anything, understand what you have

About the survey

  • US national survey of high-school students
  • Monitors 6 categories of health-risk behaviour: injury, tobacco, alcohol/drugs, sexual behaviour, diet, physical activity
  • Run by CDC every 2 years since 1991
  • Our subset: 20,000 students × 8 variables
  • Real-world missingness: 32% of bmi

Variables in our dataset

Variable Type Example
record ID 931897
age Categorical 17 years old
sex Categorical Female / Male
grade Categorical 9th – 12th
race4 Categorical White / Hispanic…
bmi Numeric 13.2 – 53.9
stweight Numeric Weight in kg

Note

We use this dataset because it has the real problems analysts face every day: missing values, skewed distributions, mixed variable types, and large enough n that we can run real tests.

Load packages, read the data

library(tidyverse)   # data wrangling + ggplot2
library(gtsummary)   # publication-ready tables
library(janitor)     # clean_names(), tabyl()

yrbss <- read_csv("yrbss.csv")

Why this order matters:

  • tidyverse first - gives you read_csv(), mutate(), filter(), plus pipes
  • gtsummary next - for the descriptive table you’ll build later
  • janitor last - small helpers; loaded last so its tabyl() doesn’t get masked

Tip

Reproducibility rule: keep yrbss.csv next to your .qmd. Use relative paths ("yrbss.csv"), never absolute paths like "C:/Users/Me/Desktop/yrbss.csv" — they break the moment someone else opens your project.

Inspect the structure first

glimpse(yrbss)
Rows: 20,000
Columns: 11
$ record      <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1…
$ age         <fct> 15 years old, 17 years old, 18 years old or older, 15 year…
$ sex         <fct> Female, Female, Male, Male, Male, Male, Male, Male, Male, …
$ grade       <fct> 10th, 12th, 11th, 10th, 9th, 9th, 11th, 12th, 12th, 10th, …
$ race4       <fct> White, White, Hispanic/Latino, Black or African American, …
$ race7       <chr> "White", "White", "Hispanic/Latino", "Black or African Ame…
$ bmi         <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.25…
$ stweight    <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, …
$ obese       <fct> Not obese, Not obese, NA, Not obese, Not obese, NA, Not ob…
$ bmi_cat     <fct> Underweight, Normal, NA, Overweight, Normal, NA, Normal, N…
$ bmi_missing <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALS…

Why we always do this first:

glimpse() is the analyst’s first move on any new dataset. In one line you learn:

  • How many rows and columns is this what you expected?
  • Variable names and types <dbl>, <chr>, <fct>?
  • The first few values do they look sensible?

Catching a wrongly-typed variable here (e.g. age stored as text) saves hours of debugging later.

Audit missing data - never skip this

colSums(is.na(yrbss))
     record         age         sex       grade       race4       race7 
          0         205         231         406         710        1187 
        bmi    stweight       obese     bmi_cat bmi_missing 
       6458        6557        6458        6458           0 

What this tells you:

  • bmi and stweight each have ~6,400 missing (~32%) - these are students who didn’t report height/weight
  • age, sex, grade, race4 are <2% missing - safe for descriptive tables

Variable types determine your analysis

yrbss |>
  summarise(across(everything(), class)) 
# A tibble: 1 × 11
  record  age   sex   grade race4 race7 bmi   stweight obese bmi_cat bmi_missing
  <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>    <chr> <chr>   <chr>      
1 numeric fact… fact… fact… fact… char… nume… numeric  fact… factor  logical    

The four R types you need to recognise:

R type Used for Example variable
dbl / int Continuous / count bmi, stweight
chr Text - recode to factor raw age strings
fct Categorical with levels grade after factor()
lgl TRUE / FALSE bmi_missing

Why type matters: gtsummary, ggplot2, and every statistical test treat numeric and categorical variables differently. If grade is stored as <chr> instead of <fct>, your tables sort alphabetically ("10th" before "9th") - wrong.

Descriptive Statistics: The four dimensions of any variable

Summarise every variable before you model anything

Dimension What it captures Statistic
Centre Typical value Mean, median, mode
Spread Variability SD, IQR, range
Shape Symmetry & tails Skewness, kurtosis
Frequency Category counts n, %, cumulative %

What to report by variable type

  • Continuous, normal → Mean ± SD
  • Continuous, skewed → Median (IQR)
  • Categorical → n (%)
  • Binary → n (%) of the positive class

Describe all variables before any modelling. A one-line summary per variable prevents major errors downstream.

What gtsummary does for you

  • Auto-detects variable types: continuous vs categorical
  • Continuous → median (IQR) by default · Categorical → n (%)
  • add_p() automatically picks the correct statistical test
  • add_overall() adds an all-participants column
  • Output-ready for HTML, Word, and PDF via as_gt() or as_flex_table()
  • Bold or italicise labels in one pipe step

Tip

  • gtsummary is the single most time-saving package in R for descriptive tables.
  • One pipe replaces ~30 lines of manual summarise() + pivot_longer() + formatting code.

Basic tbl_summary() code

table1 <- yrbss |>
  select(sex, grade, race4, bmi) |>
  tbl_summary() |>
  bold_labels()

table1

What this does, line by line:

  1. select(): pick the variables you want to describe
  2. tbl_summary(): auto-detect types and summarise each
  3. bold_labels(): make variable names stand out

The output is a publication-ready table. To save it for Word, pipe into as_flex_table() |> flextable::save_as_docx(path = "table1.docx").

Basic tbl_summary() output

Characteristic N = 20,0001
sex
    Female 9,592 (49%)
    Male 10,177 (51%)
    Unknown 231
grade
    9th 5,219 (27%)
    10th 4,907 (25%)
    11th 4,891 (25%)
    12th 4,577 (23%)
    Unknown 406
bmi 22.3 (20.1, 25.7)
    Unknown 6,458
1 n (%); Median (Q1, Q3)

Stratified tbl_summary() code

yrbss |>
  select(sex, grade, race4, bmi) |>
  tbl_summary(by = sex) |>     # split columns by sex
  add_p() |>                   # auto-pick the right test
  add_overall() |>             # add an "Overall" column
  bold_labels() |>
  italicize_levels()

The by = argument is the magic. It splits the table into one column per group and - combined with add_p() - automatically chooses:

  • Wilcoxon for skewed continuous (like BMI)
  • Chi-square for categorical
  • Fisher’s exact if any expected cell < 5

Stratified tbl_summary() output

Characteristic Overall
N = 19,7691
Female
N = 9,5921
Male
N = 10,1771
p-value2
grade


0.12
    9th 5,176 (27%) 2,492 (26%) 2,684 (27%)
    10th 4,871 (25%) 2,332 (25%) 2,539 (25%)
    11th 4,861 (25%) 2,365 (25%) 2,496 (25%)
    12th 4,540 (23%) 2,277 (24%) 2,263 (23%)
    Unknown 321 126 195
race4


0.8
    All other races 4,696 (25%) 2,260 (24%) 2,436 (25%)
    Black or African American 4,055 (21%) 1,984 (21%) 2,071 (21%)
    Hispanic/Latino 4,625 (24%) 2,252 (24%) 2,373 (24%)
    White 5,762 (30%) 2,826 (30%) 2,936 (30%)
    Unknown 631 270 361
bmi 22.3 (20.1, 25.7) 22.1 (20.0, 25.3) 22.5 (20.3, 25.8) <0.001
    Unknown 6,227 2,970 3,257
1 n (%); Median (Q1, Q3)
2 Pearson’s Chi-squared test; Wilcoxon rank sum test

Customising labels and statistics

yrbss |>  select(sex, bmi, stweight) |>
  tbl_summary( by = sex,
    statistic = list(
      all_continuous()  ~ "{median} ({p25}, {p75})",
      all_categorical() ~ "{n} ({p}%)"),
    digits = list(all_continuous() ~ 1),
    label  = list(
      bmi      ~ "Body Mass Index (kg/m²)",
      stweight ~ "Body weight (kg)"),
    missing = "ifany", missing_text = "Missing, n") |>
  add_p(test = list(all_continuous() ~ "wilcox.test")) |>
  add_overall() |>  bold_labels() |>
  modify_caption("**Table 1. Participant characteristics**")

Tip

The label argument relabels variables for publication - no need to rename your data frame. {p25} / {p75} show explicit percentiles; use {mean} / {sd} for normally-distributed variables.

How to read a gtsummary table

Table 1. Participant characteristics
Characteristic Overall
N = 19,7691
Female
N = 9,5921
Male
N = 10,1771
p-value2
Body Mass Index (kg/m²) 22.3 (20.1, 25.7) 22.1 (20.0, 25.3) 22.5 (20.3, 25.8) <0.001
    Missing, n 6,227 2,970 3,257
Body weight (kg) 64.0 (55.8, 74.8) 59.0 (52.2, 68.0) 69.4 (61.2, 81.2) <0.001
    Missing, n 6,326 3,050 3,276
1 Median (Q1, Q3)
2 Wilcoxon rank sum test

Note

  • IQR = interquartile range (25th-75th percentile) - the middle 50 % of the data
  • Medians appear because BMI and weight are right-skewed
  • The Overall column is the unstratified total - useful for the abstract

Exploratory Data Analysis

  • Why we plot every variable first?

    • Visualise before you conclude

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.”
- John Tukey, 1977

Four EDA principles

Principle What to do
📊 Look first Plot before you model
📈 Univariate first One variable, then pairs
📋 Plot AND tabulate Numbers give magnitude; graphs give shape
📝 Document decisions If you drop an outlier, write down why

EDA checklist

BMI distribution (histogram)

med_bmi <- median(yrbss$bmi, na.rm = TRUE)
ggplot(yrbss, aes(x = bmi)) +
  geom_histogram(binwidth = 2, fill = "#0D9488",
                 colour = "white") +
  geom_vline(xintercept = med_bmi,
             colour = "#EF4444", linewidth = 1,
             linetype = "dashed") +
  annotate("text", x = med_bmi + 1, y = 1700,
           label = paste("Median =",
                         round(med_bmi, 1)),
           colour = "#EF4444", hjust = 0) +
  labs(title = "Distribution of BMI",
       x = "BMI (kg/m²)", y = "Count")
Figure 1: BMI distribution, YRBSS

What to look for in any histogram:

  • Centre: where is the bulk of the data?
  • Spread: narrow peak or wide spread?
  • Shape: symmetric (bell), skewed (long tail), or bimodal (two peaks)?
  • Outliers: values far from the bulk that need investigation

For BMI: clear right-skew (long tail to the right) → use median, not mean.

BMI density curves by sex

ggplot(yrbss |> filter(!is.na(sex)),
       aes(x = bmi, fill = sex)) +
  geom_density(alpha = 0.45, colour = NA) +
  scale_fill_manual( values = c(Female = "#1B3A6B",
               Male   = "#F59E0B")) +
  labs(title = "BMI density by sex",
       x = "BMI (kg/m²)", y = "Density", fill = NULL) +
  theme(legend.position = "top")
Figure 2: BMI density by sex

Reading two density curves side-by-side:

A density plot is a smoothed histogram - the area under each curve sums to 1. So you can compare two groups of different sizes on the same axes.

  • Both curves are right-skewed - same shape, no surprises there
  • Male curve sits to the right - males have systematically higher BMI
  • Similar widths → similar IQRs (similar spread)

This is what a single number like “mean BMI” would hide entirely.

Box plots: comparing groups

ggplot(yrbss |> filter(!is.na(grade)),
       aes(x = grade, y = bmi, fill = grade)) +
  geom_boxplot() +
  scale_fill_manual(values = c("#1B3A6B", "#0D9488","#F59E0B","#EF4444" )) +
  labs(title = "BMI by Grade", x = NULL, 
       y = "BMI (kg/m²)") +
  theme(legend.position = "none")
Figure 3: BMI by grade (notched box plot)

Anatomy of a box plot:

  • Box = IQR (25th-75th percentile)
  • Line in box = median
  • Whiskers = 1.5 × IQR
  • Points beyond whiskers = outlier candidates

Bar chart: categorical frequency

yrbss |>
  filter(!is.na(race4)) |>
  count(race4) |> mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(y = reorder(race4, n), x = pct, fill = race4)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.1f%%", pct)),
    hjust = -0.1, size = 3.2) +
  scale_x_continuous(limits = c(0, 36)) +
  scale_fill_manual(values = c("#1B3A6B","#0D9488","#F59E0B","#EF4444")) +
  labs(title = "Race / ethnicity distribution",
       x = "Percentage (%)", y = NULL)
Figure 4: Race / ethnicity, YRBSS

Three rules for good bar charts:

  1. Sort by frequency with reorder(): never alphabetical (unless ordinal like grade)
  2. Use horizontal bars when category names are long - keeps labels readable
  3. Avoid 3-D and pie charts: humans can’t compare angles or volumes accurately

Add value labels with geom_text() so readers don’t have to estimate from the axis.

Stacked bars: composition within groups

yrbss |> filter(!is.na(grade), !is.na(sex)) |>
  count(grade, sex) |>
  group_by(grade) |>
  mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(x = grade, y = pct, fill = sex)) +
  geom_col(position = "fill") +
  geom_text(aes(label = sprintf("%.0f%%", pct)),
    position = position_fill(vjust = 0.5),
    colour = "white", fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c("#1B3A6B","#0D9488")) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Sex distribution by grade",
       subtitle = "Each bar sums to 100%",
       x = "Grade", y = "Proportion", fill = "Sex")
Figure 5: Sex within each grade

Stacked vs dodged: pick the right one:

  • Stacked (position = "fill") - best for proportions within a group; each bar sums to 100%.
  • Dodged (position = "dodge") - best for absolute counts between groups; bars side-by-side.

Use stacked when relative composition matters (here: sex balance per grade). Use dodged when raw counts matter.

What to Report

Mean vs median

Use MEAN when

  • Data is normally distributed (bell-shaped)
  • No extreme outliers
  • Continuous: height, blood pressure
  • Parametric tests (t-test, ANOVA) will follow
  • Symmetric Likert scales

Reporting: Mean ± SD
Example: Height: 165.3 ± 8.2 cm

Use MEDIAN when

  • Data is skewed (income, BMI, lab values)
  • Outliers are present or expected
  • Ordinal data: pain scores, satisfaction
  • Non-parametric tests will follow (Wilcoxon)
  • YRBSS BMI and weight → MEDIAN

Reporting: Median (IQR)
Example: BMI: 22.3 (19.8-25.7) kg/m²

Important

Why mean fails with skewed data: a single student with BMI = 53.9 pulls the mean upward but does not affect the median. The median better represents the typical student.

Testing for normality

p1 <- ggplot(yrbss, aes(bmi)) +
  geom_histogram(bins = 30, fill = "#0D9488",
                 colour = "white") +
  labs(title = "Histogram of BMI", 
       x = "BMI", y = "Count")

p2 <- ggplot(yrbss |> 
               filter(!is.na(bmi)), 
             aes(sample = bmi)) +
  stat_qq(alpha = 0.3, size = 0.5, 
          colour = "#1B3A6B") +
  stat_qq_line(colour = "#EF4444", 
               linewidth = 0.7) +
  labs(title = "Q–Q plot of BMI", 
       x = "Theoretical", y = "Sample")
p1 + p2  # patchwork
Figure 6: Histogram + Q–Q plot of BMI

Three ways to check normality - use them together:

  1. Histogram - does it look bell-shaped?
  2. Q–Q plot - do points follow the diagonal? Curve up at the top-right = right-skew
Check Result Decision
Histogram Nearly symmetric? → Mean ± SD
Q–Q plot Points on the line? → Mean ± SD

Table or graph: when to use which

Goal Table Graph
Exact values matter Lookup by readers Loses precision
Trend over time Hard to see Line plot
Compare many groups Rows are clear Small multiples
Distribution shape Numbers hide shape ** Histogram / violin**
Two-variable relationship Hard to parse Scatter plot
Frequency / prevalence n (%) Bar chart (≤ 5 groups)
Public presentation Audiences skip tables ** Graphs are memorable**

Note

Practical rule: use a table when readers need to look up specific numbers. Use a graph when they need to understand a pattern, trend, or shape.

Quick-reference reporting card

Variable type Example Report as Test Best chart
Continuous, normal Height, SBP Mean ± SD t-test / ANOVA Histogram, box
Continuous, skewed BMI, income Median (IQR) Wilcoxon / KW Box, violin
Ordinal Pain (1–10) Median (IQR) Wilcoxon / Spearman Bar, heatmap
Nominal / Binary Sex, race n (%) Chi-square / Fisher Bar (sorted)
Count Days absent Median (IQR) Poisson / NB Histogram
Time-to-event Survival Median (95% CI) Log-rank, Cox Kaplan–Meier

Statistical Tests - three questions

From describing the data to testing hypotheses

Before any test, ask in this order:

  1. What is the OUTCOME? Continuous or categorical?
  2. How many GROUPS are you comparing? 2, 3+, or paired?
  3. Is the data NORMAL (or large enough that the mean is stable)?
Outcome Groups Distribution Test R function
Continuous 2 indep. Normal t-test t.test(bmi ~ sex)
Continuous 3+ groups Normal ANOVA aov(bmi ~ grade)
Categorical 2×2+ - Chi-square chisq.test(table(...))
Continuous continuous predictor - Linear regression lm(bmi ~ stweight)
Binary outcome any - Logistic regression glm(..., family="binomial")

Warning

The first question is the most important. Continuous outcome → t-test / regression world. Categorical outcome → chi-square / logistic world.

flowchart TD
    Start(["<b>START</b><br/>What is your OUTCOME?"])

    Start -->|Categorical| Cat["Categorical Outcome"]
    Start -->|Continuous| Cont["Continuous Outcome"]
    Start -->|Continuous predictor| Reg["<b>Linear Regression</b>"]

    Cat --> Chi["<b>Chi-square</b><br/>Expected counts ≥ 5"]
    Cat --> Fish["<b>Fisher's exact</b><br/>Expected counts < 5"]

    Cont --> Norm{"Data normally<br/>distributed?"}

    Norm -->|YES| Param["Parametric"]
    Norm -->|NO| NonParam["Non-parametric"]

    Param -->|2 groups| TTest["<b>Independent t-test</b>"]
    Param -->|3+ groups| ANOVA["<b>One-way ANOVA</b><br/>+ Tukey post-hoc"]

    NonParam -->|2 groups| Wilc["<b>Wilcoxon rank-sum</b>"]
    NonParam -->|3+ groups| Krus["<b>Kruskal–Wallis</b><br/>+ Dunn post-hoc"]

    classDef start fill:#FBF7EF,stroke:#1B3A6B,stroke-width:2px,color:#122849
    classDef test fill:#0D9488,stroke:#0a6b62,color:#FBF7EF
    classDef branch fill:#F4ECDC,stroke:#1B3A6B,color:#122849
    classDef decision fill:#F59E0B,stroke:#a76b06,color:#122849

    class Start start
    class Reg,Chi,Fish,TTest,ANOVA,Wilc,Krus test
    class Cat,Cont,Param,NonParam branch
    class Norm decision

Linear regression: BMI predicted by weight

model1 <- lm(bmi ~ stweight, data = yrbss)
summary(model1)

Call:
lm(formula = bmi ~ stweight, data = yrbss)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4075  -1.6816  -0.2079   1.4228  24.4669 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.560047   0.092984   70.55   <2e-16 ***
stweight    0.251396   0.001336  188.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.611 on 13439 degrees of freedom
  (6559 observations deleted due to missingness)
Multiple R-squared:  0.725, Adjusted R-squared:  0.7249 
F-statistic: 3.542e+04 on 1 and 13439 DF,  p-value: < 2.2e-16

Tidy regression output with broom

broom::tidy(model1, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)    6.56    0.0930       70.6       0    6.38      6.74 
2 stweight       0.251   0.00134     188.        0    0.249     0.254
broom::glance(model1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.725         0.725  2.61    35421.       0     1 -31973. 63951. 63974.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

broom turns model output into a tidy data frame easy to filter, format, and pass to gtsummary or ggplot2.

  • tidy() - one row per coefficient (estimate, SE, p-value, CI)
  • glance() - one row per model (R², AIC, n)
  • augment() - adds fitted values + residuals to your data

Multiple regression: adding sex and age

model2 <- lm(bmi ~ stweight + sex + race4, data = yrbss)
summary(model2)

Call:
lm(formula = bmi ~ stweight + sex + race4, data = yrbss)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.7772  -1.3761  -0.1224   1.1726  25.1075 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     6.115324   0.086190  70.951   <2e-16 ***
stweight                        0.279391   0.001225 228.160   <2e-16 ***
sexMale                        -2.809706   0.041274 -68.075   <2e-16 ***
race4Black or African American  0.061472   0.058664   1.048    0.295    
race4Hispanic/Latino            0.557191   0.056076   9.936   <2e-16 ***
race4White                     -0.491424   0.052357  -9.386   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.214 on 13022 degrees of freedom
  (6972 observations deleted due to missingness)
Multiple R-squared:  0.8025,    Adjusted R-squared:  0.8024 
F-statistic: 1.058e+04 on 5 and 13022 DF,  p-value: < 2.2e-16

Publication table for the regression

gtsummary::tbl_regression(model2,
  label = list(stweight ~ "Body weight (kg)",
    sex      ~ "Sex", race4      ~ "Ethinicity")) |>
  bold_labels() |>
  add_glance_source_note(
    include = c(r.squared, adj.r.squared, AIC, nobs))

Publication table for the regression

Characteristic Beta 95% CI p-value
Body weight (kg) 0.28 0.28, 0.28 <0.001
Sex


    Female
    Male -2.8 -2.9, -2.7 <0.001
Ethinicity


    All other races
    Black or African American 0.06 -0.05, 0.18 0.3
    Hispanic/Latino 0.56 0.45, 0.67 <0.001
    White -0.49 -0.59, -0.39 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.802; Adjusted R² = 0.802; AIC = 57,690; No. Obs. = 13,028

Tip

tbl_regression() automatically formats CIs, picks exponentiation for logistic models, and bolds significant p-values. One line replaces ~30 lines of manual table code.

Logistic regression: when the outcome is yes / no

So far our outcome (bmi) has been continuous. But many real questions are about a binary outcome:

  • Is the student overweight? (yes / no)
  • Did the patient survive? (yes / no)
  • Did the household adopt the intervention? (yes / no)

Why we can’t just use linear regression here:

  • A linear model can predict probabilities below 0 or above 1 - meaningless
  • The relationship between predictors and probability is S-shaped, not straight
  • The variance of a 0/1 outcome isn’t constant - violates the equal-variance assumption.

The fix: transform the probability so the linear part of the model has no upper or lower bound. That transformation is called the logit.

In R: glm(y ~ x, family = binomial) fits this model. The default link is logit. You almost never need to change it.

yrbss <- yrbss |>
  mutate(overweight = factor(ifelse(bmi >= 25, "Yes", "No")))

# Then model:
glm(overweight ~ age + sex + grade + race4,
    data = yrbss,
    family = binomial)

Call:  glm(formula = overweight ~ age + sex + grade + race4, family = binomial, 
    data = yrbss)

Coefficients:
                   (Intercept)                 age13 years old  
                      -0.97364                        -1.08840  
               age14 years old                 age15 years old  
                      -0.58921                        -0.33808  
               age16 years old                 age17 years old  
                      -0.13499                         0.06457  
      age18 years old or older                         sexMale  
                       0.29058                         0.21353  
                     grade10th                       grade11th  
                      -0.09245                        -0.11765  
                     grade12th  race4Black or African American  
                      -0.17767                         0.37905  
          race4Hispanic/Latino                      race4White  
                       0.32089                        -0.07208  

Degrees of Freedom: 12991 Total (i.e. Null);  12978 Residual
  (7008 observations deleted due to missingness)
Null Deviance:      15540 
Residual Deviance: 15280    AIC: 15310

Odds and odds ratios - the interpretation key

Odds = probability of event / probability of no event.

Probability Odds In words
0.10 0.11 (1 : 9) “Long shot”
0.50 1.00 (1 : 1) “Even money”
0.80 4.00 (4 : 1) “Strongly favoured”

Odds ratio (OR) = odds in group A / odds in group B.

OR Meaning
= 1 No association - both groups have the same odds
> 1 Group A has higher odds (stronger if larger)
< 1 Group A has lower odds (stronger if smaller)

Important

Coefficients from glm() are on the log-odds scale. To interpret them as odds ratios, you must exponentiate: exp(coef). We never report log-odds in a paper - always ORs with 95 % CI.

Step 1 - Create the binary outcome

yrbss <- yrbss |>
  mutate(overweight = factor(ifelse(bmi >= 25, "Yes", "No"),
    levels = c("No", "Yes")))   # explicit order
# Check prevalence
yrbss |>  count(overweight) |>
  mutate(pct = round(n / sum(n) * 100, 1))
# A tibble: 3 × 3
  overweight     n   pct
  <fct>      <int> <dbl>
1 No          9684  48.4
2 Yes         3858  19.3
3 <NA>        6458  32.3

Three things to notice:

  1. bmi >= 25 is the WHO cut-off for overweight using a clinically meaningful threshold matters (don’t pick cut-offs by data-snooping).
  2. levels = c("No", "Yes") makes “No” the reference category. R will model the probability of “Yes” - what we want.

Step 2 - Fit the model

model_log <- glm(overweight ~ age + sex + grade + race4,
                 data   = yrbss, family = binomial)
summary(model_log)

Call:
glm(formula = overweight ~ age + sex + grade + race4, family = binomial, 
    data = yrbss)

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -0.97364    0.50687  -1.921   0.0547 .  
age13 years old                -1.08840    0.66961  -1.625   0.1041    
age14 years old                -0.58921    0.50904  -1.157   0.2471    
age15 years old                -0.33808    0.50531  -0.669   0.5035    
age16 years old                -0.13499    0.50453  -0.268   0.7890    
age17 years old                 0.06457    0.50466   0.128   0.8982    
age18 years old or older        0.29058    0.50613   0.574   0.5659    
sexMale                         0.21353    0.03941   5.419 6.01e-08 ***
grade10th                      -0.09245    0.06773  -1.365   0.1723    
grade11th                      -0.11765    0.08165  -1.441   0.1496    
grade12th                      -0.17767    0.09343  -1.902   0.0572 .  
race4Black or African American  0.37905    0.05855   6.474 9.52e-11 ***
race4Hispanic/Latino            0.32089    0.05651   5.679 1.36e-08 ***
race4White                     -0.07208    0.05488  -1.313   0.1890    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 15542  on 12991  degrees of freedom
Residual deviance: 15278  on 12978  degrees of freedom
  (7008 observations deleted due to missingness)
AIC: 15306

Number of Fisher Scoring iterations: 4

Reading the raw summary() output

The output has four parts you should always check:

Section What it tells you
Call Confirms the formula and family - sanity check
Coefficients Estimate (log-odds), SE, z-value, p-value
Deviance Null vs residual - drop in deviance = how much variance the model explains
AIC Lower is better; used to compare models

Warning

The Estimate column is in log-odds, not odds ratios. A coefficient of 0.45 means log-odds increase by 0.45 - which is exp(0.45) ≈ 1.57, an OR of 1.57. Always exponentiate before interpreting.

Tip

Reference categories (the level not shown in the output) are the comparison group. By default it’s the first factor level - for sex that’s “Female”, so the sexMale row compares males to females.

Step 3 - Get odds ratios with broom

broom::tidy() does the exponentiation, the CIs, and tidies everything into a data frame in one call:

broom::tidy(model_log, exponentiate = TRUE,   # log-odds → ORs
            conf.int     = TRUE,   # add 95% CI
            conf.level   = 0.95)
# A tibble: 14 × 7
   term                  estimate std.error statistic p.value conf.low conf.high
   <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
 1 (Intercept)              0.378     0.507    -1.92    0.055    0.13      0.986
 2 age13 years old          0.337     0.67     -1.62    0.104    0.088     1.27 
 3 age14 years old          0.555     0.509    -1.16    0.247    0.211     1.62 
 4 age15 years old          0.713     0.505    -0.669   0.503    0.274     2.07 
 5 age16 years old          0.874     0.505    -0.268   0.789    0.336     2.53 
 6 age17 years old          1.07      0.505     0.128   0.898    0.41      3.09 
 7 age18 years old or o…    1.34      0.506     0.574   0.566    0.513     3.88 
 8 sexMale                  1.24      0.039     5.42    0        1.15      1.34 
 9 grade10th                0.912     0.068    -1.36    0.172    0.798     1.04 
10 grade11th                0.889     0.082    -1.44    0.15     0.758     1.04 
11 grade12th                0.837     0.093    -1.90    0.057    0.697     1.01 
12 race4Black or Africa…    1.46      0.059     6.47    0        1.30      1.64 
13 race4Hispanic/Latino     1.38      0.057     5.68    0        1.23      1.54 
14 race4White               0.93      0.055    -1.31    0.189    0.836     1.04 

Now the estimate column is the OR, and conf.low / conf.high give the 95 % CI - ready to paste into a manuscript.

Step 4 - Publication table with gtsummary

gtsummary::tbl_regression(
  model_log,
  exponentiate = TRUE,                    # show ORs, not log-odds
  label = list(
    age   ~ "Age",
    sex   ~ "Sex",
    grade ~ "School grade",
    race4 ~ "Race / ethnicity")) |>
  bold_labels() |>
  bold_p(t = 0.05) |>
  add_glance_source_note(
    include = c(nobs, AIC))

Publication table with gtsummary

Characteristic OR 95% CI p-value
age


    12 years old or younger
    13 years old 0.34 0.09, 1.27 0.10
    14 years old 0.55 0.21, 1.62 0.2
    15 years old 0.71 0.27, 2.07 0.5
    16 years old 0.87 0.34, 2.53 0.8
    17 years old 1.07 0.41, 3.09 0.9
    18 years old or older 1.34 0.51, 3.88 0.6
Sex


    Female
    Male 1.24 1.15, 1.34 <0.001
School grade


    9th
    10th 0.91 0.80, 1.04 0.2
    11th 0.89 0.76, 1.04 0.15
    12th 0.84 0.70, 1.01 0.057
Race / ethnicity


    All other races
    Black or African American 1.46 1.30, 1.64 <0.001
    Hispanic/Latino 1.38 1.23, 1.54 <0.001
    White 0.93 0.84, 1.04 0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
No. Obs. = 12,992; AIC = 15,306

Interpreting your odds ratios

Pick one row at a time and translate it into a sentence. Template:

“Compared to [reference category], [predictor level] had OR-fold the odds of being overweight (95 % CI: low-high; p = X.XX), holding all other variables constant.”

Worked example - sexMale (let’s say OR = 1.55, 95 % CI 1.45-1.66):

Male students had 1.55 times the odds of being overweight compared to female students (95 % CI: 1.45-1.66; p < 0.001), adjusting for age, grade, and race / ethnicity.

Important

Three things every interpretation must include:

  1. The reference group (“compared to females”)
  2. The direction and magnitude (“1.55 times the odds”)
  3. The adjustment set (“holding age, grade, race constant”)

Without these, an OR is meaningless.

Warning

OR is not relative risk. When the outcome is common (>10 %), the OR exaggerates the relative risk. With ~28 % overweight prevalence, an OR of 1.55 is not the same as “1.55× more likely” - be careful with that phrasing in plain-language summaries.