Basic Data Analysis with R

April 28 - May 1, 2026

Yebelay & Leykun

Learning Objectives

By the end of this module you will be able to:

Understanding our data: load, inspect, audit missingness
Summarising variables: the gtsummary workflow
Graphs & plots: distributions, comparisons, what to look for
Choosing what to report: mean vs median · table vs graph
Statistical tests: t-test, chi-square, regression

Part 1. Know Your Data: What is YRBSS?

Before you analyse anything, understand what you have

About the survey

US national survey of high-school students
Monitors 6 categories of health-risk behaviour: injury, tobacco, alcohol/drugs, sexual behaviour, diet, physical activity
Run by CDC every 2 years since 1991
Our subset: 20,000 students × 8 variables
Real-world missingness: 32% of bmi

Variables in our dataset

Variable	Type	Example
`record`	ID	931897
`age`	Categorical	17 years old
`sex`	Categorical	Female / Male
`grade`	Categorical	9th – 12th
`race4`	Categorical	White / Hispanic…
`bmi`	Numeric	13.2 – 53.9
`stweight`	Numeric	Weight in kg

Note

We use this dataset because it has the real problems analysts face every day: missing values, skewed distributions, mixed variable types, and large enough n that we can run real tests.

Load packages, read the data

library(tidyverse)   # data wrangling + ggplot2
library(gtsummary)   # publication-ready tables
library(janitor)     # clean_names(), tabyl()

yrbss <- read_csv("yrbss.csv")

Why this order matters:

tidyverse first - gives you read_csv(), mutate(), filter(), plus pipes
gtsummary next - for the descriptive table you’ll build later
janitor last - small helpers; loaded last so its tabyl() doesn’t get masked

Tip

Reproducibility rule: keep yrbss.csv next to your .qmd. Use relative paths ("yrbss.csv"), never absolute paths like "C:/Users/Me/Desktop/yrbss.csv" — they break the moment someone else opens your project.

Inspect the structure first

glimpse(yrbss)

Rows: 20,000
Columns: 11
$ record      <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1…
$ age         <fct> 15 years old, 17 years old, 18 years old or older, 15 year…
$ sex         <fct> Female, Female, Male, Male, Male, Male, Male, Male, Male, …
$ grade       <fct> 10th, 12th, 11th, 10th, 9th, 9th, 11th, 12th, 12th, 10th, …
$ race4       <fct> White, White, Hispanic/Latino, Black or African American, …
$ race7       <chr> "White", "White", "Hispanic/Latino", "Black or African Ame…
$ bmi         <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.25…
$ stweight    <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, …
$ obese       <fct> Not obese, Not obese, NA, Not obese, Not obese, NA, Not ob…
$ bmi_cat     <fct> Underweight, Normal, NA, Overweight, Normal, NA, Normal, N…
$ bmi_missing <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALS…

Why we always do this first:

glimpse() is the analyst’s first move on any new dataset. In one line you learn:

How many rows and columns is this what you expected?
Variable names and types <dbl>, <chr>, <fct>?
The first few values do they look sensible?

Catching a wrongly-typed variable here (e.g. age stored as text) saves hours of debugging later.

Audit missing data - never skip this

colSums(is.na(yrbss))

     record         age         sex       grade       race4       race7 
          0         205         231         406         710        1187 
        bmi    stweight       obese     bmi_cat bmi_missing 
       6458        6557        6458        6458           0

What this tells you:

bmi and stweight each have ~6,400 missing (~32%) - these are students who didn’t report height/weight
age, sex, grade, race4 are <2% missing - safe for descriptive tables

Variable types determine your analysis

yrbss |>
  summarise(across(everything(), class))

# A tibble: 1 × 11
  record  age   sex   grade race4 race7 bmi   stweight obese bmi_cat bmi_missing
  <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>    <chr> <chr>   <chr>      
1 numeric fact… fact… fact… fact… char… nume… numeric  fact… factor  logical

The four R types you need to recognise:

R type	Used for	Example variable
`dbl` / `int`	Continuous / count	`bmi`, `stweight`
`chr`	Text - recode to factor	raw `age` strings
`fct`	Categorical with levels	`grade` after `factor()`
`lgl`	TRUE / FALSE	`bmi_missing`

Why type matters: gtsummary, ggplot2, and every statistical test treat numeric and categorical variables differently. If grade is stored as <chr> instead of <fct>, your tables sort alphabetically ("10th" before "9th") - wrong.

Descriptive Statistics: The four dimensions of any variable

Summarise every variable before you model anything

Dimension	What it captures	Statistic
Centre	Typical value	Mean, median, mode
Spread	Variability	SD, IQR, range
Shape	Symmetry & tails	Skewness, kurtosis
Frequency	Category counts	n, %, cumulative %

What to report by variable type

Continuous, normal → Mean ± SD
Continuous, skewed → Median (IQR)
Categorical → n (%)
Binary → n (%) of the positive class

Describe all variables before any modelling. A one-line summary per variable prevents major errors downstream.

What `gtsummary` does for you

Auto-detects variable types: continuous vs categorical
Continuous → median (IQR) by default · Categorical → n (%)
add_p() automatically picks the correct statistical test
add_overall() adds an all-participants column
Output-ready for HTML, Word, and PDF via as_gt() or as_flex_table()
Bold or italicise labels in one pipe step

Tip

gtsummary is the single most time-saving package in R for descriptive tables.
One pipe replaces ~30 lines of manual summarise() + pivot_longer() + formatting code.

Basic `tbl_summary()` code

table1 <- yrbss |>
  select(sex, grade, race4, bmi) |>
  tbl_summary() |>
  bold_labels()

table1

What this does, line by line:

select(): pick the variables you want to describe
tbl_summary(): auto-detect types and summarise each
bold_labels(): make variable names stand out

The output is a publication-ready table. To save it for Word, pipe into as_flex_table() |> flextable::save_as_docx(path = "table1.docx").

Basic `tbl_summary()` output

Characteristic	N = 20,000¹
sex
Female	9,592 (49%)
Male	10,177 (51%)
Unknown	231
grade
9th	5,219 (27%)
10th	4,907 (25%)
11th	4,891 (25%)
12th	4,577 (23%)
Unknown	406
bmi	22.3 (20.1, 25.7)
Unknown	6,458
¹ n (%); Median (Q1, Q3)

Stratified `tbl_summary()` code

yrbss |>
  select(sex, grade, race4, bmi) |>
  tbl_summary(by = sex) |>     # split columns by sex
  add_p() |>                   # auto-pick the right test
  add_overall() |>             # add an "Overall" column
  bold_labels() |>
  italicize_levels()

The by = argument is the magic. It splits the table into one column per group and - combined with add_p() - automatically chooses:

Wilcoxon for skewed continuous (like BMI)
Chi-square for categorical
Fisher’s exact if any expected cell < 5

Stratified `tbl_summary()` output

Characteristic	Overall N = 19,769¹	Female N = 9,592¹	Male N = 10,177¹	p-value²
grade				0.12
9th	5,176 (27%)	2,492 (26%)	2,684 (27%)
10th	4,871 (25%)	2,332 (25%)	2,539 (25%)
11th	4,861 (25%)	2,365 (25%)	2,496 (25%)
12th	4,540 (23%)	2,277 (24%)	2,263 (23%)
Unknown	321	126	195
race4				0.8
All other races	4,696 (25%)	2,260 (24%)	2,436 (25%)
Black or African American	4,055 (21%)	1,984 (21%)	2,071 (21%)
Hispanic/Latino	4,625 (24%)	2,252 (24%)	2,373 (24%)
White	5,762 (30%)	2,826 (30%)	2,936 (30%)
Unknown	631	270	361
bmi	22.3 (20.1, 25.7)	22.1 (20.0, 25.3)	22.5 (20.3, 25.8)	<0.001
Unknown	6,227	2,970	3,257
¹ n (%); Median (Q1, Q3)
² Pearson’s Chi-squared test; Wilcoxon rank sum test

Customising labels and statistics

yrbss |>  select(sex, bmi, stweight) |>
  tbl_summary( by = sex,
    statistic = list(
      all_continuous()  ~ "{median} ({p25}, {p75})",
      all_categorical() ~ "{n} ({p}%)"),
    digits = list(all_continuous() ~ 1),
    label  = list(
      bmi      ~ "Body Mass Index (kg/m²)",
      stweight ~ "Body weight (kg)"),
    missing = "ifany", missing_text = "Missing, n") |>
  add_p(test = list(all_continuous() ~ "wilcox.test")) |>
  add_overall() |>  bold_labels() |>
  modify_caption("**Table 1. Participant characteristics**")

Tip

The label argument relabels variables for publication - no need to rename your data frame. {p25} / {p75} show explicit percentiles; use {mean} / {sd} for normally-distributed variables.

How to read a `gtsummary` table

**Table 1. Participant characteristics**
Characteristic	Overall N = 19,769¹	Female N = 9,592¹	Male N = 10,177¹	p-value²
Body Mass Index (kg/m²)	22.3 (20.1, 25.7)	22.1 (20.0, 25.3)	22.5 (20.3, 25.8)	<0.001
Missing, n	6,227	2,970	3,257
Body weight (kg)	64.0 (55.8, 74.8)	59.0 (52.2, 68.0)	69.4 (61.2, 81.2)	<0.001
Missing, n	6,326	3,050	3,276
¹ Median (Q1, Q3)
² Wilcoxon rank sum test

Note

IQR = interquartile range (25th-75th percentile) - the middle 50 % of the data
Medians appear because BMI and weight are right-skewed
The Overall column is the unstratified total - useful for the abstract

Exploratory Data Analysis

Why we plot every variable first?
- Visualise before you conclude

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.”
- John Tukey, 1977

Four EDA principles

Principle	What to do
📊 Look first	Plot before you model
📈 Univariate first	One variable, then pairs
📋 Plot AND tabulate	Numbers give magnitude; graphs give shape
📝 Document decisions	If you drop an outlier, write down why

EDA checklist

Distribution of every continuous variable
Frequency table of every categorical
Outlier check (visual + IQR rule)
Distributions stratified by group
Pairwise scatter plots
Correlation matrix
Missing-data patterns (naniar)

BMI distribution (histogram)

med_bmi <- median(yrbss$bmi, na.rm = TRUE)
ggplot(yrbss, aes(x = bmi)) +
  geom_histogram(binwidth = 2, fill = "#0D9488",
                 colour = "white") +
  geom_vline(xintercept = med_bmi,
             colour = "#EF4444", linewidth = 1,
             linetype = "dashed") +
  annotate("text", x = med_bmi + 1, y = 1700,
           label = paste("Median =",
                         round(med_bmi, 1)),
           colour = "#EF4444", hjust = 0) +
  labs(title = "Distribution of BMI",
       x = "BMI (kg/m²)", y = "Count")

What to look for in any histogram:

Centre: where is the bulk of the data?
Spread: narrow peak or wide spread?
Shape: symmetric (bell), skewed (long tail), or bimodal (two peaks)?
Outliers: values far from the bulk that need investigation

For BMI: clear right-skew (long tail to the right) → use median, not mean.

BMI density curves by sex

ggplot(yrbss |> filter(!is.na(sex)),
       aes(x = bmi, fill = sex)) +
  geom_density(alpha = 0.45, colour = NA) +
  scale_fill_manual( values = c(Female = "#1B3A6B",
               Male   = "#F59E0B")) +
  labs(title = "BMI density by sex",
       x = "BMI (kg/m²)", y = "Density", fill = NULL) +
  theme(legend.position = "top")

Reading two density curves side-by-side:

A density plot is a smoothed histogram - the area under each curve sums to 1. So you can compare two groups of different sizes on the same axes.

Both curves are right-skewed - same shape, no surprises there
Male curve sits to the right - males have systematically higher BMI
Similar widths → similar IQRs (similar spread)

This is what a single number like “mean BMI” would hide entirely.

Box plots: comparing groups

ggplot(yrbss |> filter(!is.na(grade)),
       aes(x = grade, y = bmi, fill = grade)) +
  geom_boxplot() +
  scale_fill_manual(values = c("#1B3A6B", "#0D9488","#F59E0B","#EF4444" )) +
  labs(title = "BMI by Grade", x = NULL, 
       y = "BMI (kg/m²)") +
  theme(legend.position = "none")

Figure 3: BMI by grade (notched box plot)

Anatomy of a box plot:

Box = IQR (25th-75th percentile)
Line in box = median
Whiskers = 1.5 × IQR
Points beyond whiskers = outlier candidates

Bar chart: categorical frequency

yrbss |>
  filter(!is.na(race4)) |>
  count(race4) |> mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(y = reorder(race4, n), x = pct, fill = race4)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.1f%%", pct)),
    hjust = -0.1, size = 3.2) +
  scale_x_continuous(limits = c(0, 36)) +
  scale_fill_manual(values = c("#1B3A6B","#0D9488","#F59E0B","#EF4444")) +
  labs(title = "Race / ethnicity distribution",
       x = "Percentage (%)", y = NULL)

Three rules for good bar charts:

Sort by frequency with reorder(): never alphabetical (unless ordinal like grade)
Use horizontal bars when category names are long - keeps labels readable
Avoid 3-D and pie charts: humans can’t compare angles or volumes accurately

Add value labels with geom_text() so readers don’t have to estimate from the axis.

Stacked bars: composition within groups

yrbss |> filter(!is.na(grade), !is.na(sex)) |>
  count(grade, sex) |>
  group_by(grade) |>
  mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(x = grade, y = pct, fill = sex)) +
  geom_col(position = "fill") +
  geom_text(aes(label = sprintf("%.0f%%", pct)),
    position = position_fill(vjust = 0.5),
    colour = "white", fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c("#1B3A6B","#0D9488")) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Sex distribution by grade",
       subtitle = "Each bar sums to 100%",
       x = "Grade", y = "Proportion", fill = "Sex")

Stacked vs dodged: pick the right one:

Stacked (position = "fill") - best for proportions within a group; each bar sums to 100%.
Dodged (position = "dodge") - best for absolute counts between groups; bars side-by-side.

Use stacked when relative composition matters (here: sex balance per grade). Use dodged when raw counts matter.

What to Report

Mean vs median

Use MEAN when

Data is normally distributed (bell-shaped)
No extreme outliers
Continuous: height, blood pressure
Parametric tests (t-test, ANOVA) will follow
Symmetric Likert scales

Reporting: Mean ± SD
Example: Height: 165.3 ± 8.2 cm

Use MEDIAN when

Data is skewed (income, BMI, lab values)
Outliers are present or expected
Ordinal data: pain scores, satisfaction
Non-parametric tests will follow (Wilcoxon)
YRBSS BMI and weight → MEDIAN

Reporting: Median (IQR)
Example: BMI: 22.3 (19.8-25.7) kg/m²

Important

Why mean fails with skewed data: a single student with BMI = 53.9 pulls the mean upward but does not affect the median. The median better represents the typical student.

Testing for normality

p1 <- ggplot(yrbss, aes(bmi)) +
  geom_histogram(bins = 30, fill = "#0D9488",
                 colour = "white") +
  labs(title = "Histogram of BMI", 
       x = "BMI", y = "Count")

p2 <- ggplot(yrbss |> 
               filter(!is.na(bmi)), 
             aes(sample = bmi)) +
  stat_qq(alpha = 0.3, size = 0.5, 
          colour = "#1B3A6B") +
  stat_qq_line(colour = "#EF4444", 
               linewidth = 0.7) +
  labs(title = "Q–Q plot of BMI", 
       x = "Theoretical", y = "Sample")
p1 + p2  # patchwork

Three ways to check normality - use them together:

Histogram - does it look bell-shaped?
Q–Q plot - do points follow the diagonal? Curve up at the top-right = right-skew

Check	Result	Decision
Histogram	Nearly symmetric?	→ Mean ± SD
Q–Q plot	Points on the line?	→ Mean ± SD

Table or graph: when to use which

Goal	Table	Graph
Exact values matter	Lookup by readers	Loses precision
Trend over time	Hard to see	Line plot
Compare many groups	Rows are clear	Small multiples
Distribution shape	Numbers hide shape	Histogram / violin
Two-variable relationship	Hard to parse	Scatter plot
Frequency / prevalence	n (%)	Bar chart (≤ 5 groups)
Public presentation	Audiences skip tables	Graphs are memorable

Note

Practical rule: use a table when readers need to look up specific numbers. Use a graph when they need to understand a pattern, trend, or shape.

Quick-reference reporting card

Variable type	Example	Report as	Test	Best chart
Continuous, normal	Height, SBP	Mean ± SD	t-test / ANOVA	Histogram, box
Continuous, skewed	BMI, income	Median (IQR)	Wilcoxon / KW	Box, violin
Ordinal	Pain (1–10)	Median (IQR)	Wilcoxon / Spearman	Bar, heatmap
Nominal / Binary	Sex, race	n (%)	Chi-square / Fisher	Bar (sorted)
Count	Days absent	Median (IQR)	Poisson / NB	Histogram
Time-to-event	Survival	Median (95% CI)	Log-rank, Cox	Kaplan–Meier

Statistical Tests - three questions

From describing the data to testing hypotheses

Before any test, ask in this order:

What is the OUTCOME? Continuous or categorical?
How many GROUPS are you comparing? 2, 3+, or paired?
Is the data NORMAL (or large enough that the mean is stable)?

Outcome	Groups	Distribution	Test	R function
Continuous	2 indep.	Normal	t-test	`t.test(bmi ~ sex)`
Continuous	3+ groups	Normal	ANOVA	`aov(bmi ~ grade)`
Categorical	2×2+	-	Chi-square	`chisq.test(table(...))`
Continuous	continuous predictor	-	Linear regression	`lm(bmi ~ stweight)`
Binary outcome	any	-	Logistic regression	`glm(..., family="binomial")`

Warning

The first question is the most important. Continuous outcome → t-test / regression world. Categorical outcome → chi-square / logistic world.

flowchart TD
    Start(["<b>START</b><br/>What is your OUTCOME?"])

    Start -->|Categorical| Cat["Categorical Outcome"]
    Start -->|Continuous| Cont["Continuous Outcome"]
    Start -->|Continuous predictor| Reg["<b>Linear Regression</b>"]

    Cat --> Chi["<b>Chi-square</b><br/>Expected counts ≥ 5"]
    Cat --> Fish["<b>Fisher's exact</b><br/>Expected counts < 5"]

    Cont --> Norm{"Data normally<br/>distributed?"}

    Norm -->|YES| Param["Parametric"]
    Norm -->|NO| NonParam["Non-parametric"]

    Param -->|2 groups| TTest["<b>Independent t-test</b>"]
    Param -->|3+ groups| ANOVA["<b>One-way ANOVA</b><br/>+ Tukey post-hoc"]

    NonParam -->|2 groups| Wilc["<b>Wilcoxon rank-sum</b>"]
    NonParam -->|3+ groups| Krus["<b>Kruskal–Wallis</b><br/>+ Dunn post-hoc"]

    classDef start fill:#FBF7EF,stroke:#1B3A6B,stroke-width:2px,color:#122849
    classDef test fill:#0D9488,stroke:#0a6b62,color:#FBF7EF
    classDef branch fill:#F4ECDC,stroke:#1B3A6B,color:#122849
    classDef decision fill:#F59E0B,stroke:#a76b06,color:#122849

    class Start start
    class Reg,Chi,Fish,TTest,ANOVA,Wilc,Krus test
    class Cat,Cont,Param,NonParam branch
    class Norm decision

Linear regression: BMI predicted by weight

model1 <- lm(bmi ~ stweight, data = yrbss)
summary(model1)


Call:
lm(formula = bmi ~ stweight, data = yrbss)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4075  -1.6816  -0.2079   1.4228  24.4669 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.560047   0.092984   70.55   <2e-16 ***
stweight    0.251396   0.001336  188.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.611 on 13439 degrees of freedom
  (6559 observations deleted due to missingness)
Multiple R-squared:  0.725, Adjusted R-squared:  0.7249 
F-statistic: 3.542e+04 on 1 and 13439 DF,  p-value: < 2.2e-16

Tidy regression output with `broom`

broom::tidy(model1, conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)    6.56    0.0930       70.6       0    6.38      6.74 
2 stweight       0.251   0.00134     188.        0    0.249     0.254

broom::glance(model1)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.725         0.725  2.61    35421.       0     1 -31973. 63951. 63974.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

broom turns model output into a tidy data frame easy to filter, format, and pass to gtsummary or ggplot2.

tidy() - one row per coefficient (estimate, SE, p-value, CI)
glance() - one row per model (R², AIC, n)
augment() - adds fitted values + residuals to your data

Multiple regression: adding sex and age

model2 <- lm(bmi ~ stweight + sex + race4, data = yrbss)
summary(model2)


Call:
lm(formula = bmi ~ stweight + sex + race4, data = yrbss)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.7772  -1.3761  -0.1224   1.1726  25.1075 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     6.115324   0.086190  70.951   <2e-16 ***
stweight                        0.279391   0.001225 228.160   <2e-16 ***
sexMale                        -2.809706   0.041274 -68.075   <2e-16 ***
race4Black or African American  0.061472   0.058664   1.048    0.295    
race4Hispanic/Latino            0.557191   0.056076   9.936   <2e-16 ***
race4White                     -0.491424   0.052357  -9.386   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.214 on 13022 degrees of freedom
  (6972 observations deleted due to missingness)
Multiple R-squared:  0.8025,    Adjusted R-squared:  0.8024 
F-statistic: 1.058e+04 on 5 and 13022 DF,  p-value: < 2.2e-16

Publication table for the regression

gtsummary::tbl_regression(model2,
  label = list(stweight ~ "Body weight (kg)",
    sex      ~ "Sex", race4      ~ "Ethinicity")) |>
  bold_labels() |>
  add_glance_source_note(
    include = c(r.squared, adj.r.squared, AIC, nobs))

Publication table for the regression

Characteristic	Beta	95% CI	p-value
Body weight (kg)	0.28	0.28, 0.28	<0.001
Sex
Female	—	—
Male	-2.8	-2.9, -2.7	<0.001
Ethinicity
All other races	—	—
Black or African American	0.06	-0.05, 0.18	0.3
Hispanic/Latino	0.56	0.45, 0.67	<0.001
White	-0.49	-0.59, -0.39	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.802; Adjusted R² = 0.802; AIC = 57,690; No. Obs. = 13,028

Tip

tbl_regression() automatically formats CIs, picks exponentiation for logistic models, and bolds significant p-values. One line replaces ~30 lines of manual table code.

Logistic regression: when the outcome is yes / no

So far our outcome (bmi) has been continuous. But many real questions are about a binary outcome:

Is the student overweight? (yes / no)
Did the patient survive? (yes / no)
Did the household adopt the intervention? (yes / no)

Why we can’t just use linear regression here:

A linear model can predict probabilities below 0 or above 1 - meaningless
The relationship between predictors and probability is S-shaped, not straight
The variance of a 0/1 outcome isn’t constant - violates the equal-variance assumption.

The fix: transform the probability so the linear part of the model has no upper or lower bound. That transformation is called the logit.

The logit link - modelling probabilities

We don’t model the probability \(p\) directly. We model its log-odds:

\[\text{logit}(p) = \log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots\]

Why this works

p / (1 - p) converts probability into odds (ranges 0 to ∞)
Taking log() makes it range from −∞ to +∞
Now the right-hand side can take any value - no more impossible predictions
Inverting the logit gives back a probability between 0 and 1

In R: glm(y ~ x, family = binomial) fits this model. The default link is logit. You almost never need to change it.

yrbss <- yrbss |>
  mutate(overweight = factor(ifelse(bmi >= 25, "Yes", "No")))

# Then model:
glm(overweight ~ age + sex + grade + race4,
    data = yrbss,
    family = binomial)


Call:  glm(formula = overweight ~ age + sex + grade + race4, family = binomial, 
    data = yrbss)

Coefficients:
                   (Intercept)                 age13 years old  
                      -0.97364                        -1.08840  
               age14 years old                 age15 years old  
                      -0.58921                        -0.33808  
               age16 years old                 age17 years old  
                      -0.13499                         0.06457  
      age18 years old or older                         sexMale  
                       0.29058                         0.21353  
                     grade10th                       grade11th  
                      -0.09245                        -0.11765  
                     grade12th  race4Black or African American  
                      -0.17767                         0.37905  
          race4Hispanic/Latino                      race4White  
                       0.32089                        -0.07208  

Degrees of Freedom: 12991 Total (i.e. Null);  12978 Residual
  (7008 observations deleted due to missingness)
Null Deviance:      15540 
Residual Deviance: 15280    AIC: 15310

Odds and odds ratios - the interpretation key

Odds = probability of event / probability of no event.

Probability	Odds	In words
0.10	0.11 (1 : 9)	“Long shot”
0.50	1.00 (1 : 1)	“Even money”
0.80	4.00 (4 : 1)	“Strongly favoured”

Odds ratio (OR) = odds in group A / odds in group B.

OR	Meaning
= 1	No association - both groups have the same odds
> 1	Group A has higher odds (stronger if larger)
< 1	Group A has lower odds (stronger if smaller)

Important

Coefficients from glm() are on the log-odds scale. To interpret them as odds ratios, you must exponentiate: exp(coef). We never report log-odds in a paper - always ORs with 95 % CI.

Step 1 - Create the binary outcome

yrbss <- yrbss |>
  mutate(overweight = factor(ifelse(bmi >= 25, "Yes", "No"),
    levels = c("No", "Yes")))   # explicit order
# Check prevalence
yrbss |>  count(overweight) |>
  mutate(pct = round(n / sum(n) * 100, 1))

# A tibble: 3 × 3
  overweight     n   pct
  <fct>      <int> <dbl>
1 No          9684  48.4
2 Yes         3858  19.3
3 <NA>        6458  32.3

Three things to notice:

bmi >= 25 is the WHO cut-off for overweight using a clinically meaningful threshold matters (don’t pick cut-offs by data-snooping).
levels = c("No", "Yes") makes “No” the reference category. R will model the probability of “Yes” - what we want.

Step 2 - Fit the model

model_log <- glm(overweight ~ age + sex + grade + race4,
                 data   = yrbss, family = binomial)
summary(model_log)


Call:
glm(formula = overweight ~ age + sex + grade + race4, family = binomial, 
    data = yrbss)

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -0.97364    0.50687  -1.921   0.0547 .  
age13 years old                -1.08840    0.66961  -1.625   0.1041    
age14 years old                -0.58921    0.50904  -1.157   0.2471    
age15 years old                -0.33808    0.50531  -0.669   0.5035    
age16 years old                -0.13499    0.50453  -0.268   0.7890    
age17 years old                 0.06457    0.50466   0.128   0.8982    
age18 years old or older        0.29058    0.50613   0.574   0.5659    
sexMale                         0.21353    0.03941   5.419 6.01e-08 ***
grade10th                      -0.09245    0.06773  -1.365   0.1723    
grade11th                      -0.11765    0.08165  -1.441   0.1496    
grade12th                      -0.17767    0.09343  -1.902   0.0572 .  
race4Black or African American  0.37905    0.05855   6.474 9.52e-11 ***
race4Hispanic/Latino            0.32089    0.05651   5.679 1.36e-08 ***
race4White                     -0.07208    0.05488  -1.313   0.1890    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 15542  on 12991  degrees of freedom
Residual deviance: 15278  on 12978  degrees of freedom
  (7008 observations deleted due to missingness)
AIC: 15306

Number of Fisher Scoring iterations: 4

Reading the raw `summary()` output

The output has four parts you should always check:

Section	What it tells you
Call	Confirms the formula and family - sanity check
Coefficients	Estimate (log-odds), SE, z-value, p-value
Deviance	Null vs residual - drop in deviance = how much variance the model explains
AIC	Lower is better; used to compare models

Warning

The Estimate column is in log-odds, not odds ratios. A coefficient of 0.45 means log-odds increase by 0.45 - which is exp(0.45) ≈ 1.57, an OR of 1.57. Always exponentiate before interpreting.

Tip

Reference categories (the level not shown in the output) are the comparison group. By default it’s the first factor level - for sex that’s “Female”, so the sexMale row compares males to females.

Step 3 - Get odds ratios with `broom`

broom::tidy() does the exponentiation, the CIs, and tidies everything into a data frame in one call:

broom::tidy(model_log, exponentiate = TRUE,   # log-odds → ORs
            conf.int     = TRUE,   # add 95% CI
            conf.level   = 0.95)

# A tibble: 14 × 7
   term                  estimate std.error statistic p.value conf.low conf.high
   <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
 1 (Intercept)              0.378     0.507    -1.92    0.055    0.13      0.986
 2 age13 years old          0.337     0.67     -1.62    0.104    0.088     1.27 
 3 age14 years old          0.555     0.509    -1.16    0.247    0.211     1.62 
 4 age15 years old          0.713     0.505    -0.669   0.503    0.274     2.07 
 5 age16 years old          0.874     0.505    -0.268   0.789    0.336     2.53 
 6 age17 years old          1.07      0.505     0.128   0.898    0.41      3.09 
 7 age18 years old or o…    1.34      0.506     0.574   0.566    0.513     3.88 
 8 sexMale                  1.24      0.039     5.42    0        1.15      1.34 
 9 grade10th                0.912     0.068    -1.36    0.172    0.798     1.04 
10 grade11th                0.889     0.082    -1.44    0.15     0.758     1.04 
11 grade12th                0.837     0.093    -1.90    0.057    0.697     1.01 
12 race4Black or Africa…    1.46      0.059     6.47    0        1.30      1.64 
13 race4Hispanic/Latino     1.38      0.057     5.68    0        1.23      1.54 
14 race4White               0.93      0.055    -1.31    0.189    0.836     1.04

Now the estimate column is the OR, and conf.low / conf.high give the 95 % CI - ready to paste into a manuscript.

Step 4 - Publication table with `gtsummary`

gtsummary::tbl_regression(
  model_log,
  exponentiate = TRUE,                    # show ORs, not log-odds
  label = list(
    age   ~ "Age",
    sex   ~ "Sex",
    grade ~ "School grade",
    race4 ~ "Race / ethnicity")) |>
  bold_labels() |>
  bold_p(t = 0.05) |>
  add_glance_source_note(
    include = c(nobs, AIC))

Publication table with `gtsummary`

Characteristic	OR	95% CI	p-value
age
12 years old or younger	—	—
13 years old	0.34	0.09, 1.27	0.10
14 years old	0.55	0.21, 1.62	0.2
15 years old	0.71	0.27, 2.07	0.5
16 years old	0.87	0.34, 2.53	0.8
17 years old	1.07	0.41, 3.09	0.9
18 years old or older	1.34	0.51, 3.88	0.6
Sex
Female	—	—
Male	1.24	1.15, 1.34	<0.001
School grade
9th	—	—
10th	0.91	0.80, 1.04	0.2
11th	0.89	0.76, 1.04	0.15
12th	0.84	0.70, 1.01	0.057
Race / ethnicity
All other races	—	—
Black or African American	1.46	1.30, 1.64	<0.001
Hispanic/Latino	1.38	1.23, 1.54	<0.001
White	0.93	0.84, 1.04	0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
No. Obs. = 12,992; AIC = 15,306

Interpreting your odds ratios

Pick one row at a time and translate it into a sentence. Template:

“Compared to [reference category], [predictor level] had OR-fold the odds of being overweight (95 % CI: low-high; p = X.XX), holding all other variables constant.”

Worked example - sexMale (let’s say OR = 1.55, 95 % CI 1.45-1.66):

Male students had 1.55 times the odds of being overweight compared to female students (95 % CI: 1.45-1.66; p < 0.001), adjusting for age, grade, and race / ethnicity.

Important

Three things every interpretation must include:

The reference group (“compared to females”)
The direction and magnitude (“1.55 times the odds”)
The adjustment set (“holding age, grade, race constant”)

Without these, an OR is meaningless.

Warning

OR is not relative risk. When the outcome is common (>10 %), the OR exaggerates the relative risk. With ~28 % overweight prevalence, an OR of 1.55 is not the same as “1.55× more likely” - be careful with that phrasing in plain-language summaries.

Basic Data Analysis with R

Learning Objectives

Part 1. Know Your Data: What is YRBSS?

Load packages, read the data

Inspect the structure first

Audit missing data - never skip this

Variable types determine your analysis

Descriptive Statistics: The four dimensions of any variable

What gtsummary does for you

Basic tbl_summary() code

Basic tbl_summary() output

Stratified tbl_summary() code

Stratified tbl_summary() output

Customising labels and statistics

How to read a gtsummary table

Exploratory Data Analysis

BMI distribution (histogram)

BMI density curves by sex

Box plots: comparing groups

Bar chart: categorical frequency

Stacked bars: composition within groups

What to Report

Testing for normality

Table or graph: when to use which

Quick-reference reporting card

Statistical Tests - three questions

Linear regression: BMI predicted by weight

Tidy regression output with broom

Multiple regression: adding sex and age

Publication table for the regression

Publication table for the regression

Logistic regression: when the outcome is yes / no

The logit link - modelling probabilities

Odds and odds ratios - the interpretation key

Step 1 - Create the binary outcome

Step 2 - Fit the model

Reading the raw summary() output

Step 3 - Get odds ratios with broom

Step 4 - Publication table with gtsummary

Publication table with gtsummary

Interpreting your odds ratios

What `gtsummary` does for you

Basic `tbl_summary()` code

Basic `tbl_summary()` output

Stratified `tbl_summary()` code

Stratified `tbl_summary()` output

How to read a `gtsummary` table

Tidy regression output with `broom`

Reading the raw `summary()` output

Step 3 - Get odds ratios with `broom`

Step 4 - Publication table with `gtsummary`

Publication table with `gtsummary`