library(tidyverse) # data wrangling + ggplot2
library(gtsummary) # publication-ready tables
library(janitor) # clean_names(), tabyl()
yrbss <- read_csv("yrbss.csv")April 28 - May 1, 2026
By the end of this module you will be able to:
gtsummary workflowBefore you analyse anything, understand what you have
About the survey
bmiVariables in our dataset
| Variable | Type | Example |
|---|---|---|
record |
ID | 931897 |
age |
Categorical | 17 years old |
sex |
Categorical | Female / Male |
grade |
Categorical | 9th – 12th |
race4 |
Categorical | White / Hispanic… |
bmi |
Numeric | 13.2 – 53.9 |
stweight |
Numeric | Weight in kg |
Note
We use this dataset because it has the real problems analysts face every day: missing values, skewed distributions, mixed variable types, and large enough n that we can run real tests.
Why this order matters:
tidyverse first - gives you read_csv(), mutate(), filter(), plus pipesgtsummary next - for the descriptive table you’ll build laterjanitor last - small helpers; loaded last so its tabyl() doesn’t get maskedTip
Reproducibility rule: keep yrbss.csv next to your .qmd. Use relative paths ("yrbss.csv"), never absolute paths like "C:/Users/Me/Desktop/yrbss.csv" — they break the moment someone else opens your project.
Rows: 20,000
Columns: 11
$ record <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1…
$ age <fct> 15 years old, 17 years old, 18 years old or older, 15 year…
$ sex <fct> Female, Female, Male, Male, Male, Male, Male, Male, Male, …
$ grade <fct> 10th, 12th, 11th, 10th, 9th, 9th, 11th, 12th, 12th, 10th, …
$ race4 <fct> White, White, Hispanic/Latino, Black or African American, …
$ race7 <chr> "White", "White", "Hispanic/Latino", "Black or African Ame…
$ bmi <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.25…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, …
$ obese <fct> Not obese, Not obese, NA, Not obese, Not obese, NA, Not ob…
$ bmi_cat <fct> Underweight, Normal, NA, Overweight, Normal, NA, Normal, N…
$ bmi_missing <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALS…
Why we always do this first:
glimpse() is the analyst’s first move on any new dataset. In one line you learn:
<dbl>, <chr>, <fct>?Catching a wrongly-typed variable here (e.g. age stored as text) saves hours of debugging later.
record age sex grade race4 race7
0 205 231 406 710 1187
bmi stweight obese bmi_cat bmi_missing
6458 6557 6458 6458 0
What this tells you:
bmi and stweight each have ~6,400 missing (~32%) - these are students who didn’t report height/weightage, sex, grade, race4 are <2% missing - safe for descriptive tables# A tibble: 1 × 11
record age sex grade race4 race7 bmi stweight obese bmi_cat bmi_missing
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 numeric fact… fact… fact… fact… char… nume… numeric fact… factor logical
The four R types you need to recognise:
| R type | Used for | Example variable |
|---|---|---|
dbl / int |
Continuous / count | bmi, stweight |
chr |
Text - recode to factor | raw age strings |
fct |
Categorical with levels | grade after factor() |
lgl |
TRUE / FALSE | bmi_missing |
Why type matters: gtsummary, ggplot2, and every statistical test treat numeric and categorical variables differently. If grade is stored as <chr> instead of <fct>, your tables sort alphabetically ("10th" before "9th") - wrong.
Summarise every variable before you model anything
| Dimension | What it captures | Statistic |
|---|---|---|
| Centre | Typical value | Mean, median, mode |
| Spread | Variability | SD, IQR, range |
| Shape | Symmetry & tails | Skewness, kurtosis |
| Frequency | Category counts | n, %, cumulative % |
What to report by variable type
Describe all variables before any modelling. A one-line summary per variable prevents major errors downstream.
gtsummary does for youadd_p() automatically picks the correct statistical testadd_overall() adds an all-participants columnas_gt() or as_flex_table()Tip
gtsummary is the single most time-saving package in R for descriptive tables.summarise() + pivot_longer() + formatting code.tbl_summary() codeWhat this does, line by line:
select(): pick the variables you want to describetbl_summary(): auto-detect types and summarise eachbold_labels(): make variable names stand outThe output is a publication-ready table. To save it for Word, pipe into as_flex_table() |> flextable::save_as_docx(path = "table1.docx").
tbl_summary() output| Characteristic | N = 20,0001 |
|---|---|
| sex | |
| Female | 9,592 (49%) |
| Male | 10,177 (51%) |
| Unknown | 231 |
| grade | |
| 9th | 5,219 (27%) |
| 10th | 4,907 (25%) |
| 11th | 4,891 (25%) |
| 12th | 4,577 (23%) |
| Unknown | 406 |
| bmi | 22.3 (20.1, 25.7) |
| Unknown | 6,458 |
| 1 n (%); Median (Q1, Q3) | |
tbl_summary() codeThe by = argument is the magic. It splits the table into one column per group and - combined with add_p() - automatically chooses:
tbl_summary() output| Characteristic | Overall N = 19,7691 |
Female N = 9,5921 |
Male N = 10,1771 |
p-value2 |
|---|---|---|---|---|
| grade | 0.12 | |||
| 9th | 5,176 (27%) | 2,492 (26%) | 2,684 (27%) | |
| 10th | 4,871 (25%) | 2,332 (25%) | 2,539 (25%) | |
| 11th | 4,861 (25%) | 2,365 (25%) | 2,496 (25%) | |
| 12th | 4,540 (23%) | 2,277 (24%) | 2,263 (23%) | |
| Unknown | 321 | 126 | 195 | |
| race4 | 0.8 | |||
| All other races | 4,696 (25%) | 2,260 (24%) | 2,436 (25%) | |
| Black or African American | 4,055 (21%) | 1,984 (21%) | 2,071 (21%) | |
| Hispanic/Latino | 4,625 (24%) | 2,252 (24%) | 2,373 (24%) | |
| White | 5,762 (30%) | 2,826 (30%) | 2,936 (30%) | |
| Unknown | 631 | 270 | 361 | |
| bmi | 22.3 (20.1, 25.7) | 22.1 (20.0, 25.3) | 22.5 (20.3, 25.8) | <0.001 |
| Unknown | 6,227 | 2,970 | 3,257 | |
| 1 n (%); Median (Q1, Q3) | ||||
| 2 Pearson’s Chi-squared test; Wilcoxon rank sum test | ||||
yrbss |> select(sex, bmi, stweight) |>
tbl_summary( by = sex,
statistic = list(
all_continuous() ~ "{median} ({p25}, {p75})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ 1),
label = list(
bmi ~ "Body Mass Index (kg/m²)",
stweight ~ "Body weight (kg)"),
missing = "ifany", missing_text = "Missing, n") |>
add_p(test = list(all_continuous() ~ "wilcox.test")) |>
add_overall() |> bold_labels() |>
modify_caption("**Table 1. Participant characteristics**")Tip
The label argument relabels variables for publication - no need to rename your data frame. {p25} / {p75} show explicit percentiles; use {mean} / {sd} for normally-distributed variables.
gtsummary table| Characteristic | Overall N = 19,7691 |
Female N = 9,5921 |
Male N = 10,1771 |
p-value2 |
|---|---|---|---|---|
| Body Mass Index (kg/m²) | 22.3 (20.1, 25.7) | 22.1 (20.0, 25.3) | 22.5 (20.3, 25.8) | <0.001 |
| Missing, n | 6,227 | 2,970 | 3,257 | |
| Body weight (kg) | 64.0 (55.8, 74.8) | 59.0 (52.2, 68.0) | 69.4 (61.2, 81.2) | <0.001 |
| Missing, n | 6,326 | 3,050 | 3,276 | |
| 1 Median (Q1, Q3) | ||||
| 2 Wilcoxon rank sum test | ||||
Note
Why we plot every variable first?
“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.”
- John Tukey, 1977
Four EDA principles
| Principle | What to do |
|---|---|
| 📊 Look first | Plot before you model |
| 📈 Univariate first | One variable, then pairs |
| 📋 Plot AND tabulate | Numbers give magnitude; graphs give shape |
| 📝 Document decisions | If you drop an outlier, write down why |
EDA checklist
med_bmi <- median(yrbss$bmi, na.rm = TRUE)
ggplot(yrbss, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "#0D9488",
colour = "white") +
geom_vline(xintercept = med_bmi,
colour = "#EF4444", linewidth = 1,
linetype = "dashed") +
annotate("text", x = med_bmi + 1, y = 1700,
label = paste("Median =",
round(med_bmi, 1)),
colour = "#EF4444", hjust = 0) +
labs(title = "Distribution of BMI",
x = "BMI (kg/m²)", y = "Count")What to look for in any histogram:
For BMI: clear right-skew (long tail to the right) → use median, not mean.
Reading two density curves side-by-side:
A density plot is a smoothed histogram - the area under each curve sums to 1. So you can compare two groups of different sizes on the same axes.
This is what a single number like “mean BMI” would hide entirely.
Anatomy of a box plot:
yrbss |>
filter(!is.na(race4)) |>
count(race4) |> mutate(pct = n / sum(n) * 100) |>
ggplot(aes(y = reorder(race4, n), x = pct, fill = race4)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = sprintf("%.1f%%", pct)),
hjust = -0.1, size = 3.2) +
scale_x_continuous(limits = c(0, 36)) +
scale_fill_manual(values = c("#1B3A6B","#0D9488","#F59E0B","#EF4444")) +
labs(title = "Race / ethnicity distribution",
x = "Percentage (%)", y = NULL)Three rules for good bar charts:
reorder(): never alphabetical (unless ordinal like grade)Add value labels with geom_text() so readers don’t have to estimate from the axis.
yrbss |> filter(!is.na(grade), !is.na(sex)) |>
count(grade, sex) |>
group_by(grade) |>
mutate(pct = n / sum(n) * 100) |>
ggplot(aes(x = grade, y = pct, fill = sex)) +
geom_col(position = "fill") +
geom_text(aes(label = sprintf("%.0f%%", pct)),
position = position_fill(vjust = 0.5),
colour = "white", fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("#1B3A6B","#0D9488")) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Sex distribution by grade",
subtitle = "Each bar sums to 100%",
x = "Grade", y = "Proportion", fill = "Sex")Stacked vs dodged: pick the right one:
position = "fill") - best for proportions within a group; each bar sums to 100%.position = "dodge") - best for absolute counts between groups; bars side-by-side.Use stacked when relative composition matters (here: sex balance per grade). Use dodged when raw counts matter.
Mean vs median
Use MEAN when
Reporting: Mean ± SD
Example: Height: 165.3 ± 8.2 cm
Use MEDIAN when
Reporting: Median (IQR)
Example: BMI: 22.3 (19.8-25.7) kg/m²
Important
Why mean fails with skewed data: a single student with BMI = 53.9 pulls the mean upward but does not affect the median. The median better represents the typical student.
p1 <- ggplot(yrbss, aes(bmi)) +
geom_histogram(bins = 30, fill = "#0D9488",
colour = "white") +
labs(title = "Histogram of BMI",
x = "BMI", y = "Count")
p2 <- ggplot(yrbss |>
filter(!is.na(bmi)),
aes(sample = bmi)) +
stat_qq(alpha = 0.3, size = 0.5,
colour = "#1B3A6B") +
stat_qq_line(colour = "#EF4444",
linewidth = 0.7) +
labs(title = "Q–Q plot of BMI",
x = "Theoretical", y = "Sample")
p1 + p2 # patchworkThree ways to check normality - use them together:
| Check | Result | Decision |
|---|---|---|
| Histogram | Nearly symmetric? | → Mean ± SD |
| Q–Q plot | Points on the line? | → Mean ± SD |
| Goal | Table | Graph |
|---|---|---|
| Exact values matter | Lookup by readers | Loses precision |
| Trend over time | Hard to see | Line plot |
| Compare many groups | Rows are clear | Small multiples |
| Distribution shape | Numbers hide shape | ** Histogram / violin** |
| Two-variable relationship | Hard to parse | Scatter plot |
| Frequency / prevalence | n (%) | Bar chart (≤ 5 groups) |
| Public presentation | Audiences skip tables | ** Graphs are memorable** |
Note
Practical rule: use a table when readers need to look up specific numbers. Use a graph when they need to understand a pattern, trend, or shape.
| Variable type | Example | Report as | Test | Best chart |
|---|---|---|---|---|
| Continuous, normal | Height, SBP | Mean ± SD | t-test / ANOVA | Histogram, box |
| Continuous, skewed | BMI, income | Median (IQR) | Wilcoxon / KW | Box, violin |
| Ordinal | Pain (1–10) | Median (IQR) | Wilcoxon / Spearman | Bar, heatmap |
| Nominal / Binary | Sex, race | n (%) | Chi-square / Fisher | Bar (sorted) |
| Count | Days absent | Median (IQR) | Poisson / NB | Histogram |
| Time-to-event | Survival | Median (95% CI) | Log-rank, Cox | Kaplan–Meier |
From describing the data to testing hypotheses
Before any test, ask in this order:
| Outcome | Groups | Distribution | Test | R function |
|---|---|---|---|---|
| Continuous | 2 indep. | Normal | t-test | t.test(bmi ~ sex) |
| Continuous | 3+ groups | Normal | ANOVA | aov(bmi ~ grade) |
| Categorical | 2×2+ | - | Chi-square | chisq.test(table(...)) |
| Continuous | continuous predictor | - | Linear regression | lm(bmi ~ stweight) |
| Binary outcome | any | - | Logistic regression | glm(..., family="binomial") |
Warning
The first question is the most important. Continuous outcome → t-test / regression world. Categorical outcome → chi-square / logistic world.
flowchart TD
Start(["<b>START</b><br/>What is your OUTCOME?"])
Start -->|Categorical| Cat["Categorical Outcome"]
Start -->|Continuous| Cont["Continuous Outcome"]
Start -->|Continuous predictor| Reg["<b>Linear Regression</b>"]
Cat --> Chi["<b>Chi-square</b><br/>Expected counts ≥ 5"]
Cat --> Fish["<b>Fisher's exact</b><br/>Expected counts < 5"]
Cont --> Norm{"Data normally<br/>distributed?"}
Norm -->|YES| Param["Parametric"]
Norm -->|NO| NonParam["Non-parametric"]
Param -->|2 groups| TTest["<b>Independent t-test</b>"]
Param -->|3+ groups| ANOVA["<b>One-way ANOVA</b><br/>+ Tukey post-hoc"]
NonParam -->|2 groups| Wilc["<b>Wilcoxon rank-sum</b>"]
NonParam -->|3+ groups| Krus["<b>Kruskal–Wallis</b><br/>+ Dunn post-hoc"]
classDef start fill:#FBF7EF,stroke:#1B3A6B,stroke-width:2px,color:#122849
classDef test fill:#0D9488,stroke:#0a6b62,color:#FBF7EF
classDef branch fill:#F4ECDC,stroke:#1B3A6B,color:#122849
classDef decision fill:#F59E0B,stroke:#a76b06,color:#122849
class Start start
class Reg,Chi,Fish,TTest,ANOVA,Wilc,Krus test
class Cat,Cont,Param,NonParam branch
class Norm decision
Call:
lm(formula = bmi ~ stweight, data = yrbss)
Residuals:
Min 1Q Median 3Q Max
-11.4075 -1.6816 -0.2079 1.4228 24.4669
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.560047 0.092984 70.55 <2e-16 ***
stweight 0.251396 0.001336 188.21 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.611 on 13439 degrees of freedom
(6559 observations deleted due to missingness)
Multiple R-squared: 0.725, Adjusted R-squared: 0.7249
F-statistic: 3.542e+04 on 1 and 13439 DF, p-value: < 2.2e-16
broom# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.56 0.0930 70.6 0 6.38 6.74
2 stweight 0.251 0.00134 188. 0 0.249 0.254
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.725 0.725 2.61 35421. 0 1 -31973. 63951. 63974.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom turns model output into a tidy data frame easy to filter, format, and pass to gtsummary or ggplot2.
tidy() - one row per coefficient (estimate, SE, p-value, CI)glance() - one row per model (R², AIC, n)augment() - adds fitted values + residuals to your data
Call:
lm(formula = bmi ~ stweight + sex + race4, data = yrbss)
Residuals:
Min 1Q Median 3Q Max
-13.7772 -1.3761 -0.1224 1.1726 25.1075
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.115324 0.086190 70.951 <2e-16 ***
stweight 0.279391 0.001225 228.160 <2e-16 ***
sexMale -2.809706 0.041274 -68.075 <2e-16 ***
race4Black or African American 0.061472 0.058664 1.048 0.295
race4Hispanic/Latino 0.557191 0.056076 9.936 <2e-16 ***
race4White -0.491424 0.052357 -9.386 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.214 on 13022 degrees of freedom
(6972 observations deleted due to missingness)
Multiple R-squared: 0.8025, Adjusted R-squared: 0.8024
F-statistic: 1.058e+04 on 5 and 13022 DF, p-value: < 2.2e-16
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| Body weight (kg) | 0.28 | 0.28, 0.28 | <0.001 |
| Sex | |||
| Female | — | — | |
| Male | -2.8 | -2.9, -2.7 | <0.001 |
| Ethinicity | |||
| All other races | — | — | |
| Black or African American | 0.06 | -0.05, 0.18 | 0.3 |
| Hispanic/Latino | 0.56 | 0.45, 0.67 | <0.001 |
| White | -0.49 | -0.59, -0.39 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.802; Adjusted R² = 0.802; AIC = 57,690; No. Obs. = 13,028 | |||
Tip
tbl_regression() automatically formats CIs, picks exponentiation for logistic models, and bolds significant p-values. One line replaces ~30 lines of manual table code.
So far our outcome (bmi) has been continuous. But many real questions are about a binary outcome:
Why we can’t just use linear regression here:
The fix: transform the probability so the linear part of the model has no upper or lower bound. That transformation is called the logit.
We don’t model the probability \(p\) directly. We model its log-odds:
\[\text{logit}(p) = \log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots\]
Why this works
p / (1 - p) converts probability into odds (ranges 0 to ∞)log() makes it range from −∞ to +∞In R: glm(y ~ x, family = binomial) fits this model. The default link is logit. You almost never need to change it.
Call: glm(formula = overweight ~ age + sex + grade + race4, family = binomial,
data = yrbss)
Coefficients:
(Intercept) age13 years old
-0.97364 -1.08840
age14 years old age15 years old
-0.58921 -0.33808
age16 years old age17 years old
-0.13499 0.06457
age18 years old or older sexMale
0.29058 0.21353
grade10th grade11th
-0.09245 -0.11765
grade12th race4Black or African American
-0.17767 0.37905
race4Hispanic/Latino race4White
0.32089 -0.07208
Degrees of Freedom: 12991 Total (i.e. Null); 12978 Residual
(7008 observations deleted due to missingness)
Null Deviance: 15540
Residual Deviance: 15280 AIC: 15310
Odds = probability of event / probability of no event.
| Probability | Odds | In words |
|---|---|---|
| 0.10 | 0.11 (1 : 9) | “Long shot” |
| 0.50 | 1.00 (1 : 1) | “Even money” |
| 0.80 | 4.00 (4 : 1) | “Strongly favoured” |
Odds ratio (OR) = odds in group A / odds in group B.
| OR | Meaning |
|---|---|
| = 1 | No association - both groups have the same odds |
| > 1 | Group A has higher odds (stronger if larger) |
| < 1 | Group A has lower odds (stronger if smaller) |
Important
Coefficients from glm() are on the log-odds scale. To interpret them as odds ratios, you must exponentiate: exp(coef). We never report log-odds in a paper - always ORs with 95 % CI.
# A tibble: 3 × 3
overweight n pct
<fct> <int> <dbl>
1 No 9684 48.4
2 Yes 3858 19.3
3 <NA> 6458 32.3
Three things to notice:
bmi >= 25 is the WHO cut-off for overweight using a clinically meaningful threshold matters (don’t pick cut-offs by data-snooping).levels = c("No", "Yes") makes “No” the reference category. R will model the probability of “Yes” - what we want.
Call:
glm(formula = overweight ~ age + sex + grade + race4, family = binomial,
data = yrbss)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.97364 0.50687 -1.921 0.0547 .
age13 years old -1.08840 0.66961 -1.625 0.1041
age14 years old -0.58921 0.50904 -1.157 0.2471
age15 years old -0.33808 0.50531 -0.669 0.5035
age16 years old -0.13499 0.50453 -0.268 0.7890
age17 years old 0.06457 0.50466 0.128 0.8982
age18 years old or older 0.29058 0.50613 0.574 0.5659
sexMale 0.21353 0.03941 5.419 6.01e-08 ***
grade10th -0.09245 0.06773 -1.365 0.1723
grade11th -0.11765 0.08165 -1.441 0.1496
grade12th -0.17767 0.09343 -1.902 0.0572 .
race4Black or African American 0.37905 0.05855 6.474 9.52e-11 ***
race4Hispanic/Latino 0.32089 0.05651 5.679 1.36e-08 ***
race4White -0.07208 0.05488 -1.313 0.1890
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 15542 on 12991 degrees of freedom
Residual deviance: 15278 on 12978 degrees of freedom
(7008 observations deleted due to missingness)
AIC: 15306
Number of Fisher Scoring iterations: 4
summary() outputThe output has four parts you should always check:
| Section | What it tells you |
|---|---|
| Call | Confirms the formula and family - sanity check |
| Coefficients | Estimate (log-odds), SE, z-value, p-value |
| Deviance | Null vs residual - drop in deviance = how much variance the model explains |
| AIC | Lower is better; used to compare models |
Warning
The Estimate column is in log-odds, not odds ratios. A coefficient of 0.45 means log-odds increase by 0.45 - which is exp(0.45) ≈ 1.57, an OR of 1.57. Always exponentiate before interpreting.
Tip
Reference categories (the level not shown in the output) are the comparison group. By default it’s the first factor level - for sex that’s “Female”, so the sexMale row compares males to females.
broombroom::tidy() does the exponentiation, the CIs, and tidies everything into a data frame in one call:
# A tibble: 14 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.378 0.507 -1.92 0.055 0.13 0.986
2 age13 years old 0.337 0.67 -1.62 0.104 0.088 1.27
3 age14 years old 0.555 0.509 -1.16 0.247 0.211 1.62
4 age15 years old 0.713 0.505 -0.669 0.503 0.274 2.07
5 age16 years old 0.874 0.505 -0.268 0.789 0.336 2.53
6 age17 years old 1.07 0.505 0.128 0.898 0.41 3.09
7 age18 years old or o… 1.34 0.506 0.574 0.566 0.513 3.88
8 sexMale 1.24 0.039 5.42 0 1.15 1.34
9 grade10th 0.912 0.068 -1.36 0.172 0.798 1.04
10 grade11th 0.889 0.082 -1.44 0.15 0.758 1.04
11 grade12th 0.837 0.093 -1.90 0.057 0.697 1.01
12 race4Black or Africa… 1.46 0.059 6.47 0 1.30 1.64
13 race4Hispanic/Latino 1.38 0.057 5.68 0 1.23 1.54
14 race4White 0.93 0.055 -1.31 0.189 0.836 1.04
Now the estimate column is the OR, and conf.low / conf.high give the 95 % CI - ready to paste into a manuscript.
gtsummarygtsummary| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| age | |||
| 12 years old or younger | — | — | |
| 13 years old | 0.34 | 0.09, 1.27 | 0.10 |
| 14 years old | 0.55 | 0.21, 1.62 | 0.2 |
| 15 years old | 0.71 | 0.27, 2.07 | 0.5 |
| 16 years old | 0.87 | 0.34, 2.53 | 0.8 |
| 17 years old | 1.07 | 0.41, 3.09 | 0.9 |
| 18 years old or older | 1.34 | 0.51, 3.88 | 0.6 |
| Sex | |||
| Female | — | — | |
| Male | 1.24 | 1.15, 1.34 | <0.001 |
| School grade | |||
| 9th | — | — | |
| 10th | 0.91 | 0.80, 1.04 | 0.2 |
| 11th | 0.89 | 0.76, 1.04 | 0.15 |
| 12th | 0.84 | 0.70, 1.01 | 0.057 |
| Race / ethnicity | |||
| All other races | — | — | |
| Black or African American | 1.46 | 1.30, 1.64 | <0.001 |
| Hispanic/Latino | 1.38 | 1.23, 1.54 | <0.001 |
| White | 0.93 | 0.84, 1.04 | 0.2 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
| No. Obs. = 12,992; AIC = 15,306 | |||
Pick one row at a time and translate it into a sentence. Template:
“Compared to [reference category], [predictor level] had OR-fold the odds of being overweight (95 % CI: low-high; p = X.XX), holding all other variables constant.”
Worked example - sexMale (let’s say OR = 1.55, 95 % CI 1.45-1.66):
Male students had 1.55 times the odds of being overweight compared to female students (95 % CI: 1.45-1.66; p < 0.001), adjusting for age, grade, and race / ethnicity.
Important
Three things every interpretation must include:
Without these, an OR is meaningless.
Warning
OR is not relative risk. When the outcome is common (>10 %), the OR exaggerates the relative risk. With ~28 % overweight prevalence, an OR of 1.55 is not the same as “1.55× more likely” - be careful with that phrasing in plain-language summaries.
Basic Data Analysis