Data Visualization

Leykun (MSc)1, Tesfamichael (MSc)2 & Yebelay (MSc)3

1NDMC, EPHI; 2SPH, AAU; 3DMU & C4ED


October 14 - 17, 2025

What is data visualization?

  • Data visualization is the presentation of data in a pictorial or graphical format, and
  • A data visualization tool is the software that generates this presentation.
  • Effective data visualization provides users with intuitive means to
    • interactively explore and analyze data,
    • enabling them to effectively identify interesting patterns,
    • infer correlations and causalities, and
    • supports sense-making activities.
  • Good visual presentations tend to enhance the message of the visualization.

Important packages to create figures

A few packages to create figures in R are

  • ggplot2: The core package for creating graphics based on the grammar of graphics.
  • cowplot for composing ggplots
  • ggtext for advanced text rendering
  • ggthemes for additional themes
  • grid for creating graphical objects
  • gridExtra additional functions grid
  • patchwork for multi-panel plots
  • ggiraph interactive visualizations
  • highcharterinteractive visualizations
  • plotly interactive visualizations

The basic components of plot using ggplot2 Package

  • ggplot2 is a system for declaratively creating graphics, based on the Grammar of Graphics.

  • You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

Why ggplot2?

  • A grammar of graphics is a grammar used to describe and create a wide range of statistical graphics.
  • The promise of a grammar for graphics.
  • Easy to manage, save, etc.
  • Graphs are composed of layers.
  • Easy to add stuff to existing graphs.
  • ggplot2 graphics take less work to make beautiful and eye-catching graphics.
  • Enables the creation of reproducible visualization patterns.
  • Publication quality & beyond

ggplot2 mechanics: the basics

A ggplot is built up from a few basic elements:

  1. Data: The raw data that you want to plot.
  2. Geometries geom_: The geometric shapes that will represent the data.
  3. Aesthetics aes(): Aesthetics of the geometric and statistical objects, such as position, color, size, shape, and transparency
  4. Scales scale_: Maps between the data and the aesthetic dimensions, such as data range to plot width or factor values to colors.
  5. Statistical transformations stat_: Statistical summaries of the data, such as quantiles, fitted curves, and sums.
  6. Coordinate system coord_: The transformation used for mapping data coordinates into the plane of the data rectangle.
  7. Facets facet_: The arrangement of the data into a grid of plots.
  8. Visual themes theme(): The overall visual defaults of a plot, such as background, grids, axes, default typeface, sizes and colors.

Components of the layered grammar

  • Layer
    • Data
    • Mapping
    • Statistical transformation (stat)
    • Geometric object (geom)
    • Position adjustment (position)
  • Scale
  • Coordinate system (coord)
  • Faceting (facet)

Data

  • Data defines the source of the information to be visualized.
  • Must be a data.frame
  • Gets pulled into the ggplot() object

Aesthetics (aes()) (a.k.a. mapping)

  • x, y: variables
  • colour: colours the lines of geometries
  • fill: fill geometries or fill color
  • group: groups based on the data
  • shape: shape of point, an integer value 0 to 24, or NA
  • linetype: type of line, a integer value 0 to 6 or a string
  • size: sizes of elements, a non-negative numeric value
  • alpha: changes the transparency,a numeric value 0 to 1

Data

Code
# data and aesthetics
ggplot(data, mapping = aes(x, y, ...))
  • shape values

“shape: shape value”
  • line type value

Geometries (geom_*()) function

The general syntax is:

  • ggplot(data = data, mapping = aes(mapings))+ geom_function()

  • Geom Components

    Geom Description Input
    geom_histogram Histograms Continous x
    geom_bar Bar plot with frequncies Discrete x
    geom_point Points/scattorplots Discrete/continuous x and y
    geom_boxplot Box plot Disc. x and cont. y
    geom_smooth Adds a smoothed conditional mean / regression line Continuous x and y
    geom_line Line plots Discrete/continuous x and y
    geom_abline Reference line intercept and slope value
    geom_hline, geom_vline Horizontal and vertical reference lines yintercept or xintercept

Practice with ggplot2

  • dataset to practice: palmerpenguins

We will use the palmerpenguins data set:

This data set contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

Let us take a look at the variables in the penguins data set:

Code
library(palmerpenguins)
data(penguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Practice with ggplot2

  • species, island, and sex are factor variables,
  • bill measurements depicted in the image are numeric variables,
  • two integer variables (flipper length and body mass).
  • Prepare data for ggplot2
  • ggplot2 requires you to prepare the data as an object of class data.frame or tibble (common in the tidyverse).

Practice with ggplot2

More complex plots in ggplot2 require the long data frame format.

  • Scientific questions about penguins
  • Scientific questions

  • Is there a relationship between the length & the depth of bills?

  • Does the size of the bill & flipper vary together ?

  • How are these measures distributed among the 3 penguin species ?

How can we graphically address these questions with ggplot2?

ggplot() layers

Code
library(ggplot2)
ggplot(data = penguins)

Code
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm))

Code
ggplot(data = penguins,
       aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point()

Code
ggplot(data = penguins,
       aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() +  facet_wrap(~species) 

Let us explore how some of this data is structured by species:

Code
ggplot(data = penguins,               
       aes(x = bill_length_mm, y = bill_depth_mm, col = species)) +          
  geom_point(alpha = 0.8) + geom_smooth(method = "lm") +
  theme_minimal()

Customize Our Plot

  • Here are some key aspects you can customize:

Axes, Titles and Legends

Title and axes components: changing size, colour and face

Visualizing Distributions

How you visualize a variable’s distribution depends on its type: categorical or numerical.

  • Categorical Variables: Use bar charts to show the frequency of each category.
  • Numerical Variables: Use histograms or density plots to show the shape, center, and spread of the data.

Part 1: Categorical Variables

The Bar Chart

The most common way to visualize a single categorical variable is with a bar chart, using geom_bar().

Code
ggplot(penguins, aes(x = species)) +
  geom_bar(aes(fill = species)) +
  theme_minimal() +
  labs(title = "Basic Bar Chart of Penguin Species") +
  guides(fill = "none") # Hide legend

Ordering Categories

By default, bars are ordered alphabetically. For better comparisons, it’s often useful to order them by frequency using forcats::fct_infreq().

Code
ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar(aes(fill = species)) +
  theme_minimal() +
  labs(
    title = "Bar Chart Ordered by Frequency",
    x = "Species"
  ) +
  guides(fill = "none")

Example: Simple bar chart using ESPA, sick child data

  • Facility type distribution
Code
library(tidyverse)
library(haven)
library(labelled)
library(forcats)
sc_data <- read_dta("data/ESPA_2021/ETSC81FLSP.DTA") 
sc_data_fct <- to_factor(sc_data)

ggplot(sc_data_fct, aes(x = fct_infreq(cfactype))) +
  geom_bar(fill = "#1f77b4", width = 0.5) +
  geom_text(stat = 'count', aes(label = after_stat(count)), 
            vjust = -0.5, size = 4) +
  labs(
    title = "Number of Sick-Child Consultations by Facility Type",
    x = "Facility type",
    y = "Count of consultations",
    caption = "Source: ESPA 2021 — Sick child module") +
  theme_classic(base_size = 15)

Example: Simple bar chart using ESPA, sick child data

Horizontal Bar Charts

Code
# Horizontal bar chart of penguin species with count labels
ggplot(penguins, aes(x = fct_rev(fct_infreq(species)), fill = species)) +
  geom_bar(show.legend = FALSE, width = 0.6) +
  coord_flip() +  # Flip coordinates for horizontal bars
  labs(title = "Penguin Species Count",
       x = "Species",
       y = "Count") +
  theme_bw(base_size = 14)

Two categorical variables

stacked bar plots

Code
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(width = 0.7) +
  theme_bw()

Part 2: Numerical Variables

The Histogram

Histograms are the standard way to view the distribution of a single numerical variable. They show “bins” of data to reveal the underlying frequency.

Code
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 30, fill = "#2E86AB", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Penguin Body Mass")

The Density Plot

A density plot is a smoothed version of a histogram. It’s great for comparing distributions between groups.

Multiple Groups

Code
ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_density(linewidth = 0.75)+
  theme_minimal()+
  labs(title = "Body Mass by Species")

Part 3: Relationships Between Variables

Numerical vs. Categorical

How does a numerical variable’s distribution change across different categories? Boxplots are perfect for this.

Code
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot() +
  theme_minimal() +
  guides(fill = "none") +
  labs(
    title = "Body Mass Distribution by Species",
    x = "Species", y = "Body Mass (g)"
  )

Two Numerical Variables

A scatter plot is the classic way to show the relationship between two numerical variables.

Code
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(color = "blue", alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "Flipper Length vs. Body Mass",
    x = "Flipper Length (mm)", y = "Body Mass (g)"
  )

Three or More Variables

We can add more variables to a plot using aesthetics like color and shape, or by using facets.

. . .

Using Aesthetics (color, shape)

Code
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island)) +
  theme_minimal() +
  labs(title = "Adding Species (Color) and Island (Shape)")

Three or More Variables

We can add more variables to a plot using aesthetics like color and shape, or by using facets.

Using Facets (facet_wrap)

Facets create sub-plots for each category of a variable.

Code
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species), size = 2) +
  facet_wrap(~island) +
  theme_bw() +
  labs(title = "Flipper Length vs. Body Mass, Faceted by Island")

Saving Your Plots

To save the most recently displayed plot, use ggsave(). You can specify the filename, dimensions, and format.

Code
# First, create the plot
penguin_plot <- ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species)) +
  theme_minimal()

# Now, save it to a file
ggsave(
  filename = "penguin_plot.png", 
  plot = penguin_plot,
  width = 8, 
  height = 6,
  dpi = 300 # Set resolution for high quality
)


This will save a file named penguin_plot.png in your working directory.

Part 4: Time Series Data

The Line Chart

To visualize how a numerical variable changes over time, we use a line chart with geom_line(). The economics dataset from ggplot2 is perfect for this.

Code
# The 'economics' dataset is included with ggplot2
ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line(color = "blue", linewidth = 1) +
  theme_minimal()

Time series plot

Code
p <- ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line(color = "#1f77b4", linewidth = 0.6) +  # Classic blue color
  labs(
    title = "US Unemployment Over Time",
    subtitle = "Number of unemployed (in thousands)",
    x = "Year",
    y = "Unemployed (thousands)",
    caption = "Source: US Economic Time Series Data"
  ) +
  scale_x_date(date_breaks = "5 years", date_labels = "%Y") +
  scale_y_continuous(labels = scales::comma) +
  theme_bw() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "gray50"),
    # panel.grid.major = element_line(color = "gray90", linewidth = 0.2),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_line(linewidth = 0.2)
  )
p

Time series plot

Interactive Plots

  • Interactive plots enhance user experience with dynamic, engaging graphics

  • The plotly package easily converts ggplot2 plots into interactive versions

Code
library(plotly)
ggplotly(p)

Exercise

Task: Create a customized bar chart showing the count of penguin species.

Requirements:

  1. Order the species by frequency, with the most frequent on the right.
  2. Add the count as a bold, red label above each bar.
  3. Use the following custom colors as shown on the figure
  4. Center the plot title and make it bold.
  5. Increase the font size for the plot title and axis text.

Exercise

Task: Create a customized bar chart showing the count of facility types using the ESPA, facility data.

Requirements:

  1. Order the facility types by frequency, with the most frequent on the top.
  2. Add the count as a bold, blue label to the right of each bar.
  3. Center the plot title and make it bold.
  4. Increase the font size for the plot title and axis text.

Resources