Solution to ggplot tasks

1 Get started

First we need to load the packages needed to complete this task:

# install.packages("tidyverse")
# install.packages("palmerpenguins")
library(tidyverse)
library(palmerpenguins)
  • Have a look at the penguin data set
penguins
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 338 more rows
# ℹ 2 more variables: sex <fct>, year <int>

2 Exploratory plotting

2.1 Relationship between bill length and bill depth (scatterplot)

What is the relationship between bill length and bill depth?

First, I created a scatter plot and added a linear regression line. From the plot, It looks like bill length is decreasing with increasing bill depth.

# Bill length vs. bill depth scatterplot with regression line
ggplot(
  data = penguins,
  aes(
    x = bill_length_mm,
    y = bill_depth_mm
  )
) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

# or short
ggplot(penguins, aes(bill_length_mm, bill_depth_mm)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

If we add the color aesthetic locally to the point layer, only this layer will be affected by it. The regression line is not separated by species but still calculated and plotted for all data points together:

# Bill length vs. bill depth scatterplot with regression line
# color as aesthetic local to the point layer
ggplot(penguins, aes(bill_length_mm, bill_depth_mm)) +
  geom_point(aes(color = species)) +
  geom_smooth(method = "lm", se = FALSE)

We can see an example of the Simpson’s paradox here. If you don’t consider species, it looks like the bill depth decreases with bill length. But after separating the data by species, we see that the effect is actually the opposite.

To draw separate regression lines for the species, we need to either add the color aesthetic to the smooth layer as well, or define the color aesthetic globally in the top layer ggplot call:

# Define color aesthetic once globally
ggplot(penguins, aes(
  x = bill_length_mm,
  y = bill_depth_mm,
  color = species
)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

2.2 Difference in flipper length between species (boxplot)

Is there a difference in flipper length between the species?

First I created a simple boxplot with notches:

# Basic boxplot of flipper length with notches
ggplot(penguins, aes(species, flipper_length_mm)) +
  geom_boxplot(notch = TRUE)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

A geom_point with position = position_jitter() will add the individual data points to the plot. It’s important to set a seed here to get the same result for the point position on the x-axis every time. Otherwise your plot is not reproducible. I added width = 0.5 to make the jittering a bit narrower:

ggplot(penguins, aes(species, flipper_length_mm)) +
  geom_boxplot() +
  geom_point(position = position_jitter(
    seed = 123,
    width = 0.5
  ))

2.3 Differences between body mass of male and female penguins (boxplot)

Are male penguins heavier than female penguins? And is this different between the 3 species?

First, a basic boxplot of the body mass by sex:

#|warning: false
# Basic boxplot of body mass for penguins of different sex
ggplot(penguins, aes(x = sex, y = body_mass_g)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

I added species as color aesthetic:

#|warning: false
ggplot(penguins, aes(x = sex, 
                     y = body_mass_g)) +
  geom_boxplot(aes(color = species))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Alternatively, I could also specify the fill aesthetic:

#|warning: false
ggplot(penguins, aes(x = sex, y = body_mass_g)) +
  geom_boxplot(aes(fill = species))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Species as facets:

ggplot(penguins, aes(x = sex, y = body_mass_g)) +
  geom_boxplot() +
  facet_wrap(~species)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

I added a violin plot in the background to show the distribution of the datapoints. To make the violins visible, I changed the width of the boxplot to 0.4:

ggplot(penguins, aes(x = sex, y = body_mass_g)) +
  geom_violin() +
  geom_boxplot(width = .04) +
  facet_wrap(~species)

2.4 Distribution of flipper length between species (histogram)

Make a histogram of the the flipper length separated by species.

The default histogram is a histogram where the different groups are stacked:

ggplot(penguins, aes(
  x = flipper_length_mm,
  fill = species
)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

To unstack the groups, you have to use position = "identity". Also, it’s a good idea to make the histogram slightly transparent (alpha = 0.4) to see the overlapping areas.

ggplot(penguins, aes(
  x = flipper_length_mm,
  fill = species
)) +
  geom_histogram(
    alpha = 0.5,
    position = "identity"
  )

Separated by facets (no need to specify the position here, because there is only one group per plot).

ggplot(penguins, aes(
  x = flipper_length_mm,
  fill = species
)) +
  geom_histogram() +
  facet_wrap(~species, ncol = 1)

2.5 Penguin flipper length by species and sex (heatmap)

For this data it does not make too much sense, but a heat map would look like this:

ggplot(penguins, aes(
  x = species,
  y = sex,
  fill = flipper_length_mm
)) +
  geom_tile()

3 Beautify the plots

3.1 Beautify plots from Task 1

Here are just some examples of how to make the plots from before prettier. Of course there a many other options as well.

Example one: Boxplot of flipper length and species

ggplot(penguins, aes(species, flipper_length_mm, color = species)) +
  geom_boxplot(width = 0.3) +
  geom_point(
    alpha = 0.5,
    position = position_jitter(width = 0.1, seed = 123)
  ) +
  ggsci::scale_color_uchicago() +
  labs(x = "Species", y = "Flipper length (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

What was changed compared to the basic plot?

  • Add color for each species by setting a global color aesthetic
  • Make boxes and jitter points less wide by setting width for both layers
  • Make jitter points slightly transparent by specifying alpha = 0.5 for the jitter layer
  • Change the color to nicer colors from the ggsci package
  • Change from default theme to theme_minimal()
  • Remove the legend with legend.position = "none"
  • Change the axis labels with labs()

Example two: Reproducing the plot from the presentation

The following code is adapted from the palmerpengins package website.

ggplot(
  data = penguins,
  aes(
    x = bill_length_mm,
    y = bill_depth_mm,
    color = species,
    shape = species
  )
) +
  geom_point(size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_manual(values = c("darkorange", "purple", "cyan4")) +
  labs(
    title = "Penguin bill dimensions",
    subtitle = "Bill length and depth for Adelie, Chinstrap and 
    Gentoo Penguins at Palmer Station LTER",
    x = "Bill length (mm)",
    y = "Bill depth (mm)",
    color = "Penguin species",
    shape = "Penguin species"
  ) +
  theme_minimal() +
  theme(
    legend.position = c(0.85, 0.15),
    legend.background = element_rect(fill = "white", color = NA)
  )

What was changed compared to the basic plot?

  • Make points larger and slightly transparent by setting size and alpha for the point layer
  • Change to custom color scale
  • Add title and subtitle with labs
  • Change title of x-axis, y-axis and legend for color and shape aesthetic with labs
  • Use theme_minimal() instead of default theme
  • Change legend position to bottom right corner within the plot
    • Positions are relative to the bottom left corner of the plot
    • 0.85 (85% of plot width) to the right
    • 0.15 (15% of plot height) towards the top

4 Save one of the plots on your machine

Example with one of the plots from above:

# First save the plot in a variable
flipper_box <- ggplot(penguins, aes(species, flipper_length_mm, color = species)) +
  geom_boxplot(width = 0.3) +
  geom_jitter(alpha = 0.5, position = position_jitter(width = 0.2, seed = 123)) +
  ggsci::scale_color_uchicago() +
  labs(x = "Species", y = "Flipper length (mm)") +
  theme_minimal() +
  theme(legend.position = "none")
# save as png in /img directory of the project
ggsave(filename = "img/flipper_box.png", plot = flipper_box)
# save as pdf in /img directory of the project
ggsave(filename = "img/flipper_box.pdf", plot = flipper_box)

5 Some more examples

Histogram

ggplot(penguins, aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(alpha = 0.6) +
  ggsci::scale_fill_d3() +
  labs(
    y = "Frequency",
    x = "Flipper length [mm]",
    fill = "Penguin species"
  ) +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Heat map

For the penguin data set a heat map does not make that much sense. But an example would be:

ggplot(penguins, aes(
  x = species,
  y = sex,
  fill = flipper_length_mm
)) +
  geom_tile() +
  scale_fill_viridis_c() +
  theme_classic()

5.1 The patchwork package

With the patchwork package, you can combine multiple ggplots into one plot. The package allows you to add annotations to the plot and to control the layout and appearance.

Below you find a simple example of two different penugin scatterplots. For more explanation and an overview of what is possible with the package, please have a look at the package documentation

library(patchwork)
# Collecting legends and defining a common theme
plot_1 <- ggplot(penguins, aes(
  x = bill_length_mm, y = bill_depth_mm,
  color = species
)) +
  geom_point()

plot_2 <- ggplot(penguins, aes(
  x = bill_length_mm, y = body_mass_g,
  color = species
)) +
  geom_point()

# Simple combination of 2 plots in patchwork
plot_1 + plot_2

And a more complex example where shared layers are defined for both plots:

# more complex combination with annotation and definition of shared layers
final_plot <- plot_1 + plot_2 +
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "a", tag_prefix = "(", tag_suffix = ")") &
  theme_minimal() &
  scale_color_manual(values = c("darkorange", "purple", "cyan4")) &
  labs(
    color = "Penguin species"
  ) &
  theme(
    plot.tag.position = c(0.1, 0.95),
    plot.tag = element_text(face = "bold")
  )

final_plot

6 References

Check out the package website of the palmerpenguin package. They have more nice examples of data visualizations that you can do with ggplot.

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.