Wine Quality Analysis

1 The Data

The dataset used in this analysis is from Cortez et al. (2009), which models wine preferences based on physicochemical properties. You can read the data directly from the ecodata R package on Github:

library(tidyverse)
# Define the raw URL for the .rda file
url <- "https://raw.githubusercontent.com/TheoreticalEcology/ecodata/master/data/wine.rda"

# Load the .rda file directly from the URL
load(url(url))

# Convert the data to a tibble
wine <- tibble(wine)

# Checkout the data
wine
# A tibble: 1,599 × 12
   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
 1          12.7            0.6          0.49            2.8     0.075
 2           9.8            0.66         0.39            3.2     0.083
 3           6.5            0.88         0.03           NA       0.079
 4           8.6            0.52         0.38            1.5     0.096
 5           7.5            0.58         0.14            2.2     0.077
 6           7.6            0.5          0.29            2.3    NA    
 7          10.1            0.935        0.22            3.4     0.105
 8           6.4            0.4         NA               1.6     0.066
 9           6.1            0.58         0.23            2.5     0.044
10           6.7            0.46         0.24            1.7     0.077
# ℹ 1,589 more rows
# ℹ 7 more variables: free.sulfur.dioxide <dbl>, total.sulfur.dioxide <dbl>,
#   density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <int>

Before analyzing, clean the variable names and ensure the quality variable is treated as a factor.

To prepare the data:

  • Transform the quality column to a factor before plotting: use dplyr::mutate and as.factor() to tranform the column
  • Try the janitor::clean_names() function

2 Questions

1. How do physicochemical properties relate to wine quality?

  • Try different visualizations of all the variables

2. What is the correlation between the physicochemical properties?

  • For this you first need to calculate the correlation matrix and then visualize it
  • Tip: Use corrplot() to visualize the correlation matrix

3. How can Principal Component Analysis (PCA) be used to understand the data?

  • Tip: Use prcomp() for PCA and visualize with fviz_pca_var().

3 Useful Functions

  • janitor::clean_names(): To clean variable names.
  • cor(): To calculate correlation matrices.
  • drop_na(): To drop missing values before doing the correlation/PCA
  • corrplot() from the corrplot package: To visualize correlation matrices.
  • ggcorrplot from the ggcorrplot pacakge: Alternative visualization for correlation matrices
  • prcomp(): To perform PCA.
  • fviz_pca_var(), fviz_pca_ind(), fviz_pca_biplot() from the factoextra package: For PCA visualization.

4 Example analysis

Here is an example of a correlation plot:

Here is an example of a PCA result:

5 Reference

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.