Wine Quality Analysis

1 The Data

The dataset used in this analysis is from Cortez et al. (2009), which models wine preferences based on physicochemical properties. You can read the data directly from the ecodata R package on Github:

library(tidyverse)
# Define the raw URL for the .rda file
url <- "https://raw.githubusercontent.com/TheoreticalEcology/ecodata/master/data/wine.rda"

# Load the .rda file directly from the URL
load(url(url))

# Convert the data to a tibble
wine <- tibble(wine)

# Checkout the data
wine

# A tibble: 1,599 × 12
   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
 1          12.7            0.6          0.49            2.8     0.075
 2           9.8            0.66         0.39            3.2     0.083
 3           6.5            0.88         0.03           NA       0.079
 4           8.6            0.52         0.38            1.5     0.096
 5           7.5            0.58         0.14            2.2     0.077
 6           7.6            0.5          0.29            2.3    NA    
 7          10.1            0.935        0.22            3.4     0.105
 8           6.4            0.4         NA               1.6     0.066
 9           6.1            0.58         0.23            2.5     0.044
10           6.7            0.46         0.24            1.7     0.077
# ℹ 1,589 more rows
# ℹ 7 more variables: free.sulfur.dioxide <dbl>, total.sulfur.dioxide <dbl>,
#   density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <int>

Before analyzing, clean the variable names and ensure the quality variable is treated as a factor.

To prepare the data:

Transform the quality column to a factor before plotting: use dplyr::mutate and as.factor() to tranform the column
Try the janitor::clean_names() function

2 Questions

1. How do physicochemical properties relate to wine quality?

Try different visualizations of all the variables

2. What is the correlation between the physicochemical properties?

For this you first need to calculate the correlation matrix and then visualize it
Tip: Use corrplot() to visualize the correlation matrix

3. How can Principal Component Analysis (PCA) be used to understand the data?

Tip: Use prcomp() for PCA and visualize with fviz_pca_var().

3 Useful Functions

janitor::clean_names(): To clean variable names.
cor(): To calculate correlation matrices.
drop_na(): To drop missing values before doing the correlation/PCA
corrplot() from the corrplot package: To visualize correlation matrices.
- Find the package documentation here
ggcorrplot from the ggcorrplot pacakge: Alternative visualization for correlation matrices
- Find the package documentation here
prcomp(): To perform PCA.
fviz_pca_var(), fviz_pca_ind(), fviz_pca_biplot() from the factoextra package: For PCA visualization.
- Find the package documentation here

4 Example analysis

Here is an example of a correlation plot:

Here is an example of a PCA result:

5 Reference

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.