Wine Quality

The wine quality data set is a common example used to benchmark classification models. Here we use the DynaML scala machine learning environment to train classifiers to detect 'good' wine from 'bad' wine. A short listing of the data attributes/columns is given below. The UCI archive has two files in the wine quality data set namely winequality-red.csv and winequality-white.csv. We train two separate classification models, one for red wine and one for white.

Wine: Representative Image

Attribute Information:

Inputs:

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol

Output (based on sensory data):

  1. quality (score between 0 and 10)

Data Output Preprocessing

The wine quality target variable can take integer values from 0 to 10, first we convert this into a binary class variable by setting the quality to be 'good'(encoded by the value 1) if the numerical value is greater than 6 and 'bad' (encoded by value 0) otherwise.

Model

Below is a classification model for predicting the quality label y.

Logit

\begin{align} P(y \ = 1 \ | \ \mathbf{x}) &= \sigma(w^T \varphi(\mathbf{x}) + b) \\ \sigma(z) &= \frac{1}{1 + exp(-z)} \end{align}

Probit

The probit regression model is an alternative to the logit model it is represented as.

\begin{align} P(y \ = 1 \ | \ \mathbf{x}) &= \Phi(w^T \varphi(\mathbf{x}) + b) \\ \Phi(z) &= \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} exp(-\frac{z^{2}}{2}) dz \end{align}

Syntax

The TestLogisticWineQuality program in the examples package trains and tests logit and probit models on the wine quality data.

Parameter | Type | Default value |Notes --------|-----------|-----------|------------| training | Int | 100 | Number of training samples test | Int | 1000 | Number of test samples columns | List[Int] | 11, 0, ... , 10 | The columns to be selected for analysis (indexed from 0), first one is the target column. stepSize | Double | 0.01 | Step size chosen for GradientDescent maxIt | Int | 30 | Maximum number of iterations for gradient descent update. mini | Double | 1.0 | Fraction of training samples to sample for each batch update. regularization | Double | 0.5 | Regularization parameter. wineType | String | red | The type of wine: red or white modelType | String | logistic | The type of model: logistic or probit

Red Wine

TestLogisticWineQuality(stepSize = 0.2, maxIt = 120,
mini = 1.0, training = 800,
test = 800, regularization = 0.2,
wineType = "red")
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:21:57 INFO BinaryClassificationMetrics: ============================
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Accuracy: 0.8475
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Area under ROC: 0.7968417788802267
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7493563745371187

red-roc

red-fmeasure

White Wine

TestLogisticWineQuality(stepSize = 0.26, maxIt = 300,
mini = 1.0, training = 3800,
test = 1000, regularization = 0.0,
wineType = "white")
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:27:17 INFO BinaryClassificationMetrics: ============================
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Accuracy: 0.829
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Area under ROC: 0.7184782682020251
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7182203962483446

red-roc

red-fmeasure

Source Code

Comments