

Wine Quality

The wine quality data set is a common example used to benchmark classification models. Here we use the DynaML scala machine learning environment to train classifiers to detect 'good' wine from 'bad' wine. A short listing of the data attributes/columns is given below. The UCI archive has two files in the wine quality data set namely winequality-red.csv and winequality-white.csv. We train two separate classification models, one for red wine and one for white.

Wine: Representative Image

Attribute Information:¶

Inputs:¶

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Output (based on sensory data):¶

quality (score between 0 and 10)

Data Output Preprocessing¶

The wine quality target variable can take integer values from 0 to 10, first we convert this into a binary class variable by setting the quality to be 'good'(encoded by the value 1) if the numerical value is greater than 6 and 'bad' (encoded by value 0) otherwise.

Model¶

Below is a classification model for predicting the quality label $y$ .

Logit¶

$\begin{align} P(y \ = 1 \ | \ \mathbf{x}) &= \sigma(w^T \varphi(\mathbf{x}) + b) \\ \sigma(z) &= \frac{1}{1 + exp(-z)} \end{align}$

Probit¶

The probit regression model is an alternative to the logit model it is represented as.

$\begin{align} P(y \ = 1 \ | \ \mathbf{x}) &= \Phi(w^T \varphi(\mathbf{x}) + b) \\ \Phi(z) &= \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} exp(-\frac{z^{2}}{2}) dz \end{align}$

Syntax¶

The TestLogisticWineQuality program in the examples package trains and tests logit and probit models on the wine quality data.

Parameter | Type | Default value |Notes --------|-----------|-----------|------------| training | Int | 100 | Number of training samples test | Int | 1000 | Number of test samples columns | List[Int] | 11, 0, ... , 10 | The columns to be selected for analysis (indexed from 0), first one is the target column. stepSize | Double | 0.01 | Step size chosen for GradientDescent maxIt | Int | 30 | Maximum number of iterations for gradient descent update. mini | Double | 1.0 | Fraction of training samples to sample for each batch update. regularization | Double | 0.5 | Regularization parameter. wineType | String | red | The type of wine: red or white modelType | String | logistic | The type of model: logistic or probit

Red Wine¶

TestLogisticWineQuality(stepSize = 0.2, maxIt = 120,
mini = 1.0, training = 800,
test = 800, regularization = 0.2,
wineType = "red")

16/04/01 15:21:57 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:21:57 INFO BinaryClassificationMetrics: ============================
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Accuracy: 0.8475
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Area under ROC: 0.7968417788802267
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7493563745371187

red-roc

red-fmeasure

White Wine¶

TestLogisticWineQuality(stepSize = 0.26, maxIt = 300,
mini = 1.0, training = 3800,
test = 1000, regularization = 0.0,
wineType = "white")

16/04/01 15:27:17 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:27:17 INFO BinaryClassificationMetrics: ============================
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Accuracy: 0.829
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Area under ROC: 0.7184782682020251
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7182203962483446