# Wine Quality

The wine quality data set is a common example used to benchmark classification models. Here we use the DynaML scala machine learning environment to train classifiers to detect 'good' wine from 'bad' wine. A short listing of the data attributes/columns is given below. The UCI archive has two files in the wine quality data set namely winequality-red.csv and winequality-white.csv. We train two separate classification models, one for red wine and one for white.

## Attribute Information:¶

### Inputs:¶

1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol

### Output (based on sensory data):¶

1. quality (score between 0 and 10)

### Data Output Preprocessing¶

The wine quality target variable can take integer values from 0 to 10, first we convert this into a binary class variable by setting the quality to be 'good'(encoded by the value 1) if the numerical value is greater than 6 and 'bad' (encoded by value 0) otherwise.

## Model¶

Below is a classification model for predicting the quality label $y$.

### Logit¶

\begin{align} P(y \ = 1 \ | \ \mathbf{x}) &= \sigma(w^T \varphi(\mathbf{x}) + b) \\ \sigma(z) &= \frac{1}{1 + exp(-z)} \end{align}

### Probit¶

The probit regression model is an alternative to the logit model it is represented as.

\begin{align} P(y \ = 1 \ | \ \mathbf{x}) &= \Phi(w^T \varphi(\mathbf{x}) + b) \\ \Phi(z) &= \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} exp(-\frac{z^{2}}{2}) dz \end{align}

## Syntax¶

The TestLogisticWineQuality program in the examples package trains and tests logit and probit models on the wine quality data.

Parameter | Type | Default value |Notes --------|-----------|-----------|------------| training | Int | 100 | Number of training samples test | Int | 1000 | Number of test samples columns | List[Int] | 11, 0, ... , 10 | The columns to be selected for analysis (indexed from 0), first one is the target column. stepSize | Double | 0.01 | Step size chosen for GradientDescent maxIt | Int | 30 | Maximum number of iterations for gradient descent update. mini | Double | 1.0 | Fraction of training samples to sample for each batch update. regularization | Double | 0.5 | Regularization parameter. wineType | String | red | The type of wine: red or white modelType | String | logistic | The type of model: logistic or probit

## Red Wine¶

TestLogisticWineQuality(stepSize = 0.2, maxIt = 120,
mini = 1.0, training = 800,
test = 800, regularization = 0.2,
wineType = "red")

16/04/01 15:21:57 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:21:57 INFO BinaryClassificationMetrics: ============================
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Accuracy: 0.8475
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Area under ROC: 0.7968417788802267
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7493563745371187


## White Wine¶

TestLogisticWineQuality(stepSize = 0.26, maxIt = 300,
mini = 1.0, training = 3800,
test = 1000, regularization = 0.0,
wineType = "white")

16/04/01 15:27:17 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:27:17 INFO BinaryClassificationMetrics: ============================
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Accuracy: 0.829
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Area under ROC: 0.7184782682020251
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7182203962483446