The Housing data set is a popular regression benchmarking data set hosted on the UCI Machine Learning Repository. It contains 506 records consisting of multivariate data attributes for various real estate zones and their housing price indices. The task is then to learn a regression model that can predict the price index or range.
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes in $1000's
Below is a GP model for predicting the MEDV
TestGPHousing() program can be run in the REPL, below is a description of each of its arguments.
Parameter | Type | Default value |Notes
CovarianceFunction | - | The kernel function driving the GP model.
CovarianceFunction | - | The additive noise that corrupts the values of the latent function.
Double | 0.75 | Fraction of the data to be used for model training and hyper-parameter selection.
List[Int] | 13, 0,.., 12 | The columns to be selected for analysis (indexed from 0), first one is the target column.
Int | 5 | The number of grid points for each hyper-parameter
Double| 0.2| The space between grid points.
String | ML | The model selection procedure
"GS", "CSA", or
Double | 0.01 | Only relevant if
globalOpt = "ML", determines step size of steepest ascent.
Int | 300 | Maximum iterations for ML model selection procedure.
TestGPHousing( kernel = new FBMKernel(0.55) + new LaplacianKernel(2.5), noise = new RBFKernel(1.5), grid = 5, step = 0.03, globalOpt = "GS", trainFraction = 0.45)
16/03/03 20:45:41 INFO GridSearch: Optimum value of energy is: 278.1603309851301 Configuration: Map(hurst -> 0.4, beta -> 2.35, bandwidth -> 1.35) 16/03/03 20:45:41 INFO SVMKernel$: Constructing kernel matrix.
16/03/03 20:45:42 INFO GPRegression: Generating error bars 16/03/03 20:45:42 INFO RegressionMetrics: Regression Model Performance: MEDV 16/03/03 20:45:42 INFO RegressionMetrics: ============================ 16/03/03 20:45:42 INFO RegressionMetrics: MAE: 5.800070254265218 16/03/03 20:45:42 INFO RegressionMetrics: RMSE: 7.739266267762397 16/03/03 20:45:42 INFO RegressionMetrics: RMSLE: 0.4150438478412412 16/03/03 20:45:42 INFO RegressionMetrics: R^2: 0.3609909626630624 16/03/03 20:45:42 INFO RegressionMetrics: Corr. Coefficient: 0.7633838930006132 16/03/03 20:45:42 INFO RegressionMetrics: Model Yield: 0.7341944950376289 16/03/03 20:45:42 INFO RegressionMetrics: Std Dev of Residuals: 6.287519509352036
Below is the example program as a github gist, to view the original program in DynaML, click here.