Global
Model Selection Routines¶
These routines are also known as global optimizers, paradigms/algorithms such as genetic algorithms, gibbs sampling, simulated annealing, evolutionary optimization fall under this category. They can be used in situations when the objective function in not "smooth".
In DynaML they are most prominently used in hyper-parameter optimization in kernel based learning methods. All global optimizers in DynaML extend the GlobalOptimizer
trait, which implies that they provide an implementation for its optimize
method.
In order to use a global optimization routine on an model, the model implementation in question must be extending the GloballyOptimizable
trait in the dynaml.optimization
package, this trait has only one method called energy
which is to be implemented by all sub-classes/traits.
The energy
method calculates the value of the global objective function for a particular configuration i.e. for particular values of model hyper-parameters. This objective function can be defined differently for each model class (marginal likelihood for Gaussian Processes, cross validation score for parametric models, etc).
The following model selection routines are available in DynaML so far.
Grid Search¶
The most elementary (naive) method of model selection is to evaluate its performance (value returned by energy
) on a fixed set of grid points which are initialized for the model hyper-parameters.
val kernel = ...
val noise = ...
val data = ...
val model = new GPRegression(kernel, noise, data)
val grid = 5
val step = 0.2
val gs = new GridSearch[model.type](model)
.setGridSize(grid)
.setStepSize(step)
.setLogScale(false)
val startConf = kernel.state ++ noise.state
val (_, conf) = gs.optimize(startConf, opt)
model.setState(conf)
Coupled Simulated Annealing¶
Coupled Simulated Annealing (CSA) is an iterative search procedure which evaluates model performance on a grid and in each iteration perturbs the grid points in a randomized manner. Each perturbed point is accepted using a certain acceptance probability which is a function of the performance on the whole grid.
Coupled Simulated Annealing can be seen as an extension to the classical Simulated Annealing algorithm, since the acceptance probability and perturbation function are design choices, we can formulate a number of variants of CSA. Any CSA-like algorithm must have the following components.
- An ensemble or grid of points x_i \in \Theta.
- A perturbation distribution or function $P: x_i \rightarrow y_i $.
- A coupling term \gamma for an ensemble.
- An acceptance probability function A_{\Theta}(\gamma, x_i \rightarrow y_i).
- An annealing schedule T_{k}^{ac}, k = 0, 1, \cdots.
The CoupledSimulatedAnnealing
class has a companion object with the following available variants.
Variant | Acceptance Probability | Coupling term \gamma |
---|---|---|
SA : Classical Simulated Annealing |
1/(1 + exp(\frac{E(y) - E(x)}{T^{ac}_{k}})) | - |
MuSA : Multi-state Simulated Annealing: Direct generalization of Simulated Annealing |
exp(-E(y_i))/(exp(-E(y_i)) + \gamma) | \sum_{x_j \in \Theta}{exp(-E(x_j)/T^{ac}_{k})} |
BA : Blind Acceptance CSA |
1 - exp(-E(x_i)/T_{k}^{ac})/\gamma | \sum_{x_j \in \Theta}{exp(-E(x_j)/T^{ac}_{k})} |
M : Modified CSA |
exp(E(x_i)/T_{k}^{ac})/\gamma | \sum_{x_j \in \Theta}{exp(E(x_j)/T^{ac}_{k})} |
MwVC : Modified CSA with Variance Control: Employs an annealing schedule that controls the variance of the acceptance probabilities of states |
exp(E(x_i)/T_{k}^{ac})/\gamma | \sum_{x_j \in \Theta}{exp(E(x_j)/T^{ac}_{k})} |
val kernel = ...
val noise = ...
val data = ...
val model = new GPRegression(kernel, noise, data)
//The default variant of CSA is Mw-VC
val gs = new CoupledSimulatedAnnealing[model.type](model)
.setGridSize(grid)
.setStepSize(step)
.setLogScale(false)
.setVariant(CoupledSimulatedAnnealing.MuSA)
val startConf = kernel.state ++ noise.state
val (_, conf) = gs.optimize(startConf, opt)
model.setState(conf)
Gradient based Model Selection¶
Gradient based model selection can be used if the model fitness function implemented in the energy
method has differentiability properties (e.g. using marginal likelihood in the case of stochastic process inference). The GloballyOptWithGrad
trait is an extension of GlobalOptimizer
and adds a method gradEnergy
that should return the gradient of the fitness function in each hyper-parameter in the form of a Map[String, Double]
.
Maximum Likelihood ML-II¶
In the Maximum Likelihood (ML-II) algorithm (refer to Ramussen & Williams for more details), we aim to maximize the log marginal likelihood by calculating its gradient with respect to the hyper-parameters \theta_j in each iteration and performing steepest ascent. The calculations are summarized below.
The GPMLOptimizer[I, T, M]
class implements ML-II, by using the gradEnergy
method implemented by the system: M
member value (which refers to a model extending GloballyOptWithGrad
).
val kernel = ...
val noise = ...
val data = ...
val model = new GPRegression(kernel, noise, data)
val ml = new GPMLOptimizer[DenseVector[Double],
Seq[(DenseVector[Double], Double)],
GPRegression](model)
val startConf = kernel.state ++ noise.state
val (_, conf) = ml.optimize(startConf, opt)
model.setState(conf)