16 Miscellaneous Model Functions
Contents
- Yet Another k-Nearest Neighbor Function
- Partial Least Squares Discriminant Analysis
- Bagged MARS and FDA
- General Purpose Bagging
- Model Averaged Neural Networks
- Neural Networks with a Principal Component Step
- Independent Component Regression
16.1 Yet Another k-Nearest Neighbor Function
knn3
is a function for k-nearest neighbor classification. This particular implementation is a modification of the knn
C code and returns the vote information for all of the classes ( knn
only returns the probability for the winning class). There is a formula interface via
knn3(formula, data)
## or by passing the training data directly
## x is a matrix or data frame, y is a factor vector
knn3(x, y)
There are also print
and predict
methods.
For the Sonar data in the mlbench
package, we can fit an 11-nearest neighbor model:
library(caret)
library(mlbench)
data(Sonar)
set.seed(808)
inTrain <- createDataPartition(Sonar$Class, p = 2/3, list = FALSE)
## Save the predictors and class in different objects
sonarTrain <- Sonar[ inTrain, -ncol(Sonar)]
sonarTest <- Sonar[-inTrain, -ncol(Sonar)]
trainClass <- Sonar[ inTrain, "Class"]
testClass <- Sonar[-inTrain, "Class"]
centerScale <- preProcess(sonarTrain)
centerScale
## Created from 139 samples and 60 variables
##
## Pre-processing:
## - centered (60)
## - ignored (0)
## - scaled (60)
training <- predict(centerScale, sonarTrain)
testing <- predict(centerScale, sonarTest)
knnFit <- knn3(training, trainClass, k = 11)
knnFit
## 11-nearest neighbor model
## Training set outcome distribution:
##
## M R
## 74 65
predict(knnFit, head(testing), type = "prob")
## M R
## [1,] 0.45454545 0.5454545
## [2,] 0.81818182 0.1818182
## [3,] 0.63636364 0.3636364
## [4,] 0.09090909 0.9090909
## [5,] 0.54545455 0.4545455
## [6,] 0.45454545 0.5454545
Similarly, caret
contains a k-nearest neighbor regression function, knnreg
. It returns the average outcome for the neighbor.
16.2 Partial Least Squares Discriminant Analysis
The plsda
function is a wrapper for the plsr
function in the pls
package that does not require a formula interface and can take factor outcomes as arguments. The classes are broken down into dummy variables (one for each class). These 0/1 dummy variables are modeled by partial least squares.
From this model, there are two approaches to computing the class predictions and probabilities:
- the softmax technique can be used on a per-sample basis to normalize the scores so that they are more “probability like”" (i.e. they sum to one and are between zero and one). For a vector of model predictions for each class X, the softmax class probabilities are computed as. The predicted class is simply the class with the largest model prediction, or equivalently, the largest class probability. This is the default behavior for
plsda
. - Bayes rule can be applied to the model predictions to form posterior probabilities. Here, the model predictions for the training set are used along with the training set outcomes to create conditional distributions for each class. When new samples are predicted, the raw model predictions are run through these conditional distributions to produce a posterior probability for each class (along with the prior). Bayes rule can be used by specifying
probModel = "Bayes"
. An additional parameter,prior
, can be used to set prior probabilities for the classes.
The advantage to using Bayes rule is that the full training set is used to directly compute the class probabilities (unlike the softmax function which only uses the current sample’s scores). This creates more realistic probability estimates but the disadvantage is that a separate Bayesian model must be created for each value of ncomp
, which is more time consuming.
For the sonar data set, we can fit two PLS models using each technique and predict the class probabilities for the test set.
plsFit <- plsda(training, trainClass, ncomp = 20)
plsFit
## Partial least squares classification, fitted with the kernel algorithm.
## The softmax function was used to compute class probabilities.
plsBayesFit <- plsda(training, trainClass, ncomp = 20,
probMethod = "Bayes")
plsBayesFit
## Partial least squares classification, fitted with the kernel algorithm.
## Bayes rule was used to compute class probabilities.
predict(plsFit, head(testing), type = "prob")
## , , 20 comps
##
## M R
## 2 0.12860843 0.8713916
## 5 0.49074450 0.5092555
## 8 0.59582388 0.4041761
## 11 0.35693679 0.6430632
## 13 0.36360834 0.6363917
## 14 0.06626214 0.9337379
predict(plsBayesFit, head(testing), type = "prob")
## , , ncomp20
##
## M R
## 2 0.02774255 0.9722574
## 5 0.47710154 0.5228985
## 8 0.89692329 0.1030767
## 11 0.06002366 0.9399763
## 13 0.07292981 0.9270702
## 14 0.60530446 0.3946955
Similar to plsda
, caret
also contains a function splsda
that allows for classification using sparse PLS. A dummy matrix is created for each class and used with the spls
function in the spls
package. The same approach to estimating class probabilities is used for plsda
and splsda
.
16.3 Bagged MARS and FDA
Multivariate adaptive regression splines (MARS) models, like classification/regression trees, are unstable predictors (Breiman, 1996). This means that small perturbations in the training data might lead to significantly different models. Bagged trees and random forests are effective ways of improving tree models by exploiting these instabilities. caret
contains a function, bagEarth
, that fits MARS models via the earth
function. There are formula and non-formula interfaces.
Also, flexible discriminant analysis is a generalization of linear discriminant analysis that can use non-linear features as inputs. One way of doing this is the use MARS-type features to classify samples. The function bagFDA
fits FDA models of a set of bootstrap samples and aggregates the predictions to reduce noise.
This function is deprecated in favor of the bag
function.
16.4 Bagging
The bag
function offers a general platform for bagging classification and regression models. Like rfe
and sbf
, it is open and models are specified by declaring functions for the model fitting and prediction code (and several built-in sets of functions exist in the package). The function bagControl
has options to specify the functions (more details below).
The function also has a few non-standard features:
- The argument
var
can enable random sampling of the predictors at each bagging iteration. This is to de-correlate the bagged models in the same spirit of random forests (although here the sampling is done once for the whole model). The default is to use all the predictors for each model. - The
bagControl
function has a logical argument calleddownSample
that is useful for classification models with severe class imbalance. The bootstrapped data set is reduced so that the sample sizes for the classes with larger frequencies are the same as the sample size for the minority class. - If a parallel backend for the foreach package has been loaded and registered, the bagged models can be trained in parallel.
The function’s control function requires the following arguments:
16.4.1 The fit
Function
Inputs:
x
: a data frame of the training set predictor data.y
: the training set outcomes....
arguments passed fromtrain
to this function
The output is the object corresponding to the trained model and any other objects required for prediction. A simple example for a linear discriminant analysis model from the MASS package is:
function(x, y, ...) {
library(MASS)
lda(x, y, ...)
}
16.4.2 The pred
Function
This should be a function that produces predictors for new samples.
Inputs:
object
: the object generated by thefit
module.x
: a matrix or data frame of predictor data.
The output is either a number vector (for regression), a factor (or character) vector for classification or a matrix/data frame of class probabilities. For classification, it is probably better to average class probabilities instead of using the votes of the class predictions. Using the lda
example again:
## predict.lda returns the class and the class probabilities
## We will average the probabilities, so these are saved
function(object, x) predict(object, x)$posterior
## function(object, x) predict(object, x)$posterior
16.4.3 The aggregate
Function
This should be a function that takes the predictions from the constituent models and converts them to a single prediction per sample.
Inputs:
x
: a list of objects returned by thepred
module.type
: an optional string that describes the type of output (e.g. “class”, “prob” etc.).
The output is either a number vector (for regression), a factor (or character) vector for classification or a matrix/data frame of class probabilities. For the linear discriminant model above, we saved the matrix of class probabilities. To average them and generate a class prediction, we could use:
function(x, type = "class") {
## The class probabilities come in as a list of matrices
## For each class, we can pool them then average over them
## Pre-allocate space for the results
pooled <- x[[1]] * NA
n <- nrow(pooled)
classes <- colnames(pooled)
## For each class probability, take the median across
## all the bagged model predictions
for(i in 1:ncol(pooled))
{
tmp <- lapply(x, function(y, col) y[,col], col = i)
tmp <- do.call("rbind", tmp)
pooled[,i] <- apply(tmp, 2, median)
}
## Re-normalize to make sure they add to 1
pooled <- apply(pooled, 1, function(x) x/sum(x))
if(n != nrow(pooled)) pooled <- t(pooled)
if(type == "class")
{
out <- factor(classes[apply(pooled, 1, which.max)],
levels = classes)
} else out <- as.data.frame(pooled)
out
}
For example, to bag a conditional inference tree (from the party package):
library(caret)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
set.seed(825)
baggedCT <- bag(x = training[, names(training) != "Class"],
y = training$Class,
B = 50,
bagControl = bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate))
summary(baggedCT)
##
## Call:
## bag.default(x = training[, names(training) != "Class"], y
## = training$Class, B = 50, bagControl = bagControl(fit =
## ctreeBag$fit, predict = ctreeBag$pred, aggregate = ctreeBag$aggregate))
##
## Out of bag statistics (B = 50):
##
## Accuracy Kappa
## 0.0% 0.4746 -0.04335
## 2.5% 0.5806 0.17971
## 25.0% 0.6681 0.32402
## 50.0% 0.7094 0.41815
## 75.0% 0.7606 0.51092
## 97.5% 0.8060 0.59901
## 100.0% 0.8077 0.61078
16.5 Model Averaged Neural Networks
The avNNet
fits multiple neural network models to the same data set and predicts using the average of the predictions coming from each constituent model. The models can be different either due to different random number seeds to initialize the network or by fitting the models on bootstrap samples of the original training set (i.e. bagging the neural network). For classification models, the class probabilities are averaged to produce the final class prediction (as opposed to voting from the individual class predictions.
As an example, the model can be fit via train
:
set.seed(825)
avNnetFit <- train(x = training,
y = trainClass,
method = "avNNet",
repeats = 15,
trace = FALSE)
16.6 Neural Networks with a Principal Component Step
Neural networks can be affected by severe amounts of multicollinearity in the predictors. The function pcaNNet
is a wrapper around the preProcess
and nnet
functions that will run principal component analysis on the predictors before using them as inputs into a neural network. The function will keep enough components that will capture some pre-defined threshold on the cumulative proportion of variance (see the thresh
argument). For new samples, the same transformation is applied to the new predictor values (based on the loadings from the training set). The function is available for both regression and classification.
This function is deprecated in favor of the train
function using method = "nnet"
and preProc = "pca"
.
16.7 Independent Component Regression
The icr
function can be used to fit a model analogous to principal component regression (PCR), but using independent component analysis (ICA). The predictor data are centered and projected to the ICA components. These components are then regressed against the outcome. The user needed to specify the number of components to keep.
The model uses the preProcess
function to compute the latent variables using the fastICA package.
Like PCR, there is no guarantee that there will be a correlation between the new latent variable and the outcomes.