19 Feature Selection using Univariate Filters
Contents
19.1 Univariate Filters
Another approach to feature selection is to pre-screen the predictors using simple univariate statistical methods then only use those that pass some criterion in the subsequent model steps. Similar to recursive selection, cross-validation of the subsequent models will be biased as the remaining predictors have already been evaluate on the data set. Proper performance estimates via resampling should include the feature selection step.
As an example, it has been suggested for classification models, that predictors can be filtered by conducting some sort of k-sample test (where k is the number of classes) to see if the mean of the predictor is different between the classes. Wilcoxon tests, t-tests and ANOVA models are sometimes used. Predictors that have statistically significant differences between the classes are then used for modeling.
The caret function sbf
(for selection by filter) can be used to cross-validate such feature selection schemes. Similar to rfe
, functions can be passed into sbf
for the computational components: univariate filtering, model fitting, prediction and performance summaries (details are given below).
The function is applied to the entire training set and also to different resampled versions of the data set. From this, generalizable estimates of performance can be computed that properly take into account the feature selection step. Also, the results of the predictor filters can be tracked over resamples to understand the uncertainty in the filtering.
19.2 Basic Syntax
Similar to the rfe
function, the syntax for sbf
is:
sbf(predictors, outcome, sbfControl = sbfControl(), ...)
## or
sbf(formula, data, sbfControl = sbfControl(), ...)
In this case, the details are specificed using the sbfControl
function. Here, the argument functions
dictates what the different components should do. This argument should have elements called filter
, fit
, pred
and summary
.
19.2.1 The score
Function
This function takes as inputs the predictors and the outcome in objects called x
and y
, respectively. By default, each predictor in x
is passed to the score
function individually. In this case, the function should return a single score. Alternatively, all the predictors can be exposed to the function using the multivariate
argument to sbfControl
. In this case, the output should be a named vector of scores where the names correspond to the column names of x
.
There are two built-in functions called anovaScores
and gamScores
. anovaScores
treats the outcome as the independent variable and the predictor as the outcome. In this way, the null hypothesis is that the mean predictor values are equal across the different classes. For regression, gamScores
fits a smoothing spline in the predictor to the outcome using a generalized additive model and tests to see if there is any functional relationship between the two. In each function the p-value is used as the score.
19.2.2 The filter
Function
This function takes as inputs the scores coming out of the score
function (in an argument called score
). The function also has the training set data as inputs (arguments are called x
and y
). The output should be a named logical vector where the names correspond to the column names of x
. Columns with values of TRUE
will be used in the subsequent model.
19.2.3 The fit
Function
The component is very similar to the rfe
-specific function described above. For sbf
, there are no first
or last
arguments. The function should have arguments x
, y
and ...
. The data within x
have been filtered using the filter
function described above. The output of the fit
function should be a fitted model.
With some data sets, no predictors will survive the filter. In these cases, a model with predictors cannot be computed, but the lack of viable predictors should not be ignored in the final results. To account for this issue, caret
contains a model function called nullModel
that fits a simple model that is independent of any of the predictors. For problems where the outcome is numeric, the function predicts every sample using the simple mean of the training set outcomes. For classification, the model predicts all samples using the most prevalent class in the training data.
This function can be used in the fit
component function to “error-trap” cases where no predictors are selected. For example, there are several built-in functions for some models. The object rfSBF
is a set of functions that may be useful for fitting random forest models with filtering. The fit
function here uses nullModel
to check for cases with no predictors:
rfSBF$fit
## function (x, y, ...)
## {
## if (ncol(x) > 0) {
## loadNamespace("randomForest")
## randomForest::randomForest(x, y, ...)
## }
## else nullModel(y = y)
## }
## <bytecode: 0x7fa6cc2db540>
## <environment: namespace:caret>
19.2.4 The summary
and pred
Functions
The summary
function is used to
calculate model performance on held-out samples. The pred
function is used to predict new samples using the current predictor set. The arguments and outputs for these two functions are identical to the previously discussed summary
and pred
functions in previously described sections.
19.3 The Example
Returning to the example from (Friedman, 1991), we can fit another random forest model with the predictors pre-filtered using the generalized additive model approach described previously.
filterCtrl <- sbfControl(functions = rfSBF, method = "repeatedcv", repeats = 5)
set.seed(10)
rfWithFilter <- sbf(x, y, sbfControl = filterCtrl)
rfWithFilter
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3.407 0.5589 2.86 0.5309 0.1782 0.5361
##
## Using the training set, 6 variables were selected:
## real2, real4, real5, bogus2, bogus17...
##
## During resampling, the top 5 selected variables (out of a possible 13):
## real2 (100%), real4 (100%), real5 (100%), bogus44 (76%), bogus2 (44%)
##
## On average, 5.5 variables were selected (min = 4, max = 8)
In this case, the training set indicated that 6 should be used in the random forest model, but the resampling results indicate that there is some variation in this number. Some of the informative predictors are used, but a few others are erroneous retained.
Similar to rfe
, there are methods for predictors
, densityplot
, histogram
and varImp
.