# 19 Feature Selection using Univariate Filters

Contents

## 19.1 Univariate Filters

Another approach to feature selection is to pre-screen the predictors using simple univariate statistical methods then only use those that pass some criterion in the subsequent model steps. Similar to recursive selection, cross-validation of the subsequent models will be biased as the remaining predictors have already been evaluate on the data set. Proper performance estimates via resampling should include the feature selection step.

As an example, it has been suggested for classification models, that predictors can be filtered by conducting some sort of *k*-sample test (where *k* is the number of classes) to see if the mean of the predictor is different between the classes. Wilcoxon tests, *t*-tests and ANOVA models are sometimes used. Predictors that have statistically significant differences between the classes are then used for modeling.

The caret function `sbf`

(for selection by filter) can be used to cross-validate such feature selection schemes. Similar to `rfe`

, functions can be passed into `sbf`

for the computational components: univariate filtering, model fitting, prediction and performance summaries (details are given below).

The function is applied to the entire training set and also to different resampled versions of the data set. From this, generalizable estimates of performance can be computed that properly take into account the feature selection step. Also, the results of the predictor filters can be tracked over resamples to understand the uncertainty in the filtering.

## 19.2 Basic Syntax

Similar to the `rfe`

function, the syntax for `sbf`

is:

```
sbf(predictors, outcome, sbfControl = sbfControl(), ...)
## or
sbf(formula, data, sbfControl = sbfControl(), ...)
```

In this case, the details are specificed using the `sbfControl`

function. Here, the argument `functions`

dictates what the different components should do. This argument should have elements called `filter`

, `fit`

, `pred`

and `summary`

.

### 19.2.1 The `score`

Function

This function takes as inputs the predictors and the outcome in objects called `x`

and `y`

, respectively. By default, each predictor in `x`

is passed to the `score`

function individually. In this case, the function should return a single score. Alternatively, all the predictors can be exposed to the function using the `multivariate`

argument to `sbfControl`

. In this case, the output should be a named vector of scores where the names correspond to the column names of `x`

.

There are two built-in functions called `anovaScores`

and `gamScores`

. `anovaScores`

treats the outcome as the independent variable and the predictor as the outcome. In this way, the null hypothesis is that the mean predictor values are equal across the different classes. For regression, `gamScores`

fits a smoothing spline in the predictor to the outcome using a generalized additive model and tests to see if there is any functional relationship between the two. In each function the p-value is used as the score.

### 19.2.2 The `filter`

Function

This function takes as inputs the scores coming out of the `score`

function (in an argument called `score`

). The function also has the training set data as inputs (arguments are called `x`

and `y`

). The output should be a named logical vector where the names correspond to the column names of `x`

. Columns with values of `TRUE`

will be used in the subsequent model.

### 19.2.3 The `fit`

Function

The component is very similar to the `rfe`

-specific function described above. For `sbf`

, there are no `first`

or `last`

arguments. The function should have arguments `x`

, `y`

and `...`

. The data within `x`

have been filtered using the `filter`

function described above. The output of the `fit`

function should be a fitted model.

With some data sets, no predictors will survive the filter. In these cases, a model with predictors cannot be computed, but the lack of viable predictors should not be ignored in the final results. To account for this issue, `caret`

contains a model function called `nullModel`

that fits a simple model that is independent of any of the predictors. For problems where the outcome is numeric, the function predicts every sample using the simple mean of the training set outcomes. For classification, the model predicts all samples using the most prevalent class in the training data.

This function can be used in the `fit`

component function to “error-trap” cases where no predictors are selected. For example, there are several built-in functions for some models. The object `rfSBF`

is a set of functions that may be useful for fitting random forest models with filtering. The `fit`

function here uses `nullModel`

to check for cases with no predictors:

`rfSBF$fit`

```
## function (x, y, ...)
## {
## if (ncol(x) > 0) {
## loadNamespace("randomForest")
## randomForest::randomForest(x, y, ...)
## }
## else nullModel(y = y)
## }
## <bytecode: 0x7fa6cc2db540>
## <environment: namespace:caret>
```

### 19.2.4 The `summary`

and `pred`

Functions

The `summary`

function is used to
calculate model performance on held-out samples. The `pred`

function is used to predict new samples using the current predictor set. The arguments and outputs for these two functions are identical to the previously discussed `summary`

and `pred`

functions in previously described sections.

## 19.3 The Example

Returning to the example from (Friedman, 1991), we can fit another random forest model with the predictors pre-filtered using the generalized additive model approach described previously.

```
filterCtrl <- sbfControl(functions = rfSBF, method = "repeatedcv", repeats = 5)
set.seed(10)
rfWithFilter <- sbf(x, y, sbfControl = filterCtrl)
rfWithFilter
```

```
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3.407 0.5589 2.86 0.5309 0.1782 0.5361
##
## Using the training set, 6 variables were selected:
## real2, real4, real5, bogus2, bogus17...
##
## During resampling, the top 5 selected variables (out of a possible 13):
## real2 (100%), real4 (100%), real5 (100%), bogus44 (76%), bogus2 (44%)
##
## On average, 5.5 variables were selected (min = 4, max = 8)
```

In this case, the training set indicated that 6 should be used in the random forest model, but the resampling results indicate that there is some variation in this number. Some of the informative predictors are used, but a few others are erroneous retained.

Similar to `rfe`

, there are methods for `predictors`

, `densityplot`

, `histogram`

and `varImp`

.