21 Feature Selection using Genetic Algorithms

Contents

21.1 Genetic Algorithms

Genetic algorithms (GAs) mimic Darwinian forces of natural selection to find optimal values of some function (Mitchell, 1998). An initial set of candidate solutions are created and their corresponding fitness values are calculated (where larger values are better). This set of solutions is referred to as a population and each solution as an individual. The individuals with the best fitness values are combined randomly to produce offsprings which make up the next population. To do so, individual are selected and undergo cross-over (mimicking genetic reproduction) and also are subject to random mutations. This process is repeated again and again and many generations are produced (i.e. iterations of the search procedure) that should create better and better solutions.

For feature selection, the individuals are subsets of predictors that are encoded as binary; a feature is either included or not in the subset. The fitness values are some measure of model performance, such as the RMSE or classification accuracy. One issue with using GAs for feature selection is that the optimization process can be very aggressive and their is potential for the GA to overfit to the predictors (much like the previous discussion for RFE).

21.2 Internal and External Performance Estimates

The genetic algorithm code in caret conducts the search of the feature space repeatedly within resampling iterations. First, the training data are split be whatever resampling method was specified in the control function. For example, if 10-fold cross-validation is selected, the entire genetic algorithm is conducted 10 separate times. For the first fold, nine tenths of the data are used in the search while the remaining tenth is used to estimate the external performance since these data points were not used in the search.

During the genetic algorithm, a measure of fitness is needed to guide the search. This is the internal measure of performance. During the search, the data that are available are the instances selected by the top-level resampling (e.g. the nine tenths mentioned above). A common approach is to conduct another resampling procedure. Another option is to use a holdout set of samples to determine the internal estimate of performance (see the holdout argument of the control function). While this is faster, it is more likely to cause overfitting of the features and should only be used when a large amount of training data are available. Yet another idea is to use a penalized metric (such as the AIC statistic) but this may not exist for some metrics (e.g. the area under the ROC curve).

The internal estimates of performance will eventually overfit the subsets to the data. However, since the external estimate is not used by the search, it is able to make better assessments of overfitting. After resampling, this function determines the optimal number of generations for the GA.

Finally, the entire data set is used in the last execution of the genetic algorithm search and the final model is built on the predictor subset that is associated with the optimal number of generations determined by resampling (although the update function can be used to manually set the number of generations).

21.3 Basic Syntax

The most basic usage of the function is:

obj <- gafs(x = predictors, 
            y = outcome,
            iters = 100)

where

  • x: a data frame or matrix of predictor values
  • y: a factor or numeric vector of outcomes
  • iters: the number of generations for the GA

This isn’t very specific. All of the action is in the control function. That can be used to specify the model to be fit, how predictions are made and summarized as well as the genetic operations.

Suppose that we want to fit a linear regression model. To do this, we can use train as an interface and pass arguments to that function through gafs:

ctrl <- gafsControl(functions = caretGA)
obj <- gafs(x = predictors, 
            y = outcome,
            iters = 100,
            gafsControl = ctrl,
            ## Now pass options to `train`
            
            method = "lm")

Other options, such as preProcess, can be passed in as well.

Some important options to gafsControl are:

  • method, number, repeats, index, indexOut, etc: options similar to those for train top control resampling.
  • metric: this is similar to train’s option but, in this case, the value should be a named vector with values for the internal and external metrics. If none are specified, the first value returned by the summary functions (see details below) are used and a warning is issued. A similar two-element vector for the option maximize is also required. See the last example here for an illustration.
  • holdout: this is a number between [0, 1) that can be used to hold out samples for computing the internal fitness value. Note that this is independent of the external resampling step. Suppose 10-fold CV is being used. Within a resampling iteration, holdout can be used to sample an additional proportion of the 90% resampled data to use for estimating fitness. This may not be a good idea unless you have a very large training set and want to avoid an internal resampling procedure to estimate fitness.
  • allowParallel and genParallel: these are logicals to control where parallel processing should be used (if at all). The former will parallelize the external resampling while the latter parallelizes the fitness calculations within a generation. allowParallel will almost always be more advantageous.

There are a few built-in sets of functions to use with gafs: caretGA, rfGA, and treebagGA. The first is a simple interface to train. When using this, as shown above, arguments can be passed to train using the ... structure and the resampling estimates of performance can be used as the internal fitness value. The functions provided by rfGA and treebagGA avoid using train and their internal estimates of fitness come from using the out-of-bag estimates generated from the model.

The GA implementation in caret uses the underlying code from the GA package (Scrucca, 2013).

21.4 Genetic Algorithm Example

Using the example from the previous page where there are five real predictors and 40 noise predictors:

library(mlbench)
n <- 100
p <- 40
sigma <- 1
set.seed(1)
sim <- mlbench.friedman1(n, sd = sigma)
colnames(sim$x) <- c(paste("real", 1:5, sep = ""),
                     paste("bogus", 1:5, sep = ""))
bogus <- matrix(rnorm(n * p), nrow = n)
colnames(bogus) <- paste("bogus", 5+(1:ncol(bogus)), sep = "")
x <- cbind(sim$x, bogus)
y <- sim$y
normalization <- preProcess(x)
x <- predict(normalization, x)
x <- as.data.frame(x)

We’ll fit a random forest model and use the out-of-bag RMSE estimate as the internal performance metric and use the same repeated 10-fold cross-validation process used with the search. To do this, we’ll use the built-in rfGA object for this purpose. The default GA operators will be used and conduct 200 generations of the algorithm.

ga_ctrl <- gafsControl(functions = rfGA,
                       method = "repeatedcv",
                       repeats = 5)

## Use the same random number seed as the RFE process
## so that the same CV folds are used for the external
## resampling. 
set.seed(10)
rf_ga <- gafs(x = x, y = y,
              iters = 200,
              gafsControl = ga_ctrl)
rf_ga
## 
## Genetic Algorithm Feature Selection
## 
## 100 samples
## 50 predictors
## 
## Maximum generations: 200 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 50):
##     real1 (100%), real2 (100%), real4 (100%), real5 (100%), real3 (92%)
##   * on average, 9.3 variables were selected (min = 6, max = 15)
## 
## In the final search using the entire training set:
##    * 12 features selected at iteration 195 including:
##      real1, real2, real3, real4, real5 ... 
##    * external performance at this iteration is
## 
##        RMSE    Rsquared         MAE 
##      2.8056      0.7607      2.3640

With 5 repeats of 10-fold cross-validation, the GA was executed 50 times. The average external performance is calculated across resamples and these results are used to determine the optimal number of iterations for the final GA to avoid over-fitting. Across the resamples, an average of 9.3 predictors were selected at the end of each of the algorithms.

The plot function is used to monitor the average of the internal out-of-bag RMSE estimates as well as the average of the external performance estimates calculated from the 50 out-of-sample predictions. By default, this function uses ggplot2 package. A black and white theme can be “added” to the output object:

plot(rf_ga) + theme_bw()

Based on these results, the generation associated with the best external RMSE estimate was 2.81.

Using the entire training set, the final GA is conducted and, at generation 195, there were 12 that were selected: real1, real2, real3, real4, real5, bogus3, bogus5, bogus7, bogus8, bogus14, bogus17, bogus29. The random forest model with these predictors is created using the entire training set is trained and this is the model that is used when predict.gafs is executed.

Note: the correlation between the internal and external fitness values is somewhat atypical for most real-world problems. This is a function of the nature of the simulations (a small number of uncorrelated informative predictors) and that the OOB error estimate from random forest is a product of hundreds of trees. Your mileage may vary.

21.6 The Example Revisited

The previous GA included some of the non-informative predictors. We can cheat a little and try to bias the search to get the right solution.

We can try to encourage the algorithm to choose fewer predictors, we can penalize the the RMSE estimate. Normally, a metric like the Akaike information criterion (AIC) statistic would be used. However, with a random forest model, there is no real notion of model degrees of freedom. As an alternative, we can use desirability functions to penalize the RMSE. To do this, two functions are created that translate the number of predictors and the RMSE values to a measure of “desirability”. For the number of predictors, the most desirable property would be a single predictor and the worst situation would be if the model required all 50 predictors. That desirability function is visualized as:

For the RMSE, the best case would be zero. Many poor models have values around four. To give the RMSE value more weight in the overall desirability calculation, we use a scale parameter value of 2. This desirability function is:

To use the overall desirability to drive the feature selection, the internal function requires replacement. We make a copy of rfGA and add code using the desirability package and the function returns the estimated RMSE and the overall desirability. The gafsControl function also need changes. The metric argument needs to reflect that the overall desirability score should be maximized internally but the RMSE estimate should be minimized externally.

library(desirability)
rfGA2 <- rfGA
rfGA2$fitness_intern <- function (object, x, y, maximize, p) {
  RMSE <- rfStats(object)[1]
  d_RMSE <- dMin(0, 4)
  d_Size <- dMin(1, p, 2)
  overall <- dOverall(d_RMSE, d_Size)
  D <- predict(overall, data.frame(RMSE, ncol(x)))
  c(D = D, RMSE = as.vector(RMSE))
  }
ga_ctrl_d <- gafsControl(functions = rfGA2,
                         method = "repeatedcv",
                         repeats = 5,
                         metric = c(internal = "D", external = "RMSE"),
                         maximize = c(internal = TRUE, external = FALSE))

set.seed(10)
rf_ga_d <- gafs(x = x, y = y,
                iters = 150,
                gafsControl = ga_ctrl_d)

rf_ga_d
## 
## Genetic Algorithm Feature Selection
## 
## 100 samples
## 50 predictors
## 
## Maximum generations: 150 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: D, RMSE
## Subset selection driven to maximize internal D 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 50):
##     real1 (100%), real2 (100%), real4 (100%), real5 (100%), real3 (40%)
##   * on average, 5.2 variables were selected (min = 4, max = 6)
## 
## In the final search using the entire training set:
##    * 6 features selected at iteration 146 including:
##      real1, real2, real3, real4, real5 ... 
##    * external performance at this iteration is
## 
##        RMSE    Rsquared         MAE 
##      2.7046      0.7665      2.2730

Here are the RMSE values for this search:

plot(rf_ga_d) + theme_bw()

The final GA found 6 that were selected: real1, real2, real3, real4, real5, bogus43. During resampling, the average number of predictors selected was 5.2, indicating that the penalty on the number of predictors was effective.

21.7 Using Recipes

Like the other feature selection routines, gafs can take a data recipe as an input. This is advantageous when your data needs preprocessing before the model, such as:

  • creation of dummy variables from factors
  • specification of interactions
  • missing data imputation
  • more complex feature engineering methods

Like train, the recipe’s preprocessing steps are calculated within each resample. This makes sure that the resampling statistics capture the variation and effect that the preprocessing has on the model.

As an example, the Ames housing data is used. These data contain a number of categorical predictors that require conversion to indicators as well as other variables that require processing. To load (and split) the data:

library(AmesHousing)
library(rsample)

# Create the data and remove one column that is more of 
# an outcome. 
ames <- make_ames() %>%
  select(-Overall_Qual)

ncol(ames)
## [1] 80
# How many factor variables?
sum(vapply(ames, is.factor, logical(1)))
## [1] 45
# We'll use `rsample` to make the initial split to be consistent with other
# analyses of these data. Set the seed first to make sure that you get the 
# same random numbers
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", prop = 3/4)

ames_train <- training(data_split) %>% as.data.frame()
ames_test  <- testing(data_split) %>% as.data.frame()

Here is a recipe that does differetn types of preprocssing on the predictor set:

library(recipes)

ames_rec <- recipe(Sale_Price ~ ., data = ames_train) %>% 
  step_log(Sale_Price, base = 10) %>%
  step_other(Neighborhood, threshold = 0.05)  %>%
  step_dummy(all_nominal(), -Bldg_Type) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_zv(all_predictors())%>%
  step_bs(Longitude, Latitude, options = list(df = 5))

If this were executed on the training set, it would produce 280 predictor columns out of the original 79.

Let’s tune some linear models with gafs and, for the sake of computational time, only use 10 generations of the algorithm:

lm_ga_ctrl <- gafsControl(functions = caretGA, method = "cv", number = 10)

set.seed(23555)
lm_ga_search <- gafs(
  ames_rec, 
  data = ames_train,
  iters = 10, 
  gafsControl = lm_ga_ctrl,
  # now options to `train` for caretGA
  method = "lm",
  trControl = trainControl(method = "cv", allowParallel = FALSE)
) 
lm_ga_search
## 
## Genetic Algorithm Feature Selection
## 
## 2199 samples
## 273 predictors
## 
## Maximum generations: 10 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared, MAE
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (10 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 273):
##     Bldg_Type (100%), Bsmt_Exposure_No (100%), First_Flr_SF (100%), MS_Zoning_Residential_High_Density (100%), Neighborhood_Gilbert (100%)
##   * on average, 171.7 variables were selected (min = 150, max = 198)
## 
## In the final search using the entire training set:
##    * 155 features selected at iteration 9 including:
##      Lot_Frontage, Year_Built, Year_Remod_Add, BsmtFin_SF_2, Gr_Liv_Area ... 
##    * external performance at this iteration is
## 
##         RMSE     Rsquared          MAE 
##      0.06923      0.84659      0.04260