Fit a Cubist model

This function fits the rule-based model described in Quinlan (1992) (aka M5) with additional corrections based on nearest neighbors in the training set, as described in Quinlan (1993a).

# Default S3 method
cubist(x, y, committees = 1, control = cubistControl(), weights = NULL, ...)

Arguments

x: a matrix or data frame of predictor variables. Missing data are allowed but (at this time) only numeric, character and factor values are allowed. Must have column names.
y: a numeric vector of outcome
committees: an integer: how many committee models (e.g.. boosting iterations) should be used?
control: options that control details of the cubist algorithm. See cubistControl()
weights: an optional vector of case weights (the same length as y) for how much each instance should contribute to the model fit. From the RuleQuest website: "The relative weight assigned to each case is its value of this attribute divided by the average value; if the value is undefined, not applicable, or is less than or equal to zero, the case's relative weight is set to 1."
...: optional arguments to pass (not currently used)

Value

an object of class cubist with elements:

data, names, model: character strings that correspond to their counterparts for the command-line program available from RuleQuest
output: basic cubist output captured from the C code, including the rules, their terminal models and variable usage statistics
control: a list of control parameters passed in by the user
composite, neighbors, committees: mirrors of the values to these arguments that were passed in by the user
dims: the output if dim(x)
splits: information about the variables and values used in the rule conditions
call: the function call
coefs: a data frame of regression coefficients for each rule within each committee
vars: a list with elements all and used listing the predictors passed into the function and used by any rule or model
fitted.values: a numeric vector of predictions on the training set.
usage: a data frame with the percent of models where each variable was used. See summary.cubist() for a discussion.

Details

Cubist is a prediction-oriented regression model that combines the ideas in Quinlan (1992) and Quinlan (1993).

Although it initially creates a tree structure, it collapses each path through the tree into a rule. A regression model is fit for each rule based on the data subset defined by the rules. The set of rules are pruned or possibly combined. and the candidate variables for the linear regression models are the predictors that were used in the parts of the rule that were pruned away. This part of the algorithm is consistent with the "M5" or Model Tree approach.

Cubist generalizes this model to add boosting (when committees > 1) and instance based corrections (see predict.cubist()). The number of instances is set at prediction time by the user and is not needed for model building.

This function links R to the GPL version of the C code given on the RuleQuest website.

The RuleQuest code differentiates missing values from values that are not applicable. Currently, this packages does not make such a distinction (all values are treated as missing). This will produce slightly different results.

To tune the cubist model over the number of committees and neighbors, the caret::train() function in the caret package has bindings to find appropriate settings of these parameters.

References

Quinlan. Learning with continuous classes. Proceedings of the 5th Australian Joint Conference On Artificial Intelligence (1992) pp. 343-348

Quinlan. Combining instance-based and model-based learning. Proceedings of the Tenth International Conference on Machine Learning (1993a) pp. 236-243

Quinlan. C4.5: Programs For Machine Learning (1993b) Morgan Kaufmann Publishers Inc. San Francisco, CA

Wang and Witten. Inducing model trees for continuous classes. Proceedings of the Ninth European Conference on Machine Learning (1997) pp. 128-137

http://rulequest.com/cubist-info.html

Author

R code by Max Kuhn, original C sources by R Quinlan and modifications be Steve Weston

Examples


library(mlbench)
data(BostonHousing)

## 1 committee, so just an M5 fit:
mod1 <- cubist(x = BostonHousing[, -14], y = BostonHousing$medv)
mod1
#> 
#> Call:
#> cubist.default(x = BostonHousing[, -14], y = BostonHousing$medv)
#> 
#> Number of samples: 506 
#> Number of predictors: 13 
#> 
#> Number of committees: 1 
#> Number of rules: 4 
#> 

## Now with 10 committees
mod2 <- cubist(x = BostonHousing[, -14], y = BostonHousing$medv, committees = 10)
mod2
#> 
#> Call:
#> cubist.default(x = BostonHousing[, -14], y = BostonHousing$medv, committees
#>  = 10)
#> 
#> Number of samples: 506 
#> Number of predictors: 13 
#> 
#> Number of committees: 10 
#> Number of rules per committee: 4, 6, 4, 6, 6, 7, 7, 7, 4, 5 
#>