This function produces predicted classes or confidence values from a C5.0 model.
# S3 method for C5.0
predict(
object,
newdata = NULL,
trials = object$trials["Actual"],
type = "class",
na.action = na.pass,
...
)
an object of class C5.0
a matrix or data frame of predictors
an integer for how many boosting iterations are used for prediction. See the note below.
either "class"
for the predicted class or
"prob"
for model confidence values.
when using a formula for the original model fit, how should missing values be handled?
other options (not currently used)
when type = "class"
, a factor vector is returned.
When type = "prob"
, a matrix of confidence values is returned
(one column per class).
Note that the number of trials in the object my be less than
what was specified originally (unless earlyStopping = FALSE
was used in C5.0Control()
. If the number requested
is larger than the actual number available, the maximum actual
is used and a warning is issued.
Model confidence values reflect the distribution of the classes in terminal nodes or within rules.
For rule-based models (i.e. not boosted), the predicted confidence value is the confidence value from the most specific, active rule. Note that C4.5 sorts the rules, and uses the first active rule for prediction. However, the default in the original sources did not normalize the confidence values. For example, for two classes it was possible to get confidence values of (0.3815, 0.8850) or (0.0000, 0.922), which do not add to one. For rules, this code divides the values by their sum. The previous values would be converted to (0.3012, 0.6988) and (0, 1). There are also cases where no rule is activated. Here, equal values are assigned to each class.
For boosting, the per-class confidence values are aggregated over all of the trees created during the boosting process and these aggregate values are normalized so that the overall per-class confidence values sum to one.
When the cost
argument is used in the main function, class
probabilities derived from the class distribution in the
terminal nodes may not be consistent with the final predicted
class. For this reason, requesting class probabilities from a
model using unequal costs will throw an error.
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
library(modeldata)
data(mlc_churn)
treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333])
predict(treeModel, mlc_churn[3334:3350, -20])
#> [1] no no no no no no no no no no no no no no no no no
#> Levels: yes no
predict(treeModel, mlc_churn[3334:3350, -20], type = "prob")
#> yes no
#> 1 0.02706792 0.9729321
#> 2 0.01114727 0.9888527
#> 3 0.02488945 0.9751106
#> 4 0.02706792 0.9729321
#> 5 0.02706792 0.9729321
#> 6 0.07481390 0.9251861
#> 7 0.02706792 0.9729321
#> 8 0.02706792 0.9729321
#> 9 0.02706792 0.9729321
#> 10 0.02706792 0.9729321
#> 11 0.02706792 0.9729321
#> 12 0.02706792 0.9729321
#> 13 0.07481390 0.9251861
#> 14 0.02706792 0.9729321
#> 15 0.02706792 0.9729321
#> 16 0.02706792 0.9729321
#> 17 0.02706792 0.9729321