Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm
# S3 method for default
C5.0(
x,
y,
trials = 1,
rules = FALSE,
weights = NULL,
control = C5.0Control(),
costs = NULL,
...
)
# S3 method for formula
C5.0(formula, data, weights, subset, na.action = na.pass, ...)
a data frame or matrix of predictors.
a factor vector with 2 or more levels
an integer specifying the number of boosting iterations. A value of one indicates that a single model is used.
A logical: should the tree be decomposed into a rule-based model?
an optional numeric vector of case weights. Note that the data used for the case weights will not be used as a splitting variable in the model (see http://www.rulequest.com/see5-win.html#CASEWEIGHT for Quinlan's notes on case weights).
a list of control parameters; see
C5.0Control()
a matrix of costs associated with the possible errors. The matrix should have C columns and rows where C is the number of class levels.
other options to pass into the function (not currently used with default method)
a formula, with a response and at least one predictor.
an optional data frame in which to interpret the variables named in the formula.
optional expression saying that only a subset of the rows of the data should be used in the fit.
a function which indicates what should happen
when the data contain NA
. The default is to include
missing values since the model can accommodate them.
An object of class C5.0
with elements:
a parsed version of the boosting table(s) shown in the output
the function call
not currently supported.
an echo of the specifications from
C5.0Control()
the text version of the cost matrix (or "")
an echo of the model argument
original dimensions of the predictor matrix or data frame
a character vector of factor levels for the outcome
a string version of the names file
a string version of the command line output
a character vector of predictor names
a logical for rules
a character version of the rules file
n integer vector of the tree/rule size (or sizes in the case of boosting)
.
a string version of the tree file
a named vector with elements Requested
(an echo of the function call) and Actual
(how many the
model used)
This model extends the C4.5 classification algorithms described in Quinlan (1992). The details of the extensions are largely undocumented. The model can take the form of a full decision tree or a collection of rules (or boosted versions of either).
When using the formula method, factors and other classes are preserved (i.e. dummy variables are not automatically created). This particular model handles non-numeric data of some types (such as character, factor and ordered data).
The cost matrix should by CxC, where C is the number of
classes. Diagonal elements are ignored. Columns should
correspond to the true classes and rows are the predicted
classes. For example, if C = 3 with classes Red, Blue and Green
(in that order), a value of 5 in the (2,3) element of the matrix
would indicate that the cost of predicting a Green sample as
Blue is five times the usual value (of one). Note that when
costs are used, class probabilities cannot be generated using
predict.C5.0()
.
Internally, the code will attempt to halt boosting if it
appears to be ineffective. For this reason, the value of
trials
may be different from what the model actually
produced. There is an option to turn this off in
C5.0Control()
.
The command line version currently supports more data types than the R port. Currently, numeric, factor and ordered factors are allowed as predictors.
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
library(modeldata)
data(mlc_churn)
treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333])
treeModel
#>
#> Call:
#> C5.0.default(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333])
#>
#> Classification Tree
#> Number of samples: 3333
#> Number of predictors: 19
#>
#> Tree size: 27
#>
#> Non-standard options: attempt to group attributes
#>
summary(treeModel)
#>
#> Call:
#> C5.0.default(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333])
#>
#>
#> C5.0 [Release 2.07 GPL Edition] Wed Feb 8 19:59:17 2023
#> -------------------------------
#>
#> Class specified by attribute `outcome'
#>
#> Read 3333 cases (20 attributes) from undefined.data
#>
#> Decision tree:
#>
#> total_day_minutes > 264.4:
#> :...voice_mail_plan = yes:
#> : :...international_plan = no: no (45/1)
#> : : international_plan = yes: yes (8/3)
#> : voice_mail_plan = no:
#> : :...total_eve_minutes > 187.7:
#> : :...total_night_minutes > 126.9: yes (94/1)
#> : : total_night_minutes <= 126.9:
#> : : :...total_day_minutes <= 277: no (4)
#> : : total_day_minutes > 277: yes (3)
#> : total_eve_minutes <= 187.7:
#> : :...total_eve_charge <= 12.26: no (15/1)
#> : total_eve_charge > 12.26:
#> : :...total_day_minutes <= 277:
#> : :...total_night_minutes <= 224.8: no (13)
#> : : total_night_minutes > 224.8: yes (5/1)
#> : total_day_minutes > 277:
#> : :...total_night_minutes > 151.9: yes (18)
#> : total_night_minutes <= 151.9:
#> : :...account_length <= 123: no (4)
#> : account_length > 123: yes (2)
#> total_day_minutes <= 264.4:
#> :...number_customer_service_calls > 3:
#> :...total_day_minutes <= 160.2:
#> : :...total_eve_charge <= 19.83: yes (79/3)
#> : : total_eve_charge > 19.83:
#> : : :...total_day_minutes <= 120.5: yes (10)
#> : : total_day_minutes > 120.5: no (13/3)
#> : total_day_minutes > 160.2:
#> : :...total_eve_charge > 12.05: no (130/24)
#> : total_eve_charge <= 12.05:
#> : :...total_eve_calls <= 125: yes (16/2)
#> : total_eve_calls > 125: no (3)
#> number_customer_service_calls <= 3:
#> :...international_plan = yes:
#> :...total_intl_calls <= 2: yes (51)
#> : total_intl_calls > 2:
#> : :...total_intl_minutes <= 13.1: no (173/7)
#> : total_intl_minutes > 13.1: yes (43)
#> international_plan = no:
#> :...total_day_minutes <= 223.2: no (2221/60)
#> total_day_minutes > 223.2:
#> :...total_eve_charge <= 20.5: no (295/22)
#> total_eve_charge > 20.5:
#> :...voice_mail_plan = yes: no (20)
#> voice_mail_plan = no:
#> :...total_night_minutes > 174.2: yes (50/8)
#> total_night_minutes <= 174.2:
#> :...total_day_minutes <= 246.6: no (12)
#> total_day_minutes > 246.6:
#> :...total_day_charge <= 43.33: yes (4)
#> total_day_charge > 43.33: no (2)
#>
#>
#> Evaluation on training data (3333 cases):
#>
#> Decision Tree
#> ----------------
#> Size Errors
#>
#> 27 136( 4.1%) <<
#>
#>
#> (a) (b) <-classified as
#> ---- ----
#> 365 118 (a): class yes
#> 18 2832 (b): class no
#>
#>
#> Attribute usage:
#>
#> 100.00% total_day_minutes
#> 93.67% number_customer_service_calls
#> 87.73% international_plan
#> 20.73% total_eve_charge
#> 8.97% voice_mail_plan
#> 8.01% total_intl_calls
#> 6.48% total_intl_minutes
#> 6.33% total_night_minutes
#> 4.74% total_eve_minutes
#> 0.57% total_eve_calls
#> 0.18% account_length
#> 0.18% total_day_charge
#>
#>
#> Time: 0.0 secs
#>
ruleModel <- C5.0(churn ~ ., data = mlc_churn[1:3333, ], rules = TRUE)
ruleModel
#>
#> Call:
#> C5.0.formula(formula = churn ~ ., data = mlc_churn[1:3333, ], rules = TRUE)
#>
#> Rule-Based Model
#> Number of samples: 3333
#> Number of predictors: 19
#>
#> Number of Rules: 19
#>
#> Non-standard options: attempt to group attributes
#>
summary(ruleModel)
#>
#> Call:
#> C5.0.formula(formula = churn ~ ., data = mlc_churn[1:3333, ], rules = TRUE)
#>
#>
#> C5.0 [Release 2.07 GPL Edition] Wed Feb 8 19:59:17 2023
#> -------------------------------
#>
#> Class specified by attribute `outcome'
#>
#> Read 3333 cases (20 attributes) from undefined.data
#>
#> Rules:
#>
#> Rule 1: (60, lift 6.8)
#> international_plan = yes
#> total_intl_calls <= 2
#> -> class yes [0.984]
#>
#> Rule 2: (57, lift 6.8)
#> international_plan = yes
#> total_intl_minutes > 13.1
#> -> class yes [0.983]
#>
#> Rule 3: (32, lift 6.7)
#> total_day_minutes <= 120.5
#> number_customer_service_calls > 3
#> -> class yes [0.971]
#>
#> Rule 4: (79/3, lift 6.6)
#> total_day_minutes <= 160.2
#> total_eve_charge <= 19.83
#> number_customer_service_calls > 3
#> -> class yes [0.951]
#>
#> Rule 5: (43/2, lift 6.4)
#> international_plan = no
#> voice_mail_plan = no
#> total_day_minutes > 246.6
#> total_eve_charge > 20.5
#> -> class yes [0.933]
#>
#> Rule 6: (28/2, lift 6.2)
#> total_day_minutes <= 264.4
#> total_eve_calls <= 125
#> total_eve_charge <= 12.05
#> number_customer_service_calls > 3
#> -> class yes [0.900]
#>
#> Rule 7: (78/8, lift 6.1)
#> voice_mail_plan = no
#> total_day_minutes > 223.2
#> total_eve_charge > 20.5
#> total_night_minutes > 174.2
#> -> class yes [0.888]
#>
#> Rule 8: (114/24, lift 5.4)
#> voice_mail_plan = no
#> total_day_minutes > 223.2
#> total_eve_charge > 20.5
#> -> class yes [0.784]
#>
#> Rule 9: (152/58, lift 4.3)
#> total_day_minutes > 223.2
#> total_eve_charge > 20.5
#> -> class yes [0.617]
#>
#> Rule 10: (211/84, lift 4.1)
#> total_day_minutes > 264.4
#> -> class yes [0.601]
#>
#> Rule 11: (2221/60, lift 1.1)
#> international_plan = no
#> total_day_minutes <= 223.2
#> number_customer_service_calls <= 3
#> -> class no [0.973]
#>
#> Rule 12: (768/20, lift 1.1)
#> international_plan = no
#> voice_mail_plan = yes
#> number_customer_service_calls <= 3
#> -> class no [0.973]
#>
#> Rule 13: (140/5, lift 1.1)
#> account_length <= 123
#> total_eve_minutes <= 187.7
#> total_night_minutes <= 151.9
#> -> class no [0.958]
#>
#> Rule 14: (45/1, lift 1.1)
#> international_plan = no
#> voice_mail_plan = yes
#> total_day_minutes > 264.4
#> -> class no [0.957]
#>
#> Rule 15: (1972/87, lift 1.1)
#> total_day_minutes <= 264.4
#> total_intl_minutes <= 13.1
#> total_intl_calls > 2
#> number_customer_service_calls <= 3
#> -> class no [0.955]
#>
#> Rule 16: (197/9, lift 1.1)
#> total_day_minutes > 120.5
#> total_day_minutes <= 160.2
#> total_eve_charge > 19.83
#> -> class no [0.950]
#>
#> Rule 17: (155/10, lift 1.1)
#> voice_mail_plan = no
#> total_day_minutes <= 277
#> total_night_minutes <= 126.9
#> -> class no [0.930]
#>
#> Rule 18: (1675/185, lift 1.0)
#> total_day_minutes > 160.2
#> total_day_minutes <= 264.4
#> total_eve_charge > 12.05
#> -> class no [0.889]
#>
#> Rule 19: (434/49, lift 1.0)
#> total_eve_charge <= 12.26
#> -> class no [0.885]
#>
#> Default class: no
#>
#>
#> Evaluation on training data (3333 cases):
#>
#> Rules
#> ----------------
#> No Errors
#>
#> 19 146( 4.4%) <<
#>
#>
#> (a) (b) <-classified as
#> ---- ----
#> 371 112 (a): class yes
#> 34 2816 (b): class no
#>
#>
#> Attribute usage:
#>
#> 98.23% total_day_minutes
#> 84.61% number_customer_service_calls
#> 75.73% international_plan
#> 71.83% total_eve_charge
#> 60.97% total_intl_calls
#> 60.88% total_intl_minutes
#> 31.02% voice_mail_plan
#> 10.11% total_night_minutes
#> 4.20% account_length
#> 4.20% total_eve_minutes
#> 0.84% total_eve_calls
#>
#>
#> Time: 0.0 secs
#>