CostSensitiveDecisionTreeClassifier¶
-
class
costcla.models.
CostSensitiveDecisionTreeClassifier
(criterion='direct_cost', criterion_weight=False, num_pct=100, max_features=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_gain=0.001, pruned=True)[source]¶ A example-dependent cost-sensitive binary decision tree classifier.
Parameters: criterion : string, optional (default=”direct_cost”)
The function to measure the quality of a split. Supported criteria are “direct_cost” for the Direct Cost impurity measure, “pi_cost”, “gini_cost”, and “entropy_cost”.
criterion_weight : bool, optional (default=False)
Whenever or not to weight the gain according to the population distribution.
num_pct : int, optional (default=100)
Number of percentiles to evaluate the splits for each feature.
splitter : string, optional (default=”best”)
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
max_features : int, float, string or None, optional (default=None)
- The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features.max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if
max_samples_leaf
is not None.min_samples_split : int, optional (default=2)
The minimum number of samples required to split an internal node.
min_samples_leaf : int, optional (default=1)
The minimum number of samples required to be at a leaf node.
min_gain : float, optional (default=0.001)
The minimum gain that a split must produce in order to be taken into account.
pruned : bool, optional (default=True)
Whenever or not to prune the decision tree using cost-based pruning
See also
sklearn.tree.DecisionTreeClassifier
References
[R3] Correa Bahnsen, A., Aouada, D., & Ottersten, B. “Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications”, Expert Systems with Applications, 42(19), 6609–6619, 2015, http://doi.org/10.1016/j.eswa.2015.04.042 Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.cross_validation import train_test_split >>> from costcla.datasets import load_creditscoring1 >>> from costcla.models import CostSensitiveDecisionTreeClassifier >>> from costcla.metrics import savings_score >>> data = load_creditscoring1() >>> sets = train_test_split(data.data, data.target, data.cost_mat, test_size=0.33, random_state=0) >>> X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = sets >>> y_pred_test_rf = RandomForestClassifier(random_state=0).fit(X_train, y_train).predict(X_test) >>> f = CostSensitiveDecisionTreeClassifier() >>> y_pred_test_csdt = f.fit(X_train, y_train, cost_mat_train).predict(X_test) >>> # Savings using only RandomForest >>> print(savings_score(y_test, y_pred_test_rf, cost_mat_test)) 0.12454256594 >>> # Savings using CSDecisionTree >>> print(savings_score(y_test, y_pred_test_csdt, cost_mat_test)) 0.481916135529
Attributes
tree_ (Tree object) The underlying Tree object. Methods
fit
get_params
predict
predict_proba
pruning
set_param
set_params
-
fit
(X, y, cost_mat, check_input=False)[source]¶ Build a example-dependent cost-sensitive decision tree from the training set (X, y, cost_mat)
Parameters: y_true : array indicator matrix
Ground truth (correct) labels.
X : array-like of shape = [n_samples, n_features]
The input samples.
cost_mat : array-like of shape = [n_samples, 4]
Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.
check_input : boolean, (default=True)
Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
Returns: self : object
Returns self.
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: deep: boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params : mapping of string to any
Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict class of X.
The predicted class for each sample in X is returned.
Parameters: X : array-like of shape = [n_samples, n_features]
The input samples.
Returns: y : array of shape = [n_samples]
The predicted classes,
-
predict_proba
(X)[source]¶ Predict class probabilities of the input samples X.
Parameters: X : array-like of shape = [n_samples, n_features]
The input samples.
Returns: prob : array of shape = [n_samples, 2]
The class probabilities of the input samples.
-
pruning
(X, y, cost_mat)[source]¶ Function that prune the decision tree.
Parameters: X : array-like of shape = [n_samples, n_features]
The input samples.
y_true : array indicator matrix
Ground truth (correct) labels.
cost_mat : array-like of shape = [n_samples, 4]
Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: self