CostSensitiveDecisionTreeClassifier¶

class costcla.models.CostSensitiveDecisionTreeClassifier(criterion='direct_cost', criterion_weight=False, num_pct=100, max_features=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_gain=0.001, pruned=True)[source]¶

A example-dependent cost-sensitive binary decision tree classifier.

Parameters:

criterion : string, optional (default=”direct_cost”)

The function to measure the quality of a split. Supported criteria are “direct_cost” for the Direct Cost impurity measure, “pi_cost”, “gini_cost”, and “entropy_cost”.

criterion_weight : bool, optional (default=False)

Whenever or not to weight the gain according to the population distribution.

num_pct : int, optional (default=100)

Number of percentiles to evaluate the splits for each feature.

splitter : string, optional (default=”best”)

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_features : int, float, string or None, optional (default=None)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_depth : int or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if max_samples_leaf is not None.

min_samples_split : int, optional (default=2)

The minimum number of samples required to split an internal node.

min_samples_leaf : int, optional (default=1)

The minimum number of samples required to be at a leaf node.

min_gain : float, optional (default=0.001)

The minimum gain that a split must produce in order to be taken into account.

pruned : bool, optional (default=True)

Whenever or not to prune the decision tree using cost-based pruning

See also

sklearn.tree.DecisionTreeClassifier

References

[R3]	Correa Bahnsen, A., Aouada, D., & Ottersten, B. “Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications”, Expert Systems with Applications, 42(19), 6609–6619, 2015, http://doi.org/10.1016/j.eswa.2015.04.042

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.cross_validation import train_test_split
>>> from costcla.datasets import load_creditscoring1
>>> from costcla.models import CostSensitiveDecisionTreeClassifier
>>> from costcla.metrics import savings_score
>>> data = load_creditscoring1()
>>> sets = train_test_split(data.data, data.target, data.cost_mat, test_size=0.33, random_state=0)
>>> X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = sets
>>> y_pred_test_rf = RandomForestClassifier(random_state=0).fit(X_train, y_train).predict(X_test)
>>> f = CostSensitiveDecisionTreeClassifier()
>>> y_pred_test_csdt = f.fit(X_train, y_train, cost_mat_train).predict(X_test)
>>> # Savings using only RandomForest
>>> print(savings_score(y_test, y_pred_test_rf, cost_mat_test))
0.12454256594
>>> # Savings using CSDecisionTree
>>> print(savings_score(y_test, y_pred_test_csdt, cost_mat_test))
0.481916135529

Attributes

tree_

(Tree object) The underlying Tree object.

Methods

`fit`
`get_params`
`predict`
`predict_proba`
`pruning`
`set_param`
`set_params`

fit(X, y, cost_mat, check_input=False)[source]¶

Build a example-dependent cost-sensitive decision tree from the training set (X, y, cost_mat)

Parameters:

y_true : array indicator matrix

Ground truth (correct) labels.

X : array-like of shape = [n_samples, n_features]

The input samples.

cost_mat : array-like of shape = [n_samples, 4]

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

check_input : boolean, (default=True)

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:

self : object

Returns self.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:

deep: boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

predict(X)[source]¶

Predict class of X.

The predicted class for each sample in X is returned.

Parameters:

X : array-like of shape = [n_samples, n_features]

The input samples.

Returns:

y : array of shape = [n_samples]

The predicted classes,

predict_proba(X)[source]¶

Predict class probabilities of the input samples X.

Parameters:

X : array-like of shape = [n_samples, n_features]

The input samples.

Returns:

prob : array of shape = [n_samples, 2]

The class probabilities of the input samples.

pruning(X, y, cost_mat)[source]¶

Function that prune the decision tree.

Parameters:

X : array-like of shape = [n_samples, n_features]

The input samples.

y_true : array indicator matrix

Ground truth (correct) labels.

cost_mat : array-like of shape = [n_samples, 4]

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

set_param(attribute, value)[source]¶

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:	self