CostSensitiveRandomForestClassifier

class costcla.models.CostSensitiveRandomForestClassifier(n_estimators=10, combination='majority_voting', max_features='auto', n_jobs=1, verbose=False, pruned=False)[source]

A example-dependent cost-sensitive random forest classifier.

Parameters:

n_estimators : int, optional (default=10)

The number of base estimators in the ensemble.

combination : string, optional (default=”majority_voting”)

Which combination method to use:
  • If “majority_voting” then combine by majority voting
  • If “weighted_voting” then combine by weighted voting using the out of bag savings as the weight for each estimator.
  • If “stacking” then a Cost Sensitive Logistic Regression is used to learn the combination.
  • If “stacking_proba” then a Cost Sensitive Logistic Regression trained with the estimated probabilities is used to learn the combination,.
  • If “stacking_bmr” then a Cost Sensitive Logistic Regression is used to learn the probabilities and a BayesMinimumRisk for the prediction.
  • If “stacking_proba_bmr” then a Cost Sensitive Logistic Regression trained with the estimated probabilities is used to learn the probabilities, and a BayesMinimumRisk for the prediction.
  • If “majority_bmr” then the BayesMinimumRisk algorithm is used to make the prediction using the predicted probabilities of majority_voting
  • If “weighted_bmr” then the BayesMinimumRisk algorithm is used to make the prediction using the predicted probabilities of weighted_voting

max_features : int, float, string or None, optional (default=None)

The number of features to consider when looking for the best split in each tree:
  • If int, then consider max_features features at each split.
  • If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
  • If “auto”, then max_features=sqrt(n_features).
  • If “sqrt”, then max_features=sqrt(n_features).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

pruned : bool, optional (default=True)

Whenever or not to prune the decision tree using cost-based pruning

n_jobs : int, optional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

verbose : int, optional (default=0)

Controls the verbosity of the building process.

References

[R6]Correa Bahnsen, A., Aouada, D., & Ottersten, B. “Ensemble of Example-Dependent Cost-Sensitive Decision Trees”, 2015, http://arxiv.org/abs/1505.04637.
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.cross_validation import train_test_split
>>> from costcla.datasets import load_creditscoring1
>>> from costcla.models import CostSensitiveRandomForestClassifier
>>> from costcla.metrics import savings_score
>>> data = load_creditscoring1()
>>> sets = train_test_split(data.data, data.target, data.cost_mat, test_size=0.33, random_state=0)
>>> X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = sets
>>> y_pred_test_rf = RandomForestClassifier(random_state=0).fit(X_train, y_train).predict(X_test)
>>> f = CostSensitiveRandomForestClassifier()
>>> y_pred_test_csdt = f.fit(X_train, y_train, cost_mat_train).predict(X_test)
>>> # Savings using only RandomForest
>>> print(savings_score(y_test, y_pred_test_rf, cost_mat_test))
0.12454256594
>>> # Savings using CostSensitiveRandomForestClassifier
>>> print(savings_score(y_test, y_pred_test_csdt, cost_mat_test))
0.499390945808

Attributes

base_estimator_: list of estimators The base estimator from which the ensemble is grown.
estimators_: list of estimators The collection of fitted base estimators.
estimators_samples_: list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator.
estimators_features_: list of arrays The subset of drawn features for each base estimator.

Methods

fit
get_params
predict
predict_proba
score
set_params
fit(X, y, cost_mat, sample_weight=None)

Build a Bagging ensemble of estimators from the training set (X, y).

Parameters:

X : {array-like, sparse matrix} of shape = [n_samples, n_features]

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

y : array-like, shape = [n_samples]

The target values (class labels in classification, real numbers in regression).

cost_mat : array-like of shape = [n_samples, 4]

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

sample_weight : array-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. Note that this is supported only if the base estimator supports sample weighting.

Returns:

self : object

Returns self.

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep: boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

predict(X, cost_mat=None)

Predict class for X.

The predicted class of an input sample is computed as the class with the highest mean predicted probability. If base estimators do not implement a predict_proba method, then it resorts to voting.

Parameters:

X : {array-like, sparse matrix} of shape = [n_samples, n_features]

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

cost_mat : optional array-like of shape = [n_samples, 4], (default=None)

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

Returns:

pred : array of shape = [n_samples]

The predicted classes.

predict_proba(X)

Predict class probabilities for X.

The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the base estimators in the ensemble. If base estimators do not implement a predict_proba method, then it resorts to voting and the predicted class probabilities of a an input sample represents the proportion of estimators predicting each class.

Parameters:

X : {array-like, sparse matrix} of shape = [n_samples, n_features]

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns:

p : array of shape = [n_samples, n_classes]

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X : array-like, shape = (n_samples, n_features)

Test samples.

y : array-like, shape = (n_samples) or (n_samples, n_outputs)

True labels for X.

sample_weight : array-like, shape = [n_samples], optional

Sample weights.

Returns:

score : float

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self