Sampling¶

The costcla.sampling module includes methods for cost-sensitive sampling

In particular:

costcla.sampling.cost_sampling methods for cost-proportionate sampling
costcla.sampling.undersampling traditional undersampling
costcla.sampling.smote SMOTE method for synthetic over-sampling

costcla.sampling.cost_sampling(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)[source]¶

Cost-proportionate sampling.

Parameters:

X : array-like of shape = [n_samples, n_features]

The input samples.

y : array-like of shape = [n_samples]

Ground truth (correct) labels.

cost_mat : array-like of shape = [n_samples, 4]

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

method : str, optional (default = RejectionSampling)

Method to perform the cost-proportionate sampling, either ‘RejectionSampling’ or ‘OverSampling’.

oversampling_norm: float, optional (default = 0.1)

normalize value of wc, the smaller the biggest the data.

max_wc: float, optional (default = 97.5)

outlier adjustment for the cost.

References

[R15]

B. Zadrozny, J. Langford, N. Naoki, “Cost-sensitive learning by cost-proportionate example weighting”, in Proceedings of the Third IEEE International Conference on Data Mining, 435-442, 2003.

[R16]

C. Elkan, “The foundations of Cost-Sensitive Learning”, in Seventeenth International Joint Conference on Artificial Intelligence, 973-978, 2001.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.cross_validation import train_test_split
>>> from costcla.datasets import load_creditscoring1
>>> from costcla.sampling import cost_sampling, undersampling
>>> from costcla.metrics import savings_score
>>> data = load_creditscoring1()
>>> sets = train_test_split(data.data, data.target, data.cost_mat, test_size=0.33, random_state=0)
>>> X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = sets
>>> X_cps_o, y_cps_o, cost_mat_cps_o =  cost_sampling(X_train, y_train, cost_mat_train, method='OverSampling')
>>> X_cps_r, y_cps_r, cost_mat_cps_r =  cost_sampling(X_train, y_train, cost_mat_train, method='RejectionSampling')
>>> X_u, y_u, cost_mat_u = undersampling(X_train, y_train, cost_mat_train)
>>> y_pred_test_rf = RandomForestClassifier(random_state=0).fit(X_train, y_train).predict(X_test)
>>> y_pred_test_rf_cps_o = RandomForestClassifier(random_state=0).fit(X_cps_o, y_cps_o).predict(X_test)
>>> y_pred_test_rf_cps_r = RandomForestClassifier(random_state=0).fit(X_cps_r, y_cps_r).predict(X_test)
>>> y_pred_test_rf_u = RandomForestClassifier(random_state=0).fit(X_u, y_u).predict(X_test)
>>> # Savings using only RandomForest
>>> print(savings_score(y_test, y_pred_test_rf, cost_mat_test))
0.12454256594
>>> # Savings using RandomForest with cost-proportionate over-sampling
>>> print(savings_score(y_test, y_pred_test_rf_cps_o, cost_mat_test))
0.192480226286
>>> # Savings using RandomForest with cost-proportionate rejection-sampling
>>> print(savings_score(y_test, y_pred_test_rf_cps_r, cost_mat_test))
0.465830173459
>>> # Savings using RandomForest with under-sampling
>>> print(savings_score(y_test, y_pred_test_rf_u, cost_mat_test))
0.466630646543
>>> # Size of each training set
>>> print(X_train.shape[0], X_cps_o.shape[0], X_cps_r.shape[0], X_u.shape[0])
75653 109975 8690 10191
>>> # Percentage of positives in each training set
>>> print(y_train.mean(), y_cps_o.mean(), y_cps_r.mean(), y_u.mean())
0.0668182358928 0.358054103205 0.436939010357 0.49602590521

costcla.sampling.undersampling(X, y, cost_mat=None, per=0.5)[source]¶

Under-sampling.

Parameters:

X : array-like of shape = [n_samples, n_features]

The input samples.

y : array-like of shape = [n_samples]

Ground truth (correct) labels.

cost_mat : array-like of shape = [n_samples, 4], optional (default=None)

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

per: float, optional (default = 0.5)

Percentage of the minority class in the under-sampled data

costcla.sampling.smote(X, y, cost_mat=None, per=0.5)[source]¶

SMOTE: synthetic minority over-sampling technique

Parameters:

X : array-like of shape = [n_samples, n_features]

The input samples.

y : array-like of shape = [n_samples]

Ground truth (correct) labels.

cost_mat : array-like of shape = [n_samples, 4], optional (default=None)

Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.

per: float, optional (default = 0.5)

Percentage of the minority class in the over-sampled data

References

[R17]

N. Chawla, K. Bowyer, L. Hall, W. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, 16, 321-357, 2002.

Examples

>>> from costcla.datasets import load_creditscoring1
>>> from costcla.sampling import smote
>>> data = load_creditscoring1()
>>> data_smote, target_smote = smote(data.data, data.target, per=0.7)
>>> # Size of each training set
>>> print(data.data.shape[0], data_smote.shape[0])
112915 204307
>>> # Percentage of positives in each training set
>>> print(data.target.mean(), target_smote.mean())
0.0674489660364 0.484604051746