Sampling¶
The costcla.sampling
module includes methods for cost-sensitive sampling
In particular:
costcla.sampling.cost_sampling
methods for cost-proportionate samplingcostcla.sampling.undersampling
traditional undersamplingcostcla.sampling.smote
SMOTE method for synthetic over-sampling
-
costcla.sampling.
cost_sampling
(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)[source]¶ Cost-proportionate sampling.
Parameters: X : array-like of shape = [n_samples, n_features]
The input samples.
- y : array-like of shape = [n_samples]
Ground truth (correct) labels.
- cost_mat : array-like of shape = [n_samples, 4]
Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.
- method : str, optional (default = RejectionSampling)
Method to perform the cost-proportionate sampling, either ‘RejectionSampling’ or ‘OverSampling’.
- oversampling_norm: float, optional (default = 0.1)
normalize value of wc, the smaller the biggest the data.
- max_wc: float, optional (default = 97.5)
outlier adjustment for the cost.
References
[R15] B. Zadrozny, J. Langford, N. Naoki, “Cost-sensitive learning by cost-proportionate example weighting”, in Proceedings of the Third IEEE International Conference on Data Mining, 435-442, 2003. [R16] C. Elkan, “The foundations of Cost-Sensitive Learning”, in Seventeenth International Joint Conference on Artificial Intelligence, 973-978, 2001. Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.cross_validation import train_test_split >>> from costcla.datasets import load_creditscoring1 >>> from costcla.sampling import cost_sampling, undersampling >>> from costcla.metrics import savings_score >>> data = load_creditscoring1() >>> sets = train_test_split(data.data, data.target, data.cost_mat, test_size=0.33, random_state=0) >>> X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = sets >>> X_cps_o, y_cps_o, cost_mat_cps_o = cost_sampling(X_train, y_train, cost_mat_train, method='OverSampling') >>> X_cps_r, y_cps_r, cost_mat_cps_r = cost_sampling(X_train, y_train, cost_mat_train, method='RejectionSampling') >>> X_u, y_u, cost_mat_u = undersampling(X_train, y_train, cost_mat_train) >>> y_pred_test_rf = RandomForestClassifier(random_state=0).fit(X_train, y_train).predict(X_test) >>> y_pred_test_rf_cps_o = RandomForestClassifier(random_state=0).fit(X_cps_o, y_cps_o).predict(X_test) >>> y_pred_test_rf_cps_r = RandomForestClassifier(random_state=0).fit(X_cps_r, y_cps_r).predict(X_test) >>> y_pred_test_rf_u = RandomForestClassifier(random_state=0).fit(X_u, y_u).predict(X_test) >>> # Savings using only RandomForest >>> print(savings_score(y_test, y_pred_test_rf, cost_mat_test)) 0.12454256594 >>> # Savings using RandomForest with cost-proportionate over-sampling >>> print(savings_score(y_test, y_pred_test_rf_cps_o, cost_mat_test)) 0.192480226286 >>> # Savings using RandomForest with cost-proportionate rejection-sampling >>> print(savings_score(y_test, y_pred_test_rf_cps_r, cost_mat_test)) 0.465830173459 >>> # Savings using RandomForest with under-sampling >>> print(savings_score(y_test, y_pred_test_rf_u, cost_mat_test)) 0.466630646543 >>> # Size of each training set >>> print(X_train.shape[0], X_cps_o.shape[0], X_cps_r.shape[0], X_u.shape[0]) 75653 109975 8690 10191 >>> # Percentage of positives in each training set >>> print(y_train.mean(), y_cps_o.mean(), y_cps_r.mean(), y_u.mean()) 0.0668182358928 0.358054103205 0.436939010357 0.49602590521
-
costcla.sampling.
undersampling
(X, y, cost_mat=None, per=0.5)[source]¶ Under-sampling.
Parameters: X : array-like of shape = [n_samples, n_features]
The input samples.
- y : array-like of shape = [n_samples]
Ground truth (correct) labels.
- cost_mat : array-like of shape = [n_samples, 4], optional (default=None)
Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.
- per: float, optional (default = 0.5)
Percentage of the minority class in the under-sampled data
-
costcla.sampling.
smote
(X, y, cost_mat=None, per=0.5)[source]¶ SMOTE: synthetic minority over-sampling technique
Parameters: X : array-like of shape = [n_samples, n_features]
The input samples.
- y : array-like of shape = [n_samples]
Ground truth (correct) labels.
- cost_mat : array-like of shape = [n_samples, 4], optional (default=None)
Cost matrix of the classification problem Where the columns represents the costs of: false positives, false negatives, true positives and true negatives, for each example.
- per: float, optional (default = 0.5)
Percentage of the minority class in the over-sampled data
References
[R17] N. Chawla, K. Bowyer, L. Hall, W. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, 16, 321-357, 2002. Examples
>>> from costcla.datasets import load_creditscoring1 >>> from costcla.sampling import smote >>> data = load_creditscoring1() >>> data_smote, target_smote = smote(data.data, data.target, per=0.7) >>> # Size of each training set >>> print(data.data.shape[0], data_smote.shape[0]) 112915 204307 >>> # Percentage of positives in each training set >>> print(data.target.mean(), target_smote.mean()) 0.0674489660364 0.484604051746