Datasets¶

The costcla.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets. It also features some artificial data generators.

costcla.datasets.load_bankmarketing(cost_mat_parameters=None)[source]¶

Load and return the bank marketing dataset (classification).

The bank marketing is a easily transformable example-dependent cost-sensitive classification dataset.

Parameters:

cost_mat_parameters : Dictionary-like object, optional (default=None)

If not None, must include ‘per_balance’, ‘ca’, and ‘int_r’

Returns:

data : Bunch

Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘cost_mat’, the cost matrix of each example, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset.

References

[R8]	A. Correa Bahnsen, A. Stojanovic, D.Aouada, B, Ottersten, “Improving Credit Card Fraud Detection with Calibrated Probabilities”, in Proceedings of the fourteenth SIAM International Conference on Data Mining, 677-685, 2014.

Examples

Let’s say you are interested in the samples 10, 25, and 50

>>> from costcla.datasets import load_bankmarketing
>>> data = load_bankmarketing()
>>> data.target[[10, 25, 319]]
array([0, 0, 1])
>>> data.cost_mat[[10, 25, 319]]
array([[ 1.        ,  1.66274977,  1.        ,  0.        ],
       [ 1.        ,  1.63195811,  1.        ,  0.        ],
       [ 1.        ,  5.11141597,  1.        ,  0.        ]])

costcla.datasets.load_creditscoring1(cost_mat_parameters=None)[source]¶

Load and return the credit scoring Kaggle Credit competition dataset (classification).

The credit scoring is a easily transformable example-dependent cost-sensitive classification dataset.

Parameters:

cost_mat_parameters : Dictionary-like object, optional (default=None)

If not None, must include ‘int_r’, ‘int_cf’, ‘cl_max’, ‘n_term’, ‘k’,’lgd’

Returns:

data : Bunch

Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘cost_mat’, the cost matrix of each example, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset.

References

[R9]	A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, , 2014.

Examples

Let’s say you are interested in the samples 10, 25, and 50

>>> from costcla.datasets import load_creditscoring1
>>> data = load_creditscoring1()
>>> data.target[[10, 17, 400]]
array([0, 1, 0])
>>> data.cost_mat[[10, 17, 400]]
array([[  1023.73054104,  18750.        ,      0.        ,      0.        ],
       [   717.25781516,   6749.25      ,      0.        ,      0.        ],
       [  1004.32819923,  17990.25      ,      0.        ,      0.        ]])

costcla.datasets.load_creditscoring2(cost_mat_parameters=None)[source]¶

Load and return the credit scoring PAKDD 2009 competition dataset (classification).

The credit scoring is a easily transformable example-dependent cost-sensitive classification dataset.

Parameters:

cost_mat_parameters : Dictionary-like object, optional (default=None)

If not None, must include ‘int_r’, ‘int_cf’, ‘cl_max’, ‘n_term’, ‘k’,’lgd’

Returns:

data : Bunch

Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘cost_mat’, the cost matrix of each example, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset.

References

[R10]

A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, , 2014.

Examples

Let’s say you are interested in the samples 10, 25, and 50

>>> from costcla.datasets import load_creditscoring2
>>> data = load_creditscoring2()
>>> data.target[[10, 17, 50]]
array([1, 0, 0])
>>> data.cost_mat[[10, 17, 50]]
array([[ 209.   ,  547.965,    0.   ,    0.   ],
       [  24.   ,  274.725,    0.   ,    0.   ],
       [  89.   ,  371.25 ,    0.   ,    0.   ]])