Datasets¶
The costcla.datasets
module includes utilities to load datasets,
including methods to load and fetch popular reference datasets. It also
features some artificial data generators.
-
costcla.datasets.
load_bankmarketing
(cost_mat_parameters=None)[source]¶ Load and return the bank marketing dataset (classification).
The bank marketing is a easily transformable example-dependent cost-sensitive classification dataset.
Parameters: cost_mat_parameters : Dictionary-like object, optional (default=None)
If not None, must include ‘per_balance’, ‘ca’, and ‘int_r’
Returns: data : Bunch
Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘cost_mat’, the cost matrix of each example, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset.
References
[R8] A. Correa Bahnsen, A. Stojanovic, D.Aouada, B, Ottersten, “Improving Credit Card Fraud Detection with Calibrated Probabilities”, in Proceedings of the fourteenth SIAM International Conference on Data Mining, 677-685, 2014. Examples
Let’s say you are interested in the samples 10, 25, and 50
>>> from costcla.datasets import load_bankmarketing >>> data = load_bankmarketing() >>> data.target[[10, 25, 319]] array([0, 0, 1]) >>> data.cost_mat[[10, 25, 319]] array([[ 1. , 1.66274977, 1. , 0. ], [ 1. , 1.63195811, 1. , 0. ], [ 1. , 5.11141597, 1. , 0. ]])
-
costcla.datasets.
load_creditscoring1
(cost_mat_parameters=None)[source]¶ Load and return the credit scoring Kaggle Credit competition dataset (classification).
The credit scoring is a easily transformable example-dependent cost-sensitive classification dataset.
Parameters: cost_mat_parameters : Dictionary-like object, optional (default=None)
If not None, must include ‘int_r’, ‘int_cf’, ‘cl_max’, ‘n_term’, ‘k’,’lgd’
Returns: data : Bunch
Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘cost_mat’, the cost matrix of each example, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset.
References
[R9] A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, , 2014. Examples
Let’s say you are interested in the samples 10, 25, and 50
>>> from costcla.datasets import load_creditscoring1 >>> data = load_creditscoring1() >>> data.target[[10, 17, 400]] array([0, 1, 0]) >>> data.cost_mat[[10, 17, 400]] array([[ 1023.73054104, 18750. , 0. , 0. ], [ 717.25781516, 6749.25 , 0. , 0. ], [ 1004.32819923, 17990.25 , 0. , 0. ]])
-
costcla.datasets.
load_creditscoring2
(cost_mat_parameters=None)[source]¶ Load and return the credit scoring PAKDD 2009 competition dataset (classification).
The credit scoring is a easily transformable example-dependent cost-sensitive classification dataset.
Parameters: cost_mat_parameters : Dictionary-like object, optional (default=None)
If not None, must include ‘int_r’, ‘int_cf’, ‘cl_max’, ‘n_term’, ‘k’,’lgd’
Returns: data : Bunch
Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘cost_mat’, the cost matrix of each example, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset.
References
[R10] A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, , 2014. Examples
Let’s say you are interested in the samples 10, 25, and 50
>>> from costcla.datasets import load_creditscoring2 >>> data = load_creditscoring2() >>> data.target[[10, 17, 50]] array([1, 0, 0]) >>> data.cost_mat[[10, 17, 50]] array([[ 209. , 547.965, 0. , 0. ], [ 24. , 274.725, 0. , 0. ], [ 89. , 371.25 , 0. , 0. ]])