spellbook.input
Contents
spellbook.input#
Functions for handling and preprocessing input data
- Functions:
|
Calculate the numbers of datapoints in the training, validation and test sets |
|
Turns all string variables into categorical variables and adds columns with the corresponding numerical values |
|
Oversample the data to increase the sizes of the minority classes/categories |
|
Separate a |
|
Get separate |
|
Get separate |
|
Get separate |
Functions#
calculate_splits#
- spellbook.input.calculate_splits(n, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#
Calculate the numbers of datapoints in the training, validation and test sets
The size of the test dataset is calculated so as to use the remaining datapoints after filling the training and validation sets. If frac_val or n_val are not given, or if the training and validation sets use up all datapoints from the original full dataset, then the size of the test set will be set to zero.
- Parameters
n (
int
) – The size of the full datasetfrac_train (
float
) – Optional, the fractional size of the training setfrac_val (
typing.Optional
[float
]) – Optional, the fractional size of the validation setn_train (
typing.Optional
[int
]) – Optional, the absolute size of the training setn_val (
typing.Optional
[int
]) – Optional, the absolute size of the validation set
- Return type
typing.Tuple
[int
,int
,int
]- Returns
The absolute sizes of the training, validation and test sets
Examples:
Default training/validation split defined by fractional sizes
import spellbook as sb sizes = sb.input.calculate_splits(n=1000) print('train:', sizes[0]) print('validation:', sizes[1]) print('test:', sizes[2])
Output:
train: 700 validation: 300 test: 0
Training/validation/test split defined by fractional sizes
import spellbook as sb sizes = sb.input.calculate_splits(n=1000, frac_train=0.7, frac_val=0.2) print('train:', sizes[0]) print('validation:', sizes[1]) print('test:', sizes[2])
Output:
train: 700 validation: 200 test: 100
Training/validation/test split defined by absolute sizes
import spellbook as sb sizes = sb.input.calculate_splits(n=1000, n_train=600, n_val=250) print('train:', sizes[0]) print('validation:', sizes[1]) print('test:', sizes[2])
Output:
train: 600 validation: 250 test: 150
encode_categories#
- spellbook.input.encode_categories(data)[source]#
Turns all string variables into categorical variables and adds columns with the corresponding numerical values
- Parameters
data (
pandas.DataFrame
) – The dataset- Return type
typing.Dict
[str
,typing.Dict
[int
,str
]]- Returns
Dictionary of dictionaries with the encodings of each category. For each categorical variable there is a dictionary with the numerical codes as the keys and the category names/labels as the values
oversample#
- spellbook.input.oversample(data, target, shuffle=True)[source]#
Oversample the data to increase the sizes of the minority classes/categories
The target variable is assumed to be of type categorical, so this function should only be called after converting the target variable from type object (when the values are strings) to type categorical with something like
import pandas as pd data[target] = pd.Categorical(data[target])
The original datapoints in the minority classes/categories are retained and oversampling is only used to fill in the missing datapoints in order to ensure that for small imbalances all the original datapoints are kept. Otherwise, because of random fluctuations, some datapoints may be sampled twice and others never, losing some of the original data.
- Parameters
data (
pandas.DataFrame
) – The imbalanced datatarget (
str
) – The name of the variable that should be balancedshuffle (
bool
) – Optional, whether or not that oversampled DataFrame should be shuffled. If set toFalse
, then the DataFrame will be a concatenation of the different classes/categories, i.e. all datapoints belonging to the same class/category will be grouped together
- Returns
A dataset with each class/category populated with the same number of datapoints
- Return type
Example:
>>> import numpy as np >>> import pandas as pd >>> import spellbook as sb >>> np.random.seed(0) # only for the sake of reproducibility of the >>> # doctests here in the documentation >>> data_dict = {'cat': ['p']*5 + ['n']*2} >>> data = pd.DataFrame(data_dict) >>> sb.input.encode_categories(data) variable 'cat' is categorical {'cat': {0: 'n', 1: 'p'}} >>> print('before oversampling:', data.head) before oversampling: <bound method NDFrame.head of cat cat_codes 0 p 1 1 p 1 2 p 1 3 p 1 4 p 1 5 n 0 6 n 0> >>> # Oversampling without shuffling returns a dataset sorted by category >>> data_oversampled = sb.input.oversample(data, target='cat', shuffle=False) >>> print('after oversampling:', data_oversampled.head) after oversampling: <bound method NDFrame.head of cat cat_codes 5 n 0 6 n 0 5 n 0 6 n 0 6 n 0 0 p 1 1 p 1 2 p 1 3 p 1 4 p 1> >>> # Therefore, shuffling is activated by default >>> data_oversampled = sb.input.oversample(data, target='cat') >>> print('after oversampling:', data_oversampled.head) after oversampling: <bound method NDFrame.head of cat cat_codes 3 p 1 1 p 1 6 n 0 5 n 0 5 n 0 0 p 1 4 p 1 6 n 0 2 p 1 6 n 0>
separate_tfdataset_to_tftensors#
- spellbook.input.separate_tfdataset_to_tftensors(dataset)[source]#
Separate a
tf.data.Dataset
intotf.Tensor
s for the features and the labels- Parameters
dataset (
tf.data.Dataset
) – The dataset to separate, may be batched or unbatched- Returns
A tuple holding the two separate tensors for the features and the labels: (features:
tf.Tensor
, labels:tf.Tensor
)- Return type
Tuple of
tf.Tensor
split_pddataframe_to_nparrays#
- spellbook.input.split_pddataframe_to_nparrays(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#
Get separate
numpy.ndarray
s for training/validation/test features and labelsIf either frac_train and frac_val or n_train and n_val are given, six datasets are returned: the training features and labels, the validation features and labels and the test features and labels, with the test sets sized so as to use the remaining datapoints after the training and validation sets were filled
If no frac_val or n_val is given, the output will contain no test datasets, but rather just the training features and labels and the validation features and labels, with the validation sets sized so as to use the remaining datapoints after the training datasets were filled
Note
This function does not include shuffling and batching of the data, so these should be done separately!
- Parameters
data (
pandas.DataFrame
) – The datasettarget (
str
) – The name of the target variable containing the labelsfeatures (
typing.List
[str
]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.frac_train (
float
) – Optional, the fractional size of the training datasetfrac_val (
typing.Optional
[float
]) – Optional, the fractional size of the validation datasetn_train (
typing.Optional
[int
]) – Optional, the number of datapoints in the training setn_val (
typing.Optional
[int
]) – Optional, the number of datapoints in the validation set
- Returns
A tuple containing the different datasets requested, separated into features and labels
(train_features:
numpy.ndarray
, train_labels:numpy.ndarray
, validation_features:numpy.ndarray
, validation_labels:numpy.ndarray
)(train_features:
numpy.ndarray
, train_labels:numpy.ndarray
, validation_features:numpy.ndarray
, validation_labels:numpy.ndarray
, test_features:numpy.ndarray
, test_labels:numpy.ndarray
)
- Return type
Tuple of
numpy.ndarray
Example:
Training/validation split with the default relative sizes of 70%/30%:
import numpy as np import pandas as pd import spellbook as sb data_dict = { 'x': np.arange(100), 'y': np.arange(100), 'target': np.arange(100) } data = pd.DataFrame(data_dict) train_features, train_labels, val_features, val_labels \ = sb.input.split_pddataframe_to_nparrays( data, target='target', features=['x', 'y']) print('train_features:', type(train_features), train_features.shape) print('val_labels:', type(val_labels), val_labels.shape)
Output:
train_features: <class 'numpy.ndarray'> (70, 2) val_labels: <class 'numpy.ndarray'> (30,)
See also:
split_pddataframe_to_pddataframes#
- spellbook.input.split_pddataframe_to_pddataframes(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#
Get separate
pandas.DataFrame
s andpandas.Series
for training/validation/test features and labelsIf either frac_train and frac_val or n_train and n_val are given, six datasets are returned: the training features and labels, the validation features and labels and the test features and labels, with the test sets sized so as to use the remaining datapoints after the training and validation sets were filled
If no frac_val or n_val is given, the output will contain no test datasets, but rather just the training features and labels and the validation features and labels, with the validation sets sized so as to use the remaining datapoints after the training datasets were filled
Note
This function does not include shuffling and batching of the data, so these should be done separately!
- Parameters
data (
pandas.DataFrame
) – The datasettarget (
str
) – The name of the target variable containing the labelsfeatures (
typing.List
[str
]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.frac_train (
float
) – Optional, the fractional size of the training datasetfrac_val (
typing.Optional
[float
]) – Optional, the fractional size of the validation datasetn_train (
typing.Optional
[int
]) – Optional, the number of datapoints in the training setn_val (
typing.Optional
[int
]) – Optional, the number of datapoints in the validation set
- Returns
A tuple containing the different datasets requested, separated into features and labels
(train_features:
pandas.DataFrame
, train_labels:pandas.Series
, validation_features:pandas.DataFrame
, validation_labels:pandas.Series
)(train_features:
pandas.DataFrame
, train_labels:pandas.Series
, validation_features:pandas.DataFrame
, validation_labels:pandas.Series
, test_features:pandas.DataFrame
, test_labels:pandas.Series
)
- Return type
Tuple of
pandas.DataFrame
andpandas.Series
Example:
Training/validation split with the default relative sizes of 70%/30%:
import numpy as np import pandas as pd import spellbook as sb data_dict = { 'x': np.arange(100), 'y': np.arange(100), 'target': np.arange(100) } data = pd.DataFrame(data_dict) train_features, train_labels, val_features, val_labels \ = sb.input.split_pddataframe_to_pddataframes( data, target='target', features=['x', 'y']) print('train_features:', type(train_features), train_features.shape) print('val_labels:', type(val_labels), val_labels.shape)
Output:
train_features: <class 'pandas.core.frame.DataFrame'> (70, 2) val_labels: <class 'pandas.core.series.Series'> (30,)
See also:
split_pddataframe_to_tfdatasets#
- spellbook.input.split_pddataframe_to_tfdatasets(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#
Get separate
tf.data.Dataset
s for training, validation and testingIf either frac_train and frac_val or n_train and n_val are given, three datasets are returned: the training, the validation and the test sets, with the test set sized so as to use the remaining datapoints after the training and validation sets were filled
If no frac_val or n_val is given, the output will contain no test dataset, but rather just the training and the validation sets, with the validation set sized so as to use the remaining datapoints after the training dataset was filled
Note
This function does not include shuffling and batching of the data, so these should be done separately!
- Parameters
data (
pandas.DataFrame
) – The datatarget (
str
) – The name of the target variable containing the labelsfeatures (
typing.List
[str
]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.frac_train (
float
) – Optional, the fractional size of the training datasetfrac_val (
typing.Optional
[float
]) – Optional, the fractional size of the validation datasetn_train (
typing.Optional
[int
]) – Optional, the number of datapoints in the training setn_val (
typing.Optional
[int
]) – Optional, the number of datapoints in the validation set
- Returns
A tuple containing the training, validation and possibly also the test datasets:
(
tf.data.Dataset
,tf.data.Dataset
): the training and the validation set(
tf.data.Dataset
,tf.data.Dataset
,tf.data.Dataset
): the training, the validation and the test set
- Return type
Tuple of
tf.data.Dataset
s
Example:
Training/validation split with the default relative sizes of 70%/30%:
import numpy as np import pandas as pd import spellbook as sb data_dict = { 'x': np.arange(100), 'y': np.arange(100), 'target': np.arange(100) } data = pd.DataFrame(data_dict) train, val = sb.input.split_pddataframe_to_tfdatasets( data, target='target', features=['x', 'y']) print('train:', type(train), train.cardinality().numpy()) print('val:', type(val), val.cardinality().numpy())
Output:
train: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'> 70 val: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'> 30
See also: