spellbook.input#

Functions for handling and preprocessing input data

Functions:

`calculate_splits`(n[, frac_train, frac_val, ...])	Calculate the numbers of datapoints in the training, validation and test sets
`encode_categories`(data)	Turns all string variables into categorical variables and adds columns with the corresponding numerical values
`oversample`(data, target[, shuffle])	Oversample the data to increase the sizes of the minority classes/categories
`separate_tfdataset_to_tftensors`(dataset)	Separate a `tf.data.Dataset` into `tf.Tensor`s for the features and the labels
`split_pddataframe_to_nparrays`(data, target, ...)	Get separate `numpy.ndarray`s for training/validation/test features and labels
`split_pddataframe_to_pddataframes`(data, ...)	Get separate `pandas.DataFrame`s and `pandas.Series` for training/validation/test features and labels
`split_pddataframe_to_tfdatasets`(data, ...[, ...])	Get separate `tf.data.Dataset`s for training, validation and testing

Functions#

calculate_splits#

spellbook.input.calculate_splits(n, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Calculate the numbers of datapoints in the training, validation and test sets

The size of the test dataset is calculated so as to use the remaining datapoints after filling the training and validation sets. If frac_val or n_val are not given, or if the training and validation sets use up all datapoints from the original full dataset, then the size of the test set will be set to zero.

Parameters

n (int) – The size of the full dataset
frac_train (float) – Optional, the fractional size of the training set
frac_val (typing.Optional[float]) – Optional, the fractional size of the validation set
n_train (typing.Optional[int]) – Optional, the absolute size of the training set
n_val (typing.Optional[int]) – Optional, the absolute size of the validation set

Return type

typing.Tuple[int, int, int]

Returns

The absolute sizes of the training, validation and test sets

Examples:

Default training/validation split defined by fractional sizes

import spellbook as sb
sizes = sb.input.calculate_splits(n=1000)
print('train:', sizes[0])
print('validation:', sizes[1])
print('test:', sizes[2])

Output:

train: 700
validation: 300
test: 0

Training/validation/test split defined by fractional sizes

import spellbook as sb
sizes = sb.input.calculate_splits(n=1000, frac_train=0.7, frac_val=0.2)
print('train:', sizes[0])
print('validation:', sizes[1])
print('test:', sizes[2])

Output:

train: 700
validation: 200
test: 100

Training/validation/test split defined by absolute sizes

import spellbook as sb
sizes = sb.input.calculate_splits(n=1000, n_train=600, n_val=250)
print('train:', sizes[0])
print('validation:', sizes[1])
print('test:', sizes[2])

Output:

train: 600
validation: 250
test: 150

encode_categories#

spellbook.input.encode_categories(data)[source]#

Turns all string variables into categorical variables and adds columns with the corresponding numerical values

Parameters: data (pandas.DataFrame) – The dataset
Return type: typing.Dict[str, typing.Dict[int, str]]
Returns: Dictionary of dictionaries with the encodings of each category. For each categorical variable there is a dictionary with the numerical codes as the keys and the category names/labels as the values

oversample#

spellbook.input.oversample(data, target, shuffle=True)[source]#

Oversample the data to increase the sizes of the minority classes/categories

The target variable is assumed to be of type categorical, so this function should only be called after converting the target variable from type object (when the values are strings) to type categorical with something like

import pandas as pd
data[target] = pd.Categorical(data[target])

The original datapoints in the minority classes/categories are retained and oversampling is only used to fill in the missing datapoints in order to ensure that for small imbalances all the original datapoints are kept. Otherwise, because of random fluctuations, some datapoints may be sampled twice and others never, losing some of the original data.

Parameters

data (pandas.DataFrame) – The imbalanced data
target (str) – The name of the variable that should be balanced
shuffle (bool) – Optional, whether or not that oversampled DataFrame should be shuffled. If set to False, then the DataFrame will be a concatenation of the different classes/categories, i.e. all datapoints belonging to the same class/category will be grouped together

Returns

A dataset with each class/category populated with the same number of datapoints

Return type

pandas.DataFrame

Example:

>>> import numpy as np
>>> import pandas as pd
>>> import spellbook as sb
>>> np.random.seed(0) # only for the sake of reproducibility of the
>>>                   # doctests here in the documentation

>>> data_dict = {'cat': ['p']*5 + ['n']*2}
>>> data = pd.DataFrame(data_dict)
>>> sb.input.encode_categories(data)
variable 'cat' is categorical
{'cat': {0: 'n', 1: 'p'}}

>>> print('before oversampling:', data.head)
before oversampling: <bound method NDFrame.head of   cat  cat_codes
0   p          1
1   p          1
2   p          1
3   p          1
4   p          1
5   n          0
6   n          0>

>>> # Oversampling without shuffling returns a dataset sorted by category
>>> data_oversampled = sb.input.oversample(data, target='cat', shuffle=False)
>>> print('after oversampling:', data_oversampled.head)
after oversampling: <bound method NDFrame.head of   cat  cat_codes
5   n          0
6   n          0
5   n          0
6   n          0
6   n          0
0   p          1
1   p          1
2   p          1
3   p          1
4   p          1>

>>> # Therefore, shuffling is activated by default
>>> data_oversampled = sb.input.oversample(data, target='cat')
>>> print('after oversampling:', data_oversampled.head)
after oversampling: <bound method NDFrame.head of   cat  cat_codes
3   p          1
1   p          1
6   n          0
5   n          0
5   n          0
0   p          1
4   p          1
6   n          0
2   p          1
6   n          0>

separate_tfdataset_to_tftensors#

spellbook.input.separate_tfdataset_to_tftensors(dataset)[source]#

Separate a tf.data.Dataset into tf.Tensors for the features and the labels

Parameters: dataset (tf.data.Dataset) – The dataset to separate, may be batched or unbatched
Returns: A tuple holding the two separate tensors for the features and the labels: (features: tf.Tensor, labels: tf.Tensor)
Return type: Tuple of tf.Tensor

split_pddataframe_to_nparrays#

spellbook.input.split_pddataframe_to_nparrays(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Get separate numpy.ndarrays for training/validation/test features and labels

If either frac_train and frac_val or n_train and n_val are given, six datasets are returned: the training features and labels, the validation features and labels and the test features and labels, with the test sets sized so as to use the remaining datapoints after the training and validation sets were filled
If no frac_val or n_val is given, the output will contain no test datasets, but rather just the training features and labels and the validation features and labels, with the validation sets sized so as to use the remaining datapoints after the training datasets were filled

Note

This function does not include shuffling and batching of the data, so these should be done separately!

Parameters

data (pandas.DataFrame) – The dataset
target (str) – The name of the target variable containing the labels
features (typing.List[str]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.
frac_train (float) – Optional, the fractional size of the training dataset
frac_val (typing.Optional[float]) – Optional, the fractional size of the validation dataset
n_train (typing.Optional[int]) – Optional, the number of datapoints in the training set
n_val (typing.Optional[int]) – Optional, the number of datapoints in the validation set

Returns

A tuple containing the different datasets requested, separated into features and labels

(train_features: numpy.ndarray, train_labels: numpy.ndarray, validation_features: numpy.ndarray, validation_labels: numpy.ndarray)
(train_features: numpy.ndarray, train_labels: numpy.ndarray, validation_features: numpy.ndarray, validation_labels: numpy.ndarray, test_features: numpy.ndarray, test_labels: numpy.ndarray)

Return type

Tuple of numpy.ndarray

Example:

Training/validation split with the default relative sizes of 70%/30%:

import numpy as np
import pandas as pd
import spellbook as sb

data_dict = {
    'x': np.arange(100),
    'y': np.arange(100),
    'target': np.arange(100)
}
data = pd.DataFrame(data_dict)

train_features, train_labels, val_features, val_labels \
    = sb.input.split_pddataframe_to_nparrays(
        data, target='target', features=['x', 'y'])

print('train_features:', type(train_features), train_features.shape)
print('val_labels:', type(val_labels), val_labels.shape)

Output:

train_features: <class 'numpy.ndarray'> (70, 2)
val_labels: <class 'numpy.ndarray'> (30,)

split_pddataframe_to_pddataframes#

spellbook.input.split_pddataframe_to_pddataframes(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Get separate pandas.DataFrames and pandas.Series for training/validation/test features and labels

If either frac_train and frac_val or n_train and n_val are given, six datasets are returned: the training features and labels, the validation features and labels and the test features and labels, with the test sets sized so as to use the remaining datapoints after the training and validation sets were filled
If no frac_val or n_val is given, the output will contain no test datasets, but rather just the training features and labels and the validation features and labels, with the validation sets sized so as to use the remaining datapoints after the training datasets were filled

Note

This function does not include shuffling and batching of the data, so these should be done separately!

Parameters

data (pandas.DataFrame) – The dataset
target (str) – The name of the target variable containing the labels
features (typing.List[str]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.
frac_train (float) – Optional, the fractional size of the training dataset
frac_val (typing.Optional[float]) – Optional, the fractional size of the validation dataset
n_train (typing.Optional[int]) – Optional, the number of datapoints in the training set
n_val (typing.Optional[int]) – Optional, the number of datapoints in the validation set

Returns

A tuple containing the different datasets requested, separated into features and labels

(train_features: pandas.DataFrame, train_labels: pandas.Series, validation_features: pandas.DataFrame, validation_labels: pandas.Series)
(train_features: pandas.DataFrame, train_labels: pandas.Series, validation_features: pandas.DataFrame, validation_labels: pandas.Series, test_features: pandas.DataFrame, test_labels: pandas.Series)

Return type

Tuple of pandas.DataFrame and pandas.Series

Example:

Training/validation split with the default relative sizes of 70%/30%:

import numpy as np
import pandas as pd
import spellbook as sb

data_dict = {
    'x': np.arange(100),
    'y': np.arange(100),
    'target': np.arange(100)
}
data = pd.DataFrame(data_dict)

train_features, train_labels, val_features, val_labels \
    = sb.input.split_pddataframe_to_pddataframes(
        data, target='target', features=['x', 'y'])

print('train_features:', type(train_features), train_features.shape)
print('val_labels:', type(val_labels), val_labels.shape)

Output:

train_features: <class 'pandas.core.frame.DataFrame'> (70, 2)
val_labels: <class 'pandas.core.series.Series'> (30,)

split_pddataframe_to_tfdatasets#

spellbook.input.split_pddataframe_to_tfdatasets(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Get separate tf.data.Datasets for training, validation and testing

If either frac_train and frac_val or n_train and n_val are given, three datasets are returned: the training, the validation and the test sets, with the test set sized so as to use the remaining datapoints after the training and validation sets were filled
If no frac_val or n_val is given, the output will contain no test dataset, but rather just the training and the validation sets, with the validation set sized so as to use the remaining datapoints after the training dataset was filled

Note

This function does not include shuffling and batching of the data, so these should be done separately!

Parameters

data (pandas.DataFrame) – The data
target (str) – The name of the target variable containing the labels
features (typing.List[str]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.
frac_train (float) – Optional, the fractional size of the training dataset
frac_val (typing.Optional[float]) – Optional, the fractional size of the validation dataset
n_train (typing.Optional[int]) – Optional, the number of datapoints in the training set
n_val (typing.Optional[int]) – Optional, the number of datapoints in the validation set

Returns

A tuple containing the training, validation and possibly also the test datasets:

(tf.data.Dataset, tf.data.Dataset): the training and the validation set
(tf.data.Dataset, tf.data.Dataset, tf.data.Dataset): the training, the validation and the test set

Return type

Tuple of tf.data.Datasets

Example:

Training/validation split with the default relative sizes of 70%/30%:

import numpy as np
import pandas as pd
import spellbook as sb

data_dict = {
    'x': np.arange(100),
    'y': np.arange(100),
    'target': np.arange(100)
}
data = pd.DataFrame(data_dict)

train, val = sb.input.split_pddataframe_to_tfdatasets(
    data, target='target', features=['x', 'y'])

print('train:', type(train), train.cardinality().numpy())
print('val:', type(val), val.cardinality().numpy())

Output:

train: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'> 70
val: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'> 30

	Contact me on GitHub
	Contact me on LinkedIn

spellbook

spellbook.input

Contents

spellbook.input#

Functions#

calculate_splits#

encode_categories#

oversample#

separate_tfdataset_to_tftensors#

split_pddataframe_to_nparrays#

split_pddataframe_to_pddataframes#

split_pddataframe_to_tfdatasets#