spellbook.input#

Functions for handling and preprocessing input data

Functions:

calculate_splits(n[, frac_train, frac_val, ...])

Calculate the numbers of datapoints in the training, validation and test sets

encode_categories(data)

Turns all string variables into categorical variables and adds columns with the corresponding numerical values

oversample(data, target[, shuffle])

Oversample the data to increase the sizes of the minority classes/categories

separate_tfdataset_to_tftensors(dataset)

Separate a tf.data.Dataset into tf.Tensors for the features and the labels

split_pddataframe_to_nparrays(data, target, ...)

Get separate numpy.ndarrays for training/validation/test features and labels

split_pddataframe_to_pddataframes(data, ...)

Get separate pandas.DataFrames and pandas.Series for training/validation/test features and labels

split_pddataframe_to_tfdatasets(data, ...[, ...])

Get separate tf.data.Datasets for training, validation and testing

Functions#

calculate_splits#

spellbook.input.calculate_splits(n, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Calculate the numbers of datapoints in the training, validation and test sets

The size of the test dataset is calculated so as to use the remaining datapoints after filling the training and validation sets. If frac_val or n_val are not given, or if the training and validation sets use up all datapoints from the original full dataset, then the size of the test set will be set to zero.

Parameters
  • n (int) – The size of the full dataset

  • frac_train (float) – Optional, the fractional size of the training set

  • frac_val (typing.Optional[float]) – Optional, the fractional size of the validation set

  • n_train (typing.Optional[int]) – Optional, the absolute size of the training set

  • n_val (typing.Optional[int]) – Optional, the absolute size of the validation set

Return type

typing.Tuple[int, int, int]

Returns

The absolute sizes of the training, validation and test sets

Examples:

  • Default training/validation split defined by fractional sizes

    import spellbook as sb
    sizes = sb.input.calculate_splits(n=1000)
    print('train:', sizes[0])
    print('validation:', sizes[1])
    print('test:', sizes[2])
    

    Output:

    train: 700
    validation: 300
    test: 0
    
  • Training/validation/test split defined by fractional sizes

    import spellbook as sb
    sizes = sb.input.calculate_splits(n=1000, frac_train=0.7, frac_val=0.2)
    print('train:', sizes[0])
    print('validation:', sizes[1])
    print('test:', sizes[2])
    

    Output:

    train: 700
    validation: 200
    test: 100
    
  • Training/validation/test split defined by absolute sizes

    import spellbook as sb
    sizes = sb.input.calculate_splits(n=1000, n_train=600, n_val=250)
    print('train:', sizes[0])
    print('validation:', sizes[1])
    print('test:', sizes[2])
    

    Output:

    train: 600
    validation: 250
    test: 150
    

encode_categories#

spellbook.input.encode_categories(data)[source]#

Turns all string variables into categorical variables and adds columns with the corresponding numerical values

Parameters

data (pandas.DataFrame) – The dataset

Return type

typing.Dict[str, typing.Dict[int, str]]

Returns

Dictionary of dictionaries with the encodings of each category. For each categorical variable there is a dictionary with the numerical codes as the keys and the category names/labels as the values

oversample#

spellbook.input.oversample(data, target, shuffle=True)[source]#

Oversample the data to increase the sizes of the minority classes/categories

The target variable is assumed to be of type categorical, so this function should only be called after converting the target variable from type object (when the values are strings) to type categorical with something like

import pandas as pd
data[target] = pd.Categorical(data[target])

The original datapoints in the minority classes/categories are retained and oversampling is only used to fill in the missing datapoints in order to ensure that for small imbalances all the original datapoints are kept. Otherwise, because of random fluctuations, some datapoints may be sampled twice and others never, losing some of the original data.

Parameters
  • data (pandas.DataFrame) – The imbalanced data

  • target (str) – The name of the variable that should be balanced

  • shuffle (bool) – Optional, whether or not that oversampled DataFrame should be shuffled. If set to False, then the DataFrame will be a concatenation of the different classes/categories, i.e. all datapoints belonging to the same class/category will be grouped together

Returns

A dataset with each class/category populated with the same number of datapoints

Return type

pandas.DataFrame

Example:

>>> import numpy as np
>>> import pandas as pd
>>> import spellbook as sb
>>> np.random.seed(0) # only for the sake of reproducibility of the
>>>                   # doctests here in the documentation

>>> data_dict = {'cat': ['p']*5 + ['n']*2}
>>> data = pd.DataFrame(data_dict)
>>> sb.input.encode_categories(data)
variable 'cat' is categorical
{'cat': {0: 'n', 1: 'p'}}

>>> print('before oversampling:', data.head)
before oversampling: <bound method NDFrame.head of   cat  cat_codes
0   p          1
1   p          1
2   p          1
3   p          1
4   p          1
5   n          0
6   n          0>

>>> # Oversampling without shuffling returns a dataset sorted by category
>>> data_oversampled = sb.input.oversample(data, target='cat', shuffle=False)
>>> print('after oversampling:', data_oversampled.head)
after oversampling: <bound method NDFrame.head of   cat  cat_codes
5   n          0
6   n          0
5   n          0
6   n          0
6   n          0
0   p          1
1   p          1
2   p          1
3   p          1
4   p          1>

>>> # Therefore, shuffling is activated by default
>>> data_oversampled = sb.input.oversample(data, target='cat')
>>> print('after oversampling:', data_oversampled.head)
after oversampling: <bound method NDFrame.head of   cat  cat_codes
3   p          1
1   p          1
6   n          0
5   n          0
5   n          0
0   p          1
4   p          1
6   n          0
2   p          1
6   n          0>

separate_tfdataset_to_tftensors#

spellbook.input.separate_tfdataset_to_tftensors(dataset)[source]#

Separate a tf.data.Dataset into tf.Tensors for the features and the labels

Parameters

dataset (tf.data.Dataset) – The dataset to separate, may be batched or unbatched

Returns

A tuple holding the two separate tensors for the features and the labels: (features: tf.Tensor, labels: tf.Tensor)

Return type

Tuple of tf.Tensor

split_pddataframe_to_nparrays#

spellbook.input.split_pddataframe_to_nparrays(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Get separate numpy.ndarrays for training/validation/test features and labels

  • If either frac_train and frac_val or n_train and n_val are given, six datasets are returned: the training features and labels, the validation features and labels and the test features and labels, with the test sets sized so as to use the remaining datapoints after the training and validation sets were filled

  • If no frac_val or n_val is given, the output will contain no test datasets, but rather just the training features and labels and the validation features and labels, with the validation sets sized so as to use the remaining datapoints after the training datasets were filled

Note

This function does not include shuffling and batching of the data, so these should be done separately!

Parameters
  • data (pandas.DataFrame) – The dataset

  • target (str) – The name of the target variable containing the labels

  • features (typing.List[str]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.

  • frac_train (float) – Optional, the fractional size of the training dataset

  • frac_val (typing.Optional[float]) – Optional, the fractional size of the validation dataset

  • n_train (typing.Optional[int]) – Optional, the number of datapoints in the training set

  • n_val (typing.Optional[int]) – Optional, the number of datapoints in the validation set

Returns

A tuple containing the different datasets requested, separated into features and labels

Return type

Tuple of numpy.ndarray

Example:

  • Training/validation split with the default relative sizes of 70%/30%:

    import numpy as np
    import pandas as pd
    import spellbook as sb
    
    data_dict = {
        'x': np.arange(100),
        'y': np.arange(100),
        'target': np.arange(100)
    }
    data = pd.DataFrame(data_dict)
    
    train_features, train_labels, val_features, val_labels \
        = sb.input.split_pddataframe_to_nparrays(
            data, target='target', features=['x', 'y'])
    
    print('train_features:', type(train_features), train_features.shape)
    print('val_labels:', type(val_labels), val_labels.shape)
    

    Output:

    train_features: <class 'numpy.ndarray'> (70, 2)
    val_labels: <class 'numpy.ndarray'> (30,)
    

See also:

split_pddataframe_to_pddataframes#

spellbook.input.split_pddataframe_to_pddataframes(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Get separate pandas.DataFrames and pandas.Series for training/validation/test features and labels

  • If either frac_train and frac_val or n_train and n_val are given, six datasets are returned: the training features and labels, the validation features and labels and the test features and labels, with the test sets sized so as to use the remaining datapoints after the training and validation sets were filled

  • If no frac_val or n_val is given, the output will contain no test datasets, but rather just the training features and labels and the validation features and labels, with the validation sets sized so as to use the remaining datapoints after the training datasets were filled

Note

This function does not include shuffling and batching of the data, so these should be done separately!

Parameters
  • data (pandas.DataFrame) – The dataset

  • target (str) – The name of the target variable containing the labels

  • features (typing.List[str]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.

  • frac_train (float) – Optional, the fractional size of the training dataset

  • frac_val (typing.Optional[float]) – Optional, the fractional size of the validation dataset

  • n_train (typing.Optional[int]) – Optional, the number of datapoints in the training set

  • n_val (typing.Optional[int]) – Optional, the number of datapoints in the validation set

Returns

A tuple containing the different datasets requested, separated into features and labels

Return type

Tuple of pandas.DataFrame and pandas.Series

Example:

  • Training/validation split with the default relative sizes of 70%/30%:

    import numpy as np
    import pandas as pd
    import spellbook as sb
    
    data_dict = {
        'x': np.arange(100),
        'y': np.arange(100),
        'target': np.arange(100)
    }
    data = pd.DataFrame(data_dict)
    
    train_features, train_labels, val_features, val_labels \
        = sb.input.split_pddataframe_to_pddataframes(
            data, target='target', features=['x', 'y'])
    
    print('train_features:', type(train_features), train_features.shape)
    print('val_labels:', type(val_labels), val_labels.shape)
    

    Output:

    train_features: <class 'pandas.core.frame.DataFrame'> (70, 2)
    val_labels: <class 'pandas.core.series.Series'> (30,)
    

See also:

split_pddataframe_to_tfdatasets#

spellbook.input.split_pddataframe_to_tfdatasets(data, target, features, frac_train=0.7, frac_val=None, n_train=None, n_val=None)[source]#

Get separate tf.data.Datasets for training, validation and testing

  • If either frac_train and frac_val or n_train and n_val are given, three datasets are returned: the training, the validation and the test sets, with the test set sized so as to use the remaining datapoints after the training and validation sets were filled

  • If no frac_val or n_val is given, the output will contain no test dataset, but rather just the training and the validation sets, with the validation set sized so as to use the remaining datapoints after the training dataset was filled

Note

This function does not include shuffling and batching of the data, so these should be done separately!

Parameters
  • data (pandas.DataFrame) – The data

  • target (str) – The name of the target variable containing the labels

  • features (typing.List[str]) – The names of the feature variables. Not all variables have to be extracted from the dataset, e.g. the numerical codes of a categorical variable should be extracted, whereas the variable containing the original strings should be skipped.

  • frac_train (float) – Optional, the fractional size of the training dataset

  • frac_val (typing.Optional[float]) – Optional, the fractional size of the validation dataset

  • n_train (typing.Optional[int]) – Optional, the number of datapoints in the training set

  • n_val (typing.Optional[int]) – Optional, the number of datapoints in the validation set

Returns

A tuple containing the training, validation and possibly also the test datasets:

Return type

Tuple of tf.data.Datasets

Example:

  • Training/validation split with the default relative sizes of 70%/30%:

    import numpy as np
    import pandas as pd
    import spellbook as sb
    
    data_dict = {
        'x': np.arange(100),
        'y': np.arange(100),
        'target': np.arange(100)
    }
    data = pd.DataFrame(data_dict)
    
    train, val = sb.input.split_pddataframe_to_tfdatasets(
        data, target='target', features=['x', 'y'])
    
    print('train:', type(train), train.cardinality().numpy())
    print('val:', type(val), val.cardinality().numpy())
    

    Output:

    train: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'> 70
    val: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'> 30
    

See also: