Data Preparation and Input Pipeline#

Now let’s move on to bringing the data into a form that can be processed by the neural network!

We’ll have a look at the three scripts

  • 2-stroke-prediction-naive.py

  • 3-stroke-prediction-oversampling.py

  • 4-stroke-prediction-oversampling-norm.py

in examples/1-binary-stroke-prediction/.

Treatment of Categorical Variables#

After data loading and cleaning, we start off by processing the categorical variables and manipulating them so that they can be digested by the network.

# inplace convert string category labels to numerical indices
categories = sb.inputs.encode_categories(data)

Behind the scenes, the function spellbook.inputs.encode_categories() loops over all categorical variables in the dataset and converts them to pandas.Categoricals. A dictionary containing the mapping of the category names to numerical indices for each categorical variable is returned:

categories: {
    'gender': {0: 'female', 1: 'male', 2: 'other'},
    'hypertension': {0: 'no', 1: 'yes'},
    'heart_disease': {0: 'no', 1: 'yes'},
    'ever_married': {0: 'no', 1: 'yes'},
    'work_type': {0: 'children', 1: 'govt', 2: 'never', 3: 'private', 4: 'self'},
    'residence_type': {0: 'rural', 1: 'urban'},
    'smoking_status': {0: 'formerly', 1: 'never', 2: 'smokes', 3: 'unknown'},
    'stroke': {0: 'no', 1: 'yes'}
}

Finally, for each categorical variable var, an additional column var_codes is added to the dataset, containing the numerical category codes for each datapoint. These are the columns that we will later feed into the network.

The corresponding code in spellbook.inputs.encode_categories() looks like this:

categories = {}
for var in data:
    if data[var].dtype == 'object':
        print("variable '{}' is categorical".format(var))
        data[var] = pd.Categorical(data[var])

        # taken from https://stackoverflow.com/a/51102402
        categories[var] = dict(enumerate(data[var].cat.categories))

        # use numerical values instead of strings
        data[var+'_codes'] = data[var].cat.codes

Next, we are going to shuffle the dataset and order the rows randomly. As we could see before, the original dataset is actually ordered in a way that datapoints for patients with strokes come first. This is not what we want here because when splitting the dataset into a training and a validation set, this would mean that stroke cases would only be present in the training but not the validation set. We do the shuffling with pandas.DataFrame.sample() and afterwards continue to adjust our list of the feature variables and the name of the target variable to point to the new columns holding the integer category indices:

# shuffle data either now in pandas or later in TensorFlow
data = data.sample(frac = 1.0)

# use new numerical columns for the features
for var in categories:
    if var == target:
        target = target + '_codes'
    else:
        features[features.index(var)] = var + '_codes'

Feeding the Dataset into TensorFlow#

Now it is time to split the data into a training set, which is used to adjust the model parameters so as to describe the datapoints with tf.keras.Model.fit(), and a validation set. The latter is used to evaluate the model performance and benchmark different models against each other when changing model hyperparameters such as the number of layers, the number of nodes in a layer or the number of training epochs.

While doing the split, we are also going to convert the training and validation datasets into objects that can be fed into the network. There are at least three different ways of doing this in terms of the objects and datatypes involved:

  1. from a pandas.DataFrame to a tf.data.Dataset where features and labels are combined in one object

  2. from a pandas.DataFrame to two tf.Tensors, one for the features and one for the labels

  3. from a pandas.DataFrame to two numpy.ndarrays, one for the features and one for the labels

Since columns in a pandas.DataFrame can be accessed in much the same way as entries in a Python dict, it is actually also possible to directly feed features stored in a pandas.DataFrame and labels stored in a pandas.Series into TensorFlow networks. The separation of a single pandas.DataFrame into separate feature and label sets for training, validation and testing is implemented in spellbook.inputs.split_pddataframe_to_pddataframes().

Option 1: Using TensorFlow Datasets#

This approach is taken in 2-stroke-prediction-naive.py. It is implemented in spellbook.inputs.split_pddataframe_to_tfdatasets() and can be used like this:

train, val = sb.inputs.split_pddataframe_to_tfdatasets(data, target, features, n_train=3500)
print(train.cardinality()) # print the size of the dataset
print(val.cardinality())

Using 3500 datapoints for the training set and the remaining 1409 for the validation set corresponds to reserving a fraction of 71.3% of all data for training. A typical recommendation is to use about 70% for training when dealing with datasets containing a few thousand datapoints. As the size of the total dataset increases, the fraction reserved for training can increase and when a million datapoints are available, it is perhaps sufficient to only use about 1% of them for validation. Basically, the logic behind this is to use as many datapoints as possible for training while at the same time ensuring that the validation (and possibly the testing) set have sufficient size to provide numerically stable results for the metrics used to quantify the model performance. The smaller the validation set is, the larger the statistical uncertainty on the metrics will be.

Under the hood, in spellbook.inputs.split_pddataframe_to_tfdatasets(), the pandas.DataFrame is split into two separate frames, one for the features and one for the target labels. These are converted to two numpy.ndarrays which are then used to initialise the tf.data.Dataset using the tensorflow.data.Dataset.from_tensor_slices() function. Finally, the split is applied:

n = len(data)
n_train, n_val, n_test = calculate_splits(n, frac_train, frac_val,
                                          n_train, n_val)

# separate features and labels
data_features = data[features]
data_labels = data[target]

# create a TensorFlow Dataset
dataset = tf.data.Dataset.from_tensor_slices(
    (data_features.values, data_labels.values))

# split it into train/val/test
train = dataset.take(n_train)
val = dataset.skip(n_train).take(n_val)
if n_test:
    test = dataset.skip(n_train).skip(n_val).take(n_test)
    return((train, val, test))
else:
    return((train, val))

Note that this does not shuffle and batch the resulting dataset. Shuffling may be done

  • before in pandas: data = data.sample(frac = 1.0)

  • or afterwards in TensorFlow: train = train.shuffle(buffer_size = train.cardinality())

Since we shuffled the data before converting and splitting them, we can proceed to divide them into batches in TensorFlow with

train = train.batch(batch_size = 100)
val = val.batch(batch_size = 100)

The batch size of 100 is chosen so that each batch contains at least some stroke cases.

Option 2: Using TensorFlow Tensors#

Used in 3-stroke-prediction-oversampling.py and implemented in spellbook.inputs.separate_tfdataset_to_tftensors().

Just like before, TensorFlow tf.data.Datasets for training and validation are created. This time, they are each split into two separate tf.Tensors - one for the features and one for the labels:

train, val = sb.inputs.split_pddataframe_to_tfdatasets(data, target, features, n_train=7000)
print(train.cardinality()) # print the size of the dataset
print(val.cardinality())

# separate features and labels
train_features, train_labels = sb.inputs.separate_tfdataset_to_tftensors(train)
val_features, val_labels = sb.inputs.separate_tfdataset_to_tftensors(val)

Please don’t mind the increased size of the training set for now - this script uses oversampling to deal with the imbalance in the stroke categories. We will look into this in more detail later.

The advantage of this tf.Tensor-based approach is that it yields a separate object containing just the target labels, which can then be used when evaluating the model performance, e.g. when calculating a confusion matrix or a ROC curve from comparisons of the predicted labels against the actual true target labels.

Internally, spellbook.inputs.separate_tfdataset_to_tftensors() proceeds as follows:

# unpack the feature and label tensors from the dataset
# and reshape them into two separate tuples of tensors
features, labels = zip(*dataset)

# [...]

# which are then stacked to form the features and labels tensors
features = tf.stack(features)
labels = tf.stack(labels)

Finally, since tf.Tensors cannot be batched, in this approach, the batching is left to the call to tf.keras.Model.fit() later after model setup.

Option 3: Using NumPy Arrays#

Used in 4-stroke-prediction-oversampling-norm.py and implemented in spellbook.inputs.split_pddataframe_to_nparrays().

The third option is to split the dataset into numpy.ndarrays:

train_features, train_labels, val_features, val_labels \
    = sb.inputs.split_pddataframe_to_nparrays(
        data, target, features, n_train=7000)

print(train_features.shape) # print the size of the dataset
print(train_labels.shape)
print(val_features.shape)
print(val_labels.shape)

The function spellbook.inputs.split_pddataframe_to_nparrays() works like this in principle

train = data.iloc[:n_train]
val = data.iloc[n_train:].iloc[:n_val]
result = [train[features].values, train[target].values,
          val[features].values, val[target].values]

# [...]
return tuple(result)

Like with tf.Tensors in the previous approach, batching is left the call to tf.keras.Model.fit() later after model setup.